Clustering or cluster analysis is the process of grouping objects in such a way that the objects within a group have more similarities to each other than the objects in other groups. Each group is referred to as a cluster. Each cluster can have different size and the number of clusters that will be generated is not known at input. Clustering process can also be employed to find out the relationships between each cluster. Clustering has numerous applications in the field of computational biology some of which include sequence analysis, clustering similar genes based on microarray data, gene expression analysis. In this paper, our focus will be on using cluster analysis for grouping similar protein sequences.
Our work is based on the serial pclust algorithm. Though clustering may seem to be a powerful algorithm for bioinformatics, its use is limited and it cannot be applied to all projects. This is because clustering is a data-intensive process and can easily become compute-intensive as well. The performance of serial implementation of these algorithms is generally limited. These algorithms also face scalability issues.
This is why the serial pclust algorithm does not scale beyond 15K-20K sequences on desktop computer with 2GB of RAM due to memory requirements. Parallelization techniques can be used to improve these algorithms. Parallelization can not only help in improving the run-time performance but can also help in achieving higher scalability with better results.
We have tried to leverage multi-core computing architecture to solve the problem of protein clustering in parallel. In this project we use OpenMP which is a shared memory parallelization library. OpenMP allows the programmer to explicitly create multiple threads. A thread is a basic unit of execution and can be scheduled parallelly onto multiple cores for simultaneous execution of multiple tasks. We have chosen OpenMP because it is easy to use rather than some conventional multi-threading libraries like POSIX and MPI. While writing parallel programs, it should be made sure that all the threads are properly synchronized.
Improper or incorrect synchronization may cause race condition leading to generation of incorrect results. OpenMP provides various synchronization constructs like barrier, atomic and critical. However, it should also be noted that there is a certain amount of overhead associated with these constructs so the use of synchronization constructs must be minimized within the code. This modified pclust algorithm stands out from the conventional pclust algorithm not only because it provides better performance and better output but also because it offers better output visualization by use of bar graphs and pie charts. We have deployed our code on the cloud which helps us to achieve better security and flexibility of use. The software can be accessed from any client device at any location