Accelerated Multiple Sequence Alignment
Biologists and other researchers use multiple sequence alignment (MSA) as a fundamental analysis method to find similarities among nucleotide (DNA/RNA) or amino acid (protein) sequences. The compute time for an optimal MSA grows exponentially with respect to the number of sequences. Consequently, producing timely results on large problems requires more efficient algorithms and the use of parallel computing resources. Reconfigurable computing hardware, such as Field-Programmable Gate Arrays (FPGAs), provides one approach to the acceleration of biological sequence alignment. Other acceleration methods typically encounter scaling problems that arise from the overhead of inter-process communication and from the lack of parallelism. Reconfigurable computing allows a greater scale of parallelism using many fine-grained custom processing elements that have a low-overhead interconnect.
In this work, a new method for accelerating the third stage of progressive alignment is used that reduces subgroups of aligned sequences into discrete profiles before they are pairwise aligned on the accelerator. Our pairwise alignment algorithm produces the required traceback information and does not limit the sequence length by the number of processing elements (PEs) or by the amount of block RAM on the FPGA accelerator. Other hardware acceleration methods are inadequate for use in the third stage because the sequence length is severely limited or only similarity scores are computed. Using an FPGA accelerator, an overall speedup of 150 has been demonstrated on a large data set when compared to a 2.4 GHz Core2 processor [BMC Bioinformatics]. A version of MUSCLE with discrete profile alignment is shown to have comparable quality to other popular MSA programs. Our parallel algorithm and architecture accelerates large-scale MSA with reconfigurable computing and allows researchers to solve the larger problems that confront biologists today.
More information and source code is available at the following links. The bibliography includes parallel methods that use multiprocessor, vector, Cell BE, GPU, and FPGA systems.
Documents and Bibliography
Accelerated Large-Scale Multiple Sequence Alignment
Parallel Multiple Sequence Alignment: An Overview
Bibliography on Parallel Multiple Sequence Alignment
Source Code and Data Sets
MSA Programs - download v3.8.31b
- MUSCLE - enhanced version (changes)
- MUDISC - forms discrete profiles before pairwise alignment in software
- MUFPGA - forms discrete profiles before FPGA accelerated pairwise alignment
MSA Utilities - download v1.1
- mscore - Multiple sequence alignment scoring utility
- fac - FASTA file change utility (e.g. reformat, strip gaps, upper/lower case)
- fas - FASTA file statistics utility (e.g. symbol counts, avg. & N50 lengths)
Large Virus Data Sets - download (16MB)
Data Set | Sequences | Avg. Length |
---|---|---|
Influenza HA | 12,104 | 1,717 |
HIV | 2,144 | 9,019 |
Corona | 400 | 29,531 |
Herpes | 142 | 167,043 |