AlignAlign: Software for optimally aligning alignments

`Align ----- ----- Align`

AlignAlign: Software for optimally aligning alignments

Travis Wheeler
Dean Starrett
John Kececioglu

Department of Computer Science
The University of Arizona

Alert

The functionality of AlignAlign is now being provided by the software tool Opal. Because Opal is the active implementation of the algorithm for optimally aligning alignments, AlignAlign is no longer being maintained.

Opal can optimally align two alignments, and is faster than the implementation in AlignAlign because of a number of more recent speedups to the algorithm. To align alignments with Opal, run the command:

java -jar Opal.x.jar --in alignment1.fasta --in2 alignment2.fasta

Description

The software AlignAlign takes as input two alignments (of disjoint sequences) and outputs one alignment (of all the sequences) that contains both input alignments and has optimal score. In the multiple sequence alignment that is output, the columns of the two input alignments are optimally aligned under the sum-of-pairs scoring function, correctly accounting for gaps. This software is free for noncommercial use. (The terms of use are below.)

AlignAlign implements the algorithm of Kececioglu and Starrett from RECOMB 2004. (The citation is below.) This algorithm computes an optimal alignment of two multiple-sequence alignments under the sum-of-pairs objective with linear gap-costs, which is NP-complete. (In the sum-of-pairs objective, the score of a multiple-sequence alignment is the sum of the scores of the two-sequence alignments that it induces on all pairs of sequences. With linear gap costs, a run of x insertions or deletions in a pairwise alignment costs ax + b, where a and b are constants.) Extensive experiments show that on biological data the algorithm is fast and runs in quadratic time. More precisely, the theoretical worst-case time for the algorithm is O(5^min(k,n) max(k,n)²) for two alignments each with k sequences and n columns, while in practice the observed running time is O(k²n²).

AlignAlign implements two versions of the algorithm: one optimized for space, the other optimized for time. The space-optimized version computes an optimal alignment in space linear in the number of input columns. The time-optimized version computes an optimal alignment in quadratic space, but runs an order of magnitude faster in practice. Both can be invoked from AlignAlign. In addition, AlignAlign supports non-uniform sequence-pair weights, as well as terminal gap penalties that differ from the default penalties.

Example

Below is an example illustrating AlignAlign on a multiple alignment from the PALI database of structural protein alignments. The example shows part of an optimal alignment of two 10-sequence subalignments from the Kunitz STI inhibitor dataset (accession number 409). The alignment was computed using optimal parameter values (for both amino acid substitution scores and gap-open and -extension penalties) obtained by the algorithm of Kececioglu and Kim (RECOMB 2006). The optimal alignment of the two alignments (which costs 528533) improves on the score of the original PALI benchmark (which costs 529280). Computing the alignment took 0.2 seconds on a 2.4 GHz Pentium processor with 512 MB RAM.

... 1avwb_409 31 AA-PTGNE--R-CPLTVVQSRNELDKGIGTIISSP----Y---R-IRFIAEGHPLSLKFD-SFAVIMLCVG-I--P--TE-WSVV--E-D-LP 100 ... ... 1tie_409 31 LA-KTGEE--T-CPLTVVQSPNELSDGKPIRIESR----L---R-SAFIPDDDKVRIGFA-Y-A----PKCAP--S--PW-WTVV-------- 92 ... ... 1eyla_409 35 TA-KTGNE--P-CPLTVVRSPNEVSKGEPIRISSQ----F---L-SLFIPRGSLVALGFA-N-P----PSCAA--S--PW-WTVV----D-SP 99 ... ... SPR1_IPOBA/4 32 LA-HLDM--MSKCATDVIVSPNDLDNGDPITITPA----TADPE-STVVMASTYQTFRFN-I-ATNKL--CVN--N--VN-WGIQ--H-D-SA 104 ... ... KTI1_SOYBN/2 31 VD-STGKE--I-CPLTVVQSPNELDKGIGLVFTSP----L---H-ALFIAERYPLSIKFG-SFAVITLCAG-M--P--TE-WAIV----E-RE 99 ... ... ITRA_SOYBN/2 30 AA-PTGNE--R-CPLTVVQSRNELDKGIGTIISSP----Y---R-IRFIAEGHPLSLKFD-SFAVIMLCVG-I--P--TE-WSVV--E-D-LP 99 ... ... IT1A_PSOTE/2 31 AA-ATGTE--T-CPLTVVRSPNEVSVGEPLRISSQ----L---R-SGFIPDYSVVRIGFA-N-P----PKCAP--S--PW-WTVV--E-D-QP 96 ... ... IDE3_ERYCA/1 35 LA-KTGEE--T-CPLTVVQSPNELSDGKPIRIESR----L---R-SAFIPDDDKVRIGFA-Y-A----PKCAP--S--PW-WTVV--E-D-EQ 96 ... ... IECI_ERYVA/2 31 AA-RTGKE--T-CPLTVVQSPFEVSNGEPIRIASQ----F---L-STFIPDGSPYAIGFA-N-P----PSCAA--S--PW-WTVV----E-TS 95 ... ... ICW3_PSOTE/2 31 TA-KTGNE--P-CPLTVVRSPNEVSKGEPIRISSQ----F---L-SLFIPRGSLVALGFA-N-P----PSCAA--S--PW-WTVV----D-SP 95 ... || ||||| |||||||||||||||||||||||||| | | ||||||||||||||| | |||||||||| | || |||| | | || ... 1avac_409 35 MA-PGHG--RH-CPLFVSQDPNGQHDGFPVRITPYGV--A---PSDKIIRLSTDVRISFR-A-Y---TTC-LQ--S--TE-WHID--S-ELAA 105 ... ... 1wba_409 31 VV-ATGNENPE-DPLSIVKST-RNI-MYATSISSEDKTPP---Q-PRNILENMRLKINFA-T-D---P--H-K--G--DV-WSVV----DFQP 99 ... ... WIN3_POPSP/3 25 VVNATIN--PI-CNSDVILST-GIE-GLPVTFSPV--INS---T-DGVIREGTLITVSFD-A--S---TCGMA--GVTPM-WKIG--F-NSTA 95 ... ... IAAS_ORYSA/4 31 MA-PRRV--IP-CPLLVAQETDERRKGFPVRFTPWGG--A---ASPRTLRVSTDVRIRFN-A--A---TLCVQ--S--TE-WHVG--D-EPLT 100 ... ... IAAS_HORVU/2 31 MA-PGHG--RH-CPLFVSQDPNGQHDGFPVRITPYGV--A---PSDKIIRLSTDVRISFR-A-Y---TTC-LQ--S--TE-WHID--S-ELAA 100 ... ... ASP_THECC/30 32 LGRATGQ---S-CPEIVVQRRSDLDNGTPVIFSNADS------K-DDVVRVSTDVNIEFV-P---IRDRLCST--S--TV-WRLD--NYDNSA 102 ... ... MIRA_RICDU/3 31 VS-ATTPNGTFVCPPRVVQTRKEVDHDRPLAFFPE-N--P---K-EDVVRVSTDLNINFS-A-F--MPCRWTS--S--TV-WRLDKYD-E-ST 104 ... ... IT2_PSOTE/2- 31 AA-KVGNE--D-CPLTVVKSLDENSNGEPIRIASR--L-----R-STFIPEYSLVNLGFA-D--P---PKCAP--S--PF-WTVV--K-D-QS 96 ... ... ITRY_ACACO/2 31 LA-KTGDE--S-CPLTVVQAQSETKRGLPAVIWTP----P---K-IAILTPGFYLNFEFQPR-D---LPACLQKYS--TLPWKV---E-G--- 98 ... ... ALB1_PSOTE/4 29 VV-ATGNENPE-DPLSIVKST-RNI-MYATSISSEDKTPP---Q-PRNILENMRLKINFA-T-D---P--H-K--G--DV-WSVV----DFQP 96 ...

Complete alignment available here

The input alignments and parameter values for this example are provided with the software distribution below.

Distribution

The source code is in C, distributed as Unix tar and zip files. All noteworthy uses of AlignAlign should cite the paper referenced below.

Terms of use Please read the terms of use before downloading the software

Version    1.0.0 (17 May 2007)

Source code alignalign-1.0.0.tar.gz alignalign-1.0.0.zip
To unpackage the files, run
   gzip -dc alignalign-1.0.0.tar.gz | tar -xvf -
or
   unzip alignalign-1.0.0.zip

Documentation A brief man page

Citation John Kececioglu and Dean Starrett,
"Aligning alignments exactly,"
Proceedings of the 8th ACM Conference on Research in Computational Molecular Biology (RECOMB), pages 85-96, 2004.

Contact Send comments to alignalign@cs.arizona.edu

Acknowledgements

This research was supported by the US National Science Foundation through a grant from the Biological Databases and Informatics Program, Grant DBI-0317498, "Robust Tools for Biological Sequence Analysis."

17 May 2007 | John Kececioglu | Travis Wheeler | alignalign@cs.arizona.edu | alignalign.cs.arizona.edu