Succinct representation of massive alignment of highly similar sequences
A FASTA alignment can be processed using the launch.py
script:
The construction takes about 1 h and 450 MB for 200k sequences of SARS-CoV-2 genomes. The resulting file is 10-15MB. The construction time and memory grows linearly with the number of sequences.
Now the processed file can be loaded directly in an interactive Python shell to explore the alignment.
In [1]: msa = SuccinctMultipleAlignment.load("srmsa")
In [2]: len(msa)
Out[2]: 31021
In [3]: msa.get_nb_sequences()
Out[3]: 199716
In [4]: consensus = msa.get_consensus()
In [5]: consensus[1000:1200]
Out[5]: 'TGTATACTGCTGCCGTGAACATGAGCATGAAATTGCTTGGTACACGGAACGTTCTGAAAAGAGCTATGAATTGCAGACACCTTTTGAAATTAAATTGGCAAAGAAATTTGACACCTTCAATGGGGAATGTCCAAATTTTGTATTTCCCTTAAATTCCATAATCAAGACTATTCAACCAAGGGTTGAAAAGAAAAAGCTTG'
In [6]: msa[1000].nt_pos()
Out[6]:
defaultdict(list,
{'C': [(37908L, 37908L),
(37910L, 37910L),
(37919L, 37919L),
(54110L, 54110L),
(93196L, 93196L),
(98107L, 98107L),
(98737L, 98737L)],
'T': [(0, 37907L),
(37909L, 37909L),
(37911L, 37918L),
(37920L, 54109L),
(54111L, 93195L),
(93197L, 98106L),
(98108L, 98736L),
(98738L, 199715)]})
In [7]: msa[3200].nt_frequency()
Out[7]:
{'-': 2.0028440385347193e-05,
'A': 0.9998998577980732,
'G': 8.011376154138877e-05}
The documentation describe the available functions.