Succinct representation of massive alignment of highly similar sequences

A FASTA alignment can be processed using the launch.py script:

python launch.py --file sequences.fasta --ncols 10000 --save sequences.srmsa

The construction takes about 1 h and 450 MB for 200k sequences of SARS-CoV-2 genomes. The resulting file is 10-15MB. The construction time and memory grows linearly with the number of sequences.

Now the processed file can be loaded directly in an interactive Python shell to explore the alignment.

In [1]: msa = SuccinctMultipleAlignment.load("srmsa")

In [2]: len(msa)
Out[2]: 31021

In [3]: msa.get_nb_sequences()
Out[3]: 199716

In [4]: consensus = msa.get_consensus()

In [5]: consensus[1000:1200]
Out[5]: 'TGTATACTGCTGCCGTGAACATGAGCATGAAATTGCTTGGTACACGGAACGTTCTGAAAAGAGCTATGAATTGCAGACACCTTTTGAAATTAAATTGGCAAAGAAATTTGACACCTTCAATGGGGAATGTCCAAATTTTGTATTTCCCTTAAATTCCATAATCAAGACTATTCAACCAAGGGTTGAAAAGAAAAAGCTTG'

In [6]: msa[1000].nt_pos()
Out[6]: 
defaultdict(list,
            {'C': [(37908L, 37908L),
              (37910L, 37910L),
              (37919L, 37919L),
              (54110L, 54110L),
              (93196L, 93196L),
              (98107L, 98107L),
              (98737L, 98737L)],
             'T': [(0, 37907L),
              (37909L, 37909L),
              (37911L, 37918L),
              (37920L, 54109L),
              (54111L, 93195L),
              (93197L, 98106L),
              (98108L, 98736L),
              (98738L, 199715)]})

In [7]: msa[3200].nt_frequency()
Out[7]: 
{'-': 2.0028440385347193e-05,
 'A': 0.9998998577980732,
 'G': 8.011376154138877e-05}

The documentation describe the available functions.