Question

Removing unmatched Ns from query sequence based on alignment

0

Entering edit mode

6.6 years ago

Michael G ▴ 80

Assume a contig of draft genome assembly that contains several regions of unknown sequence and length, which are indicated by stretches of 'N'. Further assume the presence of a contig of the same genome region, whose source is a complete and trustworthy genome assembly of the same organism. Finally, assume that the contig of the draft genome and the contig of the final genome have been properly aligned (i.e., positional homology is established).

The result would be something like this:

$ cat alignm.fas
>contig_draft
ACGTNNNNNNNNNNACGTNNNNNNNNNNACGT
>contig_final
ACGTACGT------ACGT------ACGTACGT

I would like to remove those 'N' from the draft contig that are matched by a gap character in the alignment (n=12 in the above example), but retain those that are matched by a nucleotide (n=8 in the above example).

In the above example, the desired output would be a new draft contig sequence with a length of 20 nucleotides, 8 of which are 'N'.

$ cat alignm.fas | commands_sought
>contig_draft
ACGTNNNNACGTNNNNACGT

Is there any code (e.g., on GitHub) already available to conduct the above-mentioned procedure?

alignment sequence contig genome • 1.5k views

ADD COMMENT • link 6.6 years ago by Michael G ▴ 80

0

Entering edit mode

You might have a look at some of the MSA trimming tools? eg. trimAl or such?

I assume they will not be default give what you want , but with a bit of parameter tuning you might get there

ADD REPLY • link 6.6 years ago by lieven.sterck 15k

0

Entering edit mode

Hi, a novice attempt with t_coffee would be

t_coffee -other_pg seq_reformat -in alignm.fas -action +rm_gap 1
>contig_draft
ACGTNNNNACGTNNNNACGT

contig_final ACGTACGTACGTACGTACGT

But this will just remove all gapped columns also those with a non-N character which is matched by a gap.

ADD REPLY • link 6.6 years ago by microfuge ★ 1.9k

0

Entering edit mode

Nice, I was not aware t_coffee could do that himself.

I would give this a try and see where you end up. If the diff would be only the newly added sequence it will work perfectly. In the other case I think it might still work in most cases, given that there is only a few bases difference they will be aligned as a mismatch rather than a gap (tuning the alignment params might anyway give this result is not already the case)

ADD REPLY • link 6.6 years ago by lieven.sterck 15k