Assume a contig of draft genome assembly that contains several regions of unknown sequence and length, which are indicated by stretches of 'N'. Further assume the presence of a contig of the same genome region, whose source is a complete and trustworthy genome assembly of the same organism. Finally, assume that the contig of the draft genome and the contig of the final genome have been properly aligned (i.e., positional homology is established).
The result would be something like this:
$ cat alignm.fas
>contig_draft
ACGTNNNNNNNNNNACGTNNNNNNNNNNACGT
>contig_final
ACGTACGT------ACGT------ACGTACGT
I would like to remove those 'N' from the draft contig that are matched by a gap character in the alignment (n=12 in the above example), but retain those that are matched by a nucleotide (n=8 in the above example).
In the above example, the desired output would be a new draft contig sequence with a length of 20 nucleotides, 8 of which are 'N'.
$ cat alignm.fas | commands_sought
>contig_draft
ACGTNNNNACGTNNNNACGT
Is there any code (e.g., on GitHub) already available to conduct the above-mentioned procedure?
You might have a look at some of the MSA trimming tools? eg. trimAl or such?
I assume they will not be default give what you want , but with a bit of parameter tuning you might get there
Hi, a novice attempt with t_coffee would be
But this will just remove all gapped columns also those with a non-N character which is matched by a gap.
Nice, I was not aware t_coffee could do that himself.
I would give this a try and see where you end up. If the diff would be only the newly added sequence it will work perfectly. In the other case I think it might still work in most cases, given that there is only a few bases difference they will be aligned as a mismatch rather than a gap (tuning the alignment params might anyway give this result is not already the case)