How do denovo genome/transcriptome assemblers treat ambiguous bases?
1
0
Entering edit mode
10.1 years ago
Rohit ★ 1.5k

Dear Biostarists,

I have a basic yet important question: How do assemblers treat ambiguous bases N's? (to avoid erroneous contigs)

I read that Velvet treats each N as an A, but what about other denovo Genome assemblers such as SOAPdenovo2 (open-source black box), CLC (commercial mystique), Abyss, ALLPATH, Minia and others?

How do denovo transcriptome assemblers such as Trinity, Trans-Abyss, SOAPdenovo-Trans, Rnnotator and others treat them?

NGS Transcriptome-assembly Genome-assembly denovo • 3.2k views
ADD COMMENT
1
Entering edit mode

I suspect that this will depend entirely on the tool and that you'll have to ask the authors or read the code (this may not be mentioned in the papers) to find out.

ADD REPLY
6
Entering edit mode
10.1 years ago
Rohit ★ 1.5k

Dear all,

I can try to answer the question now, note that I have probably as some tools are dependent on others.

Velvet (probably Oases too) - Replace N's with A

Abyss (probably Trans-Abyss too) - Replace N's based on consensus sequences that fill that base, consensus sequence of 90% identity through DIALIGN-TX aligner

SOAPdenovo2, SOAPdenovo-Trans - Replace N's with G

ALLPATHS - Ambiguous bases are saved as random bases

Rnnotator - Uses velvet, so probably N's to A

IDBA - From the authors it is understood that sequencing depth is considered for assembly, Basically, we try to correct the graph based on the sequencing depth. It identifies similar paths and removes paths with very low sequencing depth comparing to neighbors. Note that it doesn't introduce new k-mers in this process. The assumption is that the actual sequence must appear in the graph and have higher depth.

Minia - If there are ambiguous bases in the input, i.e. N's in reads, then Minia cut reads around them: precisely, it discards any k-mer containing at least one N.

Trinity - Ignored first, later treated as mismatches

Non-[GATC] characters will be ignored during the early phases of Trinity (jellyfish, inchworm, and chrysalis- I think), and then likely treated as mismatches during the final butterfly phase. Trinity simply isn't compatible for the most part, though shouldn't error-out as a result of such chars.

CLCbio - I do not have a commercial license so I do not receive support, in this case those with a commercial license should try asking them as their code is unreadble

ADD COMMENT
0
Entering edit mode

Hi, when looking at an Abyss assembly there are lots of cases where there is a long run of a single base, where I presume N's should be, so I assumed Abyss replaced N's with a random base. Could you explain what is meant by "Replace N's based on consensus sequences that fill that base, consensus sequence of 90% identity through DIALIGN-TX aligner" - how could this result in what I am seeing? Hope you can help, thanks!

ADD REPLY
1
Entering edit mode

The first mention of Dialign-TX comes when the algorithm implements PopBubbles. This is already at the assembly stage where N's are replaced, based on other sequences that are 90% similar to that particular path.
I guess if there are no sequences similar, random bases are assigned.

ADD REPLY

Login before adding your answer.

Traffic: 1652 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6