Question

Use Megan To Parse Sam File

0

Entering edit mode

12.7 years ago

Shuixia100 ▴ 120

Dear all,

I got a question about using MEGAN4 to parsing SAM file.

What I want to do is to get taxonomic and functional annotaion of my raw reads against nr database . As the raw reads is too big (11 million reads in total, 100bp long each) for direct blast against nr database. So I took an approach first assemble my reads into ORFs which I could got blast result easily and then aligned my reads to ORF. Then I want to use MEGAN to parse the alignment of reads to ORF thus get the annotation of raw reads.

Here is what I did exactly:

-I first assembly the reads into contigs

then use MetaGeneMark to find open reading frames (ORFs) whose size is suitable to blast against nr database.
blast ORFs against nr database
import the ORF blast result into MEGAN using default parameters and successfully get the rma file
use the Export-Assignments To CSV funtion of MEGAN4 to generate a synomous file which contains two colums (tab seperated): the first one is the name of ORF and second column is the taxonomy ID
use bowtie align my raw reads to ORFs and get the SAM file that I want to parse

My problem: Its said on the user manual that import SAM file using the synomous file MEGAN should parse the SAM file, but what I got is all my reads are asigned into two big groups one is called "No hits" and another is "Low complexity". like this:

enter image description here

I have tried it several times, it just works that way. Does anyone know how to fix this? or is there any other alternative method to parse the sam?

blast • 4.2k views

ADD COMMENT • link updated 12.7 years ago by Michael 55k • written 12.7 years ago by Shuixia100 ▴ 120

0

Entering edit mode

Just make sure that you have the ORF to taxonomy mapping (synonyms) used during data SAM file import.

From your description it is not clear that you have actually specified the synonyms as parameter during the import phase.

ADD REPLY • link 12.7 years ago by Istvan Albert 102k

0

Entering edit mode

Thanks for you comment. yes I have use the synonyms in the parsing during my try. And Ive wrote to the developer of MEGAN; they told me there is a bug causing this kind of problem and they have updated new version of MEGAN which can parse SAM now.

ADD REPLY • link 12.6 years ago by Shuixia100 ▴ 120

score 0 · Answer 1 · 2012-04-10

There are errors in your workflow. First, Megan is for raw reads, so don't assemble reads to avoid chimeric assemblies and allow megan to count the number of reads, second I wouldn't bias your result for predicted coding regions, because you will be loosing a lot of information and I cannot imagine the gene prediction on fragments to work well. Then, if you are comparing DNA against NR you need to use blastx, that is probably the reason for not getting any taxa. However, if you are using bacterial sequences, better use NT with blastn, otherwise you are not picking up interagenic and non-coding sequences. In my experience, blastn or tblastx against AA is best for viral meta genomes where the coverage of the natural variability is so low that the next related genome is too distant on nt level to find anything.

Thus, try with raw reads and blast against nt, then add the Sam file and it should work much better. I just saw that your reads file is too big. Well that is not exactly true, you just need to get enough compute power and split up your files to blast using multiple processes and wait... Alternatively, you may reduce the database size to bacterial taxa only, it is not necessary to have the full nt, or even use a database of 16s only eg SILVA. Even drawing a manageable subset of the input reads randomly and discarding the rest will give you a less skewed analysis than the non-standard workflow you are proposing.

Metagenomics using blast is a very resource intensive analysis, you can try CARMA instead and see if it uses less resources. Assembling the reads is not a viable workflow IMO, because of the problems mentioned above.