Hi,
I having a problem with MEGAN 5. I'm working with some quite large RMA files (40 Gb aprox), built using the trimmed reads and a blast run. The problem is that my (poor little) PC always hangs up when I try to open them. So, I was wondering if there's a way to parse a file via Megan command line, dividing it in "Bacteria", "Archaeas" & "Viruses" or whatever, so the files become a little bit smaller. Thanks!
Thanks Charles for replying. Yes, I have the 16S sequences and I've already used QIIME. But now I'm working on the WGS reads and I wanted to do another taxonomical analysis (beacuse by using 16S, you leave behind viruses and eukaryotas). That's why I tried MEGAN. For the BLAST part, I used DIAMOND, which is waaay faster than regular BLAST (although, each search takes almost a day). I got that one covered, but the results I get are killing my PC (still waiting for budget approval to buy a new one...)
Ok - I haven't tested any of the following programs, but it is possible that it might help you to use a different method to quantify species abundances that doesn't depend upon your BLAST file:
http://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x
http://www.ccb.jhu.edu/software/centrifuge/
This one is really for transcriptomes, but it might still work:
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0969-1 http://taxonomer.iobio.io/
Hi Did you find out any solution to this problem? I am facing the same problem. My rma files are almost 2 GB to 3GB in size but still, my pc hangs. If you found any solution please explain because that would be really helpful.
Thanks
I am not sure what to tell you about RMA files, since I don't typically work with Megan.
In general, there was some public eDNA re-analysis where I tried out various options:
https://github.com/cwarden45/PRJNA513845-eDNA_reanalysis/blob/master/metagenomics/README.md
Running MEGABLAST does take a while, even with prioritizing more highly expressed sequences (unless you go even further). In that situation, I was specifically looking for artifacts, so I specifically was trying to look at less common things.
However, in other situations, maybe looking something like those present at >1% (or even >1/10,000, for identical sequences) might help?
Also, in that situation, the SRA has some taxonomy assignments.
Assuming that you don't have human reads without consent for public deposit (or you filter the human reads), the SRA has some taxonomy assignments. For that eDNA project, you can see some selected notes here:
https://github.com/cwarden45/PRJNA513845-eDNA_reanalysis/blob/master/extended_summary.xlsx (if you download the file to view locally).
In other words, if you haven't already deposited your data in the SRA, that is generally a good practice and might be helpful for analysis in some situations?