I am working around to perform meta genomic analyses, for which I need to align my paired end reads (dna) to bacterial genomes using blast and based on the identity perc filter need to extract sequences with 90% or higher identity and use it for downstream analyses. My approach concatenate paired end reads and align. Any thoughts?
I tried BWA mem, but I see only 0.01% reads aligning to the bacterial genome.
-I am aligning it to all the bacterial genomes available.
- I am looking for bacterial, viral reads (aligned original fastq which are from human, I am looking for bacterial and viral hits in the reads that did not map to human).
- Yes I believe ecoli (no sure which strain- supposedly pathogenic)
-I still have not got in here yet..I am thinking to use PEAR.
Lastly--BBmap is an awesome tool...:)
@genomax2
-yes i have filtered adaptors , not sure contaminants (SeqClean is on my mind--any thoughts)
-probably yes, I did push it through KRAKEN from JHU, the ratio of classified to unclassified is to less. So my question is what are the sequences that did not map has to be something...
Do you have any reason to think that your pairs are in the same orientation and the 5-end of the second read continues exactly from the 3-end of the first read? No? Then concatenating the pairs is most certainly a false approach. Joining pairs is not the same thing than concatenation. Moreover, I think it's pretty likely that your pairs don't even overlap (if they do, that was rather likely a bad decision).
In my opinion the best approach is to first assemble a metagenome (IDBA-UD always worked for me), then predict proteins from the metagenome (prodigal always worked for me), and then blastp the predicted proteins against a database that suits your needs (e.g. nr, refseq protein or kegg).
Is there a way I can check if 5-end of the second read starts from 3 end of the first read?
Assembling a meta genome - not sure if it would work in my case like i said these are reads that did not map to human (nmapped reads from bam files using sam flags -b -f 13 ) .
Please use ADD COMMENT button to add this comment to existing post by @5heikki. SUBMIT ANSWER should only be used for new answers for the original question.
No... if very few of your reads merge, it's best to not bother with the merging process. Reads can fail to merge for a couple of reasons, though - typically it's because they do not overlap, but it's also possible that they just have too high of an error rate. If you have sufficient depth, it's possible to merge non-overlapping reads with BBMerge. But I think in this case, if all you want to do is see which genomes reads map to best, you should just skip merging.
I don't really know. But, BLAST is useful in that it can search huge databases (like refseq bacterial + refseq archaea + refseq viral) for relatively distant homologies with a reasonable amount of memory. It's way too slow to use on all of your reads, and it does not scale well with multiple threads, but you can use it on a couple thousand reads to see what you get.
You can use only the forward read (the other read should give you more or less the same info).
You can download pre-made blast indexes for nt database from http://ftp.ncbi.nih.gov/blast/db/. This is a multiple file (get all files that start with nt*.gz)/multi gigabit download so you don't want to do this on a slow connection/over wireless. NCBI has a command line user manual available here. You would ideally do this on a cluster or on a computer with beefy hardware specs.
Edit: Since you are only interested in bacterial genomes you may be able to get away with using Representative_Genomes.*tar.gz indexes (which would be a much smaller download). There are no viruses in this reduced databases and not all bacterial genomes are going to be present.
Have the adapters/other contaminants been filtered out? Are the reads overlapping?
You may not gain much by aligning both reads using blast. Either sequence should pretty much give same answer.
@Brian--
-I am aligning it to all the bacterial genomes available. - I am looking for bacterial, viral reads (aligned original fastq which are from human, I am looking for bacterial and viral hits in the reads that did not map to human). - Yes I believe ecoli (no sure which strain- supposedly pathogenic) -I still have not got in here yet..I am thinking to use PEAR.
Lastly--BBmap is an awesome tool...:)
@genomax2
-yes i have filtered adaptors , not sure contaminants (SeqClean is on my mind--any thoughts)
-probably yes, I did push it through KRAKEN from JHU, the ratio of classified to unclassified is to less. So my question is what are the sequences that did not map has to be something...
Thanks
What are some of those unaligned reads aligning to if you just did a blast @NCBI?
Any reason you are not using BBMerge, instead of PEAR :-)