Question

blastn for paired end reads

2

Entering edit mode

8.5 years ago

badribio ▴ 290

I am working around to perform meta genomic analyses, for which I need to align my paired end reads (dna) to bacterial genomes using blast and based on the identity perc filter need to extract sequences with 90% or higher identity and use it for downstream analyses. My approach concatenate paired end reads and align. Any thoughts?

I tried BWA mem, but I see only 0.01% reads aligning to the bacterial genome.

Thanks Badri

genome alignment blast metagenomics • 5.2k views

ADD COMMENT • link 8.5 years ago by badribio ▴ 290

0

Entering edit mode

What are you aligning the reads to?
What kind of metagenome is it?
Do you have any reason to believe the reference genomes are in your metagenomic samples?
What command did you use for concatenating the reads?

ADD REPLY • link 8.5 years ago by Brian Bushnell 20k

0

Entering edit mode

Have the adapters/other contaminants been filtered out? Are the reads overlapping?

You may not gain much by aligning both reads using blast. Either sequence should pretty much give same answer.

ADD REPLY • link 8.5 years ago by GenoMax 148k

0

Entering edit mode

@Brian--

-I am aligning it to all the bacterial genomes available. - I am looking for bacterial, viral reads (aligned original fastq which are from human, I am looking for bacterial and viral hits in the reads that did not map to human). - Yes I believe ecoli (no sure which strain- supposedly pathogenic) -I still have not got in here yet..I am thinking to use PEAR.

Lastly--BBmap is an awesome tool...:)

@genomax2

-yes i have filtered adaptors , not sure contaminants (SeqClean is on my mind--any thoughts)

-probably yes, I did push it through KRAKEN from JHU, the ratio of classified to unclassified is to less. So my question is what are the sequences that did not map has to be something...

Thanks

ADD REPLY • link 8.5 years ago by badribio ▴ 290

0

Entering edit mode

What are some of those unaligned reads aligning to if you just did a blast @NCBI?

Any reason you are not using BBMerge, instead of PEAR :-)

ADD REPLY • link 8.5 years ago by GenoMax 148k

score 1 · Answer 1 · 2016-07-06

Do you have any reason to think that your pairs are in the same orientation and the 5-end of the second read continues exactly from the 3-end of the first read? No? Then concatenating the pairs is most certainly a false approach. Joining pairs is not the same thing than concatenation. Moreover, I think it's pretty likely that your pairs don't even overlap (if they do, that was rather likely a bad decision).

In my opinion the best approach is to first assemble a metagenome (IDBA-UD always worked for me), then predict proteins from the metagenome (prodigal always worked for me), and then blastp the predicted proteins against a database that suits your needs (e.g. nr, refseq protein or kegg).

score 0 · Answer 2 · 2016-07-06

0

Entering edit mode

8.5 years ago

badribio ▴ 290

@5heikki

That's a good point. Thank you

Is there a way I can check if 5-end of the second read starts from 3 end of the first read?

Assembling a meta genome - not sure if it would work in my case like i said these are reads that did not map to human (nmapped reads from bam files using sam flags -b -f 13 ) .

ADD COMMENT • link 8.5 years ago by badribio ▴ 290

0

Entering edit mode

Please use ADD COMMENT button to add this comment to existing post by @5heikki. SUBMIT ANSWER should only be used for new answers for the original question.

You can use BBMerge to see if the reads overlap.

ADD REPLY • link 8.5 years ago by GenoMax 148k

0

Entering edit mode

I used FLASH from jhu to merge my reads and see below

Read combination statistics: [FLASH] Total pairs: 2801070 [FLASH] Combined pairs: 145382 [FLASH] Uncombined pairs: 2655688 [FLASH] Percent combined: 5.19%

That is not very promising is it?

ADD REPLY • link 8.5 years ago by badribio ▴ 290

1

Entering edit mode

No... if very few of your reads merge, it's best to not bother with the merging process. Reads can fail to merge for a couple of reasons, though - typically it's because they do not overlap, but it's also possible that they just have too high of an error rate. If you have sufficient depth, it's possible to merge non-overlapping reads with BBMerge. But I think in this case, if all you want to do is see which genomes reads map to best, you should just skip merging.

ADD REPLY • link 8.5 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks Brian, is blast an optimum approach?

ADD REPLY • link 8.5 years ago by badribio ▴ 290

0

Entering edit mode

I thought you had already done Kraken analysis. Has that not been productive/conclusive?

ADD REPLY • link 8.5 years ago by GenoMax 148k

0

Entering edit mode

I did but the amount of sequences classified are 0.5% rest are unclassified.

ADD REPLY • link 8.5 years ago by badribio ▴ 290

0

Entering edit mode

Have you taken some of those reads and done a blast at NCBI via the web interface? What do you get there?

ADD REPLY • link 8.5 years ago by GenoMax 148k

0

Entering edit mode

I don't really know. But, BLAST is useful in that it can search huge databases (like refseq bacterial + refseq archaea + refseq viral) for relatively distant homologies with a reasonable amount of memory. It's way too slow to use on all of your reads, and it does not scale well with multiple threads, but you can use it on a couple thousand reads to see what you get.

ADD REPLY • link 8.5 years ago by Brian Bushnell 20k

0

Entering edit mode

True now the question is how do I use blast for my paired end data...also never used command line blast how does their indexing work?

Thanks.

ADD REPLY • link 8.5 years ago by badribio ▴ 290

1

Entering edit mode

You can use only the forward read (the other read should give you more or less the same info).

You can download pre-made blast indexes for nt database from http://ftp.ncbi.nih.gov/blast/db/. This is a multiple file (get all files that start with nt*.gz)/multi gigabit download so you don't want to do this on a slow connection/over wireless. NCBI has a command line user manual available here. You would ideally do this on a cluster or on a computer with beefy hardware specs.

Edit: Since you are only interested in bacterial genomes you may be able to get away with using Representative_Genomes.*tar.gz indexes (which would be a much smaller download). There are no viruses in this reduced databases and not all bacterial genomes are going to be present.

ADD REPLY • link 8.5 years ago by GenoMax 148k