Question

Identify Contamination With Blast

4

Entering edit mode

12.6 years ago

John St. John ★ 1.2k

Many reads from a recent RNA-seq sequencing run I had done are not mapping to the reference sequence they had come from. I converted some of the sequences to fasta format, and blasted them against NR allowing everything, and indeed the few that I have looked at appear to be on a whole different part of the phylogenetic tree than the organism I was going for.

The problem is that right now, using NCBI Blast, I can view a tree of results for a single sequence in my list of hundreds that I wanted to test. Is there a way to view a tree of hundreds of results? What I had in mind was maybe just pulling the top 5 hits from each of the 100 results, and adding those to a phylogenetic tree, along with counts on how many times each species showed up in the list.

Of course this is just an idea. What I really want to answer is "what is this stuff?". So any ideas you have for visualizing blast results on a subset of my unmapped reads would be much appreciated.

Thanks for your time!

blast rna-seq tree • 5.9k views

ADD COMMENT • link updated 12.6 years ago by Mary 11k • written 12.6 years ago by John St. John ★ 1.2k

score 4 · Answer 1 · 2012-10-01

This is my grumpy curmudgeonly two cents, but I get frustrated when I see people show a MEGAN output tree to confirm they had contamination in their sequencing run. It seems like you have a handle on the situation and are using lots of different means to figure out where the reads place.

I'm not discouraging you from using MEGAN, but as a blast parser you've got some inherent issues with using blast altogether, and then there's all the issues with the NCBI NR database and taxonomy (one of the fungi I study, Cryptococcus, is located in three places in the taxonomy, two fungal positions, and also listed in the bacteria). Depending on the length of your sequence reads, it can be difficult to place the samples however you do it, but particularly using blast versus phylogenetic methods. There are so many unknown samples from the environment.

I would suggest that you continue doing what you are already doing: adding a phylogenetic component to the identification of the sequences. At least the use of other metagenomic platforms, such as MetaPhlAn, which is instantaneously quick using the Galaxy platform, or MG-RAST, which will take a few days to run your samples through the pipeline, but has options for parsing out host-associated reads. Once you have an assessment of your sequence diversity, you can predict OTUs and use something like TopiaryExplorer to show where the sequences fall phylogenetically.

Sorry if my point of view here came out a little negative, I'm just frustrated on the state of identifying mystery reads and wish I had a better answer to the problem myself.

score 1 · Answer 2 · 2012-10-01

1

Entering edit mode

12.6 years ago

JC 13k

Maybe you can convert your reads to regular Fasta format, blast them to NR and parse and view the results with MEGAN.

ADD COMMENT • link 12.6 years ago by JC 13k

0

Entering edit mode

Thanks for pointing out MEGAN! This works pretty well, but it looks like they do not want to show any reads that map to human (where my reads should be coming from). There are some reads marked as "Metazoan" which could be reads that match human? I kind of wish that I could just see the underlying tree rather than just their computationally assigned nodes.

ADD REPLY • link 12.6 years ago by John St. John ★ 1.2k

0

Entering edit mode

You already remove all reads that come from human, isn't? Another problem is if a read mapped multiple species MEGAN cannot resolve the origin.

ADD REPLY • link 12.6 years ago by JC 13k

score 1 · Answer 3 · 2012-10-02

Funny, just last night I was chatting with some people about this paper that's proposing contamination in public data sets. I am trying to figure out what's going on still. But I was able to see urchin data in a data set they shouldn't have been in.... http://www.biomedcentral.com/1471-2164/13/381/abstract