Question

blast x results - some making no sense

0

Entering edit mode

8.4 years ago

Biogeek ▴ 470

Hi friends,

I'm annotating a higher plant which is well into the higher plant lineage, it is a pond species. I've conducted a BlastX on the uniprot viridiplantae database locally and I'm getting a lot of hits for basal lineages such as the unicellular green algae. The results to me are not making much sense - given the evolutionary advancement of this plant compared to lower plants. If I look down my next best hits vice versa, there is always a more appropriate hit.

As a result of the above, I then performed a blastX with just the embryophyte lineage of uniprot and trembl (land plants and aquatic plants) and the results make much more since; however, the % identity score is low in some transcripts.What are people's takes on this? Is it better to use a more specific database in such a case, given that the evolution of my non-model organism is clear to me, and is highly evolved on a much more distant branch of the higher plants, away from the early branch of the viridiplantae? OR do I just use the entire viridiplantae lineage?

I've additionally done a BlastN to detect any contamination; with a 95% cut-off and low e-value. The database for this is made for unicellular microbial algae and eukaryotes which are appearing in the first set of blast X hits. There was very little hits for this, of which I removed.

What are people's opinions? Thanks.

blastx • 1.8k views

ADD COMMENT • link updated 8.4 years ago by Chris Fields ★ 2.2k • written 8.4 years ago by Biogeek ▴ 470

score 0 · Answer 1 · 2016-07-02

0

Entering edit mode

8.4 years ago

Chris Fields ★ 2.2k

I'm guessing this is from an assembly; is it a transcriptome or full genome?

It might be worth doing an overall non-biased analysis, maybe something like a blobplot against a larger database, just to see if there are any oddities in the data that might indicate problems (e.g. contaminating organisms, which are very common). We've done this using BLASTN and DIAMOND in place of BLASTX and have found this helps considerably (you can also use the results to help identify and filter the problematic sequences). You did say this was a pond plant and your hits are against algae...

ADD COMMENT • link 8.4 years ago by Chris Fields ★ 2.2k

0

Entering edit mode

Hey Chris, thanks for this. It's a de novo transcriptome assembly. Essentially what files are needed for this? Will a fasta of the assembly suffice? Thanks

ADD REPLY • link 8.4 years ago by Biogeek ▴ 470

0

Entering edit mode

Not sure how well this would work with a transcriptome assembly as the coverage varies so much (that is one critical component on a blobplot). But you could probably look at variation in GC content and overall what taxonomic groups are found via BLASTX.

EDIT: 'phylogenetic' -> 'taxonomic'

ADD REPLY • link 8.4 years ago by Chris Fields ★ 2.2k

0

Entering edit mode

An area which I agree needs some improvement ;-). Might stick with the embryophyta and then blastN. Would someone be criticised for this approach?

ADD REPLY • link 8.4 years ago by Biogeek ▴ 470