How to deal with many uncharacterized protein in the blastx results?
0
0
Entering edit mode
9.2 years ago
seta ★ 1.9k

Hi all,

I recently made a de novo assembled transcriptome of non-model plant and run blastx of assembly against Uniprot (viridiplantae). Although, about 70% of contigs got the best hits, most of hits were uncharacterized protein that isn't interesting. I used the command of

/blastx -query file1.fasta -db uniprot -out file1_uni.txt -evalue 1e-3 -max_target_seqs 20 -outfmt '6 std sscinames scomnames stitle' -num_threads 9

and then using the command of

export LANG=C; export LC_ALL=C; sort -k1,1 -k12,12gr -k11,11g -k3,3gr blastout.txt | sort -u -k1,1 --merge > bestHits

tried to get the best hit. Could you please let me know your opinion about the results and help me out to reduce the number of uncharacterized protein hits?

RNA-Seq sequencing Assembly blast • 2.8k views
ADD COMMENT
4
Entering edit mode

In my opinion, 1e-3 is not a very stringent e-Value threshold and the best-hit not a suitable option for annotation transfer. How do you handle a BLAST best hit with a e-Value of 0.9e-3? Do you annotate that transcript with the hit's function/name? Because such an e-Value can originate from a very short match.

For a qualified annotation I would also include for example protein domain information and only transfer the best hits annotation if the match between query and subject/database sequence spans most of the transcript etc. But I think there is no clear "best practice" for this because the annotation process depends on too much variables, e.g., you having a non-model plant which likely lacks comparative sequences in databases. However, such more stringent filtering will reduce the fraction of contigs with an annotation but if you want high-quality annotations you can be sure about, this would be the path that I would follow.

However, having a large number of uncharacterized contigs is normal in my opinion. A large number of proteins in public databases is uncharacterized and thats it. What you probably can do is using alternative database, e.g., the KEGG Orthology groups (KO). But here, you definitively need to use more stringent thresholds (e.g., query sequence has to match the protein at 80% length with 75% positive/identity or something similar).

ADD REPLY
1
Entering edit mode

I agree about the e-value threshold being way too relaxed. In my opinion, significant hits start around 1e-6. In another post, I have already explained the shortcomings of sorting blastx results like this. A simple way to reduce "uncharacterized protein" hits is to first filter them away from the blastx output (if the sequence titles are included in output then simply by grep -v "uncharacterized protein" input | ..). A more sophisticated approach might only consider alternative hits that are e.g. within 0.X bitscrore of the highest score hit..

ADD REPLY
0
Entering edit mode

Thanks for comments. I agree with you about the e-value, however, I prefer not to be strict as I'm working on a non-model plant with almost nothing information. I recently did blastx against other databases that many "uncharacterized proteins" hits turn to be known in the results, but I'm not sure how the "uncharacterized protein" can be logically replaced with known proteins from another database. Could you please let me know if there is a way to integrate the various blastx results (obtained from several databases) to have a single informative result?

ADD REPLY

Login before adding your answer.

Traffic: 2521 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6