How do I get description in BLAST?
1
0
Entering edit mode
3.2 years ago
Riku ▴ 80

Hi, all.

I'm trying to get a description to the gene list using BLAST. I created the original database with the following command and did BLAST, but my output shows a lot of N/A.

$ makeblastdb \
-in caenorhabditis_elegans.PRJNA13758.WBPS16.protein.fa \
-out elegansdb \
-dbtype prot \
-parse_seqids


$ blastx \
-query genes.fasta \
-db /home/nkarim/blast/db/elegansdb \
-outfmt "6 qseqid salltitles sseqid sscinames scomnames staxid pident length mismatch gapopen qstart qend sstart send ppos evalue bitscore" \
-evalue 1e-3 \
-out results

I have tried all the commonly suggested solutions.

1) Download taxdb.tar.gz in the same directory as the database, and unzip it.

$ update_blastdb.pl taxdb
$ tar -xvf taxdb.tar.gz

2) Add export BLASTDB=$BLASTDB:/home/nkarim/blast/db to ~/.bashrc and reload.

Are there any other problems for me? I'm facing this problem for few days but can't solve it.

I really appreciate for your help in advance!

blastx RNA-seq taxonomy BLAST • 3.2k views
ADD COMMENT
0
Entering edit mode

with exactly do you mean with " shows a lot of N/A " ? That you don't get any hit or that there is no usable functional description for the hits reported?

ADD REPLY
0
Entering edit mode

I share some of results. It means partial N/A is displayed.

$ head results
TRINITY_DN3139_c0_g2_i2 transcript=K12D12.2.1 gene=WBGene00003789   K12D12.2.1  N/A N/A N/A 25.747  870 546 24  1196    3667    498 1313    44.02   5.73e-60    229
TRINITY_DN3139_c0_g2_i2 transcript=K12D12.2.1 gene=WBGene00003789   K12D12.2.1  N/A N/A N/A 22.599  177 117 4   434 949 210 371 40.11   0.025   40.0
TRINITY_DN3139_c0_g2_i2 transcript=ZK970.4.1 gene=WBGene00006918    ZK970.4.1   N/A N/A N/A 40.196  102 59  2   59  361 1   101 59.80   2.64e-16    77.4
TRINITY_DN772_c0_g1_i16 transcript=ZK675.2.1 gene=WBGene00014066    ZK675.2.1   N/A N/A N/A 32.480  625 368 5   263 2122    222 797 53.44   7.08e-97    338
TRINITY_DN772_c0_g1_i16 transcript=T08H10.1.1 gene=WBGene00020369   T08H10.1.1  N/A N/A N/A 42.857  196 93  4   5660    5085    6   186 57.65   1.93e-36    142
TRINITY_DN772_c0_g1_i16 transcript=T08H10.1.1 gene=WBGene00020369   T08H10.1.1  N/A N/A N/A 37.313  67  42  0   4961    4761    253 319 61.19   2.04e-07    55.5
TRINITY_DN772_c0_g1_i16 transcript=C07D8.6.1 gene=WBGene00015565    C07D8.6.1   N/A N/A N/A 29.825  342 173 9   5681    4767    1   312 45.61   1.30e-29    121
TRINITY_DN772_c0_g1_i16 transcript=ZC443.1.1 gene=WBGene00013896    ZC443.1.1   N/A N/A N/A 30.330  333 171 8   5657    4767    9   316 45.35   4.08e-28    117
TRINITY_DN772_c0_g1_i16 transcript=Y39G8B.1a.2 gene=WBGene00012722  Y39G8B.1a.2 N/A N/A N/A 31.804  327 179 8   5660    4761    5   314 45.87   1.89e-25    109
TRINITY_DN772_c0_g1_i16 transcript=Y39G8B.1a.1 gene=WBGene00012722  Y39G8B.1a.1 N/A N/A N/A 31.804  327 179 8   5660    4761    5   314 45.87   1.89e-25    109
ADD REPLY
0
Entering edit mode

Can you show us what the fasta headers of the input database fasta looked like?

grep "^>" caenorhabditis_elegans.PRJNA13758.WBPS16.protein.fa | head -5 
ADD REPLY
0
Entering edit mode

Sure.

$ grep "^>" caenorhabditis_elegans.PRJNA13758.WBPS16.protein.fa | head -5
>Y110A7A.10.1 transcript=Y110A7A.10.1 gene=WBGene00000001
>F27C8.1.1 transcript=F27C8.1.1 gene=WBGene00000002
>F07C3.7.1 transcript=F07C3.7.1 gene=WBGene00000003
>F52H2.2a.1 transcript=F52H2.2a.1 gene=WBGene00000004
>F52H2.2b.1 transcript=F52H2.2b.1 gene=WBGene00000004
ADD REPLY
1
Entering edit mode
3.2 years ago
GenoMax 148k

You don't have scientific names of the organism in these headers so you are getting N/A in that location. Since you are using a custom database you will also need to provide a taxID map to get staxid's.

ADD COMMENT
0
Entering edit mode

I understand I need further steps. You means I have to use "get_species_taxids.sh", and get taxID map, aren't you?

Finally can I get description if I finish this steps?

ADD REPLY
0
Entering edit mode

You will need a file that connects each of your sequences to a taxonomic id as the page indicated by GenoMax suggests:

cat test_map.txt
seq1 68287
seq2 2382161
seq3 68287
seq4 382
seq5 382
seq6 382
ADD REPLY
0
Entering edit mode

Thank you for your advice.

Yes, I got it, Do I have to use "get_species_taxids.sh" to get tax ID map? Sorry for I'm beginner, so I don't know how to make tax ID map.

Then I created a DB for C. elegans (tacID:6239). In this case, I think all taxIDs become 6239.

By the way, I'm sorry for late reply. Thank you.

ADD REPLY
0
Entering edit mode

If you only have C. elegans proteins in your database why do you need a taxID column in your blast results? taxID would be the same for each hit.

ADD REPLY
0
Entering edit mode

I wolud like to get description. I mean description is protein's name; for example, Nuclear Pore complex Protein [Caenorhabditis elegans] as following.

enter image description here

We would like to run it in a terminal because it takes a long time in a browser when there are many genes. Do you mean that taxID is not what I need?

ADD REPLY
1
Entering edit mode

If you need description of the "hit" then you do not need taxID's. It is the reason I asked you to show the fasta headers for your input file. I don't know where you got that file from but it does not have the descriptions you need.

If you are simply looking for C. elegans proteins then you can download the protein sequence file for C. elegans from NCBI here. In this file you will find description names for the proteins. That will allow you to get the descriptions like example you show above. You will need to re-index and re-do the blast search though.

Example headers from protein file below.

>NP_507836.1 C-type lectin domain-containing protein [Caenorhabditis elegans]
>NP_507838.3 WSN domain-containing protein [Caenorhabditis elegans]
>NP_507839.3 Uncharacterized protein CELE_Y116F11B.2 [Caenorhabditis elegans]
>NP_507840.1 Uncharacterized protein CELE_Y116F11B.1 [Caenorhabditis elegans]
>NP_507841.1 Prolyl Carboxy Peptidase like [Caenorhabditis elegans]
>NP_507842.1 G_PROTEIN_RECEP_F1_2 domain-containing protein [Caenorhabditis elegans]
ADD REPLY
0
Entering edit mode

I see!

I have downloaded C. elegans database from the wrong. I didn't know some information were needed in the database for BLAST.

For future reference, could you please tell me how you find the place if you would like?

ADD REPLY
0
Entering edit mode

You don't need the descriptions for blast per se but clearly that is information you want since you don't want to have to look up what gene=WBGene00000002 is.

Most common model organism genomes can be found at NCBI's genomes section. This is the page for C. elegans where I got the above file from.

ADD REPLY
0
Entering edit mode

Thank you for all the detailed explanation! I have learned a lot from you. I was able to add a description to my results with your help.

I really appreciated your kind help!

ADD REPLY
0
Entering edit mode

that is not the taxid but the subject title, which you seem to already be producing with salltitles

basically that is where

transcript=F27C8.1.1 gene=WBGene00000002

comes from. Your input sequences to the BLAST database need to contain proper subject titles.

ADD REPLY
0
Entering edit mode

Yes. It seemed that I have downloaded datamase from the wrong. For that reason, I didn't get description.

I apologize for the confusion. Thank you for your kindness!

ADD REPLY

Login before adding your answer.

Traffic: 1945 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6