Question

blast results: grepping and replacing

0

Entering edit mode

8.1 years ago

kinggang • 0

Hi, I am stuck with my blast result (outfmt 6). In first column i got the contig id;like second column also i got subject id. But i need subjectid protein and organism.

eg (tabular format)

blast output: (tabular format)

TRINITY_DN2892_c0_g1_i1 gi|743827192|ref|XP_010933525.1|    94.958  238 12  0   742 29  34  271 2.15e-148   433

TRINITY_DN2892_c0_g1_i2 gi|743827192|ref|XP_010933525.1|    95.062  243 12  0   730 2   34  276 7.07e-153   444

TRINITY_DN2873_c0_g1_i1 gi|743848015|ref|XP_010939248.1|    91.447  456 39  0   2   1369    5   460 0.0 850

TRINITY_DN2838_c0_g1_i1 gi|743771805|ref|XP_010916192.1|    90.344  901 85  2   170 2872    2   900 0.0 1504

TRINITY_DN2838_c0_g1_i1 gi|743771805|ref|XP_010916192.1|    91.176  238 21  0   2904    3617    898 1135    2.14e-137   456

etc...

**my expected output: (Tabular format)**

TRINITY_DN2892_c0_g1_i1 gi|743827192|ref|XP_010933525.1| PREDICTED: aspartic proteinase nepenthesin-2-like isoform X2 [Elaeis guineensis]   94.958  238 12  0   742 29  34  271 2.15e-148   433

TRINITY_DN2892_c0_g1_i2 gi|743827192|ref|XP_010933525.1| PREDICTED: aspartic proteinase nepenthesin-2-like isoform X2 [Elaeis guineensis]   95.062  243 12  0   730 2   34  276 7.07e-153   444

TRINITY_DN2873_c0_g1_i1 gi|743848015|ref|XP_010939248.1| PREDICTED: oligopeptide transporter 3-like [Elaeis guineensis] 91.447  456 39  0   2   1369    5   460 0.0 850

TRINITY_DN2838_c0_g1_i1 gi|743771805|ref|XP_010916192.1| PREDICTED: uncharacterized protein LOC105041089 isoform X1 [Elaeis guineensis] 90.344  901 85  2   170 2872    2   900 0.0 1504

TRINITY_DN2838_c0_g1_i1 gi|743771805|ref|XP_010916192.1| PREDICTED: uncharacterized protein LOC105041089 isoform X1 [Elaeis guineensis] 91.176  238 21  0   2904    3617    898 1135    2.14e-137   456

etc...

I have around one lakh sequences, so it is difficult doing again blast and manual editing. For that I have taken all the subject id and greped with standalone database and i got the name list in another text file .

eg textfile

gi|743771814|ref|XP_010916197.1| PREDICTED: uncharacterized protein LOC105041093 [Elaeis guineensis] gi|743771805|ref|XP_010916192.1| PREDICTED: uncharacterized protein LOC105041089 isoform X1 [Elaeis guineensis] gi|743771786|ref|XP_010916183.1| PREDICTED: uncharacterized protein LOC105041085 isoform X2 [Elaeis guineensis]

now i want to incorporate this text file to my blast result like my expected output (greping and replacing in second column ). Is der any script or awk command for doing this job ??

thank you

blast GREP AWK • 2.0k views

ADD COMMENT • link updated 8.1 years ago by 5heikki 11k • written 8.1 years ago by kinggang • 0

0

Entering edit mode

did you add "stitle" in your blast command ?

ADD REPLY • link 8.1 years ago by ashish ▴ 680

score 0 · Answer 1 · 2017-10-18

I think you have a better source for the annotation, and that is the blastdb you used to search in the first place. Try

cat myblastresultsfile.txt | cut -f2 | cut -f4 -d'|' | \
blastdbcmd -db myblastdb -entry_batch - -outfmt "%t %L"

you need to experiment a little with the -outfmt parameter to get the fields you need because they depend on how the db was generated.

In general, it is advisable not to use tabular blast output as primary output format, as it is just lacking exactly those fields you need most (following Murphy), use ASN1 instead and generate output as needed.

score 0 · Answer 2 · 2017-10-18

I had the same problem so I wrote a workaround for it in python. It fetches the descriptions and appends them to the second column of the input file.

import subprocess, sys, os

# getdescriptors.py
# usage: getdescriptors.py infile

infile = sys.argv[1]
outfile = '%s_edited.txt' %(infile[:-4])

openedin = open(infile, 'r')
openedout = open(outfile, 'w')

print "fetching descriptors, this takes a while"
for line in openedin:
    lineParsed = line.split('\t')
    print lineParsed[0]
    entry = lineParsed[1].split('|')[3]
    output = subprocess.check_output("blastdbcmd -db /scratch/db/nr/blastDB/nr -dbtype prot -entry '" + entry + "'", shell=True)
    record2 = output.split(' ',1)[1].split('\n',1)[0]
    record2 = record2.split(']',1)
    if len(record2) > 1:
        record2 = record2[0] + "]"
    else:
        record2 = record2[0]
    results = (lineParsed[0] + lineParsed[1] + record2 + '\t'+ lineParsed[2] + 
    '\t'+ lineParsed[3] + '\t' + lineParsed[4] + '\t' + lineParsed[5] + 
    '\t'+ lineParsed[6] + '\t' + lineParsed[7] + '\t' + lineParsed[8] + 
    '\t'+ lineParsed[9] + '\t' + lineParsed[10] + '\t'+ lineParsed[11] + '\n')
    openedout.write(results)
print "done"

openedin.close() 
openedout.close()

score 0 · Answer 3 · 2017-10-18

0

Entering edit mode

8.1 years ago

5heikki 11k

This should work as long as there's no pipe (|) in stitle:

join -t "|" -o 1.1,1.2,1.3,1.4,2.5,1.5 -1 2 -2 2 <(sort -t "|" -k2,2 BlastFile) <(sort -t "|" -k2,2 StitleFile)

ADD COMMENT • link 8.1 years ago by 5heikki 11k