Hey guys,
I have downloaded uniprot.fasta, now I want to blast the protein sequences with my transcripts.
uniprot.fasta file format:
kurban@kurban-X550VC:~/Desktop/Uniprot$ more uniprot_sprot.fasta
>sp|Q6GZX4|001R_FRG3G Putative transcription factor 001R OS=Frog virus 3 (isolate Goorha) GN=FV3-001R PE=4 SV=1
MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPS
EKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLD
AKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHL
EKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDD
SFRKIYTDLGWKFTPL
My query fasta file format:
kurban@kurban-X550VC:~/Desktop/Uniprot$ more truncated_cd-hit-est-Trinity_CD_and_CK.fasta
>TR1|c0_g1_i1
TAAGAGGTAAGAAAGCTAGAAAAGAGGAAATATTTTTAATAAAAATAATAAAACTTAATA
ATATAATAATAAGTATCTTTTTATAATATTATAATAAATAAAATAAGGTAGAAATTATAT
AAATTTATAAGAAAGTAATATTCTTATAATAAGAATTAACTTTTATTAATATTAAACTAG
CTAAAGTAAAAATATAAATTTAAAAAAAAGATAATAATAATAAAGATTTTAAAAAATA
and I have done blast:
blastx -db uniprot_sprot.fasta -query truncated_cd-hit-est-Trinity_CD_and_CK.fasta -out uniprot_sprot_truncated_cd-hit-est-Trinity_CD_and_CK_blastx_tabular -evalue 1e-5 -num_threads 3 -num_alignments 1 -outfmt 6
The output file form I got:
kurban@kurban-X550VC:~/Desktop/Uniprot$ more uniprot_sprot_truncated_cd-hit-est-Trinity_CD_and_CK_blastx_tabular
TR4|c0_g1_i1 sp|Q9WVJ0|KCNH3_MOUSE 76.54 81 19 0 243 1 2 82 8e-40 144
TR21|c0_g1_i1 sp|Q99315|YG31B_YEAST 34.09 88 58 0 1 264 708 795 2e-06 49.3
TR22|c0_g1_i1 sp|Q06559|RS3_DROME 62.67 75 28 0 2 226 146 220 3e-28 107
TR51|c0_g1_i1 sp|Q9M4T8|PSA5_SOYBN 50.00 78 38 1 239 6 40 116 1e-21 89.4
TR52|c0_g1_i1 sp|Q9UBS5|GABR1_HUMAN 50.00 102 36 4 3 299 377 466 8e-24 99.8
TR70|c0_g1_i1 sp|Q9H5L6|THAP9_HUMAN 31.36 169 108 5 499 2 322 485 5e-17 82.8
TR72|c0_g1_i1 sp|Q13200|PSMD2_HUMAN 51.95 77 37 0 1 231 666 742 5e-20 88.2
TR81|c0_g1_i1 sp|Q12296|MAM3_YEAST 32.00 125 82 2 3 374 204 326 3e-14 73.9
TR82|c0_g1_i1 sp|Q6BSS8|APTH1_DEBHA 50.68 73 34 2 20 235 161 232 4e-16 73.9
TR84|c0_g1_i1 sp|P20825|POL2_DROME 54.17 72 33 0 6 221 300 371 4e-20 88.2
TR97|c0_g1_i1 sp|Q921I9|EXOS4_MOUSE 36.67 90 55 2 280 14 101 189 4e-10 58.2
There is no protein information included in second column in the output file. If I could get the blasted sequences all header info. or protein information included in the second column would be awesome. The blast output file form I want to get might be look like this:
TR4|c0_g1_i1 sp|Q9WVJ0|KCNH3_MOUSE Uncharacterized protein 009R 76.54 81 19 0 243 1 2 82 8e-40 144
TR21|c0_g1_i1 sp|Q99315|YG31B_YEAST Uncharacterized protein 042L 34.09 88 58 0 1 264 708 795 2e-06 49.3
or something looks like that.
Could you give me some suggestions? How could I do that?
I'm not sure sure makeblastedb can parse the info correctly from uniprot.fasta. One option would be to create a map file with two columns, "uniprot ID" (e.g.
sp|Q9WVJ0|KCNH3_MOUSE
) in first column and the other info OP wants in second column. Then OP could use join to join the blast output file based on column 2 and map file based on column 1 and output in his desired format.