I have completed a blastx run on my samples and have obtained the following result (example):
$head blastx_result.txt
NS500162:172:HG5CJBGXX:1:11101:2522 ZWIP2_ARATH 52.500 40 19 0 2 121 25 64 8.26e-07 44.3
I would like to take the ACC_ID number, in this case, ZWIP2_ARATH and find the taxonomic information for this. After doing a search, I found this site: UniProtKB/Swiss-Prot entries.
Here is what this .txt file (from link) looks like:
ENTRY NAME AC nb AA Description - Biological Source
ZWIP2_ARATH Q9SVY1 383 Zinc finger protein WIP2 (Protein TRANSMITTING TRACT) (WIP-
domain protein 2) (AtWIP2) [Gene: WIP2 or NTT or At3g57670
or F15B8.140] - Arabidopsis thaliana (Mouse-ear cress)
023R_IIV3 Q197D7 106 Uncharacterized protein 023R [Gene: IIV3-023R] -
Invertebrate iridescent virus 3 (IIV-3) (Mosquito virus)
This text file contains all of the ACC_ID's and links them to the respective function and taxonomy. The taxonomy comes after the final '-' delimiter (there can be more than one). However, a simple grep command (grep -e 001R_FRG3G shortdes.txt) will not work because of the way this file is set up. One ACC_ID can take 1, 2, or 3 total lines, depending on the ACC_ID.
So, I thought about removing new lines:
awk '{ printf "%s", $0 }'
but this makes a mess out of the file - as it keeps all the tabs and major spacing's, but it's all one line and that's not practical.
I also must add that I have > 500,000 of these ACC_IDs to look up and map to Taxonomy!
There must be a simple solution to just extracting the taxonomy from this file or by any other means. Any inkling of light on a much more practical way to do this would be incredibly appreciated, indeed!
Thanks a ton!
Pierre: Thank you so much!!! May I ask - is there a way to give a file of these ACC_IDs? Because I have > 500,000 to look up and I cannot input each one manually.
there is a uniprot batch query: http://www.uniprot.org/help/uploadlists