Creating Table From Fasta File In Python
1
1
Entering edit mode
11.8 years ago
Peiska ▴ 30

I have a fasta file with more than 500 sequences, and from that file I need to build a table, so I am trying to write a script to do it instead of doing it by hand, with copy-paste. I am reading the file using Biopython : seq=SeqIO.parse(handle, "fasta")

From each sequence I want to know the specie the protein sequence belongs, the name of the protein and the Uniprot ID. when I use the SeqIO to parse a fasta file I noticed that there is not much info I can parse from it.

Here is an subset of my fasta file:

>gi|194757291|ref|XP_001960898.1| GF11270 [Drosophila ananassae] >gi|190622196|gb|EDV37720.1| GF11270 [Drosophila ananassae] MSAARTSQDCDCTAKCRLRQHGNTITAALTKRSISSQNLAAFVYKTCGNFANILDDLGRSAVHMSASTGRYEILEWLLNH GAYINGQDYESGSSPLHRALYYGSIDCAVLLLRYGASMELLDEDTCCPLQAICRKCDVDDFATDSQNDVLVWGSNKNYNL GVGSEQNTNAPQSVDFFRKSNIWIEQVALGAYHSLFLDKKGHLYAVGHGKGGRLGTGGENTLPAPKRVKVSSKLGSEDSI RCISVSRQHSLVLTHRSLVFACGLNSDCQLGVRDAPEHLAQFKEVVALRDKGASDLVRVIACDQHSIAYGSRCVYVWGAN QGQFGISANIASIVVPTLIKLPARTTIRFVEANNAATVIYSEEKMIYLYYAEKTRAIKTPNYEDLKSISVMGGHIKNSAK GSAAALKLLMLTETNVVYLWYENTQQFYRCNFLPIRLPQIKKILYKCNQVMVLSEDGCVYRGKCNQIALPASELQEKSRP NLDIWQNNDQNRTEISREHVIRIELQRVPNIDRAVDISCDEGFSSFAVLQESQGKYFRKPTLPRKEHSFKKLLHDTSDCD AVHDVVFHVDGEKYPAHKYIIYSRAPGLRELVRMYLDKDIYLNFENLTGKMFELVLKHIYTNYWPTEDDIDCIQQSLGPA NPQNRSRTCQMFLPHLEKFQLTELAKYVKSYVQDHQFPLPSARQRLPRLHRSDYPELYDVKIKCEDGQVLQAHKCMLVAR LEYFEMMFMHSWAERSSVTMEGVPAEYMEPVLDYLYSLEAEAFCKQAYLETFLYNMITICDQYFIESLQNLCELLILDKI SIRKCGEMLEFATMYNCKLLLKGCMDFICQNLARVLCYRSIEQCDGETLKCLNDHYRNMFSRVFDYRQITPFSEAIEDEL LLSFIDGLEVDLEYRMDAESKAKQAAKTKQKDLRKLNARHQYEQRAISSMMRSISISESNPAPEVATSPQESARSETNNW SRVIDKKEQKRKQAETALKVNKTLKQETSPEPEMVPIERTPVNEQTPPPLSPETEPSTPLNKSYNLDFSSLTPQSQKLSQ KQRKRLSSESKSWRGNSSALLESPTTPVPVPNAWGVTTTPSSSFNDSYTSPTTGSSSDPTSFANMMRSQAASSSATSKDQ SQNFSKILADERRQRESYERMRNKSLVHTQIEETAIAELREFYNVDNIDDEKITIARKSRPSDINFSTWIRQ
>gi|198456847|ref|XP_001360463.2| GA20796 [Drosophila pseudoobscura pseudoobscura] >gi|198135774|gb|EAL25038.2| GA20796 [Drosophila pseudoobscura pseudoobscura] MSTAKAQEYDCTAKCTCRQHGNSITAALTKRSIDNQNLGAFIFKTCGNFANIIDDLGRSAVHMSASVARYEILEWLLNHG AYINGLDYESGSSPLHRALYYGSIDCAVLLLRYGASLELLDEDTRCPLQAICRKCDEDFTTESQNDVLVWGSNKNYNLGI GNEQNTNAPQAVDFFRKSNIWIEQVALGAYHSLFCDKKGHLYAVGHGKGGRLGIGVENSLPAPKRVKVSSKLNDDSIMCI SVSRQHSLLLTRRSLVFACGINTDHQLGVRDAPENLTQFREVVALRDKGASDLLRVIACDQHSIAYSTKCVYVWGANQGQ FGISRTTDTIMAPTLIKLPARTSIRFVEANNAATVIYTEEKMITLFYGDKTRYIKTPNYEDLKSIAVIGGHLKSSTKGSA AALKLLMLTETNVVFLWYENTQQFYRCNFSPIRLPEIKKILYKCNQVLILSLDGCVYRGKCNQIALPAGILEEKSKPNMD IWHNNDQNRTEISREHVIRIELQRVPNIDRATDIFCDESFSSFAVLQESHMKYFRKPPLPRREHNFKKLYHDTCESDAVH DVVFHVDGERFAAHKFILYSRAPGLRELTRIYLDKDVYLNFENLTGKMFELILKYIYTSYWPTEDDIDCIQESLGPANPR ERSRACEMFIPHLEMFQLVDLARYLQSYVRDNQFPIPSTRQRFNRLHRSDYPELYDVRIVCEDSKVLEAHKCMLVSRLEY FEMMFTHSWAERTTVNMEGVPAEYMEPVLDYLYSLDTEAFCKQNYTETFLYNMVTFCDQYFIESLQNVCESLILDKISIR KCGEMLDFAAMYNCKLLHKGCMDFICHNLARVLCYRSIEQCDEATLKCLNDHYRKMFSNVFDYRQITPFSEAIEDELLLS FVVDCDIDLDYRMDPETKLKAAAKHKQKDLRRQDARHYYEQQAISSMMRSLSVSESASGPEATTGPQESTRSEGKNWSRV VDKKEQKRKLADTALKVNNTLKLEEPPRPELEVIERALMKEQTPPPTSPAEETSTPLSKSYNLDLSSLTPQSQKLSQKQR KRLSSESKSWRSPLVEQEPTTPVAVPNAWGLPPATPSSSSFTDSPATGSISDPTSFANMMRGQAAAATTPTEKGQSFSRI LADERRQRESFERMRNKSLAHTQIEETAIAELREFYNVDNTDDETITIERKSRPTDINFSTWLKH
>gi|355695434|gb|AES00009.1| inhibitor of Bruton agammaglobulinemia tyrosine kinase [Mustela putorius furo] KPGNKLKLNQKKCSFLCDVTMKSVDGKEFTCHKCVLCARLEYFHSMLSSSWIEASTCTALEMPIHSDILKVILDYLYTDE AVVIKESQNVDFVCSVLVVADQLLITRLKGMCEVALTEKLTLKNAAMILEFAAMYNAEQLKLSCLQFIGLNM

Is there any way I get the Protein name, Uniprot ID and organism from those sequences? For example I thought on parsing the genebank id from the seq.description and then search in genebank with that ID but I think it cannot be made, not all the sequences have genebank ID. Any suggestions how to do this? Any help would be really appreciated.

example of desired output:

name    organism    uniprot id  family
GF11270 Sophophora  B3MFN0  
GA20796 Sophophora  Q291S4
python • 5.1k views
ADD COMMENT
0
Entering edit mode

What you pasted there is not valid Fasta.

ADD REPLY
0
Entering edit mode

Check out the different string split functions in the Python and other like startswith, endswith.

ADD REPLY
2
Entering edit mode
11.8 years ago
Peiska ▴ 30

Solution found:

handle = open("seq.fasta")
orchid_dict = SeqIO.to_dict(SeqIO.parse(handle, "fasta"))
handle.close()
for key in orchid_dict:
    # print key
    #print orchid_dict[key].description
    if orchid_dict[key].description.find("|gb|") !=-1:
        gb_id=re.findall(r'gb\|([^\|]+)', orchid_dict[key].description)
        newstr = str(gb_id).replace("'", "")
        newstr = newstr.replace("[", "")
        newstr = newstr.replace("]", "")

        params = ( ('query',newstr),

                  ('format','tab'),
                  ('columns','protein names,organism,id,families,existence'  )          
                  )

        data = urllib.urlencode(params)

        request = urllib2.Request(url, data)
        contact = "" # Please set your email address here to help us debug in case of problems.
        request.add_header('User-Agent', 'Python contact')
        response = urllib2.urlopen(request)
        page = response.read(200000)
        f.writelines( page)
ADD COMMENT
0
Entering edit mode

You should probably quote where you found this solution. From the orchid reference I guess on the Biopython tutorial?

ADD REPLY
0
Entering edit mode

I used the code from the Biopython tutorial to read the fast file, and to access the Uniprot I found the information here: http://www.uniprot.org/faq/28#downloading

ADD REPLY
0
Entering edit mode

thanks for following up with an answer!

ADD REPLY

Login before adding your answer.

Traffic: 2543 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6