Hi,
First of all, I am a bioinformatics newbie, so sorry for any obvious mistakes or syntax use. My aim is to get programmatic access to UniProt ID database to download multiple sequences and storing those in a fasta fila. With this file I would like to do multiple sequence alignment.
So far I am able to retrieve a fasta file with multiple sequences in the file. However, whenever I want to align them I don't get an error, but when I use the output file to read it for example it cannot be found. (I can also not find the file on my computer). I think the alignment is not done properly (or at least maybe I use the wrong format in the input file), but I don't know how to solve it.
This is my code.
import urllib.parse
import urllib.request
url = 'https://www.uniprot.org/uploadlists/'
UniprotID = 'Q9Y6I4 Q9Y5T5 Q9Y2K6'
#Transfer Uniprot ID to ACC to get sequence information
params = {
'from': 'ACC+ID',
'to': 'ACC',
'format': 'fasta',
'query': UniprotID
}
data = urllib.parse.urlencode(params)
data = data.encode('utf-8')
req = urllib.request.Request(url, data)
with urllib.request.urlopen(req) as f:
response = f.read()
response = response.decode('utf-8')
print(response)
#to store data in fasta file
with open("in_filename.fasta", "w") as f:
for x in response:
f.write(str(x))
#multiple sequence alignment
from Bio.Align.Applications import ClustalOmegaCommandline
in_file = "in_filename.fasta"
out_file = "out_filename.phylip"
clustalomega_cline = ClustalOmegaCommandline(infile = in_file, outfile = out_file, outfmt = 'phylip', verbose = True, auto = False)
print(clustalomega_cline)
from Bio import AlignIO
alignment = AlignIO.read("out_filename.phylip", "clustalo")
The printed response and also fasta file looks like:
>sp|Q9Y6I4|UBP3_HUMAN Ubiquitin carboxyl-terminal hydrolase 3 OS=Homo sapiens OX=9606 GN=USP3 PE=1 SV=2
MECPHLSSSVCIAPDSAKFPNGSPSSWCCSVCRSNKSPWVCLTCSSVHCGRYVNGHAKKH
YEDAQVPLTNHKKSEKQDKVQHTVCMDCSSYSTYCYRCDDFVVNDTKLGLVQKVREHLQN
LENSAFTADRHKKRKLLENSTLNSKLLKVNGSTTAICATGLRNLGNTCFMNAILQSLSNI
EQFCCYFKELPAVELRNGKTAGRRTYHTRSQGDNNVSLVEEFRKTLCALWQGSQTAFSPE
SLFYVVWKIMPNFRGYQQQDAHEFMRYLLDHLHLELQGGFNGVSRSAILQENSTLSASNK
CCINGASTVVTAIFGGILQNEVNCLICGTESRKFDPFLDLSLDIPSQFRSKRSKNQENGP
VCSLRDCLRSFTDLEELDETELYMCHKCKKKQKSTKKFWIQKLPKVLCLHLKRFHWTAYL
RNKVDTYVEFPLRGLDMKCYLLEPENSGPESCLYDLAAVVVHHGSGVGSGHYTAYATHEG
RWFHFNDSTVTLTDEETVVKAKAYILFYVEHQAKAGSDKL
>sp|Q9Y5T5|UBP16_HUMAN Ubiquitin carboxyl-terminal hydrolase 16 OS=Homo sapiens OX=9606 GN=USP16 PE=1 SV=1
MGKKRTKGKTVPIDDSSETLEPVCRHIRKGLEQGNLKKALVNVEWNICQDCKTDNKVKDK
AEEETEEKPSVWLCLKCGHQGCGRNSQEQHALKHYLTPRSEPHCLVLSLDNWSVWCYVCD
NEVQYCSSNQLGQVVDYVRKQASITTPKPAEKDNGNIELENKKLEKESKNEQEREKKENM
AKENPPMNSPCQITVKGLSNLGNTCFFNAVMQNLSQTPVLRELLKEVKMSGTIVKIEPPD
LALTEPLEINLEPPGPLTLAMSQFLNEMQETKKGVVTPKELFSQVCKKAVRFKGYQQQDS
QELLRYLLDGMRAEEHQRVSKGILKAFGNSTEKLDEELKNKVKDYEKKKSMPSFVDRIFG
GELTSMIMCDQCRTVSLVHESFLDLSLPVLDDQSGKKSVNDKNLKKTVEDEDQDSEEEKD
NDSYIKERSDIPSGTSKHLQKKAKKQAKKQAKNQRRQQKIQGKVLHLNDICTIDHPEDSE
YEAEMSLQGEVNIKSNHISQEGVMHKEYCVNQKDLNGQAKMIESVTDNQKSTEEVDMKNI
NMDNDLEVLTSSPTRNLNGAYLTEGSNGEVDISNGFKNLNLNAALHPDEINIEILNDSHT
PGTKVYEVVNEDPETAFCTLANREVFNTDECSIQHCLYQFTRNEKLRDANKLLCEVCTRR
QCNGPKANIKGERKHVYTNAKKQMLISLAPPVLTLHLKRFQQAGFNLRKVNKHIKFPEIL
DLAPFCTLKCKNVAEENTRVLYSLYGVVEHSGTMRSGHYTAYAKARTANSHLSNLVLHGD
IPQDFEMESKGQWFHISDTHVQAVPTTKVLNSQAYLLFYERIL
>sp|Q9Y2K6|UBP20_HUMAN Ubiquitin carboxyl-terminal hydrolase 20 OS=Homo sapiens OX=9606 GN=USP20 PE=1 SV=2
MGDSRDLCPHLDSIGEVTKEDLLLKSKGTCQSCGVTGPNLWACLQVACPYVGCGESFADH
STIHAQAKKHNLTVNLTTFRLWCYACEKEVFLEQRLAAPLLGSSSKFSEQDSPPPSHPLK
AVPIAVADEGESESEDDDLKPRGLTGMKNLGNSCYMNAALQALSNCPPLTQFFLECGGLV
RTDKKPALCKSYQKLVSEVWHKKRPSYVVPTSLSHGIKLVNPMFRGYAQQDTQEFLRCLM
DQLHEELKEPVVATVALTEARDSDSSDTDEKREGDRSPSEDEFLSCDSSSDRGEGDGQGR
GGGSSQAETELLIPDEAGRAISEKERMKDRKFSWGQQRTNSEQVDEDADVDTAMAALDDQ
PAEAQPPSPRSSSPCRTPEPDNDAHLRSSSRPCSPVHHHEGHAKLSSSPPRASPVRMAPS
YVLKKAQVLSAGSRRRKEQRYRSVISDIFDGSILSLVQCLTCDRVSTTVETFQDLSLPIP
GKEDLAKLHSAIYQNVPAKPGACGDSYAAQGWLAFIVEYIRRFVVSCTPSWFWGPVVTLE
DCLAAFFAADELKGDNMYSCERCKKLRNGVKYCKVLRLPEILCIHLKRFRHEVMYSFKIN
SHVSFPLEGLDLRPFLAKECTSQITTYDLLSVICHHGTAGSGHYIAYCQNVINGQWYEFD
DQYVTEVHETVVQNAEGYVLFYRKSSEEAMRERQQVVSLAAMREPSLLRFYVSREWLNKF
NTFAEPGPITNQTFLCSHGGIPPHKYHYIDDLVVILPQNVWEHLYNRFGGGPAVNHLYVC
SICQVEIEALAKRRRIEIDTFIKLNKAFQAEESPGVIYCISMQWFREWEAFVKGKDNEPP
GPIDNSRIAQVKGSGHVQLKQGADYGQISEETWTYLNSLYGGGPEIAIRQSVAQPLGPEN
LHGEQKIEAETRAV
The printed cline is:
clustalo -i in_filename.fasta -o out_filename.phylip --outfmt phylip -v
However, the error I get when reading the out_filename.phylip I get: (I also cannot find it on my computer)
FileNotFoundError: [Errno 2] No such file or directory: 'out_filename.phylip'
I'm not sure where the problem is and I even question if the alignment is performed. If I upload the "in_filename.fasta" to the Clustal website for alignment, there is no problem and sequences are aligned. So therefore I am not really sure how to solve this. Any help/tips/recommendations are really welcome!!
Thanks in advance! Eve
Does that command complete successfully if you run the
clustalomega
directly rather than through BioPython?From the documentation (https://biopython.org/DIST/docs/api/Bio.Align.Applications._ClustalOmega.ClustalOmegaCommandline-class.html)
You need to execute the clustalomega_cline (imagine it as if you've formatted parameters to a string, and you need to paste it to terminal). Use the
clustalomega_cline()
to actually run the program.Ooh thank you that makes more sense!
Whenever I do that I get the error message:
Would that mean that Clustal-Omega is not downloaded properly?
This is why I asked if you are able to run the command correctly without biopython. It seems you don't have it installed, or it isn't available in
$PATH
under the correct name.what is the output of the command:
which -a clustalo
?