too many characters in strings containing long protein sequence ?
1
0
Entering edit mode
5.8 years ago
gaiboyan23 ▴ 30

Using Biopython, I parsed a FASTA file for a list of protein sequences. Everything works, except when the sequence is too long (some are over 1000 characters). A majority of the sequence gets replaced by ..., How can I obtain the entire sequence in my output?

Currently my output look like this (I pasted the first 3 lines)

MDSTLTASEIRQRFIDFFKRNEHTYVHSSATIPLDDPTLLFANAGMNQFKPIFL...VKN MAAYKLVLIRHGESTWNLENRFSCWYDADLSPAGHEEAKRGGQALRDAGYEFDI...AKK MATLSLTVNSGDPPLGALLAVEHVKDDVSISVEEGKENILHVSENVIFTDVNSI...RSY

My code (shortened version):

from Bio import SeqIO
every=[]
length=[]    
for seq_record in SeqIO.parse("Y100.fasta.", "fasta"):   
#going through each item in Y100.fasta  
     every.append (repr(seq_record.seq))   
#creating my list of protein sequence    
      length.append (len(seq_record))    
#length of every sequence

print (every)
sequencing fasta protein biopython • 1.5k views
ADD COMMENT
0
Entering edit mode
5.8 years ago
Joe 21k

The sequence doesn't get replaced with ....

All that's happening is that when you print it, the method that governs how its displayed in a terminal truncates it. You do not need to use the repr method. That is intended for debugging mostly. It's sufficient to just print/access the record.seq object (alternatively you can use str(record.seq)).

See the following example with the massive titin protein:

>>> from Bio import SeqIO
>>> titin = SeqIO.read('titin.fasta', 'fasta')
>>> titin
SeqRecord(seq=Seq('MTTQAPTFTQPLQSVVVLEGSTATFEAHISGFPVPEVSWFRDGQVISTSTLPGV...TLT', SingleLetterAlphabet()), id='Titin', name='Titin', description='Titin', dbxrefs=[])
>>> print(titin)
ID: Titin
Name: Titin
Description: Titin
Number of features: 0
Seq('MTTQAPTFTQPLQSVVVLEGSTATFEAHISGFPVPEVSWFRDGQVISTSTLPGV...TLT', SingleLetterAlphabet())
>>> print(titin.seq)
MTTQAPTFTQPLQSVVVLEGSTATFEAHISGFPVPEVSWFRDGQVISTSTLPGVQISFSDGRAKLTIPAVTKANSGRYSLKATNGSGQATSTAELLVKAETAPPNFVQRLQSMTVRQGSQVRLQVRVTGIPTPVVKFYRDGAEIQSSLDFQISQEGDLYSLLIAEAYPEDSGTYSVNATNSVGRATSTAELLVQGEEEVPAKKTKTIVSTAQISESRQTRIEKKIEAHFDARSIATVEMVIDGAAGQQLPHKTPPRIPPKPKSRSPTPPSIAAKAQLARQQSPSPIRHSPSPVRHVRAPTPSPVRSVSPAARISTSPIRSVRSPLLMRKTQASTVATGPEVPPPWKQEGYVASSSEAEMRETTLTTSTQIRTEERWEGRYGVQEQVTISGAAGAAASVSASASYAAEAVATGAKEVKQDADKSAAVATVVAAVDMARVREPVISAVEQTAQRTTTTAVHIQPAQEQVRKEAEKTAVTKVVVAADKAKEQELKSRTKEVITTKQEQMHVTHEQIRKETEKTFVPKVVISAAKAKEQETRISEEITKKQKQVTQEAIRQETEITAASMVVVATAKSTKLETVPGAQEETTTQQDQMHLSYEKIMKETRKTVVPKVIVATPKVKEQDLVSRGREGITTKREQVQITQEKMRKEAEKTALSTIAVATAKAKEQETILRTRETMATRQEQIQVTHGKVDVGKKAEAVATVVAAVDQARVREPREPGHLEESYAQQTTLEYGYKERISAAKVAEPPQRPASEPHVVPKAVKPRVIQAPSETHIKTTDQKGMHISSQIKKTTDLTTERLVHVDKRPRTASPHFTVSKISVPKTEHGYEASIAGSAIATLQKELSATSSAQKITKSVKAPTVKPSETRVRAEPTPLPQFPFADTPDTYKSEAGVEVKKEVGVSITGTTVREERFEVLHGREAKVTETARVPAPVEIPVTPPTLVSGLKNVTVIEGESVTLECHISGYPSPTVTWYREDYQIESSIDFQITFQSGIARLMIREAFAEDSGRFTCSAVNEAGTVSTSCYLAVQVSEEFEKETTAVTEKFTTEEKRFVESRDVVMTDTSLTEEQAGPGEPAAPYFITKPVVQKLVEGGSVVFGCQVGGNPKPHVYWKKSGVPLTTGYRYKVSYNKQTGECKLVISMTFADDAGEYTIVVRNKHGETSASASLLEEADYELLMKSQQEMLYQTQVTAFVQEPKVGETAPGFVYSEYEKEYEKEQALIRKKMAKDTVVVRTYVEDQEFHISSFEERLIKEIEYRIIKTTLEELLEEDGEEKMAVDISESEAVESGFDSRIKNYRILEGMGVTFHCKMSGYPLPKIAWYKDGKRIKHGERYQMDFLQDGRASLRIPVVLPEDEGIYTAFASNIKGNAICSGKLYVEPAAPLGAPTYIPTLEPVSRIRSLSPRSVSRSPIRMSPARMSPARMSPARMSPARMSPGRRLEETDESQLERLYKPVFVLKPVSFKCLEGQTARFDLKVVGRPMPETFWFHDGQQIVNDYTHKVVIKEDGTQSLIIVPATPSDSGEWTVVAQNRAGRSSISVILTVEAVEHQVKPMFVEKLKNVNIKEGSRLEMKVRATGNPNPDIVWLKNSDIIVPHKYPKIRIEGTKGEAALKIDSTVSQDSAWYTATAINKAGRDTTRCKVNVEVEFAEPEPERKLIIPRGTYRAKEIAAPELEPLHLRYGQEQWEEGDLYDKEKQQKPFFKKKLTSLRLKRFGPAHFECRLTPIGDPTMVVEWLHDGKPLEAANRLRMINEFGYCSLDYGVAYSRDSGIITCRATNKYGTDHTSATLIVKDEKSLVEESQLPEGRKGLQRIEELERMAHEGALTGVTTDQKEKQKPDIVLYPEPVRVLEGETARFRCRVTGYPQPKVNWYLNGQLIRKSKRFRVRYDGIHYLDIVDCKSYDTGEVKVTAENPEGVIEHKVKLEIQQREDFRSVLRRAPEPRPEFHVHEPGKLQFEVQKVDRPVDTTETKEVVKLKRAERITHEKVPEESEELRSKFKRRTEEGYYEAITAVELKSRKKDESYEELLRKTKDELLHWTKELTEEEKKALAEEGKITIPTFKPDKIELSPSMEAPKIFERIQSQTVGQGSDAHFRVRVVGKPDPECEWYKNGVKIERSDRIYWYWPEDNVCELVIRDVTAEDSASIMVKAINIAGETSSHAFLLVQAKQLITFTQELQDVVAKEKDTMATFECETSEPFVKVKWYKDGMEVHEGDKYRMHSDRKVHFLSILTIDTSDAEDYSCVLVEDENVKTTAKLIVEGAVVEFVKELQDIEVPESYSGELECIVSPENIEGKWYHNDVELKSNGKYTITSRRGRQNLTVKDVTKEDQGEYSFVIDGKKTTCKLKMKPRPIAILQGLSDQKVCEGDIVQLEVKVSLESVEGVWMKDGQEVQPSDRVHIVIDKQSHMLLIEDMTKEDAGNYSFTIPALGLSTSGRVSVYSVDVITPLKDVNVIEGTKAVLECKVSVPDVTSVKWYLNDEQIKPDDRVQAIVKGTKQRLVINRTHASDEGPYKLIVGRVETNCNLSVEKIKIIRGLRDLTCTETQNVVFEVELSHSGIDVLWNFKDKEIKPSSKYKIEAHGKIYKLTVLNMMKDDEGKYTFYAGENMTSGKLTVAGGAISKPLTDQTVAESQEAVFECEVANPDSKGEWLRDGKHLPLTNNIRSESDGHKRRLIIAATKLDDIGEYTYKVATSKTSAKLKVEAVKIKKTLKNLTVTETQDAVFTVELTHPNVKGVQWIKNGVVLESNEKYAISVKGTIYSLRIKNCAIVDESVYGFRLGRLGASARLHVETVKIIKKPKDVTALENATVAFEVSVSHDTVPVKWFHKSVEIKPSDKHRLVSERKVHKLMLQNISPSDAGEYTAVVGQLECKAKLFVETLHITKTMKNIEVPETKTASFECEVSHFNVPSMWLKNGVEIEMSEKFKIVVQGKLHQLIIMNTSTEDSAEYTFVCGNDQVSATLTVTPIMITSMLKDINAEEKDTITFEVTVNYEGISYKWLKNGVEIKSTDKCQMRTKKLTHSLNIRNVHFGDAADYTFVAGKATSTATLYVEARHIEFRKHIKDIKVLEKKRAMFECEVSEPDITVQWMKDDQELQITDRIKIQKEKYVHRLLIPSTRMSDAGKYTVVAGGNVSTAKLFVEGRDVRIRSIKKEVQVIEKQRAVVEFEVNEDDVDAHWYKDGIEINFQVQERHKYVVERRIHRMFISETRQSDAGEYTFVAGRNRSSVTLYVNAPEPPQVLQELQPVTVQSGKPARFCAVISGRPQPKISWYKEEQLLSTGFKCKFLHDGQEYTLLLIEAFPEDAAVYTCEAKNDYGVATTSASLSVEVPEVVSPDQEMPVYPPAIITPLQDTVTSEGQPARFQCRVSGTDLKVSWYSKDKKIKPSRFFRMTQFEDTYQLEIAEAYPEDEGTYTFVASNAVGQVSSTANLSLEAPESILHERIEQEIEMEMKELFSEGESEHSERDTRDAFSDSEDIDHKSMAAKRYASRISSTSSWPEYFKPSFTQKLTFKYVLEGEPVVFTCRLIACPTPEMTWFHNNRPIPTGLRRIIKAESDLHHHSSSLEIKRVQDRDSGSYRLLAINSEGSAESTASLLVIQKGQDEKYLEFLKRAERTHENVEALVERGEDRIKVDLRFTGSPFNKKQDVEQKGMMRTIHFKTMSSAKKTDYMYDEEYLESKSDIRGWLNVGESFLDKETKVKLQRLREARKTLMEKKKLSLLDTSSEISSRTLRSEASDKDILFSREDMKIRSMSDLAESYKVDHSAESIVQNPHALSNQMDQNIESEELPTSFQTIVDEEIFQTEIRMSQEALVKESLPKDHLYGEILVNENTQARGQLEEIMANTTIGESSTYITNVCEKEEVYETPENVSQAITPHASESFGTLVNVEESEEIASERIKKDDLRELQLSASTRIDEFKTEQKEENMRFFENSFRKRPQRCPPSFLQEIESQEVYEGD...
ADD COMMENT
0
Entering edit mode

(Ironically, titin is so big, I had to artificially truncate the protein when pasting it in to BioStars as there's a limit of 5000 characters!)

Here's a screengrab just in case you don't believe me haha

Screenshot-2019-01-30-at-18-18-57

ADD REPLY
0
Entering edit mode

Ah interesting to see someone else working with titin, and thank you for the response.

I tried your code and my protein sequence still gets truncated. Is there a way to get the non-truncated sequence (i'm putting them in an excel file)? My proteins are not nearly as long as titin (1000 some residues is the highest)

ADD REPLY
0
Entering edit mode

Can you update your post with the full code you’re using? You must be making other errors as BioPython is a very mature package at this stage and it’s highly unlikely to be an issue with the codebase.

ADD REPLY

Login before adding your answer.

Traffic: 1658 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6