how to use blastpt between isoforms of the same genes?
1
0
Entering edit mode
5.4 years ago
star ▴ 350

I have a fasta file like below contains the longest and the shortest isoforms of genes. I would like to know is there any way to compare those isoforms that related to a specific gene in this file (e.g. compare the longest isoform of VPS54 gene against the shortest one). or I have to save each isoform of a gene in a separate file, then compare them?

>ENSG00000143952|VPS54|ENST00000409558|2|63892146|64019072
MASSHSSSPVPQGSSSDVFFKIEVDPSKHIRPVPSLPDVCPKEPTVTDQHRWTVYHSKVN
LPAALNDPRLAKRESDFFTKTWGLDFVDTEVIPSFYLPQISKEHFTVYQQEISQREKIHE
RCKNICPPKDTFERTLLHTHDKSRTDLEQVPKIFMKPDFALDDSLTFNSVLPWSHFNTAG
GKGNRDAASSKLLQEKLSHYLDIVEVNIAHQISLRSEAFFHAMTSQHELQDYLRKTSQAV
KMLRDKIAQIDKVMCEGSLHILRLALTRNNCVKVYNKLKLMATVHQTQPTVQVLLSTSEF
VGALDLIATTQEVLQQELQGIHSFRHLGSQLCELEKLIDKMMIAEFSTYSHSDLNRPLED
DCQVLEEERLISLVFGLLKQRKLNFLEIYGEKMVITAKNIIKQCVINKVSQTEEIDTDVV
VKLADQMRMLNFPQWFDLLKDIFSKFTIFLQRVKATLNIIHSVVLSVLDKNQRTRELEEI
SQQKNAAKDNSLDTEVAYLIHEGMFISDAFGEGELTPIAVDTTSQRNASPNSEPCSSDSV
SEPECTTDSSSSKEHTSSSAIPGGVDIMVSEDMKLTDSELGKLANNIQELLYSASDICHD
RAVKFLMSRAKDGFLEKLNSMEFITLSRLMETFILDTEQICGRKSTSLLGALQSQAIKFV
NRFHEERKTKLSLLLDNERWKQADVPAEFQDLVDSLSDGKIALPEKKSGATEERKPAEVL
IVEGQQYAVVGTVLLLIRIILEYCQCVDNIPSVTTDMLTRLSDLLKYFNSRSCQLVLGAG
ALQVVGLKTITTKNLALSSRCLQLIVHYIPVIRAHFEARLPPKQYSMLRHFDHITKDYHD
HIAEISAKLVAIMDSLFDKLLSKYEVKAPVPSACFRNICKQMTKMHEAIFDLLPEEQTQM
LFLRINASYKLHLKKQLSHLNVINDGGPQNGLVTADVAFYTGNLQALKGLKDLDLNMAEI
WEQKR*

>ENSG00000143952|VPS54|ENST00000272322|2|63892150|64019428
MASSHSSSPVPQGSSSDVFFKIEVDPSKHIRPVPSLPDVCPKEPTGDSHSLYVAPSLVTD
QHRWTVYHSKVNLPAALNDPRLAKRESDFFTKTWGLDFVDTEVIPSFYLPQISKEHFTVY
QQEISQREKIHERCKNICPPKDTFERTLLHTHDKSRTDLEQVPKIFMKPDFALDDSLTFN
SVLPWSHFNTAGGKGNRDAASSKLLQEKLSHYLDIVEVNIAHQISLRSEAFFHAMTSQHE
LQDYLRKTSQAVKMLRDKIAQIDKVMCEGSLHILRLALTRNNCVKVYNKLKLMATVHQTQ
PTVQVLLSTSEFVGALDLIATTQEVLQQELQGIHSFRHLGSQLCELEKLIDKMMIAEFST
YSHSDLNRPLEDDCQVLEEERLISLVFGLLKQRKLNFLEIYGEKMVITAKNIIKQCVINK
VSQTEEIDTDVVVKLADQMRMLNFPQWFDLLKDIFSKFTIFLQRVKATLNIIHSVVLSVL
DKNQRTRELEEISQQKNAAKDNSLDTEVAYLIHEGMFISDAFGEGELTPIAVDTTSQRNA
SPNSEPCSSDSVSEPECTTDSSSSKEHTSSSAIPGGVDIMVSEDMKLTDSELGKLANNIQ
ELLYSASDICHDRAVKFLMSRAKDGFLEKLNSMEFITLSRLMETFILDTEQICGRKSTSL
LGALQSQAIKFVNRFHEERKTKLSLLLDNERWKQADVPAEFQDLVDSLSDGKIALPEKKS
GATEERKPAEVLIVEGQQYAVVGTVLLLIRIILEYCQCVDNIPSVTTDMLTRLSDLLKYF
NSRSCQLVLGAGALQVVGLKTITTKNLALSSRCLQLIVHYIPVIRAHFEARLPPKQYSML
RHFDHITKDYHDHIAEISAKLVAIMDSLFDKLLSKYEVKAPVPSACFRNICKQMTKMHEA
IFDLLPEEQTQMLFLRINASYKLHLKKQLSHLNVINDGGPQNGLVTADVAFYTGNLQALK
GLKDLDLNMAEIWEQKR*

>ENSG00000014641|MDH1|ENST00000539945|2|63589151|63607194
MRRCSYFPKDVTVFDKDDKSEPIRVLVTGAAGQIAYSLLYSIGNGSVFGKDQPIILVLLD
ITPMMGVLDGVLMELQDCALPLLKDVIATDKEDVAFKDLDVAILVGSMPRREGMERKDLL
KANVKIFKSQGAALDKYAKKSVKVIVVGNPANTNCLTASKSAPSIPKENFSCLTRLDHNR
AKAQIALKLGVTANDVKNVIIWGNHSSTQYPDVNHAKVKLQGKEVGVYEALKDDSWLKGE
FVTTVQQRGAAVIKARKLSSAMSAAKAICDHVRDIWFGTPEGEFVSMGVISDGNSYGVPD
DLLYSFPVVIKNKTWKFVEGLPINDFSREKMDLTAKELTEEKESAFEFLSSA*

>ENSG00000014641|MDH1|ENST00000233114|2|63588963|63607197
MSEPIRVLVTGAAGQIAYSLLYSIGNGSVFGKDQPIILVLLDITPMMGVLDGVLMELQDC
ALPLLKDVIATDKEDVAFKDLDVAILVGSMPRREGMERKDLLKANVKIFKSQGAALDKYA
KKSVKVIVVGNPANTNCLTASKSAPSIPKENFSCLTRLDHNRAKAQIALKLGVTANDVKN
VIIWGNHSSTQYPDVNHAKVKLQGKEVGVYEALKDDSWLKGEFVTTVQQRGAAVIKARKL
SSAMSAAKAICDHVRDIWFGTPEGEFVSMGVISDGNSYGVPDDLLYSFPVVIKNKTWKFV
EGLPINDFSREKMDLTAKELTEEKESAFEFLSSA*
blast homology • 896 views
ADD COMMENT
1
Entering edit mode

You could simply blat the same file against itself. It will give you all possible combinations automgically.

ADD REPLY
0
Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.
code_formatting

ADD REPLY
0
Entering edit mode
5.3 years ago
Mensur Dlakic ★ 28k

If you want quick answer to your question, I suggest cd-hit. It is meant for removing redundancy in a set of sequences, but it also makes a cluster file that summarizes what sequences are similar and what is their level of identity. Here is a command to try:

cd-hit -i sequences.fas -o sequences.90 -c 0.9 -d 100

This is asking the program to find what sequences are more than 90% identical, to remove all but the largest among them, and save the output in .90 file. The -d 100 switch is normally unnecessary, but it is important here since you sequences have extra long headers.

A secondary file that is also created by this program is called sequences.90.clstr and these are the clusters based on sequences you provided.

>Cluster 0
0       965aa, >ENSG00000143952|VPS54|ENST00000409558|2|63892146|64019072... at 100.00%
1       977aa, >ENSG00000143952|VPS54|ENST00000272322|2|63892150|64019428... *
>Cluster 1
0       352aa, >ENSG00000014641|MDH1|ENST00000539945|2|63589151|63607194... *
1       334aa, >ENSG00000014641|MDH1|ENST00000233114|2|63588963|63607197... at 99.70%

It tells you the longest sequence in each cluster and the identity of other cluster members to it. Hopefully that is what you wanted.

ADD COMMENT

Login before adding your answer.

Traffic: 2724 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6