concat fasta sequences based on some common header string
1
0
Entering edit mode
5.1 years ago
mforthman ▴ 50

I have a single fasta file with multiple exons from various genes. What I would like to do is use the GeneID in each header to find those exon sequences that have the same GeneID and concatenate them into a single sequence. Along with this, I would like to alter the sequence header so that it includes just the GeneID and one of product descriptors (doesn't matter which one where multiple are given). I figure, for the headers some way of delimiting based on ';' and using specific strings to produce the new headers. I also suspect this would have to be a two step approach, probably using python or perl instead of linux.

For example, input fasta file:

>ID=exon-XM_014420980.1-1;GeneID:106680938;product=zinc finger protein 391
TTGAAACAGTAGTCGTCTTAGAAGCGCTTCGGCACCACGTCTCCAACTCTCCCACGTCTAGGACTCgtgttattgttttgtttattttttttatttataaacatcgttcattttttttttttttgtttacttcgaTTTTAACCGTGAACTCGGCATTTAAGTCTCCGACAGCTGTGCGTTGAAGTCTCCAACCAACTGCGGA$
>ID=exon-XM_014420980.1-2;GeneID:106680938;product=zinc finger protein 391
atcaGGCATAGGTTCGGTTGTTATGACGAAATCACAGAAGTCGACAGGCGGCGAGTACGGCACTGCAATTGGGTACGTTTCCTCAGGGTGGTACCCGCTTACACGGATCAAGTGAACCTCattggaacaaaaataaaag
>ID=exon-XM_014420980.1-3;GeneID:106680938;product=zinc finger protein 391
GCGAACCTATTTACGAAGCAGTCAAAAACATCCCAGCGAATTCAGAACTGGTAGTTTACTATCTTCCAGAGAGGCCAGAAGAAGTATTTTTCATGCCGGCAGTCCACTACCTGAGGAACTCACTTTACAGGAGGACAATGGACACGATTCTggaag
>ID=exon-XM_014420980.1-4;GeneID:106680938;product=zinc finger protein 391
ACTCCCCCTTAGACCTCTCGATGTCGCTACTATCCCGCTCGCTCATCACACCTCCTGGTTCGTCGTCGCCCGTGGACACAGACAAATCCTCAGACTCCGAAGGGCAGCGGATGACACAGGCCAGGTCACCCAAGAGGTCTGGGCGGCCGGAGAGGAGCTTGCTGCCCTGCGAGGTGTGCGGCAAGGCATTCGACAGGCCGTC$
>ID=exon-XM_014420980.1-5;GeneID:106680938;product=zinc finger protein 391
gCGAAAAACCCCACGTGTGTATGGTGTGTAACAAAGGCTTCAGCACTAGCAGCTCTCTCAACACCCACAGGCGAATCCACTCCGGCGAAAAGCCCCACCAATGTCAAGTCTGCGGGAAACGGTTCACCGCCTCCTCCAACCTCTACTACCACAGGATGACGCATATCAAG
>ID=exon-XM_014420980.1-6;GeneID:106680938;product=zinc finger protein 391
GAGAAGCCTCACAAATGCAACCTGTGCTCCAAGTCGTTCCCCACCCCCGGAGATTTGAAATCGCACATGTACGTCCACAACGGTTCGTGGCCGTTCAGATGCAACATCTGCTCGCGAGGGTTCAGCAAGCACACCAACCTCAAGAACCACCTCTTCCTTCACACCG
>ID=exon-XM_014420980.1-7;GeneID:106680938;product=zinc finger protein 391
GTGACAAGCCACACGCTTGCGAACTCTGTCACAAGAAATTCGCCCTAGCTTGCAATCTTCGTGCTCACATGAAAACTCATGAAG
>ID=exon-XM_014420980.1-8;GeneID:106680938;product=zinc finger protein 391
GAGAGAATCAGGAAGAATGCTCCAGGTGCTCCAAGAACTGCGGAGGTACCTGCGGAATGCTCAGCGAAGAATCGGCGCCGTCAGTGACAGAAGAGGATCACGAGTAGATTATTTGGAGGAGTTGCCTTTCAGATCTCATACCCCCTCTCAACATCCAAATAgacattattataaacatatagctttatatatcatatataca$
>ID=exon-XM_024358664.1-7;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript variant X3,ID=exon-XM_024358308.1-6;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript vari$
ttctatttttttttttttttttttattaaaaacataaatacaaattcttCATCTACAATTTACAATCcaagatttcattttttgtcgAGCTTTATTCATAGActcacaaaaaatgtttttatctttagtggtttttatttcattgatacaTTTCTTAAGgtaatttctcaaatatttccACTCAGGATCACTTAAATCACTT$
>ID=exon-XM_024358664.1-6;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript variant X3,ID=exon-XM_024358308.1-5;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript vari$
CTCCAAAATCTTCATTGAAAAGAAGACCTTCGGCTACTGGGATGTCCATTGATTTCTGGGCTGAAtcttcactc
>ID=exon-XM_024358664.1-5;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript variant X3,ID=exon-XM_024358308.1-4;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript vari$
CTTTTCTGTTCACGTAAGAATTCTGGATTTCCTAGCCATGATCGCAAGTCACTGGGTAGTTCGTGACCCGATGGAAGCCACTGATCTTTTGATTCCAAACAAGTTAAGTAAGGTATTAATTGTGAATACCATTCACAAACCTGAAGCAGCAAATATCGCGAAAGTGTC
>ID=exon-XM_024358664.1-4;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript variant X3,ID=exon-XM_024358308.1-3;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript vari$
CATACCCTGCTGAGGCATGCCAAGAACAATAGAGCCAAATGCCATCTATTTCCTGTAGTTATAtgattataaagataataagcTGTTTCAGTACATGTTCTTTCTAATCGATGGAACAAGCCAGCAAACCTTTGCAAGGTAATTAGGCACCACTCTAACATTTCTTTTGTAGGACCTTCACCATTATATGTGTAATTTCGGT$
>ID=exon-XM_024358664.1-3;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript variant X3,ID=exon-XM_024358308.1-2;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript vari$
ctttTGAAAACCTTTATATCCTTTGTCATTCCTTAAACTTTTTCTGGTACGAAATAACAAACGCGATATAAGAGCGGCATCAAAGTGGAGTAAATCTTGATCCTCTAAAACTGatgaaattttatccaaaatgtTTGCGAACTTTGCACTTGTTTTCC
>ID=exon-XM_024358505.1-1;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript variant X2
CTGGATTTTTGGTTTTGATAGCCAAAGACATCAGAGATGGAGGTTccaaattccttttattcaaaaaaggtacaaaaatatcttcaaaagaCATAGTTTGAGGGAGGATTCAACCTCATACACTTATTGAGTTTAATACTAATAACCATAGGGAAGGTAGATGTTGAAAGAAACGTGACAGAAAGGttacaaataagttaaa$
>ID=exon-XM_024358664.1-2;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript variant X3,ID=exon-XM_024358308.1-1;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript vari$
CTGTATTATGCTCCGACAAAGCAGTTGCAGCTATTAACATCGTGTAAACTAAGACATTTGCTGTCGTACTATCAGGACTTACTTGGTGGTATACATGCTGTCTGAATGAATACATTATCTGGTTAATATGGTTACTACAATCACCTGCTATACCTctctatttatgaaataaaagctTATTCTTTCTCAATTTTTGCATTAC$

Desired outcome in a new fasta file:

>GeneID106680938_zinc_finger_protein_391
TTGAAACAGTAGTCGTCTTAGAAGCGCTTCGGCACCACGTCTCCAACTCTCCCACGTCTAGGACTCgtgttattgttttgtttattttttttatttataaacatcgttcattttttttttttttgtttacttcgaTTTTAACCGTGAACTCGGCATTTAAGTCTCCGACAGCTGTGCGTTGAAGTCTCCAACCAACTGCGGA$atcaGGCATAGGTTCGGTTGTTATGACGAAATCACAGAAGTCGACAGGCGGCGAGTACGGCACTGCAATTGGGTACGTTTCCTCAGGGTGGTACCCGCTTACACGGATCAAGTGAACCTCattggaacaaaaataaaagGCGAACCTATTTACGAAGCAGTCAAAAACATCCCAGCGAATTCAGAACTGGTAGTTTACTATCTTCCAGAGAGGCCAGAAGAAGTATTTTTCATGCCGGCAGTCCACTACCTGAGGAACTCACTTTACAGGAGGACAATGGACACGATTCTggaagACTCCCCCTTAGACCTCTCGATGTCGCTACTATCCCGCTCGCTCATCACACCTCCTGGTTCGTCGTCGCCCGTGGACACAGACAAATCCTCAGACTCCGAAGGGCAGCGGATGACACAGGCCAGGTCACCCAAGAGGTCTGGGCGGCCGGAGAGGAGCTTGCTGCCCTGCGAGGTGTGCGGCAAGGCATTCGACAGGCCGTC$gCGAAAAACCCCACGTGTGTATGGTGTGTAACAAAGGCTTCAGCACTAGCAGCTCTCTCAACACCCACAGGCGAATCCACTCCGGCGAAAAGCCCCACCAATGTCAAGTCTGCGGGAAACGGTTCACCGCCTCCTCCAACCTCTACTACCACAGGATGACGCATATCAAGGAGAAGCCTCACAAATGCAACCTGTGCTCCAAGTCGTTCCCCACCCCCGGAGATTTGAAATCGCACATGTACGTCCACAACGGTTCGTGGCCGTTCAGATGCAACATCTGCTCGCGAGGGTTCAGCAAGCACACCAACCTCAAGAACCACCTCTTCCTTCACACCGGTGACAAGCCACACGCTTGCGAACTCTGTCACAAGAAATTCGCCCTAGCTTGCAATCTTCGTGCTCACATGAAAACTCATGAAGGAGAGAATCAGGAAGAATGCTCCAGGTGCTCCAAGAACTGCGGAGGTACCTGCGGAATGCTCAGCGAAGAATCGGCGCCGTCAGTGACAGAAGAGGATCACGAGTAGATTATTTGGAGGAGTTGCCTTTCAGATCTCATACCCCCTCTCAACATCCAAATAgacattattataaacatatagctttatatatcatatataca$
>GeneID106680947_uncharacterized_LOC106680947%2C_transcript_variant_X3
ttctatttttttttttttttttttattaaaaacataaatacaaattcttCATCTACAATTTACAATCcaagatttcattttttgtcgAGCTTTATTCATAGActcacaaaaaatgtttttatctttagtggtttttatttcattgatacaTTTCTTAAGgtaatttctcaaatatttccACTCAGGATCACTTAAATCACTT$CTCCAAAATCTTCATTGAAAAGAAGACCTTCGGCTACTGGGATGTCCATTGATTTCTGGGCTGAAtcttcactcCTTTTCTGTTCACGTAAGAATTCTGGATTTCCTAGCCATGATCGCAAGTCACTGGGTAGTTCGTGACCCGATGGAAGCCACTGATCTTTTGATTCCAAACAAGTTAAGTAAGGTATTAATTGTGAATACCATTCACAAACCTGAAGCAGCAAATATCGCGAAAGTGTCCATACCCTGCTGAGGCATGCCAAGAACAATAGAGCCAAATGCCATCTATTTCCTGTAGTTATAtgattataaagataataagcTGTTTCAGTACATGTTCTTTCTAATCGATGGAACAAGCCAGCAAACCTTTGCAAGGTAATTAGGCACCACTCTAACATTTCTTTTGTAGGACCTTCACCATTATATGTGTAATTTCGGT$ctttTGAAAACCTTTATATCCTTTGTCATTCCTTAAACTTTTTCTGGTACGAAATAACAAACGCGATATAAGAGCGGCATCAAAGTGGAGTAAATCTTGATCCTCTAAAACTGatgaaattttatccaaaatgtTTGCGAACTTTGCACTTGTTTTCCCTGGATTTTTGGTTTTGATAGCCAAAGACATCAGAGATGGAGGTTccaaattccttttattcaaaaaaggtacaaaaatatcttcaaaagaCATAGTTTGAGGGAGGATTCAACCTCATACACTTATTGAGTTTAATACTAATAACCATAGGGAAGGTAGATGTTGAAAGAAACGTGACAGAAAGGttacaaataagttaaa$CTGTATTATGCTCCGACAAAGCAGTTGCAGCTATTAACATCGTGTAAACTAAGACATTTGCTGTCGTACTATCAGGACTTACTTGGTGGTATACATGCTGTCTGAATGAATACATTATCTGGTTAATATGGTTACTACAATCACCTGCTATACCTctctatttatgaaataaaagctTATTCTTTCTCAATTTTTGCATTAC$
sequence DNA fasta • 1.8k views
ADD COMMENT
0
Entering edit mode

a good start is going through the biopython tutorial http://biopython.org/DIST/docs/tutorial/Tutorial.html it explains how to read in a fasta file . Although you would maybe also need to understand the basics of python -> split the header , look up the relevant parts in a dictionary etc ..

ADD REPLY
0
Entering edit mode

Yes, python, perl or any other program languages could solve this problem.

ADD REPLY
3
Entering edit mode
5.1 years ago
cat jeter.fa  | paste - - | cut -f 1 | tr ";" "\n" | grep -F 'GeneID:' | cut -d ':' -f 2  | sort | uniq | while read G; do cat jeter.fa | paste  - -  | grep -F "GeneID:${G};" | tr "\t" "\n" | awk -v G=$G -F '[,;]' 'NR==1 {printf(">%s %s\n",G,$3);} (NR%2==0) {print}' ; done

>106680938 product=zinc finger protein 391
TTGAAACAGTAGTCGTCTTAGAAGCGCTTCGGCACCACGTCTCCAACTCTCCCACGTCTAGGACTCgtgttattgttttgtttattttttttatttataaacatcgttcattttttttttttttgtttacttcgaTTTTAACCGTGAACTCGGCATTTAAGTCTCCGACAGCTGTGCGTTGAAGTCTCCAACCAACTGCGGA$
atcaGGCATAGGTTCGGTTGTTATGACGAAATCACAGAAGTCGACAGGCGGCGAGTACGGCACTGCAATTGGGTACGTTTCCTCAGGGTGGTACCCGCTTACACGGATCAAGTGAACCTCattggaacaaaaataaaag
GCGAACCTATTTACGAAGCAGTCAAAAACATCCCAGCGAATTCAGAACTGGTAGTTTACTATCTTCCAGAGAGGCCAGAAGAAGTATTTTTCATGCCGGCAGTCCACTACCTGAGGAACTCACTTTACAGGAGGACAATGGACACGATTCTggaag
ACTCCCCCTTAGACCTCTCGATGTCGCTACTATCCCGCTCGCTCATCACACCTCCTGGTTCGTCGTCGCCCGTGGACACAGACAAATCCTCAGACTCCGAAGGGCAGCGGATGACACAGGCCAGGTCACCCAAGAGGTCTGGGCGGCCGGAGAGGAGCTTGCTGCCCTGCGAGGTGTGCGGCAAGGCATTCGACAGGCCGTC$
gCGAAAAACCCCACGTGTGTATGGTGTGTAACAAAGGCTTCAGCACTAGCAGCTCTCTCAACACCCACAGGCGAATCCACTCCGGCGAAAAGCCCCACCAATGTCAAGTCTGCGGGAAACGGTTCACCGCCTCCTCCAACCTCTACTACCACAGGATGACGCATATCAAG
GAGAAGCCTCACAAATGCAACCTGTGCTCCAAGTCGTTCCCCACCCCCGGAGATTTGAAATCGCACATGTACGTCCACAACGGTTCGTGGCCGTTCAGATGCAACATCTGCTCGCGAGGGTTCAGCAAGCACACCAACCTCAAGAACCACCTCTTCCTTCACACCG
GTGACAAGCCACACGCTTGCGAACTCTGTCACAAGAAATTCGCCCTAGCTTGCAATCTTCGTGCTCACATGAAAACTCATGAAG
GAGAGAATCAGGAAGAATGCTCCAGGTGCTCCAAGAACTGCGGAGGTACCTGCGGAATGCTCAGCGAAGAATCGGCGCCGTCAGTGACAGAAGAGGATCACGAGTAGATTATTTGGAGGAGTTGCCTTTCAGATCTCATACCCCCTCTCAACATCCAAATAgacattattataaacatatagctttatatatcatatataca$
>106680947 product=uncharacterized LOC106680947%2C transcript variant X3
ttctatttttttttttttttttttattaaaaacataaatacaaattcttCATCTACAATTTACAATCcaagatttcattttttgtcgAGCTTTATTCATAGActcacaaaaaatgtttttatctttagtggtttttatttcattgatacaTTTCTTAAGgtaatttctcaaatatttccACTCAGGATCACTTAAATCACTT$
CTCCAAAATCTTCATTGAAAAGAAGACCTTCGGCTACTGGGATGTCCATTGATTTCTGGGCTGAAtcttcactc
CTTTTCTGTTCACGTAAGAATTCTGGATTTCCTAGCCATGATCGCAAGTCACTGGGTAGTTCGTGACCCGATGGAAGCCACTGATCTTTTGATTCCAAACAAGTTAAGTAAGGTATTAATTGTGAATACCATTCACAAACCTGAAGCAGCAAATATCGCGAAAGTGTC
CATACCCTGCTGAGGCATGCCAAGAACAATAGAGCCAAATGCCATCTATTTCCTGTAGTTATAtgattataaagataataagcTGTTTCAGTACATGTTCTTTCTAATCGATGGAACAAGCCAGCAAACCTTTGCAAGGTAATTAGGCACCACTCTAACATTTCTTTTGTAGGACCTTCACCATTATATGTGTAATTTCGGT$
ctttTGAAAACCTTTATATCCTTTGTCATTCCTTAAACTTTTTCTGGTACGAAATAACAAACGCGATATAAGAGCGGCATCAAAGTGGAGTAAATCTTGATCCTCTAAAACTGatgaaattttatccaaaatgtTTGCGAACTTTGCACTTGTTTTCC
CTGGATTTTTGGTTTTGATAGCCAAAGACATCAGAGATGGAGGTTccaaattccttttattcaaaaaaggtacaaaaatatcttcaaaagaCATAGTTTGAGGGAGGATTCAACCTCATACACTTATTGAGTTTAATACTAATAACCATAGGGAAGGTAGATGTTGAAAGAAACGTGACAGAAAGGttacaaataagttaaa$
CTGTATTATGCTCCGACAAAGCAGTTGCAGCTATTAACATCGTGTAAACTAAGACATTTGCTGTCGTACTATCAGGACTTACTTGGTGGTATACATGCTGTCTGAATGAATACATTATCTGGTTAATATGGTTACTACAATCACCTGCTATACCTctctatttatgaaataaaagctTATTCTTTCTCAATTTTTGCATTAC
ADD COMMENT
0
Entering edit mode

Thanks Pierre for the recommended commands. I tried to use it, adding a command to direct output to a new file, but it resulted in an empty file. Without that included, it prints sequences it concatenates to the screen (it does not print sequences that do not get concatenated with anything else). Perhaps I'm not providing this command in the correct place?

cat jeter.fa  | paste - - | cut -f 1 | tr ";" "\n" | grep -F 'GeneID:' | cut -d ':' -f 2  | sort | uniq | while read G; do cat jeter.fa | paste  - -  | grep -F "GeneID:${G};" | tr "\t" "\n" | awk -v G=$G -F '[,;]' 'NR==1 {printf(">%s %s\n",G,$3);} (NR%2==0) {print}'  > jeter_merged.fa; done
ADD REPLY
0
Entering edit mode

actually, I need to make it 'append' (>>) to the output file or else it just overwrites the output file with the last set of exons it concatenated.

ADD REPLY

Login before adding your answer.

Traffic: 1744 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6