I have a single fasta file with multiple exons from various genes. What I would like to do is use the GeneID in each header to find those exon sequences that have the same GeneID and concatenate them into a single sequence. Along with this, I would like to alter the sequence header so that it includes just the GeneID and one of product descriptors (doesn't matter which one where multiple are given). I figure, for the headers some way of delimiting based on ';' and using specific strings to produce the new headers. I also suspect this would have to be a two step approach, probably using python or perl instead of linux.
For example, input fasta file:
>ID=exon-XM_014420980.1-1;GeneID:106680938;product=zinc finger protein 391
TTGAAACAGTAGTCGTCTTAGAAGCGCTTCGGCACCACGTCTCCAACTCTCCCACGTCTAGGACTCgtgttattgttttgtttattttttttatttataaacatcgttcattttttttttttttgtttacttcgaTTTTAACCGTGAACTCGGCATTTAAGTCTCCGACAGCTGTGCGTTGAAGTCTCCAACCAACTGCGGA$
>ID=exon-XM_014420980.1-2;GeneID:106680938;product=zinc finger protein 391
atcaGGCATAGGTTCGGTTGTTATGACGAAATCACAGAAGTCGACAGGCGGCGAGTACGGCACTGCAATTGGGTACGTTTCCTCAGGGTGGTACCCGCTTACACGGATCAAGTGAACCTCattggaacaaaaataaaag
>ID=exon-XM_014420980.1-3;GeneID:106680938;product=zinc finger protein 391
GCGAACCTATTTACGAAGCAGTCAAAAACATCCCAGCGAATTCAGAACTGGTAGTTTACTATCTTCCAGAGAGGCCAGAAGAAGTATTTTTCATGCCGGCAGTCCACTACCTGAGGAACTCACTTTACAGGAGGACAATGGACACGATTCTggaag
>ID=exon-XM_014420980.1-4;GeneID:106680938;product=zinc finger protein 391
ACTCCCCCTTAGACCTCTCGATGTCGCTACTATCCCGCTCGCTCATCACACCTCCTGGTTCGTCGTCGCCCGTGGACACAGACAAATCCTCAGACTCCGAAGGGCAGCGGATGACACAGGCCAGGTCACCCAAGAGGTCTGGGCGGCCGGAGAGGAGCTTGCTGCCCTGCGAGGTGTGCGGCAAGGCATTCGACAGGCCGTC$
>ID=exon-XM_014420980.1-5;GeneID:106680938;product=zinc finger protein 391
gCGAAAAACCCCACGTGTGTATGGTGTGTAACAAAGGCTTCAGCACTAGCAGCTCTCTCAACACCCACAGGCGAATCCACTCCGGCGAAAAGCCCCACCAATGTCAAGTCTGCGGGAAACGGTTCACCGCCTCCTCCAACCTCTACTACCACAGGATGACGCATATCAAG
>ID=exon-XM_014420980.1-6;GeneID:106680938;product=zinc finger protein 391
GAGAAGCCTCACAAATGCAACCTGTGCTCCAAGTCGTTCCCCACCCCCGGAGATTTGAAATCGCACATGTACGTCCACAACGGTTCGTGGCCGTTCAGATGCAACATCTGCTCGCGAGGGTTCAGCAAGCACACCAACCTCAAGAACCACCTCTTCCTTCACACCG
>ID=exon-XM_014420980.1-7;GeneID:106680938;product=zinc finger protein 391
GTGACAAGCCACACGCTTGCGAACTCTGTCACAAGAAATTCGCCCTAGCTTGCAATCTTCGTGCTCACATGAAAACTCATGAAG
>ID=exon-XM_014420980.1-8;GeneID:106680938;product=zinc finger protein 391
GAGAGAATCAGGAAGAATGCTCCAGGTGCTCCAAGAACTGCGGAGGTACCTGCGGAATGCTCAGCGAAGAATCGGCGCCGTCAGTGACAGAAGAGGATCACGAGTAGATTATTTGGAGGAGTTGCCTTTCAGATCTCATACCCCCTCTCAACATCCAAATAgacattattataaacatatagctttatatatcatatataca$
>ID=exon-XM_024358664.1-7;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript variant X3,ID=exon-XM_024358308.1-6;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript vari$
ttctatttttttttttttttttttattaaaaacataaatacaaattcttCATCTACAATTTACAATCcaagatttcattttttgtcgAGCTTTATTCATAGActcacaaaaaatgtttttatctttagtggtttttatttcattgatacaTTTCTTAAGgtaatttctcaaatatttccACTCAGGATCACTTAAATCACTT$
>ID=exon-XM_024358664.1-6;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript variant X3,ID=exon-XM_024358308.1-5;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript vari$
CTCCAAAATCTTCATTGAAAAGAAGACCTTCGGCTACTGGGATGTCCATTGATTTCTGGGCTGAAtcttcactc
>ID=exon-XM_024358664.1-5;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript variant X3,ID=exon-XM_024358308.1-4;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript vari$
CTTTTCTGTTCACGTAAGAATTCTGGATTTCCTAGCCATGATCGCAAGTCACTGGGTAGTTCGTGACCCGATGGAAGCCACTGATCTTTTGATTCCAAACAAGTTAAGTAAGGTATTAATTGTGAATACCATTCACAAACCTGAAGCAGCAAATATCGCGAAAGTGTC
>ID=exon-XM_024358664.1-4;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript variant X3,ID=exon-XM_024358308.1-3;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript vari$
CATACCCTGCTGAGGCATGCCAAGAACAATAGAGCCAAATGCCATCTATTTCCTGTAGTTATAtgattataaagataataagcTGTTTCAGTACATGTTCTTTCTAATCGATGGAACAAGCCAGCAAACCTTTGCAAGGTAATTAGGCACCACTCTAACATTTCTTTTGTAGGACCTTCACCATTATATGTGTAATTTCGGT$
>ID=exon-XM_024358664.1-3;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript variant X3,ID=exon-XM_024358308.1-2;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript vari$
ctttTGAAAACCTTTATATCCTTTGTCATTCCTTAAACTTTTTCTGGTACGAAATAACAAACGCGATATAAGAGCGGCATCAAAGTGGAGTAAATCTTGATCCTCTAAAACTGatgaaattttatccaaaatgtTTGCGAACTTTGCACTTGTTTTCC
>ID=exon-XM_024358505.1-1;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript variant X2
CTGGATTTTTGGTTTTGATAGCCAAAGACATCAGAGATGGAGGTTccaaattccttttattcaaaaaaggtacaaaaatatcttcaaaagaCATAGTTTGAGGGAGGATTCAACCTCATACACTTATTGAGTTTAATACTAATAACCATAGGGAAGGTAGATGTTGAAAGAAACGTGACAGAAAGGttacaaataagttaaa$
>ID=exon-XM_024358664.1-2;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript variant X3,ID=exon-XM_024358308.1-1;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript vari$
CTGTATTATGCTCCGACAAAGCAGTTGCAGCTATTAACATCGTGTAAACTAAGACATTTGCTGTCGTACTATCAGGACTTACTTGGTGGTATACATGCTGTCTGAATGAATACATTATCTGGTTAATATGGTTACTACAATCACCTGCTATACCTctctatttatgaaataaaagctTATTCTTTCTCAATTTTTGCATTAC$
Desired outcome in a new fasta file:
>GeneID106680938_zinc_finger_protein_391
TTGAAACAGTAGTCGTCTTAGAAGCGCTTCGGCACCACGTCTCCAACTCTCCCACGTCTAGGACTCgtgttattgttttgtttattttttttatttataaacatcgttcattttttttttttttgtttacttcgaTTTTAACCGTGAACTCGGCATTTAAGTCTCCGACAGCTGTGCGTTGAAGTCTCCAACCAACTGCGGA$atcaGGCATAGGTTCGGTTGTTATGACGAAATCACAGAAGTCGACAGGCGGCGAGTACGGCACTGCAATTGGGTACGTTTCCTCAGGGTGGTACCCGCTTACACGGATCAAGTGAACCTCattggaacaaaaataaaagGCGAACCTATTTACGAAGCAGTCAAAAACATCCCAGCGAATTCAGAACTGGTAGTTTACTATCTTCCAGAGAGGCCAGAAGAAGTATTTTTCATGCCGGCAGTCCACTACCTGAGGAACTCACTTTACAGGAGGACAATGGACACGATTCTggaagACTCCCCCTTAGACCTCTCGATGTCGCTACTATCCCGCTCGCTCATCACACCTCCTGGTTCGTCGTCGCCCGTGGACACAGACAAATCCTCAGACTCCGAAGGGCAGCGGATGACACAGGCCAGGTCACCCAAGAGGTCTGGGCGGCCGGAGAGGAGCTTGCTGCCCTGCGAGGTGTGCGGCAAGGCATTCGACAGGCCGTC$gCGAAAAACCCCACGTGTGTATGGTGTGTAACAAAGGCTTCAGCACTAGCAGCTCTCTCAACACCCACAGGCGAATCCACTCCGGCGAAAAGCCCCACCAATGTCAAGTCTGCGGGAAACGGTTCACCGCCTCCTCCAACCTCTACTACCACAGGATGACGCATATCAAGGAGAAGCCTCACAAATGCAACCTGTGCTCCAAGTCGTTCCCCACCCCCGGAGATTTGAAATCGCACATGTACGTCCACAACGGTTCGTGGCCGTTCAGATGCAACATCTGCTCGCGAGGGTTCAGCAAGCACACCAACCTCAAGAACCACCTCTTCCTTCACACCGGTGACAAGCCACACGCTTGCGAACTCTGTCACAAGAAATTCGCCCTAGCTTGCAATCTTCGTGCTCACATGAAAACTCATGAAGGAGAGAATCAGGAAGAATGCTCCAGGTGCTCCAAGAACTGCGGAGGTACCTGCGGAATGCTCAGCGAAGAATCGGCGCCGTCAGTGACAGAAGAGGATCACGAGTAGATTATTTGGAGGAGTTGCCTTTCAGATCTCATACCCCCTCTCAACATCCAAATAgacattattataaacatatagctttatatatcatatataca$
>GeneID106680947_uncharacterized_LOC106680947%2C_transcript_variant_X3
ttctatttttttttttttttttttattaaaaacataaatacaaattcttCATCTACAATTTACAATCcaagatttcattttttgtcgAGCTTTATTCATAGActcacaaaaaatgtttttatctttagtggtttttatttcattgatacaTTTCTTAAGgtaatttctcaaatatttccACTCAGGATCACTTAAATCACTT$CTCCAAAATCTTCATTGAAAAGAAGACCTTCGGCTACTGGGATGTCCATTGATTTCTGGGCTGAAtcttcactcCTTTTCTGTTCACGTAAGAATTCTGGATTTCCTAGCCATGATCGCAAGTCACTGGGTAGTTCGTGACCCGATGGAAGCCACTGATCTTTTGATTCCAAACAAGTTAAGTAAGGTATTAATTGTGAATACCATTCACAAACCTGAAGCAGCAAATATCGCGAAAGTGTCCATACCCTGCTGAGGCATGCCAAGAACAATAGAGCCAAATGCCATCTATTTCCTGTAGTTATAtgattataaagataataagcTGTTTCAGTACATGTTCTTTCTAATCGATGGAACAAGCCAGCAAACCTTTGCAAGGTAATTAGGCACCACTCTAACATTTCTTTTGTAGGACCTTCACCATTATATGTGTAATTTCGGT$ctttTGAAAACCTTTATATCCTTTGTCATTCCTTAAACTTTTTCTGGTACGAAATAACAAACGCGATATAAGAGCGGCATCAAAGTGGAGTAAATCTTGATCCTCTAAAACTGatgaaattttatccaaaatgtTTGCGAACTTTGCACTTGTTTTCCCTGGATTTTTGGTTTTGATAGCCAAAGACATCAGAGATGGAGGTTccaaattccttttattcaaaaaaggtacaaaaatatcttcaaaagaCATAGTTTGAGGGAGGATTCAACCTCATACACTTATTGAGTTTAATACTAATAACCATAGGGAAGGTAGATGTTGAAAGAAACGTGACAGAAAGGttacaaataagttaaa$CTGTATTATGCTCCGACAAAGCAGTTGCAGCTATTAACATCGTGTAAACTAAGACATTTGCTGTCGTACTATCAGGACTTACTTGGTGGTATACATGCTGTCTGAATGAATACATTATCTGGTTAATATGGTTACTACAATCACCTGCTATACCTctctatttatgaaataaaagctTATTCTTTCTCAATTTTTGCATTAC$
a good start is going through the biopython tutorial http://biopython.org/DIST/docs/tutorial/Tutorial.html it explains how to read in a fasta file . Although you would maybe also need to understand the basics of python -> split the header , look up the relevant parts in a dictionary etc ..
Yes, python, perl or any other program languages could solve this problem.