I had a similar situation posted here, which was resolved, but I'm now dealing with differently formatted headers. I have two DNA sequence files I'm working with. One file (e.g., 'company.fasta') has headers that I need to change based on corresponding headers from another file (e.g., 'original.fasta'). Notes: 1) The company.fasta file has some DNA sequences reverse complemented from the original.fasta file, but this shouldn't matter since I'm only wanting to modify the headers. 2) The company.fasta files have sequences (and thus, headers) from different sources, which means the headers display various formats. I'll indicate which header formats I'm targeting below.
The company.fasta file has several different headers, but the ones I'm targeting follow a format similar to the first two headers in the example below. You will notice that some of the headers are very similar to the ones targeted, but have an extra underscore and either a letter A or B. I do not want to modify these headers at this time. Any headers that begin with uce
or ENSOFAS
should also remain unmodified.
>Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1_103_rc
CATTGCAGCAACTAACAGAGTTGATATATTAGATCCAGCCCTTCTCCGATCAGGCAGGCTAGACAGAAAAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAAT
>Anasa_tristis_comp16311_c0_seq1_0
CTTCTGCAAATGCGCAAGCTGTGTGAAGCTCTTTGTGCACACATTGCACTTAAATGGTCTTTCGCCACTGTGAGTCCTCAGATGCACCTTCAAGTTAGAGAGCTGACCAAACGTTTTGAA
>uce-3225_p7 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:7,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410155,probes-global-end:410275,probes-local-start:0,probes-local-end:120
AAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGCCCGAGAACAACTTGGTATTGCCGAAAGGCGAGCCAACGCT
>ENSOFAS011540_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS011540,probes-probe:1,probes-source:Anoplocnemis_curvipes_contig7292
TGGGTATTTCGAGGGATCACTATCATAAAAGAAGGAAGACTGGAGGGAAAAGGAAACCCATCAGGAAGAAGAGGAAGTATGAGTTAGGTCGGCCAGCAGCTAATACTAAGCTTGGTGTAA1
>Anasa_tristis_comp8051_c0_seq1_A_0
ATCCTCCTGATTGGGCAGAAATTTTGAACCATTTTCGAGGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAA
>Anasa_tristis_comp8051_c0_seq1_B_0
TAACGTCCTAGGTTAGGTTTCTGTTTACCAGCTAAAATCTTGAGGGCTGTAGACTTTCCAATGCCATTAGTTCCAACCAGACCTAAAACTTCTCCTGGTCTTGGAATTGGAAGTCTGTGG
The original.fasta file that is to be used for replacing the corresponding company.fasta headers. Within this file, all headers follow the same format, except three parts differ in each header: the OFAS######-RA-EXON## strings that show up twice and the probe-source ID. The probe-source ID overlaps a lot with the company.fasta headers.
>OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1
CATTGCAGCAACTAACAGAGTTGATATATTAGATCCAGCCCTTCTCCGATCAGGCAGGCTAGACAGAAAAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAAT
>OFAS019593-RA-EXON02 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS019593-RA-EXON02,probes-probe:,probes-source:Anasa_tristis_comp16311_c0_seq1
CTTCTGCAAATGCGCAAGCTGTGTGAAGCTCTTTGTGCACACATTGCACTTAAATGGTCTTTCGCCACTGTGAGTCCTCAGATGCACCTTCAAGTTAGAGAGCTGACCAAACGTTTTGAAGCAAACATTACATTCATAGTGCATTTTTCCATCTTTTTTCTTCAGCGGATATGGGAGTGACCC
What I expect the output to look like:
>OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1
CATTGCAGCAACTAACAGAGTTGATATATTAGATCCAGCCCTTCTCCGATCAGGCAGGCTAGACAGAAAAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAAT
>OFAS019593-RA-EXON02 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS019593-RA-EXON02,probes-probe:,probes-source:Anasa_tristis_comp16311_c0_seq1
CTTCTGCAAATGCGCAAGCTGTGTGAAGCTCTTTGTGCACACATTGCACTTAAATGGTCTTTCGCCACTGTGAGTCCTCAGATGCACCTTCAAGTTAGAGAGCTGACCAAACGTTTTGAA
>uce-3225_p7 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:7,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410155,probes-global-end:410275,probes-local-start:0,probes-local-end:120
AAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGCCCGAGAACAACTTGGTATTGCCGAAAGGCGAGCCAACGCT
>ENSOFAS011540_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS011540,probes-probe:1,probes-source:Anoplocnemis_curvipes_contig7292
TGGGTATTTCGAGGGATCACTATCATAAAAGAAGGAAGACTGGAGGGAAAAGGAAACCCATCAGGAAGAAGAGGAAGTATGAGTTAGGTCGGCCAGCAGCTAATACTAAGCTTGGTGTAA1
>Anasa_tristis_comp8051_c0_seq1_A_0
ATCCTCCTGATTGGGCAGAAATTTTGAACCATTTTCGAGGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAA
>Anasa_tristis_comp8051_c0_seq1_B_0
TAACGTCCTAGGTTAGGTTTCTGTTTACCAGCTAAAATCTTGAGGGCTGTAGACTTTCCAATGCCATTAGTTCCAACCAGACCTAAAACTTCTCCTGGTCTTGGAATTGGAAGTCTGTGG
I have a python script provided by a member from the linked post I provided at the beginning, which I've modified slightly with a regex pattern as seen below; I need help in modifying the rest to target and modify the new header format (I'm not experienced with python).
#!/usr/bin/env python
import sys
import re
original_fn = sys.argv[1]
company_fn = sys.argv[2]
pattern = '(uce | ENSOFAS | _[AB]_[0-9]+$)'
map = {}
with open(original_fn, "r") as original_fh:
for line in original_fh:
if line.startswith('>'):
try:
(k, v) = line.strip().split('|')
# remove trailing space from key
k = k[:-1]
map[k] = v
except ValueError as err:
k = line.strip()
map[k] = None
with open(company_fn, "r") as company_fh:
for line in company_fh:
if line.startswith('>') and not re.search(pattern, line.strip()):
try:
(k, v) = line.strip().split('|')
# remove trailing character from key
k = k[:-1]
except ValueError as err:
k = line.strip()
if k not in map:
sys.stdout.write("%s\n" % (k))
else:
sys.stdout.write("%s |%s\n" % (k, map[k]))
else:
sys.stdout.write("%s" % (line))