I have two DNA sequence files, one generated by a company based off a data file I sent them. The company's file has different sequence headers for some sequences than the data file I sent, and it's important that all of it conforms to the format I sent for certain programs to use it. Is there a way that I can search a part or all of the sequence header in the company files and then replace the entire line with the corresponding header from my original data file? Additional notes: 1) the company reverse complemented some of the sequences, and this was necessary. Thus, I do not want to alter the sequences from the company file, just get the headers looking like those from the original one. 2) Those sequences that were reverse complemented have an _rc
appended to the end of the headers.
For example: Company's header
>uce-265_p7-|design:hemiptera-v1,designer:faircloth_rc
TGAGTATTCAATATTCCCTGCGCAATATTCAATGGACATACATGGCTATGTTCTTGTTTATTCTATTACATCACTTAAGTCATTCGAAGTTGTGCAGGTCATTTATGAAAAGTTGCTCGA
Original file
>uce-265_p7 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-265,probes-probe:7,probes-source:halhal1,probes-global-chromo:Scaffold391,probes-global-start:41871,probes-global-end:41991,probes-local-start:0,probes-local-end:120
TCGAGCAACTTTTCATAAATGACCTGCACAACTTCGAATGACTTAAGTGATGTAATAGAATAAACAAGAACATAGCCATGTATGTCCATTGAATATTGCGCAGGGAATATTGAATACTCA
The company's headers should look like the original. My initial thought is an if/then loop with grep, but I'm having trouble imaging how this would work in this case.
Correct, sequences were reverse complemented by the company as well in many cases.
What was the reason for reverse complementing the sequences?
prevent cross-hybridization potential
Not sure what you are referring to?
Posted a new solution to my initial answer, after you gave extra information - thanks!