Replace matching sections of two fastas
1
0
Entering edit mode
4.3 years ago
jamie.pike ▴ 80

I have a master fasta file (File_1.fasta), and another fasta file (File_2.fasta). For every instance where the header in File_2.fasta matches the header in File_1.fasta (apart from "/rc"), I would like the header and subsequent sequence in File_1.fasta to be replayed with the header and subsequent sequences from File_2.fasta.

E.g

File_1.fasta

>header1 
ATGCCTTCCTCAAAGGGATACG
>header2 
ATTGGAATTTGCATCCGAGGGC

File_2.fasta

>header2/rc
GCCCTCGGATGCAAATTCCAAT

Output file

>header1
ATGCCTTCCTCAAAGGGATACG
>header2/rc
GCCCTCGGATGCAAATTCCAAT

Are there any tools which will do this? I imagine it can be done with awk but I am not competent enough with awk to do it.

Thank you

fasta awk • 717 views
ADD COMMENT
2
Entering edit mode
4.3 years ago
  • linerize both fasta files
  • remove the '/rc' suffix with sed
  • use sort to sort both linearized files on the sequence name.
  • use join to select the sequences present in linearized1 but not in linerarized2
  • use join to select the sequences in linearized1 and in linerarized2, use cut to only select the 2nd sequence
  • convert back to fasta using tr
ADD COMMENT
0
Entering edit mode

Great thank you - could you please elaborate on the join and cut sections? How do I use join to select the sequences present in linearized1 but not in linerarized2, join to select the sequences in linearized1 and in linerarized2, and then cut to only select the 2nd sequence? I have had a look at the manual and I don't fully understand.

ADD REPLY
0
Entering edit mode
join -t $'\t' -v 1 -1 1 -2 1 file1.tsv  file2.tsv > only_in_1.tsv
ADD REPLY

Login before adding your answer.

Traffic: 1783 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6