Bash solution to replace part of FASTA headers witch contain matching values in another file
2
0
Entering edit mode
18 months ago
clmattson • 0

Hi all,

this is my first time posting on BioStars! I thought I'd easily find a bash one-liner to take care of this task but I've been searching with no luck.

I have a fasta, samples.fasta, that looks like this:

>sample01/contig002
ATCG
>sample02/contig001
GCTA
>sample11/contig003
CAGT

I have a text file, sample_key.txt, that has samples (always format 'sampleXX') paired with isolate names. Isolate names have a variety of formats, but none of them contain spaces. sample_key.txt looks like this:

sample01     AAA
sample02     def456
sample03     F7
.....
sample11     H-10

I'm trying to do two things: 1) replace the sample name with the isolate name (ie the value in the key file) and 2) replace the '/' in the original header with a '_'. I want to keep the second part of the original header, the contig number. My ideal output, looks like this:

>AAA_contig002
ATCG
>def456_contig001
GCTA
>H-10_contig003
CAGT

I've tried seqkit replace, but it doesnt seem to work unless the keys match the existing headers exactly, which wont be the case here because sample_key.txt only contains part of each header. Unless you have a really simple way to do it, I dont think simply making a new key file is a useful option.

Thanks!!

fasta awk bash seqkit sed • 1.3k views
ADD COMMENT
0
Entering edit mode

FASTA header editing is a widely discussed topic on the forum. Have you searched the forum for existing posts?

ADD REPLY
0
Entering edit mode

Yes, but for some reason none of the suggested answers in the posts I found already worked, or they were exclueively for cases with exact matching. which was suprising to me bc I usually find awesome solutions on BioStars within 10 mins or so.... I figured out a sort of janky work around but after like an hour of trying to get old solutions to work I was ready to give up. will post said janky work around in a bit lol

ADD REPLY
2
Entering edit mode
18 months ago

I've tried seqkit replace, but it doesnt seem to work unless the keys match the existing headers exactly,

$ seqkit replace -p '^(.+?)/(.+)$' -r '{kv}_$2' -k sample_key.txt samples.fasta 
[INFO] read key-value file: sample_key.txt
[INFO] 4 pairs of key-value loaded
>AAA_contig002
ATCG
>def456_contig001
GCTA
>H-10_contig003
CAGT
ADD COMMENT
1
Entering edit mode
18 months ago
Mensur Dlakic ★ 28k

Someone here probably knows how to do this in a single line or two, but this may still be helpful. I suggest you make a copy of samples.fasta before running any of these commands. Everything must be pasted exactly as below, because single- and double-quotations are not interchangeable.

Replacing the slash with an underscore is simple:

perl -pi -e 's/\//_/g' samples.fasta

Create a script that will take first-column values from sample_key.txt and replace them with second-column values:

awk '{print "perl -pi -e ^s/"$1"/"$2"/g^ samples.fasta"}' sample_key.txt > script.sh
perl -pi -e "s/\^/\'/g" script.sh

Run the script and delete it when done:

source script.sh
rm script.sh
ADD COMMENT

Login before adding your answer.

Traffic: 1956 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6