Rename fasta headers from CSV
1
0
Entering edit mode
5.5 years ago
SaltedPork ▴ 170

I have a CSV file that looks like this:

   201200175,A/name1/175/2012
   201200287,A/name2/287/2012
   201200845,A/name3/845/2012

Currently my fasta headers look like:

>201200175_AA
>201200175_AB
>201200175_BB

and I want to change it to:

>A/name1/175/2012_AA
>A/name1/175/2012_AB
>A/name1/175/2012_BB

I want to preserve the suffix (_AA etc..). I have multiple fasta files, and they are all multifastas.

I was wondering if there is a quicker way in bash rather than writing out some Perl...

bash perl python • 3.7k views
ADD COMMENT
1
Entering edit mode

Do the fasta files have linebreaks in seqs? If not, then perhaps e.g.

join -1 1 -2 1 -t $'\t' \
    <(sed 's/,/\t/' file.csv | sort -k1,1) \
    <(paste - - <file.fa | sed -e 's/_/\t/' -e 's/>//' | sort -k1,1) \
    | awk 'BEGIN{FS="\t"}{print ">"$2"_"$3"\n"$4}'
ADD REPLY
1
Entering edit mode

can you explain the commands, it would be so helpful to me

ADD REPLY
2
Entering edit mode

man join, man sed, man sort, man paste

You can try them out yourself. e.g. sed 's/,/\t/' file.csv you will see that it replaces the first commas with tabs, paste - - <file.fa etc. you can see yourself..

ADD REPLY
0
Entering edit mode

Thanks, no they don't.

ADD REPLY
5
Entering edit mode
5.5 years ago
AK ★ 2.2k

Hi SaltedPork,

Try seqkit:

$ cat changes.csv
201200175,A/name1/175/2012
201200287,A/name2/287/2012
201200845,A/name3/845/2012

$ cat sample.fasta
>201200175_AA
AGCTAGCTAGCTGCATGCTGCATGCTACG
>201200175_AB
AGCTGCATGCTAGCTGATCGTAGCTAGCT
>201200175_BB
GCTAGCTAGCTGCATGCTAGCTAGCTGCT

$ seqkit replace --kv-file <(sed "s/,/\t/g" changes.csv) --pattern "^(\d+)_(\w+)" --replacement "{kv}_\${2}" sample.fasta
[INFO] read key-value file: /dev/fd/63
[INFO] 3 pairs of key-value loaded
>A/name1/175/2012_AA
AGCTAGCTAGCTGCATGCTGCATGCTACG
>A/name1/175/2012_AB
AGCTGCATGCTAGCTGATCGTAGCTAGCT
>A/name1/175/2012_BB
GCTAGCTAGCTGCATGCTAGCTAGCTGCT
ADD COMMENT
0
Entering edit mode

Thanks for this answer, very clear.

I am getting fasta headers like >_AA, >_BB. The bit from the CSV is not there.

Could you explain what the sed and regex bits are doing. Sed is replacing commas with tabs in the csv, is this because seqkit doesn't handle the commas? Also what is the _ doing in the pattern bit?

ADD REPLY
2
Entering edit mode

Hi SaltedPork,

Sed is replacing commas with tabs in the csv, is this because seqkit doesn't handle the commas?

Yes, as seqkit replace -h shows: -k, --kv-file string tab-delimited key-value file for replacing key with value when using "{kv}" in -r (--replacement) (only for sequence name)

"^(\d+)_(\w+)" is the regex for 201200175_AA, 201200175_AB, and 201200175_BB. It's the part that you want to replace. _ is the symbol between 201200175 and AA.

ADD REPLY
0
Entering edit mode

Thanks, but my headers don't have the second value in them, just the _AA. Any ideas? I've tried playing with the regex but no luck.

ADD REPLY
2
Entering edit mode

Ah... Sorry didn't realize that you got leading spaces in your csv. Try this: seqkit replace --kv-file <(sed -r "s/^\s+//g; s/,/\t/g" changes.csv) --pattern "^(\d+)_(\w+)" --replacement '{kv}_${2}' sample.fasta

ADD REPLY
2
Entering edit mode

i have similar task and i am having difficulty in getting the key values out of my key-value file , at the replacement. i have tried several versions of this seqkit replacement command but not able to get the right result. my key-value file:

WP_000014594.1 WP_000014594.1#0001

WP_000025662.1 WP_000025662.1#0001

so the column1 has accession numbers/ids that will find in my fasta file header and i need to replace these with values given in column2.

but im either getting blank space there (command a.)or partial ids like this >_000014594 (command b.) command a. $ seqkit replace -p "^(\w+)_(\d+).(\d+)" --replacement "{kv}" -k accesskey_idsvalue1 GCF_000016305.1_ASM1630v1_protein.faa

command b. seqkit replace -p "^(\w+)_(\d+).(\d+)" --replacement '{kv}_${2}' -k accesskey_idsvalue1 GCF_000016305.1_ASM1630v1_protein.faa

ADD REPLY
1
Entering edit mode

Presuming your input files are:

$ cat accesskey_idsvalue1
WP_000014594.1 WP_000014594.1#0001
WP_000025662.1 WP_000025662.1#0001

$ cat GCF_000016305.1_ASM1630v1_protein.faa
>WP_000014594.1 Description
MYAPQCSRSKRCFAGLPSLSHEQQQQAVERIHELMAQGISSGQAIALVAEELRATHTGEQ
>WP_000025662.1 Description
MKKVLALVVAAAMGLSSVAFAADAASTTPSAAASHTTVHHKKHHKAAAKPAAEQKAQAAK

Make sure that (1) the key-value file is tab-delimited (see below) and (2) the argument for pattern -p is correctly bracketed.

$ seqkit replace -h | grep 'kv-file'
  -k, --kv-file string         tab-delimited key-value file for replacing key with value when using "{kv}" in -r (--replacement) (only for sequence name)

Guess your command can be modified as below:

$ seqkit replace -p "^(\w+_\d+\.\d+) " --replacement '{kv} ' -k <(sed -r "s/ /\t/g" accesskey_idsvalue1) GCF_000016305.1_ASM1630v1_protein.faa
[INFO] read key-value file: /dev/fd/63
[INFO] 2 pairs of key-value loaded
>WP_000014594.1#0001 Description
MYAPQCSRSKRCFAGLPSLSHEQQQQAVERIHELMAQGISSGQAIALVAEELRATHTGEQ
IVARFEDEDEDE
>WP_000025662.1#0001 Description
MKKVLALVVAAAMGLSSVAFAADAASTTPSAAASHTTVHHKKHHKAAAKPAAEQKAQAAK
KHHKTAAKTGSRAESAGCKETS
ADD REPLY

Login before adding your answer.

Traffic: 1633 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6