How to match fasta header with the first column of a text file and append the second column of the text file to the end of the fasta header
2
2
Entering edit mode
6.1 years ago
timmers ▴ 30

Aloha!

I have a fasta file that looks like the following:

>FFSA34B_100_M7_ID10014
ATCTAACAATGTTGCTCATGCAGGCCCTGCAGTAGATTTAACCATTCTATCCCTTCACCTAGCAGGTGTATCCTCCTTAATAGGAGCCATCAATTTTACAACTACTATTGCTAACAGACGTTTAGAAGGTATACCTACAGAAAAAATACCCTTATTTATT
>ART01B_100_M7_ID10021
TGTCTTCTAATATTTCCCATGGAGGGGCCTCTGTTGATATGGCTACTTTTTCTCTCCACCTGGCTGGTGTGTCCTCTATTTTGGGTGCTATCAATTTTATAACTACTATTATTAATATGCGGAGTAATTCCTTGTCTTTTGTGCGGGTTCCTTTGTTTGT
>PHR01B_100_M7_ID100236
TAAGAGGAAATGTAGCCCATAGTGGACCTGCAGTAGATTTAGGAATTTTTTCTTTACATTTAGCTGGGGTTTCTTCTATTTTAGGGGCTATTAACTTTATTACTACAATTATTAGTATACGGGAGAATATGGGGATAGAAAAAGTTACTTTATTTAGTTG

I have a text file that looks like the following:

FFSA34B_100_M7_ID10014  2
ART01B_100_M7_ID10021 454
PHR01B_100_M7_ID100236 43

And I would like to append the second column of the text file to the matching fasta header to produce the following output:

>FFSA34B_100_M7_ID10014_2
ATCTAACAATGTTGCTCATGCAGGCCCTGCAGTAGATTTAACCATTCTATCCCTTCACCTAGCAGGTGTATCCTCCTTAATAGGAGCCATCAATTTTACAACTACTATTGCTAACAGACGTTTAGAAGGTATACCTACAGAAAAAATACCCTTATTTATT
>ART01B_100_M7_ID10021_454
TGTCTTCTAATATTTCCCATGGAGGGGCCTCTGTTGATATGGCTACTTTTTCTCTCCACCTGGCTGGTGTGTCCTCTATTTTGGGTGCTATCAATTTTATAACTACTATTATTAATATGCGGAGTAATTCCTTGTCTTTTGTGCGGGTTCCTTTGTTTGT
>PHR01B_100_M7_ID100236_43
TAAGAGGAAATGTAGCCCATAGTGGACCTGCAGTAGATTTAGGAATTTTTTCTTTACATTTAGCTGGGGTTTCTTCTATTTTAGGGGCTATTAACTTTATTACTACAATTATTAGTATACGGGAGAATATGGGGATAGAAAAAGTTACTTTATTTAGTTG

The fasta headers and the text file in the real example will not be in the same order so the command will have to match the fasta header to the first column of the text file and then append the second column value to the fasta header. Anyone know of a why to easily do this?

Thanks in advance!

sequence • 2.6k views
ADD COMMENT
3
Entering edit mode
6.1 years ago

linearize, sort and join (assuming delimiter is tab)

$ join -t $'\t' -1 1 -2 1 \
   <(awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < input.fa | cut -c 2- | sort -t $'\t' -k1,1) \
   <(sort -t $'\t' -k1,1 list.txt) |\
   awk -F '\t' '{printf(">%s_%s\n%s\n",$1,$3,$2);}'

>ART01B_100_M7_ID10021_454
TGTCTTCTAATATTTCCCATGGAGGGGCCTCTGTTGATATGGCTACTTTTTCTCTCCACCTGGCTGGTGTGTCCTCTATTTTGGGTGCTATCAATTTTATAACTACTATTATTAATATGCGGAGTAATTCCTTGTCTTTTGTGCGGGTTCCTTTGTTTGT
>FFSA34B_100_M7_ID10014_2
ATCTAACAATGTTGCTCATGCAGGCCCTGCAGTAGATTTAACCATTCTATCCCTTCACCTAGCAGGTGTATCCTCCTTAATAGGAGCCATCAATTTTACAACTACTATTGCTAACAGACGTTTAGAAGGTATACCTACAGAAAAAATACCCTTATTTATT
>PHR01B_100_M7_ID100236_43
TAAGAGGAAATGTAGCCCATAGTGGACCTGCAGTAGATTTAGGAATTTTTTCTTTACATTTAGCTGGGGTTTCTTCTATTTTAGGGGCTATTAACTTTATTACTACAATTATTAGTATACGGGAGAATATGGGGATAGAAAAAGTTACTTTATTTAGTTG
ADD COMMENT
0
Entering edit mode

This is great if tab delimited! Question, how would you modify this if the text delimiter is a space?

ADD REPLY
2
Entering edit mode

Please run man join, man sort and man awk and look for the options to change delimiters. Or even easier, given that Pierre has explicitly mentioned those options and set the delimiter in all of his commands, read the commands carefully.

ADD REPLY
1
Entering edit mode
6.1 years ago
$  awk '/>/ {getline seq} {sub (">","",$0);print $0, seq}' test.fa | sort -k1 | join -1 1 -2 1 - <(sort -k1 head.txt) -o 1.1,2.2,1.2 | awk '{print ">"$1"_"$2"\n"$3}'

>ART01B_100_M7_ID10021_454
TGTCTTCTAATATTTCCCATGGAGGGGCCTCTGTTGATATGGCTACTTTTTCTCTCCACCTGGCTGGTGTGTCCTCTATTTTGGGTGCTATCAATTTTATAACTACTATTATTAATATGCGGAGTAATTCCTTGTCTTTTGTGCGGGTTCCTTTGTTTGT
>FFSA34B_100_M7_ID10014_2
ATCTAACAATGTTGCTCATGCAGGCCCTGCAGTAGATTTAACCATTCTATCCCTTCACCTAGCAGGTGTATCCTCCTTAATAGGAGCCATCAATTTTACAACTACTATTGCTAACAGACGTTTAGAAGGTATACCTACAGAAAAAATACCCTTATTTATT
>PHR01B_100_M7_ID100236_43
TAAGAGGAAATGTAGCCCATAGTGGACCTGCAGTAGATTTAGGAATTTTTTCTTTACATTTAGCTGGGGTTTCTTCTATTTTAGGGGCTATTAACTTTATTACTACAATTATTAGTATACGGGAGAATATGGGGATAGAAAAAGTTACTTTATTTAGTTG
ADD COMMENT
0
Entering edit mode

How would you modify this command if the fasta is formatted as below to avoid returning just the first line of the sequence?

 >FFSA34B_100_M7_ID10014
AGTCGGTACTGGATGAACTGGTTACCCCCCTTTATCCGGGATTCAGGCTCATTCGGGAGGTTCAGTGGATTTGGTAATTT
TTAGTTTACATTTGGCAGGAATTTCTTCTATATTAGGCGCTATGAATTTTATTACGACAATAATAAATATGAGGGCGCCA
GGGATAACTTTTGATAGAATGCCATTATTTGTATGATCTATTTTAGTAACTGCTGTTTTATTATTATTGTCTTTACCTGT
TTTAGCTGGGGCTATAACTATGTTATTAACAGATAGAAATTTTAATACTGCATTTTTTGATCCTGCAGGAGGAGGGGACC
CCATCTTATATCAACATTTATTT
>ART01B_100_M7_ID10021
TTTCGGGGAATTTAGCTCATTCTGGAGGTTCAGTTGACCTAGCTATTTTTTCTTTACATTTAGCTGGGGTTTCAAGTATT
CTTGGTGCTATTAATTTTATCACTACAACTATCAACATACGTTGGAATGGTTATCAGTTTGAACACCTTCCATTGTTTGT
CTGGTCTGTTAAATTAACTGCTATATTACTTCTTCTTTCACTACCCGTTTTAGCAGGTGCGATTACTATGTTGCTAACAG
ATCGAAATTTTAATACGTCATTTTTTGATCCTTGTGGAGGAGGAGATCCTATTCTTTATCAACATTTATTT
>PHR01B_100_M7_ID100236
TAAGCGCAAATATCTCCCATAGGGGTATGTCAGTTGATTTTGCAATCTTTTCTCTCCATTTAGCGGGTATTTCTTCTCTA
TTGGGGGCTGTAAATTTTATCGGAACTTTAGGAAATTTGCGCACATTGGGCATGTCTATAGAAAAAATACCCTTATTTCC
TTGATCCGTGCTAATTACTGCAATTTTGTTACTTCTTTCTTTACCCGTATTGGCGGGAGCCATTACTATATTATTGACTG
ATCGTAACCTGAATACCAGGTTTTACGATACTAGGGGGGGAGGTGACCCTGTCTTATACCAGCATCTGTTT
ADD REPLY
0
Entering edit mode

I would linearize the fasta sequence. If that is not an option, one can use seqkit to linearize and then format it back to 60 chars per line.

code:

$ seqkit seq -w 0 test.fa | awk '/>/ {getline seq} {sub (">","",$0);print $0, seq}' | sort -k1 | join -1 1 -2 1 - <(sort -k1 ids.txt) -o 1.1,2.2,1.2| awk '{print ">"$1"_"$2"\n"$3}' | seqkit seq -w 60

output:

>ART01B_100_M7_ID10021_454
TTTCGGGGAATTTAGCTCATTCTGGAGGTTCAGTTGACCTAGCTATTTTTTCTTTACATT
TAGCTGGGGTTTCAAGTATTCTTGGTGCTATTAATTTTATCACTACAACTATCAACATAC
GTTGGAATGGTTATCAGTTTGAACACCTTCCATTGTTTGTCTGGTCTGTTAAATTAACTG
CTATATTACTTCTTCTTTCACTACCCGTTTTAGCAGGTGCGATTACTATGTTGCTAACAG
ATCGAAATTTTAATACGTCATTTTTTGATCCTTGTGGAGGAGGAGATCCTATTCTTTATC
AACATTTATTT
>FFSA34B_100_M7_ID10014_2
AGTCGGTACTGGATGAACTGGTTACCCCCCTTTATCCGGGATTCAGGCTCATTCGGGAGG
TTCAGTGGATTTGGTAATTTTTAGTTTACATTTGGCAGGAATTTCTTCTATATTAGGCGC
TATGAATTTTATTACGACAATAATAAATATGAGGGCGCCAGGGATAACTTTTGATAGAAT
GCCATTATTTGTATGATCTATTTTAGTAACTGCTGTTTTATTATTATTGTCTTTACCTGT
TTTAGCTGGGGCTATAACTATGTTATTAACAGATAGAAATTTTAATACTGCATTTTTTGA
TCCTGCAGGAGGAGGGGACCCCATCTTATATCAACATTTATTT
>PHR01B_100_M7_ID100236_43
TAAGCGCAAATATCTCCCATAGGGGTATGTCAGTTGATTTTGCAATCTTTTCTCTCCATT
TAGCGGGTATTTCTTCTCTATTGGGGGCTGTAAATTTTATCGGAACTTTAGGAAATTTGC
GCACATTGGGCATGTCTATAGAAAAAATACCCTTATTTCCTTGATCCGTGCTAATTACTG
CAATTTTGTTACTTCTTTCTTTACCCGTATTGGCGGGAGCCATTACTATATTATTGACTG
ATCGTAACCTGAATACCAGGTTTTACGATACTAGGGGGGGAGGTGACCCTGTCTTATACC
AGCATCTGTTT

Input:

$ cat test.fa 
>FFSA34B_100_M7_ID10014
AGTCGGTACTGGATGAACTGGTTACCCCCCTTTATCCGGGATTCAGGCTCATTCGGGAGGTTCAGTGGATTTGGTAATTT
TTAGTTTACATTTGGCAGGAATTTCTTCTATATTAGGCGCTATGAATTTTATTACGACAATAATAAATATGAGGGCGCCA
GGGATAACTTTTGATAGAATGCCATTATTTGTATGATCTATTTTAGTAACTGCTGTTTTATTATTATTGTCTTTACCTGT
TTTAGCTGGGGCTATAACTATGTTATTAACAGATAGAAATTTTAATACTGCATTTTTTGATCCTGCAGGAGGAGGGGACC
CCATCTTATATCAACATTTATTT
>ART01B_100_M7_ID10021
TTTCGGGGAATTTAGCTCATTCTGGAGGTTCAGTTGACCTAGCTATTTTTTCTTTACATTTAGCTGGGGTTTCAAGTATT
CTTGGTGCTATTAATTTTATCACTACAACTATCAACATACGTTGGAATGGTTATCAGTTTGAACACCTTCCATTGTTTGT
CTGGTCTGTTAAATTAACTGCTATATTACTTCTTCTTTCACTACCCGTTTTAGCAGGTGCGATTACTATGTTGCTAACAG
ATCGAAATTTTAATACGTCATTTTTTGATCCTTGTGGAGGAGGAGATCCTATTCTTTATCAACATTTATTT
>PHR01B_100_M7_ID100236
TAAGCGCAAATATCTCCCATAGGGGTATGTCAGTTGATTTTGCAATCTTTTCTCTCCATTTAGCGGGTATTTCTTCTCTA
TTGGGGGCTGTAAATTTTATCGGAACTTTAGGAAATTTGCGCACATTGGGCATGTCTATAGAAAAAATACCCTTATTTCC
TTGATCCGTGCTAATTACTGCAATTTTGTTACTTCTTTCTTTACCCGTATTGGCGGGAGCCATTACTATATTATTGACTG
ATCGTAACCTGAATACCAGGTTTTACGATACTAGGGGGGGAGGTGACCCTGTCTTATACCAGCATCTGTTT

$ cat ids.txt 
FFSA34B_100_M7_ID10014  2
ART01B_100_M7_ID10021   454
PHR01B_100_M7_ID100236  43
ADD REPLY
0
Entering edit mode

If you are okay with 2 step solution, here is one (a simpler one) with seqkit without linearization of sequence:

$ awk -v OFS="\t" '{print $1,$1"_"$2}' ids.txt > ids2.txt
$ seqkit replace  -p '(.+)$' -r '{kv}' test.fa -k ids2.txt -o test2.fa

output:

$ cat test2.fa 
>FFSA34B_100_M7_ID10014_2
AGTCGGTACTGGATGAACTGGTTACCCCCCTTTATCCGGGATTCAGGCTCATTCGGGAGG
TTCAGTGGATTTGGTAATTTTTAGTTTACATTTGGCAGGAATTTCTTCTATATTAGGCGC
TATGAATTTTATTACGACAATAATAAATATGAGGGCGCCAGGGATAACTTTTGATAGAAT
GCCATTATTTGTATGATCTATTTTAGTAACTGCTGTTTTATTATTATTGTCTTTACCTGT
TTTAGCTGGGGCTATAACTATGTTATTAACAGATAGAAATTTTAATACTGCATTTTTTGA
TCCTGCAGGAGGAGGGGACCCCATCTTATATCAACATTTATTT
>ART01B_100_M7_ID10021_454
TTTCGGGGAATTTAGCTCATTCTGGAGGTTCAGTTGACCTAGCTATTTTTTCTTTACATT
TAGCTGGGGTTTCAAGTATTCTTGGTGCTATTAATTTTATCACTACAACTATCAACATAC
GTTGGAATGGTTATCAGTTTGAACACCTTCCATTGTTTGTCTGGTCTGTTAAATTAACTG
CTATATTACTTCTTCTTTCACTACCCGTTTTAGCAGGTGCGATTACTATGTTGCTAACAG
ATCGAAATTTTAATACGTCATTTTTTGATCCTTGTGGAGGAGGAGATCCTATTCTTTATC
AACATTTATTT
>PHR01B_100_M7_ID100236_43
TAAGCGCAAATATCTCCCATAGGGGTATGTCAGTTGATTTTGCAATCTTTTCTCTCCATT
TAGCGGGTATTTCTTCTCTATTGGGGGCTGTAAATTTTATCGGAACTTTAGGAAATTTGC
GCACATTGGGCATGTCTATAGAAAAAATACCCTTATTTCCTTGATCCGTGCTAATTACTG
CAATTTTGTTACTTCTTTCTTTACCCGTATTGGCGGGAGCCATTACTATATTATTGACTG
ATCGTAACCTGAATACCAGGTTTTACGATACTAGGGGGGGAGGTGACCCTGTCTTATACC
AGCATCTGTTT

$ cat ids2.txt 
FFSA34B_100_M7_ID10014  FFSA34B_100_M7_ID10014_2
ART01B_100_M7_ID10021   ART01B_100_M7_ID10021_454
PHR01B_100_M7_ID100236  PHR01B_100_M7_ID100236_43

input:

$ cat test.fa 
>FFSA34B_100_M7_ID10014
AGTCGGTACTGGATGAACTGGTTACCCCCCTTTATCCGGGATTCAGGCTCATTCGGGAGGTTCAGTGGATTTGGTAATTT
TTAGTTTACATTTGGCAGGAATTTCTTCTATATTAGGCGCTATGAATTTTATTACGACAATAATAAATATGAGGGCGCCA
GGGATAACTTTTGATAGAATGCCATTATTTGTATGATCTATTTTAGTAACTGCTGTTTTATTATTATTGTCTTTACCTGT
TTTAGCTGGGGCTATAACTATGTTATTAACAGATAGAAATTTTAATACTGCATTTTTTGATCCTGCAGGAGGAGGGGACC
CCATCTTATATCAACATTTATTT
>ART01B_100_M7_ID10021
TTTCGGGGAATTTAGCTCATTCTGGAGGTTCAGTTGACCTAGCTATTTTTTCTTTACATTTAGCTGGGGTTTCAAGTATT
CTTGGTGCTATTAATTTTATCACTACAACTATCAACATACGTTGGAATGGTTATCAGTTTGAACACCTTCCATTGTTTGT
CTGGTCTGTTAAATTAACTGCTATATTACTTCTTCTTTCACTACCCGTTTTAGCAGGTGCGATTACTATGTTGCTAACAG
ATCGAAATTTTAATACGTCATTTTTTGATCCTTGTGGAGGAGGAGATCCTATTCTTTATCAACATTTATTT
>PHR01B_100_M7_ID100236
TAAGCGCAAATATCTCCCATAGGGGTATGTCAGTTGATTTTGCAATCTTTTCTCTCCATTTAGCGGGTATTTCTTCTCTA
TTGGGGGCTGTAAATTTTATCGGAACTTTAGGAAATTTGCGCACATTGGGCATGTCTATAGAAAAAATACCCTTATTTCC
TTGATCCGTGCTAATTACTGCAATTTTGTTACTTCTTTCTTTACCCGTATTGGCGGGAGCCATTACTATATTATTGACTG
ATCGTAACCTGAATACCAGGTTTTACGATACTAGGGGGGGAGGTGACCCTGTCTTATACCAGCATCTGTTT

$ cat ids.txt 
FFSA34B_100_M7_ID10014  2
ART01B_100_M7_ID10021   454
PHR01B_100_M7_ID100236  43
ADD REPLY
0
Entering edit mode

One step is enough:

seqkit replace -p '(.+)' -r '${1}_{kv}' -k ids.txt test.fa > out.fa

$ grep '>' out.fa 
>FFSA34B_100_M7_ID10014_2
>ART01B_100_M7_ID10021_454
>PHR01B_100_M7_ID100236_43
ADD REPLY

Login before adding your answer.

Traffic: 1706 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6