combine headers from two fasta files
1
0
Entering edit mode
2.2 years ago

Dear all, I am in the following situation. I have two files: 1. it is a collection of sequences in fasta format for emu; 2. the list of sequences with more taxonomic information. Please, find a "fake" example of them:

  1. genome.txt
 > 2591237:ncbi:1 [MK211378]
mammammmammammammammammammammammmammammammammammammammamammammmammammammammammammammammmammammammammammammammamammammmammammammammammammammammmammammammammammammammamammammmammammammammammammammammmammammammammammammammamammammmammammammammammammammammmammammammammammammammamammammmammammammammammammammammmammammammammammammammamammammmammammammammammammammammmammammammammammammammamammammmammammammammammammammammmammammammammammammamma

 >11120:ncbi:1011 [MG021194]
banananavananabananabananabanananavananabananabananabanananavananabananabananabanananavananabananabananabanananavananabananabananabanananavananabananabananabanananavananabananabananabanananavananabananabananabanananavananabananabananabanananavananabananabananabanananavananabananabananabanananavananabananabananabanananavananabananabananabanananavananabananabananabanananavananabananabanana
  1. lista.txt
1120   ncbi    1011 [MG021194] 11120   Infectious bronchitis virus             scientific name

1237 ncbi    1 [MK211378] 2591237    Coronavirus BtRs-BetaCoV/YN2018D                scientific name
`

What I want to obtain is an "extended" version of the genome.txt file where the header of each sequence has been combined to the information from the lista.txt file. The "join" operation could be done by the sequence ID (already unique, e.g MK211378). I already tried to use the join (bash) command and awk, but without results.

Please, can someone help me?

Thank you very much.

Emilio

fasta bash awk header • 823 views
ADD COMMENT
1
Entering edit mode
2.2 years ago
iraun 6.2k

Hi! Try this and check if it does what you need.

It is not obviously the most elegant work, but maybe you can take it from here and adapt it to your needs.

awk 'FNR==NR{split($0,a,"[");split(a[2],b,"]"); c[b[1]]=$0; next}NR>1{split($0,a,"[");split(a[2],b,"]"); if (b[1] in c) { print $0"\t"c[b[1]]} else {print}}' lista.txt genome.txt
ADD COMMENT
0
Entering edit mode

Thanks iraun. The merger line is perfect.

ADD REPLY
0
Entering edit mode

No worries, please consider marking the answer as accepted if it fixed your problem :).

ADD REPLY

Login before adding your answer.

Traffic: 2117 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6