Question

Remove reference sequence from multifasta file

0

Entering edit mode

5 weeks ago

FJCF • 0

Hi everyone.

I'm a complete beginner in this field and I'm having a hard time with something that looked very simple. I'm working with a large bacterial genome dataset and I need to remove a reference sequence from a multifasta file (both its ID and its sequence). Is the first sequence in the file, but I don't find a simple solution for doing this task. I've heard that using seqkit could be useful, but I'm don't really know how to use it.

I'd be grateful if someone could help me with this.

Thanks in advance!

fasta multifasta • 289 views

ADD COMMENT • link updated 5 weeks ago by lieven.sterck 15k • written 5 weeks ago by FJCF • 0

0

Entering edit mode

you might get some inspiration from this post : How To Remove Certain Sequences From A Fasta File or this one : How do I remove certain sequences in fast based on header?

ADD REPLY • link 5 weeks ago by lieven.sterck 15k

score 0 · Answer 1 · 2024-11-12

As long as your file is actually "multi-fasta" and the sequence you want to remove is first one in the file, the following will work:

 awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < multifasta.fa | awk 'NR>1' | tr "\t" "\n"  > new.fa

Explanation: First awk command linearizes a multi-line multi-fasta file. Second awk command (after the pipe) removes the first line/record. Final tr command reformats the single line file back to a multi-fasta file.

Replace multifasta.fa with your input file. Choose file name you want for output in place of new.fa.