Remove reference sequence from multifasta file
1
0
Entering edit mode
5 weeks ago
FJCF • 0

Hi everyone.

I'm a complete beginner in this field and I'm having a hard time with something that looked very simple. I'm working with a large bacterial genome dataset and I need to remove a reference sequence from a multifasta file (both its ID and its sequence). Is the first sequence in the file, but I don't find a simple solution for doing this task. I've heard that using seqkit could be useful, but I'm don't really know how to use it.

I'd be grateful if someone could help me with this.

Thanks in advance!

fasta multifasta • 291 views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode
5 weeks ago
GenoMax 148k

As long as your file is actually "multi-fasta" and the sequence you want to remove is first one in the file, the following will work:

 awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < multifasta.fa | awk 'NR>1' | tr "\t" "\n"  > new.fa

Explanation: First awk command linearizes a multi-line multi-fasta file. Second awk command (after the pipe) removes the first line/record. Final tr command reformats the single line file back to a multi-fasta file.

Replace multifasta.fa with your input file. Choose file name you want for output in place of new.fa.

ADD COMMENT

Login before adding your answer.

Traffic: 1898 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6