How to delete last 5 characters off of FASTA header?
2
0
Entering edit mode
4.7 years ago
angela1 • 0

Hi,

I am trying to remove the last 5 characters from my FASTA header in my sequencing data. I have ≈400,000 sequences and have tried to use sed command in terminal to do this for me.

Input text:

>1-4-8.45  
TAGGGAGA

Expected Output:

>1-4           
TAGGGAGA

How can I use sed command to remove the last 5 characters from my FASTA headers?

FASTA header sed • 3.5k views
ADD COMMENT
3
Entering edit mode
4.7 years ago
wm ▴ 570

using sed, this solution is not consider the white spaces in header.

$ sed '/^>/s/.\{5\}$//' in.fa

for fasta and fastq file, bioawk https://github.com/lh3/bioawk is also good option, it can separate the $name and $comment in header.

$ bioawk -cfastx '{id=substr($name, 0, length($name) - 5); print ">"id"\n"$seq}'
ADD COMMENT
0
Entering edit mode
4.7 years ago
Ram 44k

What have you tried? This sort of problem has been addressed on the site multiple times

sed can match the first character of each line to pick lines where an operation is performed - you can use that to restrict the operation to just header lines. You can also capture the last five characters with the regex (.{5})$.

Please use these hints to get to the solution.

ADD COMMENT

Login before adding your answer.

Traffic: 1882 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6