How to remove the header in fasta file and keep only the desirable part on ubuntu?
3
0
Entering edit mode
3.2 years ago
Jelo • 0

Hi all,

I have a fasta file with this header

>10005_M12.fastq    Otu0001|242290|M1.fastq-M12.fastq-M5.fastq-URTM6.fastq-M7.fastq-M9.fastq

I want to remove all the header parts except the OTU (with its number), I used the this command sed 's/>M.*Otu/>Otu/g' rep.fasta |sed -e 's/|.*//g'> rep.otu.fasta but the command removed only the part after OTU as following;

>10005_M12.fastq    Otu0001

I want the header looks like (>Otu0001)

Any advice will be appreciated

Thank you

microbiome fasta NGS • 1.8k views
ADD COMMENT
0
Entering edit mode

Thank you all for help

ADD REPLY
1
Entering edit mode

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
upvote_bookmark_accept

ADD REPLY
3
Entering edit mode
3.2 years ago
 sed '/^>/s/.*[ \t]*\(Otu[0-9]*\).*/>\1/' in.fa
ADD COMMENT
2
Entering edit mode
3.2 years ago

seqkit answer also for posterity

seqkit replace -p "\|.*" in.fa
ADD COMMENT
1
Entering edit mode
seqkit replace -p "^.+\s|\|.*" foo.fasta

or

seqkit replace -p ".+\s(\w+)\|.+" -r "\$1" foo.fasta

or just

seqkit seq -i --id-regexp "\s(\w+)\|" foo.fasta
ADD REPLY
1
Entering edit mode
3.2 years ago

if sequences have no |, try this:

$ awk -F "|" '{print $1}' test.fa 

if you are not sure, you can use this:

$ awk -F "|" '/^>/ {print $1}; !/^>/' test.fa

or this:

$ awk -F "|" '{print ($0 ~ /^>/)?$1:$0}' test.fa
ADD COMMENT

Login before adding your answer.

Traffic: 1309 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6