How do I remove certain sequences in fasta based on header?
3
0
Entering edit mode
4.4 years ago
tianshenbio ▴ 180

I have a fasta file like this:

>XM_0000001.1 
actact
>XR_0000001.1
atcatc

How do I remove all the sequences with a XR header?

I only want to keep:

>XM_0000001.1
actact
fasta sequence RNA-Seq • 4.2k views
ADD COMMENT
1
Entering edit mode
4.4 years ago
shiyeyishang ▴ 10

If you do it on linux,it will be easy.

  1. Step 1: grep “>” file.fa | sed 's/>//g' > file.fa.id
  2. Step 2: grep -v 'XR_' file.fa.id > file.fa.id.final
  3. step 3: seqtk subseq file.fa file.fa.id.final > final.fa

PS: Seqtk is a software that you need to install.

edit:formatting.

ADD COMMENT
0
Entering edit mode
4.4 years ago

try with gnu-sed on ubuntu/mint:

$ sed  -e '/^>XR/,+1d' test.fa

If you have multiline fasta, use seqkit:

$ seqkit grep -rvip "^XR" test.fa
ADD COMMENT
0
Entering edit mode
4.4 years ago
Hugo ▴ 380

You can try SEDA (https://www.sing-group.org/seda/). The Pattern filtering operation (https://www.sing-group.org/seda/manual/operations.html#pattern-filtering) would allow you to do this if you configure a Not contains pattern with the "^XR_" text.

ADD COMMENT

Login before adding your answer.

Traffic: 1553 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6