Change fasta header sequence ID's
1
0
Entering edit mode
10.3 years ago
thhaverk • 0

Hi All,

I think there must be a solution somewhere to my issue, but until now I could not find it.

I have a list of fasta headers that I want to use to select a subset of genes from a fasta file that was created using the RAST annotation pipeline.

The headers look like this:

160798.5.peg.2
160798.5.peg.12
160798.5.peg.123
160798.5.peg.1234

My problem is that if I use my script fastagrep.pl that when it searches for the first sequence, it not only selects

160798.5.peg.2

but also selects

160798.5.peg.20
160798.5.peg.201
160798.5.peg.2001
 etc...

Now I was trying to see if I could make the numbers more specific e.g

  • change 160798.5.peg.2 into 160798.5.peg.0002
  • change 160798.5.peg.20 into 160798.5.peg.0020

I think I should be able to do this with sed or awk, but I am not sure how to insert text before the 2, and the dots in the header are also a problem since they are wildcards in a my sed command

So I started with a sed command but that is not working:

sed -e 's/peg../peg.0/g' headers.txt > new_headers.txt

Any suggestions are welcome.

grep oneliner fasta • 4.8k views
ADD COMMENT
1
Entering edit mode

Just add -w to your grep. See man grep for definition.

ADD REPLY
0
Entering edit mode

Thanks, that sort of worked.

I now used the following:

grep -w "peg.." headers.txt | sed 's/peg./peg.000/g' > new_headers.txt

This only selects those lines with a peg.2 and turns it into peg.0002.

For lines with number between 10 and 100 and from 100 to 1000,I can rerun these commands by adding more dots to the expression e.g

grep -w "peg..." headers.txt | sed 's/peg./peg.00/g' > new_headers2.txt

It is not perfect but it is workable.

ADD REPLY
1
Entering edit mode

If this is the script you're using, you should just use the -X flag which I take is like -w for grep.

ADD REPLY
0
Entering edit mode

That did not work in first instance.

However, I changed the headers in the fasta file

they were:

>fig|160798.5.peg.1

into

>160798.5.peg.1

The "fig|" is repeated for every sequence and I had the feeling the " | " was creating trouble here.

Removing that made fastagrep.pl -X work and I got my 576 sequences

That is good. Thanks for the suggestion and the help

ADD REPLY
0
Entering edit mode

Have you tried to end your regular expression with "$" in your fastagrep.pl? It should only match lines containing exactly your pattern, and nothing else.

ADD REPLY
0
Entering edit mode

I have written my sed command now like this:

sed -e 's/peg..$/peg.000/g' headers.txt > new_headers.txt

That gives me a change of my header...

change 160798.5.peg.2 into 160798.5.peg.000

That means the 2 is replaced by 000.

So how can I insert the three 0's without removing the two?

ADD REPLY
0
Entering edit mode

The general idea is to use a regular expression to match the 2 and then use printf with a preset size, which can 0 pad numbers. For example, echo "2" | awk '{printf("%05i\n",$1)}' would produce 00002.

ADD REPLY
1
Entering edit mode
10.3 years ago
hugorody ▴ 20

just type

cat fasta_file.fasta | grep -m 1 --file=fasta_headers.txt --after-context=1 > output.txt
ADD COMMENT

Login before adding your answer.

Traffic: 2503 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6