Hi All,
I think there must be a solution somewhere to my issue, but until now I could not find it.
I have a list of fasta headers that I want to use to select a subset of genes from a fasta file that was created using the RAST annotation pipeline.
The headers look like this:
160798.5.peg.2
160798.5.peg.12
160798.5.peg.123
160798.5.peg.1234
My problem is that if I use my script fastagrep.pl that when it searches for the first sequence, it not only selects
160798.5.peg.2
but also selects
160798.5.peg.20
160798.5.peg.201
160798.5.peg.2001
etc...
Now I was trying to see if I could make the numbers more specific e.g
- change
160798.5.peg.2
into160798.5.peg.0002
- change
160798.5.peg.20
into160798.5.peg.0020
I think I should be able to do this with sed or awk, but I am not sure how to insert text before the 2, and the dots in the header are also a problem since they are wildcards in a my sed command
So I started with a sed command but that is not working:
sed -e 's/peg../peg.0/g' headers.txt > new_headers.txt
Any suggestions are welcome.
Just add -w to your grep. See man grep for definition.
Thanks, that sort of worked.
I now used the following:
This only selects those lines with a
peg.2
and turns it intopeg.0002
.For lines with number between 10 and 100 and from 100 to 1000,I can rerun these commands by adding more dots to the expression e.g
It is not perfect but it is workable.
If this is the script you're using, you should just use the
-X
flag which I take is like-w
for grep.That did not work in first instance.
However, I changed the headers in the fasta file
they were:
into
The "fig|" is repeated for every sequence and I had the feeling the " | " was creating trouble here.
Removing that made
fastagrep.pl -X
work and I got my 576 sequencesThat is good. Thanks for the suggestion and the help
Have you tried to end your regular expression with "$" in your fastagrep.pl? It should only match lines containing exactly your pattern, and nothing else.
I have written my sed command now like this:
That gives me a change of my header...
change
160798.5.peg.2
into160798.5.peg.000
That means the
2
is replaced by000
.So how can I insert the three 0's without removing the two?
The general idea is to use a regular expression to match the 2 and then use printf with a preset size, which can 0 pad numbers. For example,
echo "2" | awk '{printf("%05i\n",$1)}'
would produce00002
.