How to replace header in FASTA file in incremental order?
3
0
Entering edit mode
2.8 years ago
Ap1438 ▴ 50

If I have a file where I want to replace something in the middle of the header and the file is in proper order or sequence i.e. peg no is sequential peg_1,peg_2,peg_3 and so on.

>S_griseus__Contig_0001__peg_1__799__35__negative            
atgcatgc
>S_griseus__Contig_0001__peg_2__1655__3444__posetive            
gtcgtacg

Previously I used seqkit replace to solve a similar issue it but this time I am not able to solve this issue because I couldn't understand the previous command fully. The command was
seqkit replace -p '.+' -r 'Contig_{nr}' --nr-width 3 assembly.fasta -o renamed.fasta

And want to change the peg_1 to peg_0001 in this header? What command will help me solve it?
Thank you for your valuable time. Please let me know how can it be done. Please explain the command that you are using so that I can understand and learn it.

FASTA seqkit awk • 2.7k views
ADD COMMENT
2
Entering edit mode

see if this is okay with you:

$ awk -v OFS="_" -F "_|__" '/>/ {$6=sprintf("%05d",$6)}1' test.fa

>S_griseus_Contig_0001_peg_00001_799_35_negative            
atgcatgc
>S_griseus_Contig_0001_peg_00002_1655_3444_posetive            
gtcgtacg

if you want exactly the way it was, but with padding, try this:

$ awk -v OFS="_" -F "_|__" '/>/ {$6=sprintf("%05d",$6)}1' test.fa| sed '/>/ s/_/__/2g;s/peg__/peg_/;s/Contig__/Contig_/'

>S_griseus__Contig_0001__peg_00001__799__35__negative            
atgcatgc
>S_griseus__Contig_0001__peg_00002__1655__3444__posetive            
gtcgtacg

Since it's 10000, you need padding by 4 zeros, not 3.

ADD REPLY
0
Entering edit mode

Thank you for your valuable time and suggestion. The command works perfectly. Can you explain the command please?

ADD REPLY
1
Entering edit mode
awk -v OFS="_" -F "_|__" '/>/ {$6=sprintf("%05d",$6)}1' test.fa 

Splits the lines starting with ">" into fields (columns) with "_ and __" as delimiters. 6th field/column (from splitting) is then padded with zeros and current 6th field is replaced with padded 6th field. Then awk prints entire file.

sed '/>/ s/_/__/2g;s/peg__/peg_/;s/Contig__/Contig_/'

First replacement replaces all the _ with __ from 2nd occurrence till last occurrence. Second replacement replaces peg__ with peg_. Third replacement replaces Contig__ with Contig_

ADD REPLY
0
Entering edit mode

Thanks for the explanation

ADD REPLY
1
Entering edit mode
2.8 years ago
jena ▴ 320

In case you want the second effect I described (i.e. converting peg_999 to peg_0999) I can think of one way to do it. It's a bit round-about, but it should work.

First, you can create a "dictionary" for sed to use as an external definition for the conversions. Here I will assume you have peg numbers up to 10000:

paste <(seq 1 10000) <(seq -w 1 10000) | awk '{print "s/peg_"$1"/peg_"$2"/"}' > fix_peg_numbers.sed

The first command creates two columns of numbers, first one just simple 1, 2, 3,... while the second column has the same numbers padded with 0s, so for our case it would start with 00001 and go all the way to 10000. The second command (awk) will reformat this into a "sed script" or "dictionary" as I called it above.

Now you can use sed to load this script and perform replacements in your input, like this:

sed -i -f fix_peg_numbers.sed input.fasta

The -f argument reads the created "dictionary", while the -i argument performs the conversion in-place (overwriting the original file).

If some peg numbers are missing, it is not a problem - sed will just skip numbers that it didn't find.

ADD COMMENT
0
Entering edit mode

Thank you for your valuable time and suggestion. This command works perfectly.

ADD REPLY
1
Entering edit mode

I realized there might be a problem with partial matching. For example a new pattern peg_00001 as a replacement for peg_1 would also match peg_10, so you could end up with peg_000010, which is wrong. And there are many numbers with partial matches like that.

It is rather easy to fix though - just modify the awk script to this form:

awk '{print "s/peg_"$1"_/peg_"$2"_/"}'

This makes sure that the number you are changing matches completely, because you require it to be followed by an underscore "_".

ADD REPLY
0
Entering edit mode

Thankyou for your valuable time and suggestion to make it more accurate.

ADD REPLY
0
Entering edit mode
2.8 years ago

if you just want to change peg_1 to peg_0001 you can use simple sed command.

sed -i 's/old-text/new-text/g' input.txt
ADD COMMENT
0
Entering edit mode

Its not just replacing one peg no. , there are almost 10000 peg no. to be replaced.

ADD REPLY
0
Entering edit mode

Sorry, but do you want to just put 000 in front of any peg number? Like

sed -i 's/peg_/peg_000/g' input.txt

would convert peg_1 to peg_0001 and peg_999 to peg_000999. Is this what you want? Or rather peg_0999 for the second case?

ADD REPLY
1
Entering edit mode

I want it to be peg_0999 for 999 and peg_0009 for 9.

ADD REPLY
0
Entering edit mode
2.8 years ago
jena ▴ 320

I don't know seqkit, but you can also use seqmagick for this. Specifically, you could use the --pattern-replace argument to either seqmagick convert (create new modified file) or seqmagick mogrify (change original file in-place). But it mostly depends on what change exactly you want to do (see my comment above).

ADD COMMENT

Login before adding your answer.

Traffic: 878 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6