Renaming header of contig file by awk
2
0
Entering edit mode
9.0 years ago
waqasnayab ▴ 250

Hi,

I have a contig file:

>NODE_1_length_248_cov_3.157258
AAGGACTTGAGGGGCCTAACCTACCCTCAAGCATGCTCCCCGAAAGATTCCATCCATCCT
AGTCTTTTGAGGACAAATCCTACTGTGTAGACGAGTCATAGGGCAGACATTCGCGACGAA
TGGATCCGCCGGCCTCATCAGATAATTGAGACCGTCAACTGCCAGGTGCTCAAGAGGTTC
CTGGTTAAGTCTCCCTAGGCGTGGGAACTCTTTATGCATCGTTAACGTCCATCGGCTGAG
TGCCCACAGCGTTACTCAAGGCAGATTATACTGGGgag
>NODE_2_length_89_cov_4.494382
GTCGATAGATCTATGTGTTTAGACATGTAGATCAGTGGTCGTTGTGATGAGCGTAGCGCT
TGCGGAACGTGCACGAGTATACTATCACCGCCGGATTTTAATGCAGAGAGGTTCCCGAg
>NODE_3_length_79_cov_3.227848

and so on ........

I need to change the header in the following way:

>Contig1.1
AAGGACTTGAGGGGCCTAACCTACCCTCAAGCATGCTCCCCGAAAGATTCCATCCATCCT
AGTCTTTTGAGGACAAATCCTACTGTGTAGACGAGTCATAGGGCAGACATTCGCGACGAA
TGGATCCGCCGGCCTCATCAGATAATTGAGACCGTCAACTGCCAGGTGCTCAAGAGGTTC
CTGGTTAAGTCTCCCTAGGCGTGGGAACTCTTTATGCATCGTTAACGTCCATCGGCTGAG
TGCCCACAGCGTTACTCAAGGCAGATTATACTGGGgag
>Contig1.2
GTCGATAGATCTATGTGTTTAGACATGTAGATCAGTGGTCGTTGTGATGAGCGTAGCGCT
TGCGGAACGTGCACGAGTATACTATCACCGCCGGATTTTAATGCAGAGAGGTTCCCGAg
>Contig1.3

and so on........

I tried this awk command:

cat contig_1.fa | awk '{print (NR%4 == 1) ? ">Contig1." ++i : $0}' > contig_1_rename.fa

the output is:

contig_1_rename.fa

>Contig1.1
AAGGACTTGAGGGGCCTAACCTACCCTCAAGCATGCTCCCCGAAAGATTCCATCCATCCT
AGTCTTTTGAGGACAAATCCTACTGTGTAGACGAGTCATAGGGCAGACATTCGCGACGAA
TGGATCCGCCGGCCTCATCAGATAATTGAGACCGTCAACTGCCAGGTGCTCAAGAGGTTC
>Contig1.2
TGCCCACAGCGTTACTCAAGGCAGATTATACTGGGgag
>NODE_2_length_89_cov_4.494382
GTCGATAGATCTATGTGTTTAGACATGTAGATCAGTGGTCGTTGTGATGAGCGTAGCGCT
>Contig1.3

seems to me inserting header after every four lines instead of replacing the header. how to give a pattern search and replace in awk command rather than mentioning line (NR)?

Thanks,
Waqas.

next-gen Assembly awk • 2.4k views
ADD COMMENT
2
Entering edit mode
9.0 years ago

Another approach is to modify the header line and leave the sequence lines untouched:

$ awk ' \
    BEGIN { \
        contigIdx = 1; \
    } \
    { \
        if ($0 ~ /^>/) { \
            print ">Contig1."contigIdx; \
            contigIdx++; \
        } \
        else { \
            print $0; \
        } \
    }' sequences.fa > sequences_renamed.fa

The pattern /^>/ matches lines which start with the character >.

ADD COMMENT
0
Entering edit mode

Thanks, its great, worked fine,

ADD REPLY
1
Entering edit mode
9.0 years ago
seta ★ 1.9k

Also, you can use the following awk command:

awk '/^>/{print "> contig" ++i; next}{print}' < file.fasta > output.fasta
ADD COMMENT

Login before adding your answer.

Traffic: 2278 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6