Question

How to split the header line into its components

0

Entering edit mode

3.7 years ago

Inayat • 0

I have txt file containing several of these header lines

>lcl|NC_001133.9_cds_NP_009332.1_1 [gene=PAU8] [locus_tag=YAL068C] [db_xref=SGD:S000002142,GeneID:851229] [protein=seripauperin PAU8] [protein_id=NP_009332.1] [location=complement(1807..2169)] [gbkey=CDS]

I want to read specific values 1807 and 2169 mentioned in "location". I have tried to use split() and strip command in python but it doesn't work as expected. Can you please suggest the way how to do this? Any kind of help will be appreciated.

Thank you

Macspider • 1.1k views

ADD COMMENT • link updated 3.7 years ago by cpad0112 21k • written 3.7 years ago by Inayat • 0

score 1 · Answer 1 · 2021-04-09

1

Entering edit mode

3.7 years ago

5heikki 11k

awk 'BEGIN{FS="\\[location="}{print $2}' input.txt | awk 'BEGIN{FS="("}{print $2}' | awk 'BEGIN{FS=")"}{print $1}'

ADD COMMENT • link 3.7 years ago by 5heikki 11k

score 1 · Answer 2 · 2021-04-09

$awk -v OFS="\t" -F "=|\(|\..|\)" '/^>/ {print $11,$12}' test.fa                                                                                                                        
1807    2169

$ awk -v OFS="\t" -F "complement|\(|\..|\)" ' /^>/ {print $6,$7}' test.fa

input:

$ cat test.fa                                                                                                                                                                        
>lcl|NC_001133.9_cds_NP_009332.1_1 [gene=PAU8] [locus_tag=YAL068C] [db_xref=SGD:S000002142,GeneID:851229] [protein=seripauperin PAU8] [protein_id=NP_009332.1] [location=complement(1807..2169)] [gbkey=CDS]

score 1 · Answer 3 · 2021-04-09

echo ">lcl|NC_001133.9_cds_NP_009332.1_1 [gene=PAU8] [locus_tag=YAL068C] [db_xref=SGD:S000002142,GeneID:851229] [protein=seripauperin PAU8] [protein_id=NP_009332.1] [location=complement(1807..2169)] [gbkey=CDS]" \
| awk -F "location=" '{print $2}' | \
| cut -d "(" -f2 | cut -d ")" -f1 \
| awk -F "." 'OFS="\t" {print $1, $3}'