How to extract two genomic location numbers within the following fasta header?
0
0
Entering edit mode
3.2 years ago
mrj ▴ 180

I am wondering how to extract the two numbers within the location tab of the following fasta header.

>lcl|CP033719.1_cds_AYW77996.1_1542 [locus_tag=EGX94_07890] [protein=copper oxidase] [protein_id=AYW77996.1] [location=1885267..1887939] [gbkey=CDS]
fasta extract location genomic bash • 1.2k views
ADD COMMENT
1
Entering edit mode
$ awk -F 'location=|]|[.]{2}' '/^>/ {print $5,$6}' test.fa
$ sed -rn '/^>/ s/(.*location=)([0-9]+)..([0-9]+)].*$/\2\t\3/p' test.fa
$ grep -Po "(?<=location=).*(?=]\s.*)" test.fa | tr -s '.' '\t'
$ seqkit replace -p '.*location=(.*)]\s.*' -r '${1}' test.fa | seqkit seq -n  | sed -r 's/\.{2}/\t/'
ADD REPLY
0
Entering edit mode

Thank you so much for this solution. It works for me. I am learning a lot from your solution.

ADD REPLY
1
Entering edit mode

In Python

Suppose your header is saved in header variable

header.partition("location=")[2].partition("]")[0].split('..')

This will return list ['1885267', '1887939'] which you can easily manipulate

It will only work if it finds a location keyword, otherwise, it will return an empty list

ADD REPLY
0
Entering edit mode

Hello Renesh, Thanks. This is much more similar and does the task perfectly.

Thank you so much.

ADD REPLY

Login before adding your answer.

Traffic: 1658 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6