How to parse sequences from a gbk file, based on the coordinates specified in given file?
2
0
Entering edit mode
4.8 years ago
Kumar ▴ 120

I need to parse sequences from a gbk file based on the coordinates specified in a given file. For which I have tried to use samtools, but it fails to serve my purpose. Kindly suggest me any tool which can be used to parse the same. Thank you

perl python bash gbk • 2.0k views
ADD COMMENT
0
Entering edit mode

Can you clarify?

Are these arbitrary DNA sections, or specific features?

Is this a multi-genbank file?

What does the input file look like?

ADD REPLY
0
Entering edit mode

Dear Joe, It is an output of genomic island gbk file obtained from islandviewer4 web server. It is a complete genome sequence of Xanthomonas citri. My coordinate file as follows,

 638483-646063
 638483-646063
1137523-1143403
1182782-1250536

sample gbk file

ADD REPLY
2
Entering edit mode
4.8 years ago
Joe 21k

If its a single continguous genbank file, you can slice out genbank sections, complete with the features they contain, using some code I wrote a while back here:

https://github.com/jrjhealey/bioinfo-tools/blob/master/Genbank_slicer.py

You'll need to create a loop over your coordinate file, something like:

Assuming a coordinates file of the form:

$ cat coords.txt
638483-646063
638483-646063
1137523-1143403
1182782-1250536

Perform a loop and feed the coordinates in, writing a new genbank file for each range:

while IFS="-" read -r -a array ; do
    python genbank_slicer.py -g /path/to/reference.gbk -s "${array[0]}" -e "${array[1]}" -o "reference_${array[0]}:${array[1]}.gbk" 
done < coords.txt

# one line form:
while IFS="-" read -r -a array ; do python genbank_slicer.py -g /path/to/reference.gbk -s "${array[0]}" -e "${array[1]}" -o "reference_${array[0]}:${array[1]}.gbk" ; done < coords.txt
ADD COMMENT
2
Entering edit mode
4.8 years ago
vkkodali_ncbi ★ 3.8k

If you need the sequence for that list of coordinates in FASTA format and based on your screenshot it appears to be a sequence available at NCBI, you can use Entrez Direct as follows:

$ cat temp.txt 
638483-646063
1137523-1143403
1182782-1250536

$ cat temp.txt \
  | sed 's/-/ /g' \
  | while read -r start stop ; do 
    efetch -db nuccore \
      -id NC_020815.1 \
      -seq_start $start \
      -seq_stop $stop \
      -format fasta ; 
    done

A few things to consider:

  • I assumed that your coordinates are 1-based. If they are 0-based, you should either change your coordinates or use chr_start and chr_stop instead of seq_start and seq_stop. Check efetch -help for more details.
  • I assumed that you want sequence for fwd strand. You can specify strand passing strand information to strand option of efetch.
  • This approach does not scale well. If you are doing this for several thousands of coordinate ranges and sequences, you may want to check out bedtools getfasta or other similar tools that index a FASTA file and extract specific sequence regions.
ADD COMMENT

Login before adding your answer.

Traffic: 1680 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6