I need to parse sequences from a gbk file based on the coordinates specified in a given file. For which I have tried to use samtools, but it fails to serve my purpose. Kindly suggest me any tool which can be used to parse the same. Thank you
I need to parse sequences from a gbk file based on the coordinates specified in a given file. For which I have tried to use samtools, but it fails to serve my purpose. Kindly suggest me any tool which can be used to parse the same. Thank you
If its a single continguous genbank file, you can slice out genbank sections, complete with the features they contain, using some code I wrote a while back here:
https://github.com/jrjhealey/bioinfo-tools/blob/master/Genbank_slicer.py
You'll need to create a loop over your coordinate file, something like:
$ cat coords.txt
638483-646063
638483-646063
1137523-1143403
1182782-1250536
while IFS="-" read -r -a array ; do
python genbank_slicer.py -g /path/to/reference.gbk -s "${array[0]}" -e "${array[1]}" -o "reference_${array[0]}:${array[1]}.gbk"
done < coords.txt
# one line form:
while IFS="-" read -r -a array ; do python genbank_slicer.py -g /path/to/reference.gbk -s "${array[0]}" -e "${array[1]}" -o "reference_${array[0]}:${array[1]}.gbk" ; done < coords.txt
If you need the sequence for that list of coordinates in FASTA format and based on your screenshot it appears to be a sequence available at NCBI, you can use Entrez Direct as follows:
$ cat temp.txt
638483-646063
1137523-1143403
1182782-1250536
$ cat temp.txt \
| sed 's/-/ /g' \
| while read -r start stop ; do
efetch -db nuccore \
-id NC_020815.1 \
-seq_start $start \
-seq_stop $stop \
-format fasta ;
done
A few things to consider:
chr_start
and chr_stop
instead of seq_start
and seq_stop
. Check efetch -help
for more details.strand
option of efetch
. bedtools getfasta
or other similar tools that index a FASTA file and extract specific sequence regions.Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Can you clarify?
Are these arbitrary DNA sections, or specific features?
Is this a multi-genbank file?
What does the input file look like?
Dear Joe, It is an output of genomic island gbk file obtained from
islandviewer4
web server. It is a complete genome sequence of Xanthomonas citri. My coordinate file as follows,sample gbk file