Question

Extracting exact location of interest from genbank file

0

Entering edit mode

2.8 years ago

Sasha ▴ 10

Hi everyone,

I am trying to find any infromation how to extract the known cordinates from the genbank file but had no luck so far.

This is how i have my results where the top line is the chromosome and from each chromosome I want to extract the sequence from the given cordinates.

gnl|BL_ORD_ID|21 CM017728.1 Coilia nasus isolate PX2019 linkage group LG22, whole genome shotgun sequence

9825652 9826524

9854872 9855648

9866528 9867421

gnl|BL_ORD_ID|14 CM017720.1 Coilia nasus isolate PX2019 linkage group LG15, whole genome shotgun sequence

28861074 28862216

gnl|BL_ORD_ID|23 CM017730.1 Coilia nasus isolate PX2019 linkage group LG24, whole genome shotgun sequence

9828100 9829197

14400268 14401302

16620220 16621236

If anyone knows where i can find some documentation, example of how to do it, I would be really greatful!

Biopython Genbank • 1.3k views

ADD COMMENT • link updated 2.8 years ago by Joe 21k • written 2.8 years ago by Sasha ▴ 10

score 2 · Answer 1 · 2022-03-16

get your data as FASTA instead of GenBank then you can extract sequences from it in different ways

wherever you get GenBank usually you can get FASTA files as well. You can also covert GenBank to FASTA with various tools, you could use the bio package see: https://www.bioinfo.help/ to transform into fasta

bio fetch NC_045512 | bio fasta > genome.fa

now index the FASTA file:

samtools faidx genome.fa

and after that, you can extract any subsequence of it with

samtools faidx genome.fa NC_045512.2:100-120

prints:

>NC_045512.2:100-120
CGGCTGCATGCTTAGTGCACT

score 1 · Answer 2 · 2022-03-16

Using EntrezDirect (truncated to save space) :

$ efetch -db nuccore -id CM017728.1 -seq_start 9825652 -seq_stop 9826524 -format fasta
>CM017728.1:9825652-9826524 Coilia nasus isolate PX2019 linkage group LG22, whole genome shotgun sequence
TCGCTCTGTTGTGTTTTTGCCCCAAATGGCTCGGTAGCACTTGGGTACAAGGAGACAGAATATGATGCCG
TAGTTGGAGACCAGGATGGCGGAGGCCTGTACAATGGGGCGAGACTCGTTCCTGGTGATGTAGATAGGAA

Use following additional options when applicable.

-strand        1 = forward DNA strand, 2 = reverse complement
                   (otherwise strand minus is set if start > stop)
  -forward       Force strand 1
  -revcomp       Force strand 2

score 1 · Answer 3 · 2022-03-16

1

Entering edit mode

2.8 years ago

Joe 21k

"Slicing" out regions of a Genbank is super easy with Biopython:

Slicing Genbank File by 'gene_id' range

ADD COMMENT • link 2.8 years ago by Joe 21k