Extracting exact location of interest from genbank file
3
0
Entering edit mode
2.8 years ago
Sasha ▴ 10

Hi everyone,

I am trying to find any infromation how to extract the known cordinates from the genbank file but had no luck so far.

This is how i have my results where the top line is the chromosome and from each chromosome I want to extract the sequence from the given cordinates.

gnl|BL_ORD_ID|21 CM017728.1 Coilia nasus isolate PX2019 linkage group LG22, whole genome shotgun sequence

9825652 9826524

9854872 9855648

9866528 9867421

gnl|BL_ORD_ID|14 CM017720.1 Coilia nasus isolate PX2019 linkage group LG15, whole genome shotgun sequence

28861074 28862216

gnl|BL_ORD_ID|23 CM017730.1 Coilia nasus isolate PX2019 linkage group LG24, whole genome shotgun sequence

9828100 9829197

14400268 14401302

16620220 16621236

If anyone knows where i can find some documentation, example of how to do it, I would be really greatful!

Biopython Genbank • 1.3k views
ADD COMMENT
2
Entering edit mode
2.8 years ago

get your data as FASTA instead of GenBank then you can extract sequences from it in different ways

wherever you get GenBank usually you can get FASTA files as well. You can also covert GenBank to FASTA with various tools, you could use the bio package see: https://www.bioinfo.help/ to transform into fasta

bio fetch NC_045512 | bio fasta > genome.fa

now index the FASTA file:

samtools faidx genome.fa

and after that, you can extract any subsequence of it with

samtools faidx genome.fa NC_045512.2:100-120

prints:

>NC_045512.2:100-120
CGGCTGCATGCTTAGTGCACT
ADD COMMENT
1
Entering edit mode
2.8 years ago
GenoMax 148k

Using EntrezDirect (truncated to save space) :

$ efetch -db nuccore -id CM017728.1 -seq_start 9825652 -seq_stop 9826524 -format fasta
>CM017728.1:9825652-9826524 Coilia nasus isolate PX2019 linkage group LG22, whole genome shotgun sequence
TCGCTCTGTTGTGTTTTTGCCCCAAATGGCTCGGTAGCACTTGGGTACAAGGAGACAGAATATGATGCCG
TAGTTGGAGACCAGGATGGCGGAGGCCTGTACAATGGGGCGAGACTCGTTCCTGGTGATGTAGATAGGAA

Use following additional options when applicable.

-strand        1 = forward DNA strand, 2 = reverse complement
                   (otherwise strand minus is set if start > stop)
  -forward       Force strand 1
  -revcomp       Force strand 2
ADD COMMENT
1
Entering edit mode
2.8 years ago
Joe 21k

"Slicing" out regions of a Genbank is super easy with Biopython:

Slicing Genbank File by 'gene_id' range

ADD COMMENT

Login before adding your answer.

Traffic: 1894 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6