Question

How to extract upstream and downstream sequence of a set of genes using python

0

Entering edit mode

5.1 years ago

t.reza • 0

I want to extract the upstream and downstream sequence (for example 1000 bp up and 1000 bp down) of a set of genes (chromosome number with start and end position). Could you please suggest me a suitable package or module?

Thanks in advance

biopython python sequence sequencing RNA-Seq • 3.4k views

ADD COMMENT • link updated 5.1 years ago by A. Domingues ★ 2.7k • written 5.1 years ago by t.reza • 0

0

Entering edit mode

How are you defining their genes? Just by coordinate?

This is reasonably easy to do with biopython if you want to code something. Otherwise, I think seqkit can do slicing/subsetting.

ADD REPLY • link 5.1 years ago by Joe 21k

0

Entering edit mode

I have the locus locations of each gene. For example: chr1:7990339-7990408 I would like to do it with biopython. Could you suggest any built-in module available?

ADD REPLY • link 5.1 years ago by t.reza • 0

0

Entering edit mode

There's no such built in module, but the syntax is quite easy.

If you take a look at my code here for example: https://github.com/jrjhealey/bioinfo-tools/blob/master/Genbank_slicer.py#L118

That script is specifically for slicing genbank files, as slicing the record preserves the format, but you can apply exactly the same logic to fastas or whatever data you have.

Essentially you need to do something like (pseudocode):

from Bio import SeqIO

for record in SeqIO.parse('infile.ext', 'format'):
    if record.description.startswith('chr1'):  # test which record you're using
        subrecord = record[start - upstream_offset: end + downstream_offset]

This is made more complicated if your file has all the chromosomes in it as you'll need to check all of the records for your chromosome index, then slice out the right bit if its the right sequence, but at its simplest that can be done with one or two loops.

ADD REPLY • link 5.1 years ago by Joe 21k

0

Entering edit mode

I recently did the same work. And the lesson I learned is get the fasta and annotation files from the SAME source (eg, gencode). Do not fully trust the version number... And even though, I found the sequences I extracted from some genes are not completely consistent with the annotation from NCBI genebank files (sometimes they may have couple bp shift, and I don't know why...).

I believe there are some tools could do this extraction work (seqkit, bedtools?), although I coded it by myself.

ADD REPLY • link 5.1 years ago by shoujun.gu ▴ 350

score 1 · Answer 1 · 2019-11-06

Edit: this answer presumes you will want to do this in python as suggested by the tags.

pybedtools, which is a python implementation of bedtools should do the trick. You will need a combination of:

flank to get the up and downstram location in relation to the inut coordinates, https://daler.github.io/pybedtools/autodocs/pybedtools.bedtool.BedTool.flank.html?highlight=flank#pybedtools.bedtool.BedTool.flank
sequence to retrieve the actual sequences, https://daler.github.io/pybedtools/autodocs/pybedtools.bedtool.BedTool.sequence.html

Bonus: there is a five_prime, and three_prime function to retrieve the start and end coordinates. It might be useful in case you want to split the up and downstream sequences. https://daler.github.io/pybedtools/pybedtools.featurefuncs.five_prime.html?highlight=five%20prime#pybedtools.featurefuncs.five_prime