I want to extract the upstream and downstream sequence (for example 1000 bp up and 1000 bp down) of a set of genes (chromosome number with start and end position). Could you please suggest me a suitable package or module?
Thanks in advance
I want to extract the upstream and downstream sequence (for example 1000 bp up and 1000 bp down) of a set of genes (chromosome number with start and end position). Could you please suggest me a suitable package or module?
Thanks in advance
Edit: this answer presumes you will want to do this in python
as suggested by the tags.
pybedtools
, which is a python implementation of bedtools
should do the trick. You will need a combination of:
flank
to get the up and downstram location in relation to the inut coordinates, https://daler.github.io/pybedtools/autodocs/pybedtools.bedtool.BedTool.flank.html?highlight=flank#pybedtools.bedtool.BedTool.flank
sequence
to retrieve the actual sequences, https://daler.github.io/pybedtools/autodocs/pybedtools.bedtool.BedTool.sequence.html
Bonus: there is a five_prime
, and three_prime
function to retrieve the start and end coordinates. It might be useful in case you want to split the up and downstream sequences. https://daler.github.io/pybedtools/pybedtools.featurefuncs.five_prime.html?highlight=five%20prime#pybedtools.featurefuncs.five_prime
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
How are you defining their genes? Just by coordinate?
This is reasonably easy to do with biopython if you want to code something. Otherwise, I think
seqkit
can do slicing/subsetting.I have the locus locations of each gene. For example: chr1:7990339-7990408 I would like to do it with biopython. Could you suggest any built-in module available?
There's no such built in module, but the syntax is quite easy.
If you take a look at my code here for example: https://github.com/jrjhealey/bioinfo-tools/blob/master/Genbank_slicer.py#L118
That script is specifically for slicing genbank files, as slicing the record preserves the format, but you can apply exactly the same logic to fastas or whatever data you have.
Essentially you need to do something like (pseudocode):
This is made more complicated if your file has all the chromosomes in it as you'll need to check all of the records for your chromosome index, then slice out the right bit if its the right sequence, but at its simplest that can be done with one or two loops.
I recently did the same work. And the lesson I learned is get the fasta and annotation files from the SAME source (eg, gencode). Do not fully trust the version number... And even though, I found the sequences I extracted from some genes are not completely consistent with the annotation from NCBI genebank files (sometimes they may have couple bp shift, and I don't know why...).
I believe there are some tools could do this extraction work (seqkit, bedtools?), although I coded it by myself.