Entering edit mode
5.6 years ago
zhangdengwei
▴
210
hi, how to extract the gene sequence according to its coordinates on the reference genome? Thanks!
hi, how to extract the gene sequence according to its coordinates on the reference genome? Thanks!
I found a python module named "pyfaidx" which could satisfy my needs. It can make things simple which fetch sequence from a FASTA file. And here is the link https://pypi.org/project/pyfaidx/#description
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
What have you tried? I would suggest using BioPython and string slicing notation of which there are many examples on the forum.
You also haven’t told us what format your data is in.
this is a very briefly formulated question ! perhaps read this first : How To Ask Good Questions On Technical And Scientific Forums
What kind of input files do you have? do you want to do this for a single gene, multiple genes, all genes ... ? Do take the effort to include a bit more info to get more suitable answers.
And do you want the whole gene (introns and exons), cDNA, CDS, protein sequences?
I am sorry I don't state my question clearly. I am writing a python script to integrate my pipeline, and there is a step which I need to get the DNA sequence by a random pair of start and end position from a quite large FASTA file. So what I want to ask is just which tool or approach can handle it quickly, biopython or else?
if it's within a python pipeline / project, then yes likely biopython is the more sensible option. Otherwise you could for instance also get this through blast if you have a blastdb formatted version of your fasta file
Take a look at my code here https://github.com/jrjhealey/bioinfo-tools/blob/master/Genbank_slicer.py
The same approach would work for fasta files as well as Genbanks etc.
Thank you very much for your time!