Entering edit mode
4.4 years ago
akashbala0
▴
10
Hi! I have an excel file with thousands of chromosome names with transcription start endpoint. Someone, please develop a python program from where I can retrieve the sequence from genome file according to the start and endpoint mentioned in excel.
excel file looks like
chromosome start end
KB317696.1 1361 1376
KB317696.1 1594 1929
KB317697.1 2033 2101
KB317697.1 2159 2265
KB317698.1 2319 2421
KB317699.1 2513 2736
KB317700.1 2789 2903
KB317700.1 3157 3279
That is not what the forum is here for. There is an expectation that you demonstrate some effort toward solving the problem yourself first. Moreover, this is not an uncommon task, so please search the forum, there will undoubtedly be existing solutions you can try.
this can be done in the following steps: - import the sequence using Biopython (SeqIO.read()) - import the excel file as table using pandas, subset the table to only keep the start and stop positions - go through the columns of this table in a for loop, splice the sequence using start and stop column entries (example : seq_output = sequence[start_position:end_position]
P.S - python's index starts from 0 so your start_position should really be start_position+1
duplicate: How to use Bed file to extract sequence from FASTA file? ; Extract several sequences from genome in FASTA format with genomic coordinates. ; how to quickly extract sequence from genome positions ; ... etc... etc...