Extracting substring from fasta file
2
0
Entering edit mode
7.5 years ago
Burgenix • 0

I have an excel sheet and I extract some values out of it with openpyxl. I want to use those two values, lets say start and end, as borders to extract a substring from a fasta file.

For example, if the value of start is 34 and the value of end is 4000(as read from two cells in excel - FILE A), I want to print the string of characters(letters)(from FILE B) into another file.(FILE C)

Any ideas?

python • 3.6k views
ADD COMMENT
2
Entering edit mode

and please, don't use excel.

ADD REPLY
0
Entering edit mode

did you search this site for a similar question ?

ADD REPLY
0
Entering edit mode

Are you asking someone to do it for you in python?

ADD REPLY
2
Entering edit mode
7.5 years ago
Benn 8.4k

You'll need samtools for this, not excel.

For example.

samtools faidx file.fasta name_of_seq:34-4000 > another_file.fasta

name_of_seq is the name of your sequence in the fasta file.

Try to figure out yourself first how to get your coordinates from file A, if it does not work show us what you have tried and some-one will help you further.

ADD COMMENT
2
Entering edit mode
7.5 years ago
st.ph.n ★ 2.7k

Suppose your excel file looks like this: (ideally just remove columns other than ids and start/end values - and save as text tab-delimited (my_coords.txt). As Pierre said, don't use excel)

id1    34    4000
id2    45    3156
id3    33    3764

And your fasta looks like this (if you have a multi-line fasta, linearize it):

>id1
sequence
>id2
sequence
>id3
  
#!/usr/bin/env python

with open('my_coords.txt', 'r') as f1:
    pos = {}
    for line in f1:
        pos[line.strip().split('\t')[0]] = (int(line.strip().split('\t')[1]), int(line.strip().split('\t')[2]))

with open('my_fasta.fasta', 'r') as f2:
    seqs = {}
    for line in f2:
        if line.startswith('>'):
            seqs[line.strip().split('>')[1]] = next(f).strip()

with open('my_fasta_trimmed.fasta', 'w') as out:
    for i in seqs:
        out.write('>' + i, '\n', seqs[pos[i][0]:pos[i][1]])

Condensed, write directly to output:

#!/usr/bin/env python

with open('my_coords.txt', 'r') as f1:
    pos = {}
    for line in f1:
        pos[line.strip().split('\t')[0]] = (int(line.strip().split('\t')[1]), int(line.strip().split('\t')[2]))

with open('my_fasta.fasta', 'r') as f2:
    with open('my_fasta_trimmed.fasta', 'w') as out:
        for line in f2:
            if line.startswith('>'):
                out.write(line.strip(), '\n', next(f).strip()[pos[line.strip().split('>')[1]][0]:pos[line.stripI().split('>')[1]][1])
ADD COMMENT
0
Entering edit mode

Small comment: rather than putting all sequences in seqs you could also directly write the output, without keeping everything in memory. As such, when the data is very large (or your RAM very small) you don't get memory-troubles.

ADD REPLY
1
Entering edit mode

@WouterDeCoster, I originally was going to write them directly, but chose to break it up so the OP can see the operations on each file.

ADD REPLY
0
Entering edit mode

That's alright, but it's a nice habit to be memory-efficient (also when not necessary).

ADD REPLY

Login before adding your answer.

Traffic: 1880 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6