Question

Extract pairs of sequences from a fasta file based on a CSV list of headers

0

Entering edit mode

7.9 years ago

ajsakla • 0

I have a fasta file of primer sequences and I have a csv file with the sequence ids of primer sequence pairs that I need to parse out into their own fasta files. What I need to do is extract each pair of sequences from the fasta and put them into their own fasta files.

So my files look something like this:

My fasta file of primer sequences:

>01
ACGTACGT
>02
ACGTACGT
>03
ACGTACGT
>04
ACGTACGT

My CSV list of primer pairs:

column1,column2
01,03
02,04

Based on this example data set, what I need to end up with is 2 fasta files of the pairs of fasta sequences i.e., File1.fa will contain sequences 01 and 03 and File2.fa will contain sequences 02 and 04.

Thanks in advance!

fasta extract • 3.8k views

ADD COMMENT • link updated 7.9 years ago by st.ph.n ★ 2.7k • written 7.9 years ago by ajsakla • 0

score 1 · Answer 1 · 2017-09-13

1

Entering edit mode

7.9 years ago

st.ph.n ★ 2.7k

#!/usr/bin/env python
with open('pairs.txt', 'r') as f:
    # Get pairs, ',' separated
    pairs = []
    for line in f:
        pairs.append((line.strip().split(',')[0], line.strip().split(',')[1]))

with open('file.fasta', 'r') as fa:
    #open fasta to dictionary 
    primer = {}
    for line in fa:
        if line.startswith(">"):
            primer[line.strip().split('>')[1]] = next(fa).strip()

#For each pair, open new file, and print pair in fasta format
for p in pairs:
    with open('file' + str(pairs.index(p)) + '.fasta', 'w') as out:
        out.write('>' + p[0] + '\n' + primer[p[0]])
        out.write('>' + p[1] + '\n' + primer[p[1]])

ADD COMMENT • link 7.9 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Thanks. I'm having some trouble getting this to run.

I've copied and pasted this script into a text file as extract.sh
I've chmod u+x extract.sh
I've replaced the 'pairs.txt' and 'file.fasta' with the appropriate file names
I've run ./extract.sh within the folder containing the two files.

But I get this error:

Traceback (most recent call last): File "./parse.sh", line 11, in <module> for line in f: ValueError: I/O operation on closed file

ADD REPLY • link 7.9 years ago by ajsakla • 0

0

Entering edit mode

This is python code, not bash. mv extract.sh extract.py, run as python extract.py

ADD REPLY • link 7.9 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

I still get the same error:

Traceback (most recent call last): File "parse.py", line 11, in <module> for line in f: ValueError: I/O operation on closed file

ADD REPLY • link 7.9 years ago by ajsakla • 0

0

Entering edit mode

I edited the error.

ADD REPLY • link 7.9 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

So now I get this error:

Traceback (most recent call last): File "extract.py", line 18, in <module> with open('file' + str(pairs.index(p)) + '.fasta') as out: IOError: [Errno 2] No such file or directory: 'file0.fasta'

It looks like it's looking for the file or directory file0.fasta so when I make an empty file0.fasta I get this error:

Traceback (most recent call last): File "extract.py", line 19, in <module> out.write('>' + p[0] + '\n' + primer[p[0]]) KeyError: 'ccRepeat-46449'

ADD REPLY • link 7.9 years ago by ajsakla • 0

0

Entering edit mode

Should be fixed now.

ADD REPLY • link 7.9 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

I'm getting a similar error as before. Key error is the id of the first sequence in the first pair:

Traceback (most recent call last): File "extract.py", line 18, in <module> file.write('>' + p[0] + '\n' + primer[p[0]]) KeyError: 'ccRepeat-46449'

If it helps, here's a sampling of my primer pairs list.

ccRepeat-46449,ge00012
ccRepeat-52142,ge00379
ccRepeat-48210,ge00387
cc00375,ge00431
ccRepeat-46825,ge01057
ccRepeat-49062,ge01469
cc15291,ge01941
ccRepeat-46651,ge03344
cc22391,ge03673
ccRepeat-81763,ge03838

It's just occurred to me that python may not be reading this file as csv formatted. Could that be an issue?

ADD REPLY • link 7.9 years ago by ajsakla • 0

0

Entering edit mode

this line is separating the ids by ',' into a tuple:

pairs.append((line.strip().split(',')[0], line.strip().split(',')[1]))

The error you're getting is that the keys in the dictionary (headers from your fasta), are not matching the primers. Double check that the headers match the primers, and if there is anything else in the header, this needs parsed out. The script was based on your OP example, so there may be further things to consider where the script needs to be modified. If you're unable to troubleshoot on your own, it's best to find another method, ideally by your own testing and come back with errors. Biostars is not a coding service.

ADD REPLY • link 7.9 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

I realize now I should provide exact examples of the data I'm working with. I was able to get your script to work with my data which saved me many hours of parsing thousands of pairs manually. Thank you for your time.

ADD REPLY • link 7.9 years ago by ajsakla • 0