Question

Extract a subset of fasta file based on header in python

0

Entering edit mode

6.2 years ago

xupbuy ▴ 30

I have a fasta file like this:

>Contig1
aattata
>Contig2
ttgtgtgt
>Contig3
cgcgtgtgt
>Contig4
tgtgttgtttttt
>Contig5
cccgcgctg
>Contig6
tgtgtgtgtgtgt
>Contig7
aacgctg
>Contig8
acgtgtgc

I have a 2nd txt file:

1
5
7

I want extract a subset from the fasta file if the header contains EXACTLY the number in the txt file, like this:

    >Contig1
    aattata
    >Contig5
    cccgcgctg
    >Contig7
    aacgctg

I don't want >Contig10 even it contains the number 1.

How to do this in python? I tried scripts other people provided but they both give me empty output file. I don't know what I did wrong: script1:

"""                                                                                 
%prog some.fasta wanted-list.txt                                                    
"""                                                                                 
from Bio import SeqIO                                                               
import sys                                                                          

wanted = [line.strip() for line in open(sys.argv[2])]
print(wanted)
seqiter = SeqIO.parse(open(sys.argv[1]), 'fasta')      
sys.stdout = open('output.fasta', 'w')                               
SeqIO.write((seq for seq in seqiter if seq.id in wanted), sys.stdout, "fasta")
sys.stdout.close()

Script2:

from Bio import SeqIO
import sys                                                                          

fasta_file = sys.argv[1] # Input fasta file
wanted_file = sys.argv[2] # Input interesting sequence IDs, one per line
result_file = "result_file.fasta" # Output fasta file

wanted = set()
with open(wanted_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            wanted.add(line)
            print(line)

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
with open(result_file, "w") as f:
    for seq in fasta_sequences:
        if seq.id in wanted:
            SeqIO.write([seq], f, "fasta")

python fasta • 5.1k views

ADD COMMENT • link updated 6.2 years ago by Siya Diya ▴ 10 • written 6.2 years ago by xupbuy ▴ 30

1

Entering edit mode

Why are you reinventing the wheel? There already exist a bunch of tools that do this!

ADD REPLY • link 6.2 years ago by Ram 44k

1

Entering edit mode

You can try this:

rename 1,2,3 to Contig1, contig2, contig3 (using awk)
samtools index file.fasta,
xargs samtools faidx fasta.file < contigs.txt >> output.fasta

ADD REPLY • link 6.2 years ago by Alice ▴ 320

score 0 · Answer 1 · 2018-10-01

0

Entering edit mode

6.2 years ago

WouterDeCoster 47k

Print your seq.id, you'll see it will never be in wanted since it looks like... Contig1

ADD COMMENT • link 6.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Indeed! Thank you finally! Worked!

ADD REPLY • link 6.2 years ago by xupbuy ▴ 30

0

Entering edit mode

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Upvote|Bookmark|Accept

ADD REPLY • link 6.2 years ago by GenoMax 147k

score 0 · Answer 2 · 2018-10-03

Try these updated scripts

Script 1:

"""
%prog some.fasta wanted-list.txt
"""
from Bio import SeqIO
import sys, re

wanted = [line.strip() for line in open(sys.argv[2])]
print(wanted)
seqiter = SeqIO.parse(open(sys.argv[1]), 'fasta')
sys.stdout = open('output.fasta', 'w')
SeqIO.write((seq for seq in seqiter if any(item == re.findall('\d+',seq.id)[-1] for item in wanted)), sys.stdout, "fasta")
sys.stdout.close()

Script 2:

from Bio import SeqIO
import sys, re

fasta_file = sys.argv[1] # Input fasta file
wanted_file = sys.argv[2] # Input interesting sequence IDs, one per line
result_file = "result_file.fasta" # Output fasta file

wanted = set()
with open(wanted_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            wanted.add(line)
            print(line)

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
with open(result_file, "w") as f:
    for seq in fasta_sequences:
        if any(item == re.findall('\d+',seq.id)[-1] for item in wanted):
            SeqIO.write([seq], f, "fasta")