Extract a subset of fasta file based on header in python
2
0
Entering edit mode
6.1 years ago
xupbuy ▴ 30

I have a fasta file like this:

>Contig1
aattata
>Contig2
ttgtgtgt
>Contig3
cgcgtgtgt
>Contig4
tgtgttgtttttt
>Contig5
cccgcgctg
>Contig6
tgtgtgtgtgtgt
>Contig7
aacgctg
>Contig8
acgtgtgc

I have a 2nd txt file:

1
5
7

I want extract a subset from the fasta file if the header contains EXACTLY the number in the txt file, like this:

    >Contig1
    aattata
    >Contig5
    cccgcgctg
    >Contig7
    aacgctg

I don't want >Contig10 even it contains the number 1.

How to do this in python? I tried scripts other people provided but they both give me empty output file. I don't know what I did wrong: script1:

"""                                                                                 
%prog some.fasta wanted-list.txt                                                    
"""                                                                                 
from Bio import SeqIO                                                               
import sys                                                                          

wanted = [line.strip() for line in open(sys.argv[2])]
print(wanted)
seqiter = SeqIO.parse(open(sys.argv[1]), 'fasta')      
sys.stdout = open('output.fasta', 'w')                               
SeqIO.write((seq for seq in seqiter if seq.id in wanted), sys.stdout, "fasta")
sys.stdout.close()

Script2:

from Bio import SeqIO
import sys                                                                          

fasta_file = sys.argv[1] # Input fasta file
wanted_file = sys.argv[2] # Input interesting sequence IDs, one per line
result_file = "result_file.fasta" # Output fasta file

wanted = set()
with open(wanted_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            wanted.add(line)
            print(line)

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
with open(result_file, "w") as f:
    for seq in fasta_sequences:
        if seq.id in wanted:
            SeqIO.write([seq], f, "fasta")
python fasta • 5.1k views
ADD COMMENT
1
Entering edit mode

Why are you reinventing the wheel? There already exist a bunch of tools that do this!

ADD REPLY
1
Entering edit mode

You can try this:

  • rename 1,2,3 to Contig1, contig2, contig3 (using awk)
  • samtools index file.fasta,
  • xargs samtools faidx fasta.file < contigs.txt >> output.fasta
ADD REPLY
0
Entering edit mode
6.1 years ago

Print your seq.id, you'll see it will never be in wanted since it looks like... Contig1

ADD COMMENT
0
Entering edit mode

Indeed! Thank you finally! Worked!

ADD REPLY
0
Entering edit mode

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Upvote|Bookmark|Accept

ADD REPLY
0
Entering edit mode
6.1 years ago
Siya Diya ▴ 10

Try these updated scripts

Script 1:

"""
%prog some.fasta wanted-list.txt
"""
from Bio import SeqIO
import sys, re

wanted = [line.strip() for line in open(sys.argv[2])]
print(wanted)
seqiter = SeqIO.parse(open(sys.argv[1]), 'fasta')
sys.stdout = open('output.fasta', 'w')
SeqIO.write((seq for seq in seqiter if any(item == re.findall('\d+',seq.id)[-1] for item in wanted)), sys.stdout, "fasta")
sys.stdout.close()

Script 2:

from Bio import SeqIO
import sys, re

fasta_file = sys.argv[1] # Input fasta file
wanted_file = sys.argv[2] # Input interesting sequence IDs, one per line
result_file = "result_file.fasta" # Output fasta file

wanted = set()
with open(wanted_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            wanted.add(line)
            print(line)

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
with open(result_file, "w") as f:
    for seq in fasta_sequences:
        if any(item == re.findall('\d+',seq.id)[-1] for item in wanted):
            SeqIO.write([seq], f, "fasta")
ADD COMMENT

Login before adding your answer.

Traffic: 2078 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6