Extracting fasta sequences with identical header
2
0
Entering edit mode
7.7 years ago
seta ★ 1.9k

Hi all friends,

I have a large fasta file that most sequences have a identical header (they differ from the length). I usually extracted the sequences of interest with the following python codes:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""Extract sequences from a fasta file if their name is in a 'wanted' file.

Wanted file contains one sequence name per line.

Usage:
%program <input_file> <wanted_file> <output_file>"""

import sys
import re

try:
from Bio import SeqIO
except:
print "This program requires the Biopython library"
sys.exit(0)

try:
fasta_file = sys.argv[1]  # Input fasta file
wanted_file = sys.argv[2] # Input wanted file, one gene name per line
result_file = sys.argv[3] # Output fasta file
except:
print __doc__
sys.exit(0)

wanted = set()
with open(wanted_file) as f:
for line in f:
    line = line.strip()
    if line != "":
        wanted.add(line)

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')

with open(result_file, "w") as f:
for seq in fasta_sequences:
    name = seq.id
    if name in wanted and len(seq.seq.tostring()) > 0:
        wanted.remove(name) # Output only the first appearance for a name
        SeqIO.write([seq], f, "fasta")

But the code just extracted one of the sequences with identical header, while I want to all sequences of interest that may have identical headers and adding the sequential number at the beginning or end of identical headers in the extracted fasta file? something like:

header_1
header_2
header_3
etc.

Many thanks

fasta extracting identical header • 3.2k views
ADD COMMENT
3
Entering edit mode
7.7 years ago

I didn't test it:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""Extract sequences from a fasta file if their name is in a 'wanted' file.

Wanted file contains one sequence name per line.

Usage:
%program <input_file> <wanted_file> <output_file>"""

import sys
import re

try:
    from Bio import SeqIO
except:
    print("This program requires the Biopython library")
    sys.exit(0)

try:
    fasta_file = sys.argv[1]  # Input fasta file
    wanted_file = sys.argv[2] # Input wanted file, one gene name per line
    result_file = sys.argv[3] # Output fasta file
except:
    print(__doc__)
    sys.exit(0)

wanted = {} # CHANGE
with open(wanted_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            wanted[line] = 0 # CHANGE

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')

with open(result_file, "w") as f:
    for seq in fasta_sequences:
        if seq.id in wanted and len(seq.seq.tostring()) > 0:
            wanted[seq.id] += 1     # CHANGE
            seq.id += '_{0}'.format(wanted[seq.id]) # CHANGE
            #wanted.remove(name) # Output only the first appearance for a name
            SeqIO.write([seq], f, "fasta")
ADD COMMENT
0
Entering edit mode

Thanks, I name it py_ext.py, but it retutned the below error:

File "py_ext.py", line 21
    fasta_file = sys.argv[1] # Input fasta file
             ^
IndentationError: expected an indented block

I tried to convert tab to space, but the error still remain. Would you please check the code?

ADD REPLY
1
Entering edit mode

I edited my answer. The code you pasted in your question has the problem with indentations (no idents after try: or with:) so it doesn't work at all. I would test my code if you provide some sample lines of your input files.

ADD REPLY
1
Entering edit mode

Thank you very much. It worked well and I accepted it.

ADD REPLY
0
Entering edit mode
File "extractentr.py", line 17
    print "This program requires the Biopython library"
                                                      ^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print("This program requires the Biopython library")?

and after changing it

 File "extractentr.py", line 25
    print __doc__
                ^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(__doc__)?
ADD REPLY
0
Entering edit mode

I updated the code to Python3.

ADD REPLY
1
Entering edit mode
7.7 years ago

You can use BBMap's FilterByName tool like this:

filterbyname.sh in=x.fa out=y.fa names=z.fa include=t substring

That will send everything to z.fa with headers matching those between x.fa and y.fa. The flag "substring" allows substring matches, which seems to be what you are seeking.

ADD COMMENT

Login before adding your answer.

Traffic: 2537 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6