Extracting specific FASTA sequences from a complex FASTA file
6
0
Entering edit mode
9.7 years ago
boileau • 0

I have been trying to troubleshoot this for the past couple days but I can't seem to find the issue, any help you could provide is greatly appreciated. I am a noob but will be writing similar scripts to this in the future- so any added resources are welcomed.

I have a large file with a bunch of FASTA sequences and I am trying to extract specific ones based on a list of IDs. While I would like the output to be a collection of whatever ID == FASTA sequences it won't give me an output with results. I know that if I change t to a definite string (ie. 'NM_xxxxx') then it will give a desired output so I suspect the issue is in reading from the csv file.

import re
import sys
import csv
from Bio import SeqIO

output = open("matches.fasta", "w")
x = SeqIO.parse(open(sys.argv[1], 'rU'),'fasta')
y = csv.reader(open(sys.argv[2], 'rU'))

for row in y:
    print str(row[1])
    t = str(row[1])
    for fasta in x:
            name, sequence = fasta.id, fasta.seq.tostring()
            searchObj = re.search( t, fasta.id, re.M|re.I)
            if searchObj:
                SeqIO.write(fasta, output, "fasta")

output.close()
fasta python regex • 15k views
ADD COMMENT
2
Entering edit mode

I hope you read through the dozens of ID matching posts here.

ADD REPLY
0
Entering edit mode

I did, but the fact my experience is almost exclusively python and very limited at that means I am unable to take advantage of related posts.

ADD REPLY
1
Entering edit mode

From your description, there could be one problem with the data that is a bit undetectable unless you look for it. For a few sequences, after reading them with Bio.SeqIO, try printing their ID, name and description attributes. Check if your expectations of what should be printed are in fact what are printed.

If the FASTA header has white spaces, the SeqIO parser can be a bit crazy. I'd usually say that the part before spaces goes into id and the part after goes into description, but this assumption was recently proven wrong.

Check out if the data conforms to expectations, and then dig into your code for changes.

ADD REPLY
0
Entering edit mode

The main issue is that the FASTA header has a wierd header (ie. >NM_xxxxxxx_2000_from_chromosome_spongebob) which is why I was trying to use RE to match with my candidate list containing just the RefSeq Ids (NM_xxxxxx). Else, Bio.SeqIO is behaving like it should given the control tests I've run.

ADD REPLY
1
Entering edit mode

Ah, I see. Have you tried printing str(row[1]), as plain text as well as an ASCII sequence? Maybe a new line or a black space at the end (or a weird unicode character) is messing up your RE search.

ADD REPLY
0
Entering edit mode

You were right, looks like the export .txt left some weird ascii characters. Thanks for the help Ram!

ADD REPLY
0
Entering edit mode

In linux

#exract fasta sequences with a reference list ALL SEQUENCE LENGTH

#LIST is your list as text and IN is your fasta
>awk 'BEGIN{while((getline<"LIST")>0)l[">"$1]=1}/^>/{f=l[$1]?1:0}f' IN.fa > OUT.fa

ADD REPLY
0
Entering edit mode

Couple of notes:

a. LIST is a file, not text. substituting that can be a bit non-trivial

b. Heng Li's bioawk is awk for biological formats. It will obviate the need for multi-line checks and header line awareness in your awk script.

ADD REPLY
3
Entering edit mode
9.7 years ago
dago ★ 2.8k

Try pyfaidx, it is really easy to use.

pip install pyfaidx

then to extract your seq

xargs faidx -d "<here the separator used in the IDs of your fasta>" <Your fastafile.fasta> < <Here the file containing the IDs>

Example

xargs faidx -d "|" My_seq.fasta < My_Ids.txt

PS: this is a clear example of the great "teaching power" of Biostars. extract sequences from multi fasta using partial ID

ADD COMMENT
2
Entering edit mode
9.7 years ago

What you are missing is seek() in biopython. Assuming the seq_ids in a file called "names"

How to use seek() with Biopython SeqIO object

from Bio import SeqIO

fasta=SeqIO.parse("fasta.fa","fasta")
seq_dict={}

for record in fasta:
        seq_dict[record.id]=record.seq
# If you feel its memory intensive, try next solution

for line in open("names","r"):
        line=line.strip()
        if line in seq_dict.keys():
                print ">"+line+"\n"+seq_dict[line]

Otherwise write a small function like:

from Bio import SeqIO

def getSeqRec(seq_id):
        fasta=SeqIO.parse("fasta.fa","fasta")
        for record in fasta:
                if seq_id==record.id:
                        return record

for line in open("names","r"):
        line=line.strip()
        seq_rec=getSeqRec(line)
        print ">"+seq_rec.id
        print seq_rec.seq
ADD COMMENT
1
Entering edit mode
9.7 years ago
mark.ziemann ★ 1.9k

Does it need to be Python? samtools faidx does it.

samtools faidx Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa chr22
ADD COMMENT
1
Entering edit mode
9.7 years ago
barvazduck ▴ 20

In linux

#exract fasta sequences with a reference list ALL SEQUENCE LENGTH
#LIST is your list as text and IN is your fasta

awk 'BEGIN{while((getline<"LIST")>0)l[">"$1]=1}/^>/{f=l[$1]?1:0}f' IN.fa > OUT.fa
ADD COMMENT
0
Entering edit mode
9.7 years ago

Here is a script I wrote. Before using it, Perl and perl module BioUtil should be install.

fasta_extract_by_pattern.pl could extract FASTA sequences by header or sequence, exactly matching or regular expression matching are both supported. The query pattern could read from files. And negation of the result is also easy to get. What's the most important, it could read from STDIN.

Combining fasta2tab and tab2fasta with cvs_grep could also have the same function.

Examples

1. sequences WITH "bacteria" in header

fasta_extract_by_pattern.pl -r -p Bacteria *.fa > result.fa

2. sequences WITHOUT "bacteria" in header

fasta_extract_by_pattern.pl -r -n -p Bacteria seq1.fa seq2.fa > result.fa

3. sequences with TTSAA (AgsI digest site) in SEQUENCE. Base S stands for C or G.

fasta_extract_by_pattern.pl -r -s -p 'TT[C|G]AA' seq.fa > result.fa

4. sequences (read from STDIN ) with header that matches any patterns in list file

zcat seq.fa.gz | fasta_extract_by_pattern.pl -pf name_list.txt > result.fa

See more: Manipulation on FASTA/Q format file

ADD COMMENT
0
Entering edit mode
9.7 years ago

To extract sequences with specific names from a large file efficiently, I suggest you use my filterbyname tool:

A: Extract sequence with header from a fasta file with specific ID given in another

ADD COMMENT

Login before adding your answer.

Traffic: 1886 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6