I have been trying to troubleshoot this for the past couple days but I can't seem to find the issue, any help you could provide is greatly appreciated. I am a noob but will be writing similar scripts to this in the future- so any added resources are welcomed.
I have a large file with a bunch of FASTA sequences and I am trying to extract specific ones based on a list of IDs. While I would like the output to be a collection of whatever ID == FASTA sequences it won't give me an output with results. I know that if I change t to a definite string (ie. 'NM_xxxxx') then it will give a desired output so I suspect the issue is in reading from the csv file.
import re import sys import csv from Bio import SeqIO output = open("matches.fasta", "w") x = SeqIO.parse(open(sys.argv[1], 'rU'),'fasta') y = csv.reader(open(sys.argv[2], 'rU')) for row in y: print str(row[1]) t = str(row[1]) for fasta in x: name, sequence = fasta.id, fasta.seq.tostring() searchObj = re.search( t, fasta.id, re.M|re.I) if searchObj: SeqIO.write(fasta, output, "fasta") output.close()
I hope you read through the dozens of ID matching posts here.
I did, but the fact my experience is almost exclusively python and very limited at that means I am unable to take advantage of related posts.
From your description, there could be one problem with the data that is a bit undetectable unless you look for it. For a few sequences, after reading them with Bio.SeqIO, try printing their
ID
,name
anddescription
attributes. Check if your expectations of what should be printed are in fact what are printed.If the FASTA header has white spaces, the SeqIO parser can be a bit crazy. I'd usually say that the part before spaces goes into id and the part after goes into description, but this assumption was recently proven wrong.
Check out if the data conforms to expectations, and then dig into your code for changes.
The main issue is that the FASTA header has a wierd header (ie. >NM_xxxxxxx_2000_from_chromosome_spongebob) which is why I was trying to use RE to match with my candidate list containing just the RefSeq Ids (NM_xxxxxx). Else, Bio.SeqIO is behaving like it should given the control tests I've run.
Ah, I see. Have you tried printing
str(row[1])
, as plain text as well as an ASCII sequence? Maybe a new line or a black space at the end (or a weird unicode character) is messing up your RE search.You were right, looks like the export .txt left some weird ascii characters. Thanks for the help Ram!
In linux
#exract fasta sequences with a reference list ALL SEQUENCE LENGTH
#LIST is your list as text and IN is your fasta
>awk 'BEGIN{while((getline<"LIST")>0)l[">"$1]=1}/^>/{f=l[$1]?1:0}f' IN.fa > OUT.fa
Couple of notes:
a. LIST is a file, not text. substituting that can be a bit non-trivial
b. Heng Li's bioawk is awk for biological formats. It will obviate the need for multi-line checks and header line awareness in your awk script.