How to limit/filter BLAST results using NcbiblastnCommandline
1
0
Entering edit mode
7.1 years ago
yarmda ▴ 40

I am using a command like

NcbiblastnCommandline(query="fasta.fasta", db=nt, outfmt=5, out="out.xml", show_gis=True)

And that is returning results.

However, I want to limit the results to only returning the gi number/taxID (not the alignment or the bulk of the data) and to limit the hitlist size (as you would in NCBIWWW) to some arbitrary number. How can these be done?

Ultimately, I am trying to find related species to a given target that aren't the target and download their sequences. Since BLAST can't provide the complete genomes, I want to take their identifiers, so I don't need most of the BLAST output.

biopython ncbiblastncommandline blast • 3.3k views
ADD COMMENT
1
Entering edit mode

I'm assuming you're running blast in Python, because you have some upstream/downstream code that do other processes. However, it would be much easier to run the blast on the CLI, or use subprocess() in Python to call your blast cmd. Output format 6, tab-delimited, is the easiest IMO to parse. You can set max_target_seqs to a value for each query sequence to return that number of db hits.

ADD REPLY
0
Entering edit mode
7.1 years ago

However, I want to limit the results to only returning the gi number/taxID

using bioalcidae to extract the hit-it, then call efetch to get the XML for the sequence, and use xpath to extract the taxon id/name.

java -jar dist/bioalcidae.jar -F BLAST -e 'while(iter.hasNext()) {var hit=iter.next(); out.println(hit.getHitId());} ' input.blastn.xml |\
grep '^gi|' | cut -d '|' -f 2 | sort | uniq | while read ID
do
wget -q -O - "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${ID}&retmode=xml&rettype=fasta" |\
xmllint --xpath $'concat(TSeqSet/TSeq/TSeq_taxid/text(),"|",TSeqSet/TSeq/TSeq_orgname/text(),"\n")' - 
done | uniq | sort |uniq
ADD COMMENT

Login before adding your answer.

Traffic: 1837 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6