I am a bioinformatics novice, but I'm learning and managing. I've recently sequenced and am now analysing and annotating 8 transcriptomes from 2 different species. I've just recently ran a BLASTX against the the Drosophila melanogaster proteome (with .xml output).
The next thing I want to do is isolate all of the transcripts from that BLAST that did not hit anything in the Dmel proteome and BLAST them against the SwissProt Invertebrate database. As I said earlier, I am a novice, so please forgive me if this is a really simple thing to do. I would like to know, specifically, how I might approach subsetting the transcriptome fasta file to only contain the transcripts with no BLAST hits from Dmel.
Any and all insight is greatly appreciated.
hmm, a little unfortunate you ran the blastx with xml output , with tabular you would have been able to much more easily process the list and get to the list of no-hit IDs.
I think this will help: Retrieve nonmatching blast queries. Which is with FASTA and XML as inputs.
If the BLAST version that you used preserves the queries with "No hits found", you can also get the list of no-hit queries by:
Then based on the above list, extract the no-hit sequences using SeqKit.