I've got 344,964 sequences and did a blastx against the nr database but I got 350,252 hits. There must be some redundancy in the results. How do I explain this redundancy and can I remove it from my results?
I've got 344,964 sequences and did a blastx against the nr database but I got 350,252 hits. There must be some redundancy in the results. How do I explain this redundancy and can I remove it from my results?
This is a misunderstanding of non-redundant: The term means there are not two or more 100% identical sequences in the database, with different headers, instead redundant sequences are collapsed and maintain headers and gi's of all identical entries.
This doesn't guarantee only a single hit per query, and it should not, think of highly conserved proteins they will have multiple hits all over the place, but these hits are not redundant. Your result contains on average ~1 hits per query, there might be queries with 0 hits and some with many. How you should treat multiple hits depends on your application. For most types of applications keeping only the best hit is not recommended.
Just repeat against SwissProt or UniProt 90. Less hits but cleaner results
It works,
$1: your file with sequences on fasta format
-db: Database
-out: result on txt format
-evalue: You can filter results based on min or max evalue, on this case 1e-3
-outfmt 6:, Tabular format, you can quit and get the default
-max_target_seqs 1: Just one result per sequence on this case, just one in case it is equal or better than 1e-3
blastx -query $1 -db data_base.fasta -out blastx_result.txt -evalue 1e-3 -outfmt 6 -max_target_seqs 1
:-)
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Strictly 'nr' is non-identical not non-redundant. Other databases such as the UniProt UniRef databases provide non-redundant options.
Another consideration is that since BLAST uses a local alignment based search method it will find conserved local features (e.g. domains) as well as sequences which are similar over their whole length. If you have reason to believe that your queries should have end-to-end matches (for example if you are searching with predicted CDS), then you might want to use an alternative approach such as producing translations directly and using a global alignment search method such as GGSEARCH (part of the FASTA suite).
Thanks, guys. I tried to process the results by removing the multiple hits. I removed all the rows with the same query IDs and now I got 319,235 results. Is this an appropriate way to deal with the multiple hits?
Not even close!
The NR database will contains sequences from numerous species, it also contains sequences that represent alternative transcripts or proteins of a gene. Assuming you kept the best hit and didn't just arbitrarily delete all but one hit, there's plenty of useful information in this data.
You might want to update your OP so we have a better idea of what you're doing.
Thank you, Joe. I also did a blast against nt database. I have 390,292 sequences but I got over 1.2 million results. I know the nt database is not non-redundant so I removed all the rows with the same query IDs and I have 390,079 results left.
I would like to parse the results by kingdom. Is this a correct way of doing this?
I'm still not exactly sure what you mean. Did you keep only the best hit for each query (by hit statistics) or did you delete all but one hit for each query. Did you track the score of these hits?
It seems like you lost ~220 query sequences when you removed hits, this means you threw out sequences from your transcriptome which doesn't make sense.
Is this data from a single species (cell culture/tissue) or from metagenomic/environmental samples?
My sequences are de novo assembled contigs (total 390,292) from RNAseq reads. My plan is to blast these contigs against the nt database first, by doing which I can sift out the prokaryotic sequences and then I blast the eukaryotic sequences against the nr, swissprot and refseq databases to get the gene names for my sequences. I set the output format so that it only shows the best hit.
I don't think I lost 200 sequences by removing the redundancy. When you blast against the databases, it is quite normal that some contigs don't have any hits when you are doing de novo assembly and gene annotation.
I'm just wondering if that was the correct way to remove redundancy when you do local blast.
If you're looking for differential expression and the material came from cell culture, check for common bacterial contaminants. That is a pretty massive transcriptome, are you looking at a mammal or plant?
cdsouthan is right (see below), you may want to do an initial filtering on the nr database and then use SwissProt for the annotations.
I would also start by removing hits that have expect values greater than 1e-10 or hits with less than 50-70% identity.
After that you can either pick the absolute best hit, best hit from a specific reference family/genus/species, or keep the best hit per subject species. Picking a single hit by score is the easiest, but picking the best hit per subject species can help you get a feel for "where" your species is with respect to a particular gene.
One thing I've noticed with mammalian transcriptomes is that you will likely see a large number of hits from human (followed by mouse) regardless of how far away phylogenetically your species is. This is due to the relative degree of completeness of the human/mouse genome versus other species. This probably holds true into other places, so don't be surprised if you see relatively distant species with high quality genomes generating more hits than closer but less complete species.
Another recommendation I can make is to annotate for protein families with HMMER. This can increase the overall confidence in your annotations and help you get some idea about the function of things that don't have BLAST hits.
Thank you, Joe. You really helped me a lot. Actually, this transcriptome is a reptilian transcriptome. The reason I used nt database as my initial filter is that many of these sequences may not be protein coding sequences. What I want to do is to separate the non-eukaryotic sequences out of my data and then blast the metazoan sequences against the major non-redundant protein databases. And then blast the sequences without any hit in these databases against the interpro database to get a better grasp of them.
I guess HMMER is a good software for that job. It is free, isn't it?
As far as I know, the nt database includes non-coding sequences, only the nr database won't have these.
Don't throw out sequences that don't any sufficiently high scoring hits. Filtering by taxonomy should only be to discard potential microbial contamination (either from experiment procedure or natural microbiota). You could be throwing out novel proteins if you throw out those that don't have hits or those that have hits in relatively distant eukaryotic species.
HMMER will help with this, it isn't like BLAST however. You may want to read up on it before jumping into it, but it is a free and fairly simple tool to use.
If you suspect there is a large number of noncoding sequences, you should look into tools that perform predictions of these sequences. Another approach would be to download non-coding transcripts from many species from ENSEMBL and blast against those. You could also go back to your results from the nt search and see what came up there.
From my own vested interest, is this a snake?
Thanks for your kind advice, Joe. It is a turtle transcriptome, in fact. There are not too many people studying turtle transcriptomes, are there?