Entering edit mode
4.3 years ago
Dunois
★
2.8k
Hi all,
I have a transcriptome assembled de novo, and had its read support evaluated using Salmon
. The transcriptome was constructed to find a couple of genes of interest, which were subsequently identified using Blast
searches.
I am looking at the read support for some of these candidates now, and Salmon
indicates that they have 0 reads assigned to them (NumReads == 0.00
).
My questions:
- How can something be assembled from a set of reads, and not have any reads align to it?
- Can I retain these candidate transcripts for further analysis despite them having 0 read support?
I went through the Salmon
documentation, but I was unable to find any clarifications indicating what this means.
one possibility is that when doing the assembly (to get the transcriptome) you end up with transcripts that are not 100% correct. Assembly processes always strive to get to a consensus and it is that consensus that might sometime result in a transcript for which the consensus is too deviating from the reality, giving rise to the fact that the original reads might not align to them anymore because too many wrong bases have been selected for the consensus.
In a more broad overview level this transcript will look ok (== you can have OK blast hits) but on the lower per-base level it might be too deviating
Hmm I must mention these weren't just* okay-ish blast hits, but had really high evalues (e.g., 10^(-200), and had the right domains. Should I just filter out matches with no read support irrespective of that? I am just worried that the gene I'm looking for just so happened to have a very low level of expression at the time of sampling, leading to low read support; but this wouldn't equate to no read support, right?
with those stats why do you than consider them as not ok-ish blast hits? those can't be random things no?
if you do expression analysis, then yes it's common practise to filter those out
I mean that's what I am saying: they were significant matches, but had 0 read support. And that's why I'm concerned. And at this point, I'm not looking to perform expression analysis; all I want to do is identify a suitable candidate for the gene of interest.
if the goal is gene discovery then yes you can keep those in. As mentioned above there are possible reasons why this might occur but that does not mean they might not be valid gene (loci).
Blast works on a much lower specificity (nucleotide-wise) than most aligners will do (== read aligners will be much more susceptible for per-nucleotide mismatches than blast is)
@lieven.sterck, I guess I'll retain the sequence then, but I'd also appreciate your inputs on what @Rob mentioned in their comment down below.
what Rob and I are stating here are complementary angles (or even additive) to explain the no read mapped to transcript issue. For what it's worth it could be both even.
Building on Rob's info: if the transcript in question is only present in a few samples it could very well be assembled but since in the complete pool of all your RNAseq data it can be that it falls beneath some thresholds and thus gets filtered/masked out when analysing all data together.
Seeing the input from others here I think you can safely continue with the result you have (especially given the circumstantial evidence you gathered)
This of course does not mean you should not be critical on the results you get, that is actually a good attitude in science, one that many unfortunately often lack.
@lieven.sterck thank you. I'll follow up on what I mentioned in my reply to @Rob's post here (https://www.biostars.org/p/460562/#460738), and update this thread with my findings. Given that the worst case scenario you're all outlining isn't particularly bad, I think this is a good direction to proceed in.
Dunois : Did the blast result make sense for (more or less) the entire length of the transcript in question? Assembly is not a perfect science and it is possible that this may be an artifact.
Hi @genomax, as far as I can tell, yes. There is at least one
blast
hit with 100% coverage. Most e-values are0
or close to0
. Are you suspecting the sequence of being a chimera?Have you done anything to remove redundancy in your transcript set? Something along the lines of CD-HIT-EST?
I did. But I didn't do the read mapping against this redundancy-reduced transcriptome (see my post here (https://www.biostars.org/p/460562/#460738).
In addition to Lieven’s excellent comments, I wanted to ask an orthogonal question about the process here. Has the assembly been performed from a collection of different samples? And if so, is this 0 read support transcript discovered in each individual sample, or across all samples, or across the data if you pool them together for quantification. Also, is this transcript sequence distinct from others in the assembly (if you look over the salmon logs that were produced when _indexing_ the reference, it will tell you if there were identical transcripts that were discarded). There are a few things that could be happening. The transcript could be present (expressed) in some samples but not others, it could be a sequence duplicate and therefore not quantified, it could be a computational chimera that the assembler identifies by joining reads from different samples but that isn’t well supported in any individual sample (sort of unlikely given the blast result), or it could be something along the lines of what Lieven mentioned.
@Rob I'm not sure what you're referring to as a sample here, so I'll add some context. I think 10-20 individuals of the species (a marine eukaryote) were collected and sacrificed to obtain the RNA. These RNAs were then pooled together and sequenced (paired end), yielding a single (well two
fastq
files) output. This is what was subsequently assembled into the transcriptome. I presume--in this case--this happens to be a single sample, going by your definition? Coming to the second part of your response: the transcript sequence does seem to be unique. I looked at the<samplename>.fasta_index/duplicate_clusters.tsv
file produced bysalmon 1.3.0
, and I could not find this sequence listed there. I presume this means it isn't a duplicate, and thus wasn't discarded from the quantification process. If the computational chimera outcome is unlikely, the multiple samples idea inapplicable, and this sequence is not a duplicate, what is going on? If @lieven.sterck and yourself are interested, I can share the sequence with you, so you could take a look yourselves.Also, does anything from your recent paper here (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02151-8) apply in this context? (I haven't read it yet. I actually just stumbled upon it.)
Hi @Dunois,
Regarding the selective-alignment paper, that is a general improvement to the mapping methodology, so the main application of that paper to this context is just to make sure you're using the most recent version of salmon that implements these improvements (which it seems you are doing; thanks!). However, I'd venture to guess it's probably not the main issue here. Depending on the technique used for assembly, it's possible that the assembler may have produced a number of transcripts that are (very?) similar to each other (though not quite identical). Quantification generally seeks a high-likelihood (and parsimonious) description of the observed sequencing reads. Thus, if it can account for all of the reads in a high-likelihood manner (i.e. with high-quality alignments and without large deviations in coverage) using a subset of the transcripts, then it is likely to do so. In many cases, there are subsets of transcripts whose abundances can be altered (e.g. swapped) while retaining a basically identical likelihood. These are transcripts with a large amount of uncertainty in inference (you can read more about this in e.g. our terminus paper : https://academic.oup.com/bioinformatics/article/36/Supplement_1/i102/5870485). So, the other thing you could consider doing is to ask salmon to produce Gibbs samples (or bootstrap replicates), and then investigate if the uncertainty in this transcript is high.
If you'd like to share the data and the assembled transcriptome, I'd be happy to take a look — though I might not be able to get to it immediately.
Thanks for your input here @Rob (also, congratulations on the paper). Thank you for the link to the Terminus paper also. That looks like an interesting read. (And maybe worth trying out here in my case? Where is the
conda
version?)Thing is, the transcriptome I'm working with is a concatenated one that was constructed from the outputs of three assemblers (
Trinity
,TransAbyss
, andrnaSpades
). I did cluster away the redundant sequences in the concatenated assembly, but I decided to perform the quantification against the non-deduplicated assembly. My (in hindsight misleading) assumption was that reads would get falsely mapped to transcripts that weren't assembled from them if I were to use the redundancy-reduced version. In light of what you've said here, I am inclined to believe that this is the cause behind what I've observed. I will check what the read support looks like when the reads are mapped against the redundancy-reduced assembly; I'll also take a look at what things look like w.r.t the output of the specific assembler that produced this transcript (Trinity
in this case). I'll also rerunsalmon
with the Gibbs option enabled. Any suggestion for how many samples to draw?All this said, it would be nice if you could take a look at this down the road if things don't pan out as hoped. I can send you a preliminary email tomorrow to touch base with you.
(Note: it is the redundancy-reduced assembly that I used for the gene-finding exercise.)