I have an RNA-seq assembly and the corresponding reads form a non-model organism. Those reads map (using RSEM with bowtie via a Trinity wrapper) well against the assembly generally and transcripts of special interest.
However, the subset of transcripts in question that I am very interested in are somewhat fragmented, so I have tried to remove those from the assembly and instead put in PCR clones of the CDS:s from another individual (4-5% difference between individuals on the nucleotide level). However, when I do this, the mapping becomes substantially worse, perhaps 1/100th or 1/1000th of the FPKM compared with before.
I have tested and refuted the following hypotheses:
mapping stringency: tested with a previous version of the software that was more lenient and also tried to reduce seed length to 20 from 25 (both with 2 mismatches).
3'-bias: inspected alignments in IGV, found some 3'-bias, but nowhere near enough to be an important factor.
Paired-end issues: mapped just the left reads as singles.
Server issues: re-ran it on another server, no change.
Made sure reads were sorted.
Command issues: re-mapped against the original assembly with the contigs but without the clones and gotten back the same substantial result, so the command used is not erroneous.
Hidden characters: inspected with vim, found no hidden characters that could explain it.
I initially suspected that the large individual variation would be the clear cause, but even when I use more lenient mapping parameters (allowing 2 mismatches for a 20 nucleotide seed), this does not seem to help.
Any ideas?
How did you clone anything if you don't have a reference transcriptome or genome to work off of?
Put together the fragments (identified by BLAST with a homolog from another species as a query) into a construct, make PCR primers from construct, run PCR, sequence results.
What is causing you to think they're fragmented, the quality/length of the BLAST hits? What is the query coverage for the hits?
Can you clarify what you mean by "put together the fragments"?
Have you aligned the contigs from trinity with the cloned sequences? How well do your reads map to that homolog in the related species?
The bowtie settings in trinity do not allow for gaps (as RSEM doesn't work with gapped alignments). Have you tried using bowtie2 with the
--very-sensitive
option and using eXpress instead?You could also try primer walking using just your trinity sequence(s) to try and get more complete transcript sequences.
You take a reference sequence from a nearby species. You BLAST this sequence against the assembly. You pick up the 2-3 fragments that together match the vast majority of query sequence in a convincing way (with support from e. g. domain searchers and phylogeny). You make a construct from these 2-3 fragments and from this construct you make PCR primers. Then you run the PCR, sequence the results, and you get a clone. By putting together the fragments, I mean create a putative transcriptome construct.
So we know the clones we have are from this sequence of interest. The sequence identity between the contigs and the clone is 94-96% or so on the nucleotide level.
I have not mapped reads against the homolog, because if there is issue with individual variation, I suspect that it will not map that well against the homolog.
I will try your suggestions. Thanks.
Sorry, was a little confused on how you got the material to put into your construct to begin with.
This is a problem I've run into with trinity before, where gaps in coverage might lead to trinity seeing several "genes" when there is only one. You may also have incomplete/improperly assembled transcripts.
When you built the construct, did you only use the regions of the fragments aligned by BLAST, or did you use the whole fragments. If you used whole fragments you may want to run a MSA to see where the regions of the fragment not mapped by BLAST end up.
I'd try mapping your construct again without the gap penalty align_and_estimate_abundance.pl puts in place. With more tolerant mismatch and gap settings you could probably get more reads to map to the reference.
Unless the homolog species is incredibly close phylogenetically, there could be considerable differences in the sequence of the transcript in your species. The idea of annotating with BLAST for de novo assemblies isn't to try and define sequences, but to get a rough idea of what the identity/function of a given transcript is. So when a part of a de novo assembled contig maps to a reference sequence, you can roughly infer that the function/identity of that contig is the function/identity of the reference hit. The more sequence homology between the two, the stronger that putative annotation is. The problem is that you GeneX in species a and GeneX in species B may be paralogs with some degree of conservation. They look similar enough to match by BLAST but they may have very different functions and sequences. Add in improper assembly (incomplete/incorrect/fragmented) and it turns into a headache.
I think primer walking is likely to be the best option for getting complete transcript sequences. You may want to run each round of primer walking out on a gel and extract each band for sequencing. This would allow you to collect isoform sequences.
Well, at the moment we know the clone is a true CDS and we also have functional studies supporting homology above and beyond sequence similarity for some of the sequences of interests. Thank you for your efforts, but I do not think this exchange relates to my mapping question anymore.
A couple of questions that might clarify:
Have you visualized the read alignment to the original assembly transcript fragments, and compared it to read alignment of the CDS? If so, what's different?
Did you rebuild the index after replacing the fragments with CDSs (had to ask; it's a common mistake)?
I do not have the RSEM mapping .bam at the moment (removed it due to storage issues), but I quickly ran a everything-default bwa mapping with the assembly reads against just the clones. This showed a substantial amount of mapping (in accordance with expectations), but with lots of SNPs across reads. So this would suggest that individual variation itself is not a major issue, but perhaps only individual variation in concert with the RSEM mapping parameters set by the Trinity wrapper. But I once set those to two mismatches per 20 nucleotide seed, and 10% should have mapped them, but perhaps it is due to backtracking.
Yes, the file that contains clones but not the assembly contigs got a new name, because I concatenated a cleaned assembly with the clones to a new file.
So right now I am thinking about the backtrack limit in bowtie
--maxbts
. Maybe this setting makes it so that many valid-n 2
alignments are missed. This was run with the default125
. Also looking at--pairtries
, for which the default is100
. Perhaps it is worth to try-y
/--tryhard
. Even though it might be slower, it might be rewarding.