I am looking for sex-biased genes in different species.
RNAseq data : I am mapping 75bp
SE reads via RSEM
with bowtie2
(very-sensitive)
, on a transcriptomes of reference. I actually have that for 4 species. The data comes from a very specific tissue.
For each species, the reference transcriptomes were made with different PE reads coming from a mix of different stages and whole body animals. They should represent a fairly big proportion of all the genes of the animals. I used Trinity
to (re)assemble them. Followed by transdecoder.LongOrfs
and transdecoder.predict
to retain only the protein coding genes in which I am mostly interested. I don't have a reference genome as it is not a model species.
I mapped the 75bp
SE reads (sex-specific and tissue specific) on:
(1)
Trinity
raw output, and I get70%
of the reads that map at least once on the reference. They cover around60%
of the reference with a min coverage of5X
.Busco
completeness assessment of these raw trinity out puts gives ~95% ofBUSCO
(on arthropoda data base 9) for each species.(2) only the longest mRNA with a coding sequence of each gene (~20 000 per ref), and I get
40-30%
of reads map at least once with a min coverage of5X
. They cover around70%
of that references.Busco
completeness assessment of these mRNA references gives ~85% ofBUSCO
(on arthropoda data base 9) for each species.
Polymorphism might lower the mapping rate because the population used for the references and the SE reads are different.
Incompleteness of the reference might also lower the rate. Although BUSCO
searches are fairly good.
Other info:
the SE sequencing was done with polyA selector primers “smart-seq cds primer ii a”.
the PE sequencing for de novo references also had polyA selector primer. Thus, I shouldn't have ribosomes rRNA.
I still have transposable elements in my mRNA references (because they have coding sequences).
I am sure I haven’t mixed up species. I map SE reads on the corresponding reference.
Each SE reads library are at least
30M
reads.by mRNA I mean whole transcript; UTRs + CDS
Species are insects
I wonder whether I should be concerned about this low mapping rate on the mRNA references (30-40%) ?
I would like to use the mapping from the reference containing only the mRNA => (2), because I only care about these sequences which I can identify.
The ultimate goal is to perform sub-sequent Differential Expression analysis and extract candidate genes involved in the development of one trait present in males of certain species but not others.
1) The closest species with a Genome is quite far (~100 Myears). Running
RSEM
with SE reads of other species on it gives a mapping rate is ~1%. This species with a genome is actually part of my experimental design, but due to this low between species alignment rate, I decided to built de novo transcriptomes for each other species and look for orthologues across all these different references withOrthofinder
.2) I am not sure a new guided de novo assembly with
cufflinks
will help because this closest genome is actually quite far. Also, my Trinity assembles are quite good. The raw Trinity output assembles have a N50 from 1000 to 1900, depending on the species. AfterTranscoder.LongOrfs
andTranscoder.predict
, I get ~20 000 genes, which is the same as the Official Gene Set (~ 20 000 genes) of this closest species with a genome. If Oases gives more fragmented transcripts, as you say, it might not help neither.My organisms are none-model systems, and they don't have sequences on any database.
Is 40-30% mapping rate, with RSEM on solely mRNA with a coding sequence, too low to continue? You said that anyway, higher mapping rates don't mean better sequences.
A genome mapping rate of 1% is low, but I would usually report that with a genome aligner like TopHat2/STAR/HISAT. I would usually expect you were using RSEM with a Bowtie2 alignment for a transcriptome.
Even with a transcriptome alignment, I think you could just use Bowtie2 for your alignment. It seems to me like that alone may noticeably increase your alignment rate. To be honest, I have either not gotten great results with RSEM or I found the run-time to be unacceptable. So, I would actually prefer eXpress (from the Bowtie2 alignment, although you can also start with FASTQ files) over RSEM for your transcriptome quantification.
On a side note: Are there actually circumstances, where it would make sense to use the transdecoder-predicted ORFs for quantification, instead of the assembled transcripts? I'm asking, because I noticed quite a few concatenated transcripts, especially for plastids. Transdecoder was able to disentangle these genes, as far as BLAST could tell. However, the mapping rate was quite a bit lower (using Salmon's quasi alignment: 99 % for assembled transcripts, 67 % for ORFs only).