Hi,
I'm trying to do a typical transcriptomics analysis on mRNA samples from several Saccharomyces cerevisiae CEN.PK113-7D strains. By typical I mean:
1) Cleaning raw reads: trimming of adapters and removal of low quality reads. 2) Mapping reads against reference genome/transcriptome and quantification of reads 3) Performing differential gene expression (DGE) analysis from quantified reads in the different samples/conditions 4) Function or pathway enrichment from DGE results
The sequencing company did step 1 already, so my starting point would be to map the "clean" reads into a reference genome/transcriptome. I don't know if I'm overcomplicating everything or if my views are not entirely correct, but I'm already stuck here...
I was initially planning to use salmon to map the reads against the transcriptome. However, I'm struggling to find a good transcriptome for yeast. I can see that in most complex eukaryotes, like human or mouse, if I download the cDNA transcriptome from ENSEMBL, as I would expect I can find several mRNA transcripts belonging to the same gene (alternative splicing), the transcripts contain 5' and 3'UTR regions, etc. In the case of yeast, however, when I download the transcriptome there's just a single transcript per gene (alternative splicing is completely ignored even though it's been demonstrated for several genes) and all the transcripts miss UTR regions, i.e, literally all transcripts start with ATG and end with a stop codon, which is completely wrong. It looks like the transcriptome has been built from computational predictions based on the genome sequence rather than from mapping sequenced cDNA into the genome.
I thought that these issues could be specific to my strain (CEN.PK113-7D, https://www.ncbi.nlm.nih.gov/bioproject/PRJNA398797/). but that's also the case for the S. cerevisiae reference genome! (https://ftp.ensembl.org/pub/release-108/fasta/saccharomyces_cerevisiae/cdna/). Am I missing out something, or there is no "experimental" transcriptome for baker's yeast uploaded to the most common databases? There's some 5' UTR and 3' UTR fasta files on SGD (http://sgd-archive.yeastgenome.org/sequence/S288C_reference/), but they belong to the reference strain S288C. Also, I don't know how correct it would be to just join those UTR regions to transcripts, as still alternative splicing would not be taken into account. For my strain I found this record http://genomebrowser.uams.edu/cenpk1137/ that has the CEN.PK113-7D genome sequence plus the genome annotation in GFF3 or BED that contains info about all exons, CDS, and UTR regions. However, I'm not sure how correct it would be to build a transcriptome from the genome + GFF3 annotation, plus I'm a bioinformatics newbie and don't know how to do it.
So basically, if I use the transcriptome for my strain that I can get from the most common databases, reads that cover only in the UTR region would be ignored, as well as probably some reads derived from splicing variants not taken into account in the single transcript annotation. Not sure if this is something standard, or ideally you should count also all those reads.
As an alternative to this situation, I'm considering mapping against the CEN.PK113-7D genome. There I would use hisat2 for the mapping and featureCounts to count reads that have mapped on exon regions. Again, I don't know if it's a common practice but then reads mapping into potential UTR regions would be ignored?
Could anyone enlighten me? Maybe I'm overthinking about all possible details that are not really relevant in the analysis...
Thanks!!
I'm not gonna lie your question went way over my head but I always just use these prebuilt indices from kallisto (with kallisto, not salmon, however) and they work great for me. Do these have the same problem as those from ensembl?
https://github.com/pachterlab/kallisto-transcriptome-indices/releases
Thanks for your reply. It seems that they are built from ensembl, so they have the same issue. Transcripts miss out UTR regions and do not seem to consider alternative splicing variants.
Hey yeast_rnaseq
did you figure out a solution to your problem - I have a very related problem!
--
To partially answer your question: You can map your reads even in absence of the full transcriptome just using the cDNA file. What's stopping you from doing this and how do your results look like?