I'm working on a project and could use some advice in terms of best practices and tool suggestions! (I'm new to this!)
The Setup
I have:
- 2 technical replicates of RNA seq data for ~50 strains of a pathogenic fungus which has a highly variable and highly repetitive genome.
- 5 genomes in FastA format
- 1 published genome with FastA and GFF3
- 1 transcriptome (has 1 FastA file, and 1 GFF3 file) *more on this shortly
Progress thus far
At first I had aligned my reads to the transcriptome (was published from a former lab member) and found that on average only 75% of reads were aligning though the range was from 25% through 85% (using Tophat). Even for an organism with high genetic variability this seemed strange so I set up a local BLAST database with the FastA genomes and queried using the originally unaligned reads (for all 50 strains) and found that a significant amount of reads were matching with high specificity to the 5 genomes in BLAST. I came to the conclusion that the transcriptome was lacking many (~30%) genes.
At this point I went back to the beginning and aligned each of the 50 strains to each of the 5 genomes (Tophat, creating 250 files), took those reads and created assembled transcript files (Cufflinks), and lastly merged the transcript files (per genome) into 5 transcriptome assemblies (CuffMerge).
At this point I began running CuffDiff to analyze differential gene expression. The original 50 read files were being analyzed 5 times, once for each new transcriptome.
My mentor suggested that it would be better off to kill the process merge all of the 5 new transcriptomes together along with the original transcriptome that was lacking information. I initially looked at CuffMerge again and feed it a list of my 5 transcriptome GTF files along with the original transcriptomes GTF (I'd have to convert GFF3 -> GTF) to make one master reference.
Question
Does this seem OK? I would greatly appreciate any advice that anyone has to offer! Thanks!
Yeah I think that's the point of CuffMerge, you can make one big transcriptome and some transcripts will not be present in some samples, but they all can align and quantify to the same reference. Then run it through RSEM to get TPM at each transcript for each sample. Later you can see which transcripts are present in which samples, and by how much.