First of all I will answer the question you asked on a comment, but it is the most important question here:
I need to do differential expression analysis so I need to assemble the transcriptome right?
You do not need to assemble the transcriptome to perform gene / transcript differential expression analysis. There is a very, very good genome assembly for mouse. Also, there is a very, very good annotation for the mouse genome - and consequently, there is a very good transcriptome for the mouse. You do not need to assemble, it is better to map / count using the existing reference genome / transcrptome. In addition, the reads from the project you linked are single-end and only 36bp long, they will not provide a good transcriptome assembly.
You can perform gene-level differential expression by first mapping the reads to the genome with STAR to get gene expression counts (as I have told you before here), and then using DESeq2 or edgeR to perform the differential expression statistical analysis.
You can perform transcript-level differential expression quantifying transcript counts with Salmon or kallisto, and performing the statistical analysis with sleuth.
Now, for your original question:
I did map one of the files to the mouse genome with STAR, obtaining gene counts:
Sox17 9368 81 9287
Mrpl15 4473 30 4443
Lypla1 3122 20 3102
Tcea1 397 6 391
One can use these counts to infer strandedness of the reads. I know the protocol is stranded, so the first column of counts is automatically excluded, as it represents unstranded counts (check STAR manual). It is easy to see then that the true counts are in the fourth column, which the STAR manual describes as:
column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)
This means the library prep kit produces what we call "reverse" stranded reads. The Trinity manual says the correct parameter to indicate this library type is --SS_lib_type R
, which is the opposite of what you are using in your call to the align_and_estimate_abundance.pl
script. So maybe if you use align_and_estimate_abundance.pl --SS_lib_type R
you will get good mapping rate.
But remember, you do not need to assemble the transcriptome.
Could you please share the commands used for both alignment as well for getting stat from SAM.
I just used gsnap for the SAM files by aligning trimmed mouse reads to the mouse genome (GRCm38.primary_assembly.genome.fa from GENCODE). I used gmap to build the database first and built intron index with the gencode.vM18.primary_assembly.annotation.gff3 file. I used this for the SAM stat:
For Trinity I did this (they were single-end, strand-specific forward reads):
And for the Trinity-GG.fasta I did this:
What is your library prep kit? Is it really stranded? Is it from Illumina? In general, stranded Illumina kits use dUTP method, with Trinity, you should use
--SS_lib_type R
for single-end reads, and--SS_lib_type RF
for paired-end reads.Are the reads you are trying to align the same used to assemble the transcriptome?
Are you working with Mus musculus? Do you really need to assemble a transcriptome?
Here are the reads' link which gives all the info needed: https://www.ncbi.nlm.nih.gov/bioproject/PRJDB5631. I need to do differential expression analysis so I need to assemble the transcriptome right? Also the merged BAM file was only like 10G while the SAM files in total were around 150G; is that normal?
Yea actually it is still not working after rerunning the sorting and merging; the BAM files are quite small still and I believe that is messing everything up
I see a similar issue here in this post: Sam To Bam - Loss Of Data Or Just Great Compression?. Should I just accept the Trinity results and go ahead with differential expression analysis? These also are cancer derived organoids if that makes a difference.