What I'm trying to do is really straightforward - align ribosome profiling reads (only mRNA fragments protected from RNase degradation are sequenced) to the mouse transcriptome. I've completed this using the following:
I first ran my fastq file through FastQC. I noticed there was a lot of adapter contamination, so I ran my fastq through cutadapt/trim_galore. The output file appeared free of illumina adapters.
I then aligned to the transcriptome using Hisat2 w/ genome/transcriptome I downloaded from their website (GRCm38 genome_snp_tran files [I think this is what I want in order to align to the transcriptome?]). My command was as follows:
hisat2 -x [genome/transcriptome index] -U [single end read file].fastq -S [output file name].sam
samtools for SAM to sorted BAM conversion
Gene abundance using Cufflinks w/ command:
cufflinks -G [transcriptome annotation].gtf [input sorted bam]
Ultimately, the results looks okay. I did get a lot of unmapped reads (~30%) from Hisat2 alignment. This may have to do with the fact that the mice I'm using are not C57Bl6. Don't know if hisat2 genome/transcriptome build I'm using accounted for all snps.
Any suggestions?
That's a reasonable enough plan. I've done something similar but used STAR instead, which produces a bit nicer results since it can soft-clip the alignments. I'm not the worlds biggest fan of cufflinks, but with a well annotated mouse genome it's probably OK. I should note that if possible it's really nice to combine a standard mRNAseq sample or two with your ribosomal profiling, mostly because it can make it easier to determine which transcripts are really the ones getting expressed to begin with.
I actually DO have the mRNA sequencing data for this experiment, and not just ribosome footprinting. I actually asked a question here on how I can compare the 2 given that my mRNA seq data is reported in RNA counts and ribosome footprinting is reported in FPKM (how cufflinks outputs data). But you made a great argument, that one is not better than the other, but that mRNA seq can just be used to corroborate RF data.
Sounds like a reasonable pipeline. That's indeed quite a lot of unmapped reads. Have you tried blasting some of them to find out where they are derived from?
hi, for the reads that didn't align, you can attempt de novo using Trinity. Works well in reasonable time. Though probably 30% is quite a chunk but if your model has some genetic modification: like maybe carrying an oncogene insert with Cre modification, or maybe has been CRISPR exposed to edit a particular locus. In such scenario the affected genes express transcript structures carrying the vector backbone (antibiotic selection markers, viral promoters etc.) which fails to align. If such a scenario is your case then Trinity or any other de novo assembler can salvage affected reads.
You may want to check the Mouse Genomes project for a reference genome for your specific strain if it's there. Might help with the unmapped reads.