Hi,
I am new to bioinformatics and am trying to perform differential expression analyses on some mouse RNA-seq data. We performed Tru-Seq Strand Specific Large Insert RNA Sequencing - High Coverage (50M pairs) on the sample. I am now trying to pseudo-align the reads to the mouse transcriptome using Kallisto. I ran bam2fq to obtain two fasts files, and also generated a mouse reference transcriptome index from both Ensemble( Mus_musculus.GRCm38.cdna.all.fa) and UCSC Genome Browser (refMrna.fa.gz).
I ran kallisto using the following command: kallisto quant -i index -o output pairA1.fastq pairA2.fastq For all the samples, the resulting run_info.json output looks similar to the example below:
"ntargets": 42184, "nbootstraps": 0, "nprocessed": 73044298, "npseudoaligned": 33281349, "nunique": 19777682, "ppseudoaligned": 45.6, "punique": 27.1, "kallistoversion": "0.45.0", "index_version": 10,
I would really appreciate any help in troubleshooting this issue. Is it an issue with the data quality, or should I be running Kallisto with additional arguments (strand specific, etc.)
Thank you very much for your help and please let me know if I can provide any additional information.
First I would check for rRNA contamination, there are several threads here discussing methods to do so (e.g. How to screen for rRNA and gDNA contamination in RNA-seq data? ). RSeQC can also give some useful diagnostics, but you will have to map to the genome to use it.
What are the other possibilities of getting low pseudo alignment rate if there are no/minimal contamination and the strandedness option has been correctly used?
Can you provide any follow-up, msubramanian1 ?