My issue is with recent RNA-seq data we have. I've aligned my RNA-seq reads to the genome with STAR. The animal is a cnidarian at question. We have a control set of reps and a treated set of reps. Only the treated set are showing poor alignment rates. RNA extraction, enrichment and sequencing were performed on all samples in the same time and run.
The output of STAR for the low aligning samples is that 60% of reads are not mapping due to being 'too short' - this seems to be characteristic for all the treatment reps. QC of reads seems fine. I've used minimal trimming with Trimmomatic as I don't want to remove a lot of valuable data. No head cropping or anything that could affect the alignment % was performed.
The same results are also produced using Salmon. So it doesn't seem to be an issue with software. I've noticed in the treatment samples the GC content is up by 2% compared to control samples.
I'm starting to think contamination? or is there something else at stake?
Thanks :-)
The "too short" refers to the alignment length rather than the read length. (see here) Meaning that 60% of the reads' alignments are not matching the reference.
You can store the unmapped reads with
--outReadsUnmapped Fastx
and analyse these further. E.g. you can run fastqc and check the overrepresented sequences.Does the treatment influences the cells too drastically (fragmenting its RNA)? Could your samples be mixed up in the either the lab or the sequencing facility? Did you share the sequencing run with others?
Hey Michael,
I've done so. I took a proportion of those unmapped reads as fastq and converted to fasta. seems a lot are coming back as fungi and bacteria. As such, I'm gonna assemble a de novo transcriptome of all those reads and annotate the entirety of unmapped reads... OR is there an easier alternative to show what each library has contamination wise?
As for carrying forward with those reads aligned, and as a second opinion: Would it be satisfactory to do DE analysis as long as a suitable normalisation is conducted between libraries (ie TMM / quartile)??
Thanks.
Bets programs that can delineate contamination from .fastq?
You may try FastQscreen: "FastQ Screen is a simple application which allows you to search a large sequence dataset against a panel of different genomes to determine from where the sequences in your data originate. "
If you know your main contamination source, you can use bbsplit from the bbmap suite to separate the contamination.
Before starting DE analysis, I'd check the alignments' quality, with RSeQC (geneBodyCoverage, read_distribution, ...) to check if the 40% target hits are OK.
Cheers, Michael
I'm going back and seeing if sorting .FastQ files has any affect.
Please give some details. You say STAR complains about 'too short' reads. What are the read lengths? It has a notably influence on the mapping efficiency. See a recent post of mine that is slightly related to read length and mapping %.
75bp PE reads, sjdboverhang set to read length-1