Question

Pipeline analysis used to analyze RNA-seq data

0

Entering edit mode

2.9 years ago

Peter ▴ 20

I am analyzing RNA-seq data for the first time and would like an opinion on my choices.

I used fastqc to assess the quality of samples and trim galore to pull out adapters

fastqc -o fastqc fastq/*fastq.gz
trim_galore SRR12816740.fastq.gz (for all my samples)

Then I aligned my samples using histat2 with the following code and got alignment rate between 80-90%:

#!/usr/bin/bash

SAMPLES="SRR12816740_trimmed SRR12816741_trimmed SRR12816742_trimmed SRR12816743_trimmed SRR12816744_trimmed"

for SAMPLE in $SAMPLES; of
hisat2 -p 5 --dta -x /home/referenceData/hisat2_index/GRCh38_index -S /home/mydata/aligned/${SAMPLE}.sam -U /home/trimmed/${SAMPLE}.fq.gz
done

Then I converted the .sam files to .bam files:

samtools view -S -b SRR12816743_trimmed.sam > SRR12816743_trimmed.bam
samtools sort SRR12816743_trimmed.bam -o SRR12816743_trimmed.sorted.bam
samtools index SRR12816743_trimmed.sorted.bam

(For all samples)

Finally, I used feuatureCounts to get the quantification:

featureCounts -T 5 -t exon -g gene_id -a Homo_sapiens.GRCh38.80.gtf -o SRR12816743_trimmed.txt SRR12816743_trimmed.sorted.bam

For the first sample I got this result and it doesn't look good to me:

Assigned 21174635
Unassigned_Unmapped 20419658
Unassigned_Read_Type 0
Unassigned_Singleton 0
Unassigned_MappingQuality 0
Unassigned_Chimera 0
Unassigned_FragmentLength 0
Unassigned_Duplicate 0
Unassigned_MultiMapping 12896943
Unassigned_Secondary 0
Unassigned_NonSplit 0
Unassigned_NoFeatures 42374417
Unassigned_Overlapping_Length 0
Unassigned_Ambiguity 1354811

What's the best way to improve?

Thanks!

featureCounts histat2 Linux RNA-seq • 1.0k views

ADD COMMENT • link updated 2.9 years ago by swbarnes2 14k • written 2.9 years ago by Peter ▴ 20

score 0 · Answer 1 · 2022-01-20

You might want to run fastqc again after your trimming and QC steps to get a sense of how those steps have affected your read set. You might still have adaptor contamination despite the trimgalore run. There's also no indication that you trimmed for base quality, unless trimgalore has a default that I'm unaware of. Beyond that, maybe you can ask a more specific question.

This is a lot of ask without much supporting information, which probably why you didn't get a timely response on the question. Even explicitly stating that you're working on Human data would be helpful. I hope that you can find a solution to whatever problem you are having. Here is a very detailed guide to reference-based RNAseq using Hisat2:

https://github.com/CBC-UCONN/RNA-seq-with-reference-genome-and-annotation

score 0 · Answer 2 · 2022-01-20

The most common problem at this stage is having a gtf that doesn't match the reference. The fact that you have lots of reads which do match to genes suggest that you at least don't have a massive chromosome naming problem. But it's worth checking to make sure that your gtf and genome do come from the same place and the same version.