I am analyzing RNA-seq data for the first time and would like an opinion on my choices.
I used fastqc to assess the quality of samples and trim galore to pull out adapters
fastqc -o fastqc fastq/*fastq.gz
trim_galore SRR12816740.fastq.gz (for all my samples)
Then I aligned my samples using histat2 with the following code and got alignment rate between 80-90%:
#!/usr/bin/bash
SAMPLES="SRR12816740_trimmed SRR12816741_trimmed SRR12816742_trimmed SRR12816743_trimmed SRR12816744_trimmed"
for SAMPLE in $SAMPLES; of
hisat2 -p 5 --dta -x /home/referenceData/hisat2_index/GRCh38_index -S /home/mydata/aligned/${SAMPLE}.sam -U /home/trimmed/${SAMPLE}.fq.gz
done
Then I converted the .sam files to .bam files:
samtools view -S -b SRR12816743_trimmed.sam > SRR12816743_trimmed.bam
samtools sort SRR12816743_trimmed.bam -o SRR12816743_trimmed.sorted.bam
samtools index SRR12816743_trimmed.sorted.bam
(For all samples)
Finally, I used feuatureCounts to get the quantification:
featureCounts -T 5 -t exon -g gene_id -a Homo_sapiens.GRCh38.80.gtf -o SRR12816743_trimmed.txt SRR12816743_trimmed.sorted.bam
For the first sample I got this result and it doesn't look good to me:
Assigned 21174635
Unassigned_Unmapped 20419658
Unassigned_Read_Type 0
Unassigned_Singleton 0
Unassigned_MappingQuality 0
Unassigned_Chimera 0
Unassigned_FragmentLength 0
Unassigned_Duplicate 0
Unassigned_MultiMapping 12896943
Unassigned_Secondary 0
Unassigned_NonSplit 0
Unassigned_NoFeatures 42374417
Unassigned_Overlapping_Length 0
Unassigned_Ambiguity 1354811
What's the best way to improve?
Thanks!