Hello,
I am using raw paired-end RNA-seq Fastq files from a public database for gene expression quantification. After aligning the sequences using STAR, I noticed a couple of issues with the generated BAM file:
The BAM file is very small, only about 20 MB (too small for humans). The quality sequences in the BAM file are all displayed as "?". When I proceed to quantify gene expression using RSEM, the vast majority of genes have an expression level of zero, with only a few hundred genes showing non-zero expression.
Here is the workflow I followed:
Pre-processing the data with trim_galore.
Aligning the sequences using STAR. The code used is as follows:
STAR --runThreadN 5 --genomeDir STAR_genome \
--readFilesCommand zcat \
--readFilesIn clean_data/Sample1_1_val_1.fq.gz \
clean_data/Sample1_2_val_2.fq.gz \
--outFileNamePrefix align_out/sample1_ \
--outSAMtype BAM SortedByCoordinate \
--outBAMsortingThreadN 5 \
--quantMode TranscriptomeSAM GeneCounts
Quantifying gene expression with RSEM.
My questions are:
Why is the BAM file so small, and why are the quality sequences displayed as "?"? How can I resolve these issues to ensure accurate gene expression quantification?
Thank you very much for your help!
Run fastqc on the fastq files.
I have run fastqc on the Fastq file and the results show no big problem, as shown in the picture below.
You could possibly also check if the output log of STAR might have mentioned anything of note?
Thanks, I checked the "ReadsPerGene.out.tab" file and found that the majority of reads were classified as "unmapped". Is this because my reference genome was used incorrectly?
That is a possibility. If this is public/published data you can try to get the reference the paper used and see if that improves things.
OK, I will try it, thank you very much!!