Hello,
I am using raw paired-end RNA-seq Fastq files from a public database for gene expression quantification. After aligning the sequences using STAR, I noticed a couple of issues with the generated BAM file:
The BAM file is very small, only about 20 MB (too small for humans).
The quality sequences in the BAM file are all displayed as "?".
When I proceed to quantify gene expression using RSEM, the vast majority of genes have an expression level of zero, with only a few hundred genes showing non-zero expression.
Here is the workflow I followed:
Pre-processing the data with trim_galore.
Aligning the sequences using STAR. The code used is as follows:
STAR --runThreadN 5 --genomeDir STAR_genome \
--readFilesCommand zcat \
--readFilesIn clean_data/Sample1_1_val_1.fq.gz \
clean_data/Sample1_2_val_2.fq.gz \
--outFileNamePrefix align_out/sample1_ \
--outSAMtype BAM SortedByCoordinate \
--outBAMsortingThreadN 5 \
--quantMode TranscriptomeSAM GeneCounts
Quantifying gene expression with RSEM.
My questions are:
Why is the BAM file so small, and why are the quality sequences displayed as "?"? How can I resolve these issues to ensure accurate gene expression quantification?
Thank you very much for your help!
Run fastqc on the fastq files.
I have run fastqc on the Fastq file and the results show no big problem, as shown in the picture below.
![Fastqc output2](/media/images/06159124-4962-4330-abe1-297406ea)
You could possibly also check if the output log of STAR might have mentioned anything of note?
Thanks, I checked the "ReadsPerGene.out.tab" file and found that the majority of reads were classified as "unmapped". Is this because my reference genome was used incorrectly?![ReadsPerGene.out.tab](/media/images/11d1d12a-aaac-491a-bb10-ec0f7ac1)
That is a possibility. If this is public/published data you can try to get the reference the paper used and see if that improves things.
OK, I will try it, thank you very much!!