BAM File Quality Sequences Are "?" and File Size is Very Small After STAR Alignment
1
0
Entering edit mode
4 months ago
Tommmm • 0

Hello,

I am using raw paired-end RNA-seq Fastq files from a public database for gene expression quantification. After aligning the sequences using STAR, I noticed a couple of issues with the generated BAM file:

The BAM file is very small, only about 20 MB (too small for humans). The quality sequences in the BAM file are all displayed as "?". The bam file looks like this When I proceed to quantify gene expression using RSEM, the vast majority of genes have an expression level of zero, with only a few hundred genes showing non-zero expression.

Here is the workflow I followed:

Pre-processing the data with trim_galore.

Aligning the sequences using STAR. The code used is as follows:

STAR --runThreadN 5 --genomeDir STAR_genome \
--readFilesCommand zcat \
--readFilesIn clean_data/Sample1_1_val_1.fq.gz \
clean_data/Sample1_2_val_2.fq.gz \
--outFileNamePrefix align_out/sample1_ \
--outSAMtype BAM SortedByCoordinate \
--outBAMsortingThreadN 5 \
--quantMode TranscriptomeSAM GeneCounts

Quantifying gene expression with RSEM.

My questions are:

Why is the BAM file so small, and why are the quality sequences displayed as "?"? How can I resolve these issues to ensure accurate gene expression quantification?

Thank you very much for your help!

Bam Fastq Gene-expression RNA-seq • 633 views
ADD COMMENT
1
Entering edit mode

Run fastqc on the fastq files.

ADD REPLY
0
Entering edit mode

I have run fastqc on the Fastq file and the results show no big problem, as shown in the picture below. Fastqc output1 Fastqc output2

ADD REPLY
1
Entering edit mode

You could possibly also check if the output log of STAR might have mentioned anything of note?

ADD REPLY
0
Entering edit mode

Thanks, I checked the "ReadsPerGene.out.tab" file and found that the majority of reads were classified as "unmapped". Is this because my reference genome was used incorrectly? ReadsPerGene.out.tab

ADD REPLY
1
Entering edit mode

Is this because my reference genome was used incorrectly?

That is a possibility. If this is public/published data you can try to get the reference the paper used and see if that improves things.

ADD REPLY
0
Entering edit mode

OK, I will try it, thank you very much!!

ADD REPLY
1
Entering edit mode
4 months ago
Darked89 4.7k

Unless you try to map small RNA libraries there is little if anything to gain from trimming the reads prior to mapping. The SRR12544419 run (2x 51bp) is from the COVID patient. Long shot: you may add COVID sequence to human genome and map to such hybrid database. Not that it is likely that RNA from human leukocytes does contain viral sequences, but in this way you can exclude one of the speculative reasons of low number of reads mapped to human genome/genes.

ADD COMMENT
0
Entering edit mode

Thank you for your advice, which really helped me to explore the reasons for the low sequence alignment rate.

ADD REPLY

Login before adding your answer.

Traffic: 1165 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6