Question

Weird output of geneCount for STAR?

1

Entering edit mode

4.0 years ago

FantasticAI ▴ 60

Hi

I'm using STAR to perform alignment and my goal is to do the DEG analysis in the future. The parameter I set up as follows:

Step 1:

                       STAR --runThreadN 12 \
                                 --runMode genomeGenerate \
                                 --genomeDir genomedir \
                                 --genomeFastaFiles ./ref/GRCm39.primary_assembly.genome.fa \
                                 --sjdbGTFfile ./ref/gencode.vM27.primary_assembly.annotation.gtf \
                                 --sjdbOverhang 100

Step 2:

                       STAR --runThreadN 12 \
                                 --runMode alignReads \
                                 --twopassMode None \
                                 --genomeDir genomedir \
                                 --sjdbGTFfile ref/gencode.vM27.primary_assembly.annotation.gtf \
                                 --sjdbOverhang 100 \
                                 --readFilesIn "R1_001_trim.fastq.gz" "R2_001_trim.fastq.gz" \
                                 --readFilesCommand zcat \
                                 --outFileNamePrefix "./align_output/try1_" \
                                 --outSAMtype BAM SortedByCoordinate \
                                 --quantMode GeneCounts

But the output of genecount file is a bit weird:

N_unmapped  1657081 1657081 1657081
N_multimapping  8053375 8053375 8053375
N_noFeature 12418438    43724858    13355695
N_ambiguous 1629488 39643   430809
ENSMUSG00000102693.2    0   0   0
ENSMUSG00000064842.3    0   0   0
ENSMUSG00000051951.6    49  2   47
ENSMUSG00000102851.2    0   0   0
ENSMUSG00000103377.2    10  0   10
ENSMUSG00000104017.2    13  0   13
ENSMUSG00000103025.2    4   1   3
ENSMUSG00000089699.2    0   0   0
ENSMUSG00000103201.2    2   0   2
ENSMUSG00000103147.2    1   1   0
ENSMUSG00000103161.2    7   0   7
ENSMUSG00000102331.2    20  0   20
ENSMUSG00000102348.2    0   0   0

Why the gene index at the first column has ".2" or ".3" in the end? like "ENSMUSG00000102693.2", Shouldn't be just "ENSMUSG00000102693"? I use reference from Gencode by the way

reference STAR rna-seq genome • 2.5k views

ADD COMMENT • link updated 4.0 years ago by rpolicastro 13k • written 4.0 years ago by FantasticAI ▴ 60

score 2 · Accepted Answer · 2021-11-21

2

Entering edit mode

4.0 years ago

rpolicastro 13k

That's the version number of that ensembl gene ID. When a gene model is changed to an appreciable extent the version number increases.

ADD COMMENT • link 4.0 years ago by rpolicastro 13k

0

Entering edit mode

Thank you for reply, will it matter for the downstream analysis? like DEG? or I don't need to worry about that?

ADD REPLY • link 4.0 years ago by FantasticAI ▴ 60

2

Entering edit mode

It won't effect the results of DEG analysis, but keep the version numbers in mind when filtering, joining, or using your data for something like GO enrichment since the gene ID part is stable, but the gene versions can change between releases. You can always use regex to remove the version numbers as well.

ADD REPLY • link 4.0 years ago by rpolicastro 13k

0

Entering edit mode

So if I have multiple samples, there might be the case like "ENSMUSG00000102693.2" in geneCount output for sample 1 and "ENSMUSG00000102693.3" in geneCount output for sample 2. When I merge the geneCount in the end, I need to consider these two are actually referring to the same gene. I have to add them together, am I correct?

ADD REPLY • link 4.0 years ago by FantasticAI ▴ 60

2

Entering edit mode

If all of your samples are aligned and/or quantified using the same annotation release, you won't run into that problem, but you could if different samples were processed with different releases.

What also may happen for example is lets say you query ENSEMBL biomart so you can add gene names to your gene IDs. If the annotation release version of biomart is not the same as the version of the reference you used for alignment and quantification, there might be gene IDs with different version numbers.

ADD REPLY • link 4.0 years ago by rpolicastro 13k

0

Entering edit mode

I see, learned a lot. Thank you so much

One follow up question, My goal is to do DEG analysis, so in practice, like you mentioned earlier, should I just trim off the version number for all genes, so there won't be the version issues?

By the way, just to confirm that based on the geneCount output, my data is reverse stranded data, am I correct?

ADD REPLY • link 4.0 years ago by FantasticAI ▴ 60

2

Entering edit mode

It's safe to trim off the version number, since that would result in the equivalent of the non-versioned ENSEMBL gene IDs.

Your data does look reverse stranded.

ADD REPLY • link 4.0 years ago by rpolicastro 13k