Entering edit mode
3.0 years ago
FantasticAI
▴
60
Hi
I'm using STAR to perform alignment and my goal is to do the DEG analysis in the future. The parameter I set up as follows:
Step 1:
STAR --runThreadN 12 \
--runMode genomeGenerate \
--genomeDir genomedir \
--genomeFastaFiles ./ref/GRCm39.primary_assembly.genome.fa \
--sjdbGTFfile ./ref/gencode.vM27.primary_assembly.annotation.gtf \
--sjdbOverhang 100
Step 2:
STAR --runThreadN 12 \
--runMode alignReads \
--twopassMode None \
--genomeDir genomedir \
--sjdbGTFfile ref/gencode.vM27.primary_assembly.annotation.gtf \
--sjdbOverhang 100 \
--readFilesIn "R1_001_trim.fastq.gz" "R2_001_trim.fastq.gz" \
--readFilesCommand zcat \
--outFileNamePrefix "./align_output/try1_" \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts
But the output of genecount file is a bit weird:
N_unmapped 1657081 1657081 1657081
N_multimapping 8053375 8053375 8053375
N_noFeature 12418438 43724858 13355695
N_ambiguous 1629488 39643 430809
ENSMUSG00000102693.2 0 0 0
ENSMUSG00000064842.3 0 0 0
ENSMUSG00000051951.6 49 2 47
ENSMUSG00000102851.2 0 0 0
ENSMUSG00000103377.2 10 0 10
ENSMUSG00000104017.2 13 0 13
ENSMUSG00000103025.2 4 1 3
ENSMUSG00000089699.2 0 0 0
ENSMUSG00000103201.2 2 0 2
ENSMUSG00000103147.2 1 1 0
ENSMUSG00000103161.2 7 0 7
ENSMUSG00000102331.2 20 0 20
ENSMUSG00000102348.2 0 0 0
Why the gene index at the first column has ".2" or ".3" in the end? like "ENSMUSG00000102693.2", Shouldn't be just "ENSMUSG00000102693"? I use reference from Gencode by the way
Thank you for reply, will it matter for the downstream analysis? like DEG? or I don't need to worry about that?
It won't effect the results of DEG analysis, but keep the version numbers in mind when filtering, joining, or using your data for something like GO enrichment since the gene ID part is stable, but the gene versions can change between releases. You can always use regex to remove the version numbers as well.
So if I have multiple samples, there might be the case like "ENSMUSG00000102693.2" in geneCount output for sample 1 and "ENSMUSG00000102693.3" in geneCount output for sample 2. When I merge the geneCount in the end, I need to consider these two are actually referring to the same gene. I have to add them together, am I correct?
If all of your samples are aligned and/or quantified using the same annotation release, you won't run into that problem, but you could if different samples were processed with different releases.
What also may happen for example is lets say you query ENSEMBL biomart so you can add gene names to your gene IDs. If the annotation release version of biomart is not the same as the version of the reference you used for alignment and quantification, there might be gene IDs with different version numbers.
I see, learned a lot. Thank you so much
One follow up question, My goal is to do DEG analysis, so in practice, like you mentioned earlier, should I just trim off the version number for all genes, so there won't be the version issues?
By the way, just to confirm that based on the geneCount output, my data is reverse stranded data, am I correct?
It's safe to trim off the version number, since that would result in the equivalent of the non-versioned ENSEMBL gene IDs.
Your data does look reverse stranded.