Question

[RNA-seq] Extremely Unconsistent Raw Count from Different Tools; a dark mistery

1

Entering edit mode

4.4 years ago

ReWeeda ▴ 120

Good morning fellow bioinformaticians,

It's been a full week since I started struggling with a problem related to reads count extraction from RNA-seq data and I ran out of ideas.

I mainly work with RNA-seq data from FFPE samples (I know... FFPE samples sucks but this is another story) and I developed a dedicated workflow for data analysis.

After a pre-processing phase (quality check, quality filtering and trimming), raw reads are aligned with STAR (v.2.7.3.a; with 2pass mode), then I perform reads count extraction using salmon (v.1.4.0).

Recently, I had to focus on a specific gene (CD99) that is expressed in my samples. I noticed that, while salmon returns me a reasonable value for that specific gene, STAR (option "--quantMode GeneCounts" provided in the alignment phase) gives me 0 counts for CD99.

Given this observation, I tried to produce reads count with different tools such as htseq-count (v.0.13.5) and featurecounts (v.2.4.0). Again, htseq-count and featurecount returned me 0 counts for CD99. Checking the bam file of the tested sample with IGV, CD99 is clearly covered by a good amount of aligned reads.

First, I thought I set wrong some flags related to the strandness of my data in htseq-count and featurecounts so I brutally tried all the possibile flags combination. Unfortunately, no solution popped up (still 0 counts for CD99)

Given that both htseq-counts and featurecounts take bam file in input, I thought about some problem related to the alignment phase so I regenerated the STAR genome index and then I realigned my sample. Again, STAR, htseq-count and featurecounts returned me 0 counts for CD99.

At the same time, uploading original fastq files to BaseSpace, I get counts for that gene as expected. Moreover, trying to produce counts in R with featurecounts (v.1.3.4) I get counts for CD99.

Right now, I've no clue about what's going on. Can anyone help me explaining such behaviour?

Reference Genome is HG38
GTF file is from Gencode v33
Library is Paired-end; reverse strand

Code I use for STAR:

STAR --genomeDir $star_index_path --readFilesIn $R1 $R2 --runThreadN 8 --outFilterMultimapScoreRange 1 --outFilterMultimapNmax 20 --outFilterMismatchNmax 10 --alignIntronMax 500000 --alignMatesGapMax 1000000 --sjdbScore 2 --alignSJDBoverhangMin 1 --readFilesCommand gunzip -c --sjdbGTFfile $gtfFile --outFilterMatchNminOverLread 0.33 --outFilterScoreMinOverLread 0.33 --sjdbOverhang 100 --outSAMstrandField intronMotif --outSAMtype None --outSAMmode None --outFileNamePrefix $output_prefix

STAR --runMode genomeGenerate --genomeDir $idx_path_1pass --genomeFastaFiles $reference_path --sjdbOverhang 100 --runThreadN 8 --sjdbFileChrStartEnd $output_prefix'SJ.out.tab'

STAR --genomeDir $idx_path_1pass --readFilesIn $R1 $R2 --runThreadN 8 --outFilterMultimapScoreRange 1 --outFilterMultimapNmax 20 --outFilterMismatchNmax 10 --alignIntronMax 500000 --alignMatesGapMax 1000000 --sjdbScore 2 --alignSJDBoverhangMin 1 --limitBAMsortRAM 0 --readFilesCommand gunzip -c --sjdbGTFfile $gtfFile --quantMode GeneCounts --outFilterMatchNminOverLread 0.33 --outFilterScoreMinOverLread 0.33 --sjdbOverhang 100 --alignSJstitchMismatchNmax 5 -1 5 5 --chimSegmentMin 10 --chimOutType Junctions WithinBAM HardClip --chimJunctionOverhangMin 10 --chimScoreDropMax 30 --chimScoreJunctionNonGTAG 0 --chimScoreSeparation 1 --chimSegmentReadGapMax 3 --chimMultimapNmax 0 --outSAMstrandField intronMotif --outFileNamePrefix $output_prefix --outSAMunmapped Within --outSAMtype BAM SortedByCoordinate

Code for Salmon:

salmon quant -i $salmon_transcriptome_index -l ISR -1 $R1 -2 $R2 --threads 8 --validateMappings --output $salmon_output_dir

Code for htseq-count:

htseq-count -q -f bam -s reverse -r pos -t exon -i gene_name -n 8 $bam $gtf > './htseq/'$idn'_htseq_name-sort_rw.txt'

Code for featurecounts :

featureCounts --primary -C -t exon -g gene_name -a $GTF --ignoreDup -s 2 -T 8 -o './featurecounts/'$idn'.quant.FC.name-sorted.rw.txt' $bam

RNA-Seq alignment counts • 2.2k views

ADD COMMENT • link updated 4.4 years ago by i.sudbery 22k • written 4.4 years ago by ReWeeda ▴ 120

1

Entering edit mode

Are the reads covering that gene of good mapping quality or multimappers (MAPQ=0, or at least low)?

ADD REPLY • link 4.4 years ago by ATpoint 89k

0

Entering edit mode

samtools view sample_CD99.bam | awk '{sum+=$5} END { print "Mean MAPQ =",sum/NR}'

Mean MAPQ = 2.94356

(*) CD99 (chrX) has got its PAR on chrY

Edit for: copy-pasting error.

ADD REPLY • link 4.4 years ago by ReWeeda ▴ 120

1

Entering edit mode

That is pretty low, guess that's the answer then. I would use salmon, in fact I do :)

ADD REPLY • link 4.4 years ago by ATpoint 89k

1

Entering edit mode

Well, I was expecting such low MAPQ value for a gene in pseudoautosomal region. I didn't realize until now that it could be a problem during reads count extraction.

I guess I will stick to Salmon too. :-)

ADD REPLY • link 4.4 years ago by ReWeeda ▴ 120

score 2 · Answer 1 · 2021-03-04

2

Entering edit mode

4.4 years ago

i.sudbery 22k

Is it possible that all reads that map to CD99 are multimapping? Both featureCounts and htseq-count ignore multimapping reads, salmon uses EM to distribute their signal across all locations.

ADD COMMENT • link 4.4 years ago by i.sudbery 22k

1

Entering edit mode

That's the answer. I was aware that htseq-count and featurecounts ignore multimapping reads but I didn't think that genes in psudoautosomal regions are by definition multimapper in chromosome X and Y.

ADD REPLY • link 4.4 years ago by ReWeeda ▴ 120

0

Entering edit mode

Any time a gene has a big number on the end you have to think about why. Are there 98 other genes with sequence similarity? I would want to keep multi-mapping to some extent. Or get fancy like Salmon or Cufflinks tries to disambiguate, and dont even look at "raw counts". Edit this will also be an issue when you try to qPCR or Sanger validate some findings, beware there could be 98 more genes with sequence similarity.

ADD REPLY • link 4.4 years ago by karl.stamm 4.1k

0

Entering edit mode

I would not blame CD99 sequence similarity alone. While it may have similar regions, it also has unique regions. You can get many counts (CPM > 10) for CD99 based on unique mappers alone.

If you are using Salmon, you should strongly consider using decoys in your index. Otherwise, low-quality samples may end up with highly inflated counts.

ADD REPLY • link 4.4 years ago by igor 13k