Good morning fellow bioinformaticians,
It's been a full week since I started struggling with a problem related to reads count extraction from RNA-seq data and I ran out of ideas.
I mainly work with RNA-seq data from FFPE samples (I know... FFPE samples sucks but this is another story) and I developed a dedicated workflow for data analysis.
After a pre-processing phase (quality check, quality filtering and trimming), raw reads are aligned with STAR (v.2.7.3.a; with 2pass mode), then I perform reads count extraction using salmon (v.1.4.0).
Recently, I had to focus on a specific gene (CD99) that is expressed in my samples. I noticed that, while salmon returns me a reasonable value for that specific gene, STAR (option "--quantMode GeneCounts" provided in the alignment phase) gives me 0 counts for CD99.
Given this observation, I tried to produce reads count with different tools such as htseq-count (v.0.13.5) and featurecounts (v.2.4.0). Again, htseq-count and featurecount returned me 0 counts for CD99. Checking the bam file of the tested sample with IGV, CD99 is clearly covered by a good amount of aligned reads.
First, I thought I set wrong some flags related to the strandness of my data in htseq-count and featurecounts so I brutally tried all the possibile flags combination. Unfortunately, no solution popped up (still 0 counts for CD99)
Given that both htseq-counts and featurecounts take bam file in input, I thought about some problem related to the alignment phase so I regenerated the STAR genome index and then I realigned my sample. Again, STAR, htseq-count and featurecounts returned me 0 counts for CD99.
At the same time, uploading original fastq files to BaseSpace, I get counts for that gene as expected. Moreover, trying to produce counts in R with featurecounts (v.1.3.4) I get counts for CD99.
Right now, I've no clue about what's going on. Can anyone help me explaining such behaviour?
- Reference Genome is HG38
- GTF file is from Gencode v33
- Library is Paired-end; reverse strand
Code I use for STAR:
STAR --genomeDir $star_index_path --readFilesIn $R1 $R2 --runThreadN 8 --outFilterMultimapScoreRange 1 --outFilterMultimapNmax 20 --outFilterMismatchNmax 10 --alignIntronMax 500000 --alignMatesGapMax 1000000 --sjdbScore 2 --alignSJDBoverhangMin 1 --readFilesCommand gunzip -c --sjdbGTFfile $gtfFile --outFilterMatchNminOverLread 0.33 --outFilterScoreMinOverLread 0.33 --sjdbOverhang 100 --outSAMstrandField intronMotif --outSAMtype None --outSAMmode None --outFileNamePrefix $output_prefix
STAR --runMode genomeGenerate --genomeDir $idx_path_1pass --genomeFastaFiles $reference_path --sjdbOverhang 100 --runThreadN 8 --sjdbFileChrStartEnd $output_prefix'SJ.out.tab'
STAR --genomeDir $idx_path_1pass --readFilesIn $R1 $R2 --runThreadN 8 --outFilterMultimapScoreRange 1 --outFilterMultimapNmax 20 --outFilterMismatchNmax 10 --alignIntronMax 500000 --alignMatesGapMax 1000000 --sjdbScore 2 --alignSJDBoverhangMin 1 --limitBAMsortRAM 0 --readFilesCommand gunzip -c --sjdbGTFfile $gtfFile --quantMode GeneCounts --outFilterMatchNminOverLread 0.33 --outFilterScoreMinOverLread 0.33 --sjdbOverhang 100 --alignSJstitchMismatchNmax 5 -1 5 5 --chimSegmentMin 10 --chimOutType Junctions WithinBAM HardClip --chimJunctionOverhangMin 10 --chimScoreDropMax 30 --chimScoreJunctionNonGTAG 0 --chimScoreSeparation 1 --chimSegmentReadGapMax 3 --chimMultimapNmax 0 --outSAMstrandField intronMotif --outFileNamePrefix $output_prefix --outSAMunmapped Within --outSAMtype BAM SortedByCoordinate
Code for Salmon:
salmon quant -i $salmon_transcriptome_index -l ISR -1 $R1 -2 $R2 --threads 8 --validateMappings --output $salmon_output_dir
Code for htseq-count:
htseq-count -q -f bam -s reverse -r pos -t exon -i gene_name -n 8 $bam $gtf > './htseq/'$idn'_htseq_name-sort_rw.txt'
Code for featurecounts :
featureCounts --primary -C -t exon -g gene_name -a $GTF --ignoreDup -s 2 -T 8 -o './featurecounts/'$idn'.quant.FC.name-sorted.rw.txt' $bam
Are the reads covering that gene of good mapping quality or multimappers (MAPQ=0, or at least low)?
(*) CD99 (chrX) has got its PAR on chrY
Edit for: copy-pasting error.
That is pretty low, guess that's the answer then. I would use salmon, in fact I do :)
Well, I was expecting such low MAPQ value for a gene in pseudoautosomal region. I didn't realize until now that it could be a problem during reads count extraction.
I guess I will stick to Salmon too. :-)