Hello,
I have been trying to run RNA-seq analyses for the purpose of chrRNA-seq. My first chrRNA-seq analysis used mapped reads from STAR (no annotation) counting reads into full gene coordinates using featureCounts [1]. The group then wanted an exon only analysis, so I used the same STAR reads and featureCounts using the GTF file using an exon wise analysis [2]. Lastly they wanted an intron only analysis. To do this I took the gene counts of the previous two analysis and used R to subtract the exon counts from the full gene counts. HOWEVER, when running the final intron only counts through DESeq2 I was informed that there are negative counts, see extract of the count files below [3]. The biggest mystery to me is why the full gene count for ENSG00000001631.17 has nearly 0 counts, but the exon has thousands.
I would welcome any advice to explain what i am doing wrong. Something to do with strandedness...? Any more reliable method? Sundry thoughts that may be important: it is a stranded analysis. Negative values appear in ~9000 genes of ~65K.
Thank you!
[1] Intron + Exon:
Extract gene coords:
library(GenomicFeatures)
txdb <- makeTxDbFromGFF("gencode.v41.annotation.gtf", format="gtf")
genes <- genes(txdb)
write.table(as.data.frame(genes)[,-4], file="gencode.v41.annotation_genes.txt", col.names=F, sep="\t", quote=F)
Cut out first four columns to create the SAF file for feature Counts:
cut -f1-5 gencode.v41.annotation_genes.txt > gencode.v41.annotation_genes.saf
Add required header:
GeneID Chr Start End Strand
Remove PAR_Y genes:
cat gencode.v41.annotation_genes.saf | sed '/PAR_Y/d' > gencode.v41.annotation_genes_minus_PAR_Y.saf
Run featureCounts on a sample
featureCounts -F SAF -a gencode.v41.annotation_genes_minus_PAR_Y.saf -s 2 -p -C -T 8 -o KWR1_Neg_1_hg38_STAR_277a_defaults_noanno_featureCounts_whole_gene.txt KWR1_Neg_1_hg38_STAR_277a_defaults_noannoAligned.out.bam
[2] Exon only ('-t exon' is default):
featureCounts -a gencode.v41.annotation.gtf -s 2 -p -C -T 8 -o KWR1_Neg_1_hg38_STAR_277a_defaults_noanno_featureCounts_exons.txt KWR1_Neg_1_hg38_STAR_277a_defaults_noannoAligned.out.bam
[3] Extracts of counts files
Intron + exon counts:
ENSG00000001631.17 2 16 13 6 4
ENSG00000002016.18 3132 5265 5222 3298 2828
ENSG00000002079.14 0 13 10 4 0
ENSG00000002330.14 588 1041 1064 782 766
Exon only counts:
ENSG00000001631.17 1611 2579 2875 2045 1582
ENSG00000002016.18 915 1377 1400 847 778
ENSG00000002079.14 0 0 0 3 0
ENSG00000002330.14 170 189 218 160 154
Intron only counts:
ENSG00000001631.17 -1609 -2563 -2862 -2039 -1578
ENSG00000002016.18 2217 3888 3822 2451 2050
ENSG00000002079.14 0 13 10 1 0
ENSG00000002330.14 418 852 846 622 612
Whenever I encounter something like this I move in R with GenomicAlignments and do a targeted analysis (in your case you could focus on the 4 genes in your list).I
My instinct says either the strandedness is causing an issue or that the exon/intron distinction is causing bad counting. A question that might be important is how are counts overlapping both exons and introns being counted (like for alternative transcript isoforms).
Also how are you getting the exon/intron counts with -t exon,intron ?
Thank you for your reply @benformatics! Like you I wonder whether all the different transcripts within the initial gene footprint could do odd things to counting. Both runs of featureCounts in [1] and [2] do not set '-t' so the default is 'exon'. Although in [1] a SAF file is used and so -t isn't used (I think...). I will also try an unstranded count to see the affect.