Greetings Biostars community and those familiar with Feature Counts,
We have one RNA sequencing, paired-end, 150bp file for which we estimate gene and junction counts using Feature Counts.
Feature Counts version: 2.0.8
Feature Counts code:
$FCOUNTS -a GCF_000001405.40_GRCh38.p14_genomic.gtf -G GCF_000001405.40_GRCh38.p14_genomic.fna -T 8 -J -M -s 0 -p --countReadPairs --largestOverlap -o outfile infile.bam
We have documented the following instance in a junction counts output file (outfile.jcounts) :
Genes:
PNKD TMBIM1
Situation (see attached picture in IGV):
- PNKD gene overlaps TMBIM1 gene, and all transcripts.
- All TMBIM1 transcripts are in the PNKD intron.
- All TMBIM1 junction reads ~1000 are splitting between PNKD and TMBIM1 for multiple intron junctions.
- In outfile.jcounts PNKD is the primary gene, however has no exons located in this region.
- When we remove the PNKD gene from the GTF and recount with Feature Counts, the reads are assigned perfectly to TMBIM1, as the primary gene with NA as secondary gene, as it should.
- A check of gene counts using exon as feature correctly assigns counts to genes. This appears to be only a junction counting issue.
- The .gtf and .fna are Refseq 1405.40 files (as noted above) and are untouched.
- We use this as a simple verification of raw junction counts before analysing PSI.
- As you will also note, the reverse is also true with PNKD exon-exon junctions sharing read counts with a partially overlapped gene CATIP-AS2.
- We have encountered others and believe this may be a widespread event.
Here is the outfile.jcounts of that region:
PNKD NA NC_000002.12 218270602 + NC_000002.12 218271381 + 17
PNKD NA NC_000002.12 218270602 + NC_000002.12 218271424 + 4
PNKD NA NC_000002.12 218271549 + NC_000002.12 218272570 + 14
PNKD TMBIM1 NC_000002.12 218275621 - NC_000002.12 218276026 - 74
PNKD TMBIM1 NC_000002.12 218276079 - NC_000002.12 218277004 - 46
PNKD TMBIM1 NC_000002.12 218277099 - NC_000002.12 218277366 - 90
PNKD TMBIM1 NC_000002.12 218277453 - NC_000002.12 218277633 - 96
PNKD TMBIM1 NC_000002.12 218277670 - NC_000002.12 218277935 - 73
PNKD TMBIM1 NC_000002.12 218277974 - NC_000002.12 218278515 - 123
PNKD TMBIM1 NC_000002.12 218277992 - NC_000002.12 218278515 - 1
PNKD TMBIM1 NC_000002.12 218278118 - NC_000002.12 218278515 - 2
PNKD TMBIM1 NC_000002.12 218278565 - NC_000002.12 218279038 - 128
PNKD TMBIM1 NC_000002.12 218279091 - NC_000002.12 218279289 - 101
PNKD TMBIM1 NC_000002.12 218279091 - NC_000002.12 218280026 - 6
PNKD TMBIM1 NC_000002.12 218279353 - NC_000002.12 218280026 - 109
PNKD TMBIM1,MIR6513 NC_000002.12 218280126 - NC_000002.12 218281940 - 26
PNKD TMBIM1,MIR6513 NC_000002.12 218280126 NA NC_000002.12 218292466 NA 34
PNKD TMBIM1 NC_000002.12 218282181 - NC_000002.12 218284039 - 1
PNKD TMBIM1 NC_000002.12 218282181 - NC_000002.12 218285763 - 1
PNKD TMBIM1 NC_000002.12 218282181 - NC_000002.12 218287207 - 1
PNKD TMBIM1 NC_000002.12 218282181 NA NC_000002.12 218292466 NA 71
PNKD LOC105373881 NC_000002.12 218316364 + NC_000002.12 218385147 + 3
PNKD LOC105373880,LOC105373881 NC_000002.12 218316883 + NC_000002.12 218317955 + 2
PNKD CATIP-AS2 NC_000002.12 218323428 NA NC_000002.12 218341591 NA 5
PNKD CATIP-AS2 NC_000002.12 218323431 + NC_000002.12 218339783 + 52
PNKD CATIP-AS2 NC_000002.12 218339898 + NC_000002.12 218340029 + 27
PNKD CATIP-AS2 NC_000002.12 218340141 + NC_000002.12 218340728 + 8
PNKD CATIP-AS2 NC_000002.12 218340786 + NC_000002.12 218341534 + 5
PNKD CATIP-AS2 NC_000002.12 218341626 + NC_000002.12 218341981 + 12
PNKD CATIP-AS2 NC_000002.12 218342144 + NC_000002.12 218343500 + 12
PNKD CATIP-AS2 NC_000002.12 218343586 + NC_000002.12 218344455 + 15
PNKD CATIP-AS2 NC_000002.12 218344570 + NC_000002.12 218344808 + 28
Has anyone encountered this before and know where things have gone wrong? It would be great to hear from Prof Smyth on this as well, just in case we have missed something.
Thanks in advance, Chris.
These two genes are on opposite strands. Based on your command you appear to have unstranded data or are doing the counting with that setting (
-s 0
). If you have strand specific RNA libraries you should use the correct option.Thanks for the comment @GenoMax. We too note, those genes are on opposite strands. For context, we have an unstranded library. Cheers, Chris.
Tagging Gordon Smyth
Thank you GenoMax for tagging Gordon Smyth to this post. Looking forward to his feedback (as well as the biostars' community), on what might be occuring.
Regards, Chris
I am not the appropriate person to respond to a detailed question of this type. If you want to submit a (potential) bug report to the featureCounts authors, please post to the Bioconductor forum, as the Subread documentation advises you to do. Biostars is for getting opinions/experience from the Biostats community, not for communicating with the featureCounts authors.
Thanks for the note @gordonsmyth.
Will open this up as a potential issue on Bioconductor as advised. I was hesitant to post on Bioconductor because I still I have an outstanding query for Feature Counts : https://support.bioconductor.org/p/9156945/
Here is the Bioconductor link for the query if anyone interested in following the thread : https://support.bioconductor.org/p/9160807/
Chris
What do the alignments say for these read pairs?
Is it possible that you have read pairs that start on features assigned to PNKD then end on those that are assigned to TMBIM1?
My expectation was always that featureCounts tabulates what exists in the alignments, and if there are reads that map across incompatible features then it will report those as such,
Hi Istvan Albert
Thank you for the feedback.
Sorry, it is hard to see the span of reads on the IGV screenshot.
Here is an alternative shot; viewed as pairs, and sorted by start coordinate.
The paired end reads all start and end within the gene region of TMBIM1. The paired end reads over junctions are within the intronic region of 2 of PNKD's 5 transcripts. The 3 other transcripts are completely up/downstream of the TMBIM1 gene domain.
When we tested count calculation at gene level, (where the gene is meta, and exon is feature) the counts on genes look correct. There are ~1000 reads to count unfortunately.
When we remove the PNKD gene from the GTF and rerun Feature Counts, those junction reads are assigned perfectly to TMBIM1, as the primary gene, with NA as the secondary gene.
Have you encountered this before?
When we count the same file for junctions with Regtools, those junctions are given the correct counts, and PNKD does not have any junctions counted in that region (as expected).
As Gordon suggested, will move this to Bioconductor. Link is in the reply to Gordon if you are interested.
Cheers, Chris