Dear all,
this might be a naive question but my googlefoo fails me. I count reads from a bam, aligend by Star against a custom hg19 genome, after running picard markDuplicates, then counting reads assigned to exons with a slightly customized variation of the NCBI reference annotation gff. The customization is mostly adaption to the genome and propagation of some tags from gene level to exon.
Then I count reads using featureCounts, where $bam
are all the bams in the pipeline. There's quite a lot of transcript variants per gene, also multimapping is allowed on purpose and I wanted to catch that with --fraction
featureCounts \
-p -f -T 4 \
-O -M --fraction \
-a hg19.gff \
-F GTF \
-t "exon" \
-g "ID" \
-s 2 \
--extraAttributes "toplevel_id,gene,transcript_id,GeneID,gbkey,gene_biotype,description,tag" \
-o out.tsv \
${bam}
When summing up the counts assigned in the count table, they're different to what the corresponding .summary
file reports as assigned reads. Is this an known / expected side effect of the fractional count with multiple exons and multimapping?
I'd say that this is a subtle and specialized use case that only the implementer knows for sure. It feels like one of those issues where the various definitions of terms can be reconciled in multiple ways, and the different reports use it differently.
perhaps asking on the software's issue tracker would be a more appropriate
FWIW I would sort of ignore the discrepancy as something "expected"
Thanks, sort of matches my gut feeling. Except the “write-the-developer-part”, obviously…