I have a set of Illumina short read RNA-seq files which I'm hoping to use to analyse exon usage patterns within a set of specific genes.
The workflow is fastqc -> quality trimming and adapter removal -> alignment with STAR -> sorting and indexing to produce BAM file -> count alignments to annotated exons using featureCounts in subread.
However, I'm running into issues as the GENCODE annotation (GTF) file I used for alignment and feature counting (Release 33 (GRCh38.p13) Comprehensive gene annotation) has:
instances where the same genomic region is annotated with different ENSEMBL exon IDs; and
instances where two different exon annotations overlap one another for the same gene (presumably varying 5' and 3' splice sites??)
In any event, in order to do feature counting properly I need a non-redundant and non-overlapping set of exon annotations. This could possibly be the set of exons in the inferred metatranscript of the gene, or the set of exons where each exon is the largest possible set of overlapping genomic coordinates that are annotated as exons. However, I'm not quite sure how to go about producing something like this, and at which point to incorporate such an annotation into my analyses.
If I attempt to reinterpret the featureCounts table with new coordinates after it's already been produced, it's possible I'll miss some reads that haven't been counted.
I'd really appreciate some help in terms of how to go about doing this. Thanks!
A standard way to do this is to collapse all exonic regions for each individual gene and use those as your 'transcriptome' reference. In this case you are assigning these reads to the exonic regions of genes and they are your "meta-features".
Obviously there are cases where you will have multiple genes with overlapping exonic regions. The standard for this is to discard reads in those regions - because you cannot be sure from which gene they originate.
This is done automatically and is clearly stated on the featureCounts "webpage".
Here is what is says:
Overlap between reads and features A read is said to overlap a feature
if at least one read base is found to overlap the feature. For
paired-end data, a fragment (or template) is said to overlap a feature
if any of the two reads from that fragment is found to overlap the
feature.
> By default, featureCounts does not count reads overlapping with more
than one feature (or more than one meta-feature when summarizing at
meta-feature level). Users can use the -O option to instruct
featureCounts to count such reads (they will be assigned to all their
overlapping features or meta-features).
Note that, when counting at the meta-feature level, reads that overlap
multiple features of the same meta-feature are always counted exactly
once for that meta-feature, provided there is no overlap with any
other meta-feature. For example, an exon-spanning read will be counted
only once for the corresponding gene even if it overlaps with more
than one exon.
One additional suggestion is that you could potentially remove genes that you think are not expressed at all - which would potentially allow you to recover some counts.
Thanks for this. What I'm having trouble with is how to do that first part: "collapse all exonic regions for each individual gene and use those as your 'transcriptome' reference." Does this mean editing the GTF file prior to alignment? And if so, how do I do this?
I believe as long as your GTF file has the right meta-data information this should be the default for the featureCounts function. The first example from their manual:
featureCounts -T 5 -a annotation.gtf -t exon -g gene id -o counts.txt mapping results SE.sam
Above the -t refers to what column to use as the counted features and then -g is what to group them by. In this case -g is your meta-features (i.e. genes) and the exons within each gene group are the regions where counts can come from.
from cgat-apps (installable via bioconda), but the meat of the code is:
def gene_to_blocks(gene):
'''Given a list of all exons in a gene, as a list of CGAT.GTF entries, create a seperate exon
for each unqiue part of a exon, as well as one for introns. '''
exons = [(e.start, e.end)
for e in gene if e.feature == "exon"]
exons = list(set(sum(exons, ())))
exons.sort()
entry = GTF.Entry()
entry = entry.copy(gene[0])
entry.transcript_id = "merged"
entry.feature = "exon"
entry.source = "merged"
for i in range(len(exons)-1):
entry.start = exons[i]
entry.end = exons[i+1]
entry.attributes["exon_id"] = str(i + 1)
yield entry
Thanks very much for this! However, I'm struggling to find documentation on the gtf2gtf tool in cgat, and in particular the flags that the tool accepts. --help gives an error:
$ cgat gtf2gtf --help
Traceback (most recent call last):
File "/home/oates_binf_1/software/miniconda3/bin/cgat", line 11, in <module>
sys.exit(main())
File "/home/oates_binf_1/software/miniconda3/lib/python3.7/site-packages/cgat/cgat.py", line 132, in main
module.main(sys.argv)
File "/home/oates_binf_1/software/miniconda3/lib/python3.7/site-packages/cgat/tools/gtf2gtf.py", line 362, in main
parser = E.ArgumentParser(description=__doc__)
AttributeError: module 'cgatcore.experiment' has no attribute 'ArgumentParser'
I've pinged the maintainers, but from a quick look at the code, it looks like you've got a mismatch between the version of cgatcore and cgat-apps. Try updating cgat-core with $ conda install -c conda-forge -c biconda cgatcore=0.6.5
Thanks for this. What I'm having trouble with is how to do that first part: "collapse all exonic regions for each individual gene and use those as your 'transcriptome' reference." Does this mean editing the GTF file prior to alignment? And if so, how do I do this?
I believe as long as your GTF file has the right meta-data information this should be the default for the featureCounts function. The first example from their manual:
Above the
-t
refers to what column to use as the counted features and then-g
is what to group them by. In this case-g
is your meta-features (i.e. genes) and the exons within each gene group are the regions where counts can come from.