Hi,
I am trying to use Deeptools' computeMatrix and plotHeatmap functions to create TSS-centered metaprofiles over all genes. Ultimately, I would like to apply some additional filters, e.g. protein-coding, non-overlapping TSSs, above a certain size. But for an initial trial, I decided to just filter Ensembl's GTF annotation file for genes:
awk 'BEGIN{FS=OFS="\t"} $3 == "gene"' GRCm38.99.gtf > test1.gtf
head -n 5 GRCm38.99.gtf | cat - test1.gtf > test.gtf # for re-adding the GTF "header"
I then use computeMatrix:
computeMatrix reference-point \
--referencePoint TSS \
--scoreFileName input.bw \
--regionsFileName test.gtf \
-out TSSmeta.gz \
--beforeRegionStartLength 2000 --afterRegionStartLength 2000 \
--binSize 20 \
--missingDataAsZero \
--sortRegions no
This resulted in the following error:
RuntimeError: None of the input BED/GTF files had valid regions
From what I understand, the problem stems from the absence of transcript features in the 3rd column of the GTF file. Hence, I tried running the above computeMatrix command, but including --metagene
, which resulted in the same error. I also tried setting --transcriptID gene --exonID gene
with no success.
I would really appreciate help on this!
Here are the first lines of the gtf file:
I was hoping to directly use the gtf file without having to convert to a bed file to avoid having to switch between 1- and 0-based coordinates.
I have now tried:
It ran through without errors and the result from plotHeatmap looks reasonable. But now I am doubting whether setting
--transcript ID gene
and--transcript_id_designator gene_id
is in in this specific case correct (trying to get the signal over entire genes instead of transcripts)The error most likely occurred because you didn't include any entry with "transcript" in the 3rd column in your test.gtf.
Why are you in doubt? Seems like you want to focus on genes rather than transcripts.
If you want genes then that will work fine. Just remember that genes are just groups of transcripts when looking at the results.