Hi everyone, I am trying to use deeptools computeMatrix for which regions files are custom made by subsetting the hg38 gtf file. If I use only first 2 lines from gtf to make a test run it works. For example the below gtf file it works. (temp.gtf).
head -n 2 temp.gtf
1 havana gene 182696 184174 . + . gene_id "ENSG00000279928"; gene_version "2"; gene_name "DDX11L17"; gene_source "havana"; gene_biotype "unprocessed_pseudogene";
1 havana transcript 182696 184174 . + . gene_id "ENSG00000279928"; gene_version "2"; transcript_id "ENST00000624431"; transcript_version "2"; gene_name "DDX11L17"; gene_source "havana"; gene_biotype "unprocessed_pseudogene"; transcript_name "DDX11L17-201"; transcript_source "havana"; transcript_biotype "unprocessed_pseudogene"; tag "basic"; tag "Ensembl_canonical"; transcript_support_level "NA";
but if I subset according some conditions after grepping selected lines as below as final gtf file, it does not work. (temp_ce.gtf)
head -n 2 temp_ce.gtf
1 ensembl CDS 100877476 100877903 . - 1 gene_id "ENSG00000162694"; gene_version "14"; transcript_id "ENST00000535414"; transcript_version "5"; exon_number "3"; gene_name "EXTL2"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "EXTL2-207"; transcript_source "ensembl"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS72831"; protein_id "ENSP00000444385"; protein_version "2"; tag "basic"; transcript_support_level "5";
1 ensembl CDS 100877476 100877903 . - 1 gene_id "ENSG00000162694"; gene_version "14"; transcript_id "ENST00000535414"; transcript_version "5"; exon_number "3"; gene_name "EXTL2"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "EXTL2-207"; transcript_source "ensembl"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS72831"; protein_id "ENSP00000444385"; protein_version "2"; tag "basic"; transcript_support_level "5";
Here is the deeptools command that I am using:
computeMatrix scale-regions -R temp.gtf -S data/test_rnaseq_plus.bigWig -b 5000 -a 5000 --regionBodyLength 10000 -o Analysis/overlap_scaled.gz --outFileNameMatrix Analysis/overlap_scaled.tab --numberOfProcessors 20 --metagene
computeMatrix scale-regions -R temp_ce.gtf -S data/test_rnaseq_plus.bigWig -b 5000 -a 5000 --regionBodyLength 10000 -o Analysis/overlap_scaled.gz --outFileNameMatrix Analysis/overlap_scaled.tab --numberOfProcessors 20 --metagene
The first command works and the second throws the error:
Traceback (most recent call last):
File "/test/conda/miniconda3/envs/core/bin/computeMatrix", line 14, in <module>
main(args)
File "/test/conda/miniconda3/envs/core/lib/python3.10/site-packages/deeptools/computeMatrix.py", line 421, in main
hm.computeMatrix(scores_file_list, args.regionsFileName, parameters, blackListFileName=args.blackListFileName, verbose=args.verbose, allArgs=args)
File "/test/conda/miniconda3/envs/core/lib/python3.10/site-packages/deeptools/heatmapper.py", line 252, in computeMatrix
res, labels = mapReduce.mapReduce([score_file_list, parameters],
File "/test/conda/miniconda3/envs/core/lib/python3.10/site-packages/deeptools/mapReduce.py", line 85, in mapReduce
bed_interval_tree = GTF(bedFile, defaultGroup=defaultGroup, transcriptID=transcriptID, exonID=exonID, transcript_id_designator=transcript_id_designator, keepExons=keepExons)
File "/test/conda/miniconda3/envs/core/lib/python3.10/site-packages/deeptoolsintervals/parse.py", line 593, in __init__
self.parseGTF(fp, line)
File "/test/conda/miniconda3/envs/core/lib/python3.10/site-packages/deeptoolsintervals/parse.py", line 522, in parseGTF
self.parseGTFexon(cols)
File "/test/conda/miniconda3/envs/core/lib/python3.10/site-packages/deeptoolsintervals/parse.py", line 444, in parseGTFexon
if name not in self.exons[self.labelIdx]:
IndexError: list index out of range
I tried to find differences between the two gtf file format but I could not. Hence I would appreciate if I can know what am I doing wrong when creating gtf file.
Is it because your subsetted version doesn't have the parent "gene" and "transcript" lines?
Here I just showed only the two lines but if I am not wrong I do have parent gene and transcript lines as well. Is it need to be sorted in some way that the parent gene and transcript lines should be coming first?
If
gene
andtranscript
exist for each entry, then it might be an issue with the--metagene
flag if all of your "child" entries are labeled asCDS
instead ofexon
. It might help to share more of your subsetted GTF file.