deeptools computeMatrix gtf file : if name not in self.exons[self.labelIdx]: IndexError: IndexError: list index out of range
0
0
Entering edit mode
11 months ago
Iam • 0

Hi everyone, I am trying to use deeptools computeMatrix for which regions files are custom made by subsetting the hg38 gtf file. If I use only first 2 lines from gtf to make a test run it works. For example the below gtf file it works. (temp.gtf).

head -n 2 temp.gtf
        1       havana  gene    182696  184174  .       +       .       gene_id "ENSG00000279928"; gene_version "2"; gene_name "DDX11L17"; gene_source "havana"; gene_biotype "unprocessed_pseudogene";
        1       havana  transcript      182696  184174  .       +       .       gene_id "ENSG00000279928"; gene_version "2"; transcript_id "ENST00000624431"; transcript_version "2"; gene_name "DDX11L17"; gene_source "havana"; gene_biotype "unprocessed_pseudogene"; transcript_name "DDX11L17-201"; transcript_source "havana"; transcript_biotype "unprocessed_pseudogene"; tag "basic"; tag "Ensembl_canonical"; transcript_support_level "NA";

but if I subset according some conditions after grepping selected lines as below as final gtf file, it does not work. (temp_ce.gtf)

head -n 2 temp_ce.gtf
1       ensembl CDS     100877476       100877903       .       -       1       gene_id "ENSG00000162694"; gene_version "14"; transcript_id "ENST00000535414"; transcript_version "5"; exon_number "3"; gene_name "EXTL2"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "EXTL2-207"; transcript_source "ensembl"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS72831"; protein_id "ENSP00000444385"; protein_version "2"; tag "basic"; transcript_support_level "5";
1       ensembl CDS     100877476       100877903       .       -       1       gene_id "ENSG00000162694"; gene_version "14"; transcript_id "ENST00000535414"; transcript_version "5"; exon_number "3"; gene_name "EXTL2"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "EXTL2-207"; transcript_source "ensembl"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS72831"; protein_id "ENSP00000444385"; protein_version "2"; tag "basic"; transcript_support_level "5";

Here is the deeptools command that I am using:

computeMatrix  scale-regions -R temp.gtf -S data/test_rnaseq_plus.bigWig -b 5000 -a 5000   --regionBodyLength 10000  -o Analysis/overlap_scaled.gz  --outFileNameMatrix Analysis/overlap_scaled.tab  --numberOfProcessors 20 --metagene

computeMatrix  scale-regions -R temp_ce.gtf -S data/test_rnaseq_plus.bigWig -b 5000 -a 5000   --regionBodyLength 10000  -o Analysis/overlap_scaled.gz  --outFileNameMatrix Analysis/overlap_scaled.tab  --numberOfProcessors 20 --metagene

The first command works and the second throws the error:

Traceback (most recent call last):
  File "/test/conda/miniconda3/envs/core/bin/computeMatrix", line 14, in <module>
    main(args)
  File "/test/conda/miniconda3/envs/core/lib/python3.10/site-packages/deeptools/computeMatrix.py", line 421, in main
    hm.computeMatrix(scores_file_list, args.regionsFileName, parameters, blackListFileName=args.blackListFileName, verbose=args.verbose, allArgs=args)
  File "/test/conda/miniconda3/envs/core/lib/python3.10/site-packages/deeptools/heatmapper.py", line 252, in computeMatrix
    res, labels = mapReduce.mapReduce([score_file_list, parameters],
  File "/test/conda/miniconda3/envs/core/lib/python3.10/site-packages/deeptools/mapReduce.py", line 85, in mapReduce
    bed_interval_tree = GTF(bedFile, defaultGroup=defaultGroup, transcriptID=transcriptID, exonID=exonID, transcript_id_designator=transcript_id_designator, keepExons=keepExons)
  File "/test/conda/miniconda3/envs/core/lib/python3.10/site-packages/deeptoolsintervals/parse.py", line 593, in __init__
    self.parseGTF(fp, line)
  File "/test/conda/miniconda3/envs/core/lib/python3.10/site-packages/deeptoolsintervals/parse.py", line 522, in parseGTF
    self.parseGTFexon(cols)
  File "/test/conda/miniconda3/envs/core/lib/python3.10/site-packages/deeptoolsintervals/parse.py", line 444, in parseGTFexon
    if name not in self.exons[self.labelIdx]:
IndexError: list index out of range

I tried to find differences between the two gtf file format but I could not. Hence I would appreciate if I can know what am I doing wrong when creating gtf file.

metagene gtf computeMatrix deeptools • 851 views
ADD COMMENT
0
Entering edit mode

Is it because your subsetted version doesn't have the parent "gene" and "transcript" lines?

ADD REPLY
0
Entering edit mode

Here I just showed only the two lines but if I am not wrong I do have parent gene and transcript lines as well. Is it need to be sorted in some way that the parent gene and transcript lines should be coming first?

ADD REPLY
0
Entering edit mode

If gene and transcript exist for each entry, then it might be an issue with the --metagene flag if all of your "child" entries are labeled as CDS instead of exon. It might help to share more of your subsetted GTF file.

ADD REPLY

Login before adding your answer.

Traffic: 2336 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6