Text version below

Question

rnaseqc.v2.3 - Doesn't detect exons in GTF

0

Entering edit mode

5.3 years ago

Jon17 ▴ 20

RNASeqc doesn't like any of my GTF files. I have no problem running the example dataset. Any idea why? When I run this:

rnaseqc.v2.3.4.linux gencode.v29.compressed-noGene.gtf NuGen_mRNA_10_ERCC2_2_md.bam  OUTPUT

I get this result:

There were either no genes or no exons in the GTF
0 genes parsed
334017 exons parsed

I have a freshly downloaded gencode gtf file which was treated with gtx-pipeline

python3 gtex-pipeline/gene_model/collapse_annotation.py Homo_sapiens.GRCh38.97.gtf Homo_sapiens.GRCh38.97_compressed.gtf

and to remove genes and just keep exons:

awk -F'\t' '($3 != "gene")' gencode.v29.compressed.gtf > gencode.v29.compressed-noGene.gtf

And 334017 exons were confirmed:

cat gencode.v29.compressed-noGene.gtf | awk -F'\t' '{print $3}' | grep -c exon
334017

So why can't it run?

The bam file processes fine with the older version of RNA seq:

java -jar RNA-SeQC_v1.1.8.jar -n 10000 -s "NuGen_mRNA_10_ERCC1_2|R3e_MAP_READS/NuGen_mRNA_10_ERCC1_2_md.bam|Desc" -t Homo_sapiens.GRCh38.83_filtered_transcript_id.gtf -r Homo_sapiens.GRCh38.dna.primary_assembly.fa -o OUT

But I can't get either the Ensemble GTF or gencode GTF to work with the newer version of RNASeqc. So I'm using gencode GTF and now running into this issue. At least the software doesn't crash. Anyone with RNASeqc experience have advice on how I can get this to run?

When I try to compress an Ensemble format:

python3 gtex-pipeline/gene_model/collapse_annotation.py Homo_sapiens.GRCh38.83_filtered_transcript_id.gtf Homo_sapiens.GRCh38.83_collapsed.gtf

I get this error:

Traceback (most recent call last):   File "SOFTWARE/gtex-pipeline/gene_model/collapse_annotation.py", line 249, in <module>
    annotation = Annotation(args.transcript_gtf)   File "SOFTWARE/gtex-pipeline/gene_model/collapse_annotation.py", line 76, in __init__
    t = Transcript(attributes.pop('transcript_id'), attributes.pop('transcript_name'), attributes.pop('transcript_type'), g, start_pos, end_pos) KeyError: 'transcript_type'

RNA-Seq rna-seq qc mit gencode • 2.1k views

ADD COMMENT • link updated 5.3 years ago by Ram 44k • written 5.3 years ago by Jon17 ▴ 20

0

Entering edit mode

Are the chromosomal identifiers matching in all files that you are using?

ADD REPLY • link 5.3 years ago by GenoMax 147k

0

Entering edit mode

I think so. I've ran the bam file against both the original gencode gtf file and a modified version where I converted the "chr2" to just "2". Basically just removing all of the "chr" from the 1st column. I still have issues.

Legend for the text below:
Column 1 = Bam file (samtools view | awk -F'\t' '{print $3}')
Column 2 = modified GTF file ( awk -F'\t' '{print $1}')
Column 3 = original GTF file ( awk -F'\t' '{print $1}')

=====================================================
Easy to read screenshot

enter image description here

Text version below

 *  1   chr1
1   10  chr10
10  11  chr11
11  12  chr12
12  13  chr13
13  14  chr14
14  15  chr15
15  16  chr16
16  17  chr17
17  18  chr18
18  19  chr19
19  2   chr2
2   20  chr20
20  21  chr21
21  22  chr22
22  3   chr3
3   4   chr4
4   5   chr5
5   6   chr6
6   7   chr7
7   8   chr8
8   9   chr9
9   M   chrM
GL000008.2  X   chrX 
GL000009.2  Y   chrY
GL000194.1       
GL000195.1       
GL000205.2       
GL000208.1       
GL000213.1      
GL000214.1      
GL000216.2      
GL000218.1      
GL000219.1      
GL000220.1      
GL000221.1      
GL000224.1      
GL000225.1

ADD REPLY • link 5.3 years ago by Jon17 ▴ 20

0

Entering edit mode

I think I solved the situation.

Still investigating but at least I'm getting output now. Will post an update later.