Question

Differential Gene Expression of lncRNAs analysis from GENCODE reference file?

0

Entering edit mode

2.8 years ago

mathavanbioinfo ▴ 80

Hello, I am working on the differential lncRNA expression analysis. I have downloaded the reference FASTA and GTF file from GENCODE data base. Generally for doing the DEG analysis, the gene identifier in the fasta file should match with 9th column gene identifier of the GTF file. The GENCODE lncRNA GTF and reference lncRNA are not matched in their format. Here I listed the file format

reference lncRNA.fa

ENST00000473358.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-2HG-202|MIR1302-2HG|712| GTGCACACGGCTCCCATGCGTTGTCTTCCGAGCGTCAGGCCGCCCCTACCCGTGCTTTCT GCTCTGCAGACCCTCTTCCTAGACCTCCGTCCTTTGTCCCATCGCTGCCTTCCCCTCAAG

In what way we have to change the identifiers to run featrueCounts

Reference GTF

description: evidence-based annotation of the human genome (GRCh38), version 41 (Ensembl 107) - long non-coding RNAs

provider: GENCODE

contact: gencode-help@ebi.ac.uk

format: gtf

date: 2022-05-12

chr1 HAVANA gene 29554 31109 . + . gene_id "ENSG00000243485.5"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; level 2; hgnc_id "HGNC:52482"; tag "ncRNA_host"; havana_gene "OTTHUMG00000000959.2"; chr1 HAVANA transcript 29554 31097 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; transcript_type "lncRNA"; transcript_name "MIR1302-2HG-202"; level 2; transcript_support_level "5"; hgnc_id "HGNC:52482"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1"; chr1 HAVANA exon 29554 30039 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; transcript_type "lncRNA"; transcript_name "MIR1302-2HG-202"; exon_number 1; exon_id "ENSE00001947070.1"; level 2; transcript_support_level "5"; hgnc_id "HGNC:52482"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1"; chr1 HAVANA exon 30564 30667 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; transcript_type "lncRNA"; transcript_name "MIR1302-2HG-202"; exon_number 2; exon_id "ENSE00001922571.1"; level 2; transcript_support_level "5"; hgnc_id "HGNC:52482"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";

Reference https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_41/gencode.v41.long_noncoding_RNAs.gtf.gz https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_41/gencode.v41.lncRNA_transcripts.fa.gz

DEG lncRNA • 819 views

ADD COMMENT • link updated 2.8 years ago by mark.ziemann ★ 2.0k • written 2.8 years ago by mathavanbioinfo ▴ 80

1

Entering edit mode

I'm assuming you're doing RNA-seq. Typically we use something like kallisto to map the reads to the whole transcriptome and perform differential expression analysis with all genes. Afterwards, we might want to sort the results out by gene_type, such as lncRNA, mRNA, into separate R objects for further investigation. In order to do this, you will need a two column object that contains the gene identifier and the gene_type to perform that filtering.

ADD REPLY • link 2.8 years ago by mark.ziemann ★ 2.0k