Hello, I am working on the differential lncRNA expression analysis. I have downloaded the reference FASTA and GTF file from GENCODE data base. Generally for doing the DEG analysis, the gene identifier in the fasta file should match with 9th column gene identifier of the GTF file. The GENCODE lncRNA GTF and reference lncRNA are not matched in their format. Here I listed the file format
reference lncRNA.fa
ENST00000473358.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-2HG-202|MIR1302-2HG|712| GTGCACACGGCTCCCATGCGTTGTCTTCCGAGCGTCAGGCCGCCCCTACCCGTGCTTTCT GCTCTGCAGACCCTCTTCCTAGACCTCCGTCCTTTGTCCCATCGCTGCCTTCCCCTCAAG
In what way we have to change the identifiers to run featrueCounts
Reference GTF
description: evidence-based annotation of the human genome (GRCh38), version 41 (Ensembl 107) - long non-coding RNAs
provider: GENCODE
contact: gencode-help@ebi.ac.uk
format: gtf
date: 2022-05-12
chr1 HAVANA gene 29554 31109 . + . gene_id "ENSG00000243485.5"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; level 2; hgnc_id "HGNC:52482"; tag "ncRNA_host"; havana_gene "OTTHUMG00000000959.2"; chr1 HAVANA transcript 29554 31097 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; transcript_type "lncRNA"; transcript_name "MIR1302-2HG-202"; level 2; transcript_support_level "5"; hgnc_id "HGNC:52482"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1"; chr1 HAVANA exon 29554 30039 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; transcript_type "lncRNA"; transcript_name "MIR1302-2HG-202"; exon_number 1; exon_id "ENSE00001947070.1"; level 2; transcript_support_level "5"; hgnc_id "HGNC:52482"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1"; chr1 HAVANA exon 30564 30667 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; transcript_type "lncRNA"; transcript_name "MIR1302-2HG-202"; exon_number 2; exon_id "ENSE00001922571.1"; level 2; transcript_support_level "5"; hgnc_id "HGNC:52482"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";
Reference https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_41/gencode.v41.long_noncoding_RNAs.gtf.gz https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_41/gencode.v41.lncRNA_transcripts.fa.gz
I'm assuming you're doing RNA-seq. Typically we use something like
kallisto
to map the reads to the whole transcriptome and perform differential expression analysis with all genes. Afterwards, we might want to sort the results out bygene_type
, such as lncRNA, mRNA, into separate R objects for further investigation. In order to do this, you will need a two column object that contains the gene identifier and thegene_type
to perform that filtering.