Entering edit mode
3.2 years ago
dimitrischat
▴
210
hi all,
new to bioinformatics. so bare with me.. I am trying find long non coding RNA from RNA-seq data. As i checked the human gtf file there are 2 different types of long non coding RNA, "lnc_RNA" and "lncRNA", like so:
NC_000001.11 Gnomon transcript 29926 31295 . + . gene_id "MIR1302-2HG"; transcript_id "XR_001737835.1"; db_xref "GeneID:107985730"; gbkey "ncRNA"; gene "MIR1302-2HG"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 8 samples with support for all annotated introns"; product "MIR1302-2 host gene, transcript variant X2"; transcript_biotype "lnc_RNA";
NC_000001.11 BestRefSeq gene 34611 36081 . - . gene_id "FAM138A"; transcript_id ""; db_xref "GeneID:645520"; db_xref "HGNC:HGNC:32334"; description "family with sequence similarity 138 member A"; gbkey "Gene"; gene "FAM138A"; gene_biotype "lncRNA"; gene_synonym "F379"; gene_synonym "FAM138F";
"lnc_RNA" is on the "transcript" line, and "lncRNA" is on the "gene" line. My first question is should I choose "lncRNA" ?
And most importantly, how do i get only the "gene_id" names of the ones that have "lncRNA" ?
edit: for the 2nd question i did: grep 'lncRNA' GRCh38.p13_genomic.gtf > GRCh38.p13_genomic_lnc.gtf and proceeded as usual.
But is my choice correct of the lncRNA?
In the example you posted above one is a
gene_biotype
and othertranscript_biotype
. Biotypes should be applicable to both Gene/Transcripts. I am not sure why there is an extra_
in your example for transcript. Is that convention followed for all transcripts? If you are doing analysis at the gene level then you should only select those entries.