Question

How to analyze unannotated lncRNA using RNA-seq data?

0

Entering edit mode

8.3 years ago

xiaoyonf ▴ 60

Hi all,

I have a ~2kb sequence of unannotated lncRNA acquired from published literature. Since it is unannotated, I can not search by its name in any Genome Browser (i.e. TCGA, UCSC) and check its expression in RNA-seq datasets.

How to analyze such unannotated lncRNA using RNA-seq data? e.g., its expression across different subtypes of BC in TCGA dataset?

Thanks, Xiaoyong

RNA-Seq • 2.4k views

ADD COMMENT • link updated 8.2 years ago by tiago211287 ★ 1.5k • written 8.3 years ago by xiaoyonf ▴ 60

score 3 · Answer 1 · 2016-10-10

If this feature is not annotated, the programs for counting and measuring will not 'see' it. I would first visualize the expression by looking into the coordinates of this unannotated lncRNA using IGV or any other visual tool. If you have no reads mapping to this position, there is nothing you can do because it is not being expressed in your dataset.

If it is being expressed, you can create some 'fake' row at the annotation file (GTF file) using the coordinates of this lncRNA you have so HTSeq or Kallisto could see it.

Afterwards, you can use any statistics program (DESeq2, EdgeR) for telling if it is over or under expressed.

For Kallisto, you can transform your modified GTF to a transcriptome fasta file using gffread from the cufflinks package like this:

gffread -w transcriptome.fa -g Reference.genome.fa annotation.gtf

Afterwards, you can use this transcriptome.fa in Kallisto index and perform the counting with kallisto quant.

PS: Kallisto give you both normalized data and raw counts estimation. If you are going to use DESeq2 keep in mind that you must give only raw counts as input and never normalized data.