Question

Full GTF file vs Subset GTF file

0

Entering edit mode

4.3 years ago

Arindam Ghosh ▴ 530

I have aligned raw RNA-seq reads to the Ensembl reference genome. I intend to quantify the expression using FeatureCounts of only say lincRNAs. What would be a better approach, use the full GTF file containing all types of RNAs or create a GTF containing only lincRNA and then use as input for FeatureCounts?

I tried both these approaches. For protein coding and lncRNA, the results were similar but a huge difference in case of miRNA.

featurecounts RNA-Seq miRNA-seq Ensembl • 3.1k views

ADD COMMENT • link updated 4.3 years ago by Shalu Jhanwar ▴ 540 • written 4.3 years ago by Arindam Ghosh ▴ 530

1

Entering edit mode

the results were similar but a huge difference in case of miRNA.

What do you mean by that? miRNA's being small are likely to multi-map. You should be using a specific pipeline meant for miRNA, if you have that data. Normal mRNA protocols will generally not capture miRNA's.

ADD REPLY • link 4.3 years ago by GenoMax 147k

0

Entering edit mode

Actually I tried with miRNA-seq data, aligned them to the reference genome and then in featureCounts used the full GTF (containing protein coding, lncRNA etc) and miRNA GTF.

The miRNA GTF was created using:

grep -E '#|gene_biotype "miRNA"' Homo_sapiens.GRCh38.84.gtf > Homo_sapiens.GRCh38.84.miRNA.gtf

Even I suspect the difference might be due to multi-mapping.

Most paper I came across usually use miRBase reference and annotaion for miRNA-seq analysis. But, I was insisting on using the Ensembl GTF file as it contains miRBase annotations for mIRNA.

ADD REPLY • link 4.3 years ago by Arindam Ghosh ▴ 530

0

Entering edit mode

aligned them to the reference genome

Which program did you use? miRNA's need un-gapped alignments.

ADD REPLY • link 4.3 years ago by GenoMax 147k

0

Entering edit mode

Bowtie2 with vsl (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4931105/)

ADD REPLY • link 4.3 years ago by Arindam Ghosh ▴ 530

score 1 · Answer 1 · 2020-07-21

1

Entering edit mode

4.3 years ago

Shalu Jhanwar ▴ 540

I think the difference in the full vs subset GTF depends on how the GTF is being subset from the full GTF, and not at FeatureCounts step. Do you use a filter on Biotype to get GTF of all miRNAs? More information on Biotype is at http://www.ensembl.org/info/genome/genebuild/biotypes.html.

ADD COMMENT • link 4.3 years ago by Shalu Jhanwar ▴ 540

0

Entering edit mode

Create subset of Ensembl GTF file based on gene biotype

grep -E '#|gene_biotype "miRNA"' Homo_sapiens.GRCh38.84.gtf > Homo_sapiens.GRCh38.84.miRNA.gtf

ADD REPLY • link 4.3 years ago by Arindam Ghosh ▴ 530

0

Entering edit mode

I'd recommend extracting GTF with miRNA biotype by filtering in the specific column (e.g. using awk), instead of using grep on entire lines. For e.g. for a Gencode gtf, extract miRNA gtf as:

zcat gencode.v19.long_noncoding_RNAs.gtf | awk '{if ($20!~"miRNA") print $}' | sort | uniq | > miRNA.gtf

ADD REPLY • link 4.3 years ago by Shalu Jhanwar ▴ 540

0

Entering edit mode

Anyway is this a logical way? Should this miRNA.gtf be used for read quantification?

ADD REPLY • link 4.3 years ago by Arindam Ghosh ▴ 530