How to separate htseq-count table to coding RNAs and non-coding RNAs
2
1
Entering edit mode
9.2 years ago
Naresh D J ▴ 110

Hi,

I have generated the raw read counts for genes from RNA-seq data using htseq-count. Now I want to separate the this table into coding RNAs and non-coding RNAs.

I am new to the NGS data analysis.

Can anyone help me or suggest me ideas how to do it?

Thank you,
Naresh

RNA-Seq htseq-count • 4.9k views
ADD COMMENT
0
Entering edit mode

What you mean by coding and non-coding RNAs? Do you mean separating counts for coding and non-coding transcripts ? Or do you mean separating counts for coding (exonic) and non-coding (intronic, UTRs) regions for a given transcript?

ADD REPLY
0
Entering edit mode

@Ashutosh Pandey, yes I want to separate the counts for coding and non-coding transcripts.

For separation of coding and non-coding regions there is a tool RSeQC.

ADD REPLY
0
Entering edit mode

Well RSeQC will give you the numbers or fractions of reads aligned to different genic features but it won't separate them. Anyways, what you need is the annotation of transcripts (genes) based on their biotypes. If these are ENSEMBL genes or gene IDs then you can use Biomart (http://www.ensembl.org/biomart) to download the "Biotype" for each gene and then annotate ENSEMBL genes in the count file as protein-coding, rRNA, tRNA, snoRNA, miRNA etc.

ADD REPLY
0
Entering edit mode

Thank you. I will try your suggestion and let you know.

ADD REPLY
2
Entering edit mode
9.2 years ago
tiago211287 ★ 1.5k

What is your organism model?

If you are using some genome from ensembl, and used the gtf file with the set of anotations in the HTSeq-count, you can import all the tables with counts in txt files inside a data.frame in R.

With bioconductor do:

biocLite("biomaRt")

With this package you can get from the ensembl, a dataset based in several filters, for example, the biotype (if it is coding or noncoding).

Then you can simple merge the two tables based in the ensembl ID's and separate them based in your criteria. If you do not want to use R, the ensemble has a graphic web interface in http://www.ensembl.org/biomart, although I recommend R, because will be more easy later to create better graphics and statistics.

Links:

P.S.: biomaRt also can handle Uniprot and HapMap databases

ADD COMMENT
0
Entering edit mode

Thank you. I will try your suggestion and let you know.

ADD REPLY
1
Entering edit mode
9.2 years ago

To do so, you need a file with a relation (range of bases) of the sequences that are coding and not coding. Mapping reads to the reference genome or transcriptome is not aware of this information

ADD COMMENT
0
Entering edit mode

@Antonio R, Franco, can you kindly elaborate your thoughts.

ADD REPLY
0
Entering edit mode

What other information would you want?

ADD REPLY
0
Entering edit mode

What other information would you want?

ADD REPLY

Login before adding your answer.

Traffic: 2572 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6