Gene level RNA seq expression measure
1
2
Entering edit mode
8.6 years ago
Bnf83 ▴ 150

Hi guys, I'm a newbie in the RNA Seq world. I'm analyzing some RNA seq data (HiSeq) and doing this, I'm following the best practice guideline available online form different sources. Briefly I performed the alignment to the reference genome, the merge, I removed duplicates, then I performed the sorting, indexing and finally using bedtools multicov I ended up having the reads count per RefSeq. My point is that I would like to have a gene level expression measure but after the counts I have reads counts per transcript variants. in other words I have multiple transcripts per gene with corresponding reads count but I would like to have one reads counts per gene. How do you deal with this issue?

Thanks in advance

B.

RNA-Seq gene expression • 2.5k views
ADD COMMENT
0
Entering edit mode

What do you want to use the counts for? Normally you'd use featureCounts for that (and not remove duplicates, which is generally a bad idea in RNAseq).

ADD REPLY
0
Entering edit mode

I only would like to know if the genes are differentially expressed. I don't want to know if a specific splicing variant is differentially expressed. Only the gene.

ADD REPLY
0
Entering edit mode

As Devon wrote, use featureCounts (or htseq-count or ...), which you can use to count per gene.

ADD REPLY
4
Entering edit mode
8.6 years ago

Firstly I would agree that removing duplicates is generally only something that should done if there are reasons, and not as a default.

It sounds like you used a reference set of refseq transcripts when counting with multicov. As far as I can see multicov performs a naive count of the reads in each interval given, which means the counts you observe on transcript level are not really meaningful as transcripts have many overlaps.

To get what you want you simply need to count features on a gene level. Both featureCounts as mentioned before and for example htseq-count can be used for this, as they can count reads on a gene metalevel which will count reads from all possible exons in each gene and summarize the results. I am unsure about the annotation quality of other species, but in human I have usually used the lastest version of gencode annotation in gtf or gff format (which includes gene and exon information) together with htseq-count/featureCounts. Note that your annotation version and reference genome used in alignment must match. You just have to make sure the meta feature you are counting is gene, and the feature is exons.

ADD COMMENT
0
Entering edit mode

Hi Victor, I also use the gtf file from GENECODE, but I was thinking of counting only the reads mapping to exons (features) which have the biotype "protein coding". Would this be a mistake? Should I consider all exons? I also use featureCounts.

ADD REPLY
0
Entering edit mode

Having looked recently at some human RiboSeq data, I'm not sure how accurate the biotypes tend to be. I've seen a good number of non-protein coding getting translated.

ADD REPLY
0
Entering edit mode

Oh, I see. Thanks Devon! However, I have mouse data not human.

ADD REPLY
0
Entering edit mode

The organism won't matter.

ADD REPLY

Login before adding your answer.

Traffic: 1546 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6