Question

Few genes with many reads aligned,

0

Entering edit mode

8.6 years ago

RNAseq • 0

I have RNAseq data from cell lines measure at 5 different timepoints, and each timepoint with a biological replciate. They have been sequenced strand specifically. I align them using hisat against ensemble built 75 with --rna-strandness R. After, I get the raw counts with HTSeq with -m union and -s yes. I normalized the counts using TPM measurement.

enter image description here

As you can see, there are 3 big outliers. These three are found in all of the samples and are all misc non coding RNAs

ID counts TPM

ENSG00000258486 153195 227942.04 RN7SL1

ENSG00000265150 100826 151541.60 RN7SL2

ENSG00000202198 112308 151407.64 RN7SK

Am I doing something wrong with my alignment or counting?

RNA-Seq alignment • 2.4k views

ADD COMMENT • link updated 8.6 years ago by Devon Ryan 105k • written 8.6 years ago by RNAseq • 0

1

Entering edit mode

hi, I am not aware the biological condition you exposed your cell lines to, but the ncRNAs that you mentioned share seq. homology to a highly rep. element of the human genome: Alu element. I dont remember correclty now but all these 7SL and 7SK RNA family members share that homology. The other part of info is that there are biological scenario where cells increase RNA pol III transcription like UV treatment, viral infection, some cytotoxic treatment etc. RNA pol III transcribes Alu element. So there is a possibility that what you are seeing is biological.

ADD REPLY • link 8.6 years ago by Amitm ★ 2.3k

0

Entering edit mode

On second thoughts, you should also check how much is the transcriptome coverage. I mean how much of the exons are covered. You can do this either using bedtools coverage or use RNASeqc Even if for some reason you have high expression of those ncRNAs, the rest of the transcriptome should have sufficient coverage (in line with total amount of reads you had in first place.)

Also, if you log-transform the above graph's y-axis then it would be more informative to see. Generally in RNA-seq there is a huge peak of transcripts with near 0 expression value and then a long tail. If you make a density plot of the above data (log-trans'd), you should also observe that. Anyways, diagnostic plots from RNA Seqc and positive control genes (those which you know should have had considerable expression ) would be more informative in helping you infer if the data is right.

ADD REPLY • link 8.6 years ago by Amitm ★ 2.3k

score 0 · Answer 1 · 2016-05-25

0

Entering edit mode

8.6 years ago

Devon Ryan 105k

Your samples weren't prepared by polyA enrichment, but rather by ribodepletion. In such cases you'll get a bunch of rRNA-related genes (and tRNAs) with absurdly high expression.

ADD COMMENT • link 8.6 years ago by Devon Ryan 105k