Hi all I have some doubts on rna-seq analysis. I have rna-seq data (paired end using Truseq) from patients - 4 tumor and one normal (not matched normal). I am trying to find differentially expressed genes. So far so good - aligned with tophat2, estimated expression with cufflinks, got counts with ht-seq count, did differential expression using DESeq2. But here are my concerns -
While counting with htseq-count I am getting very low counts assigned to features. On an average only 30% of the fragments are assigned to features (refseq gtf). Maximum of the remaining 70% fragments are ambiguous or no feature. Is it common to find such low percentage of counts? I read one could get around 60%. I tried with all modes in htseq (reverse union, stranded union, unstranded union and with intersection nonempty) and the best result was with unstranded union mode.
For some of the features ht-seq count assigned zero counts. For example between region chr20:3776385-3786768 htseq gives -
3919 SAM alignment pairs processed. NM_001287516 0 NM_001287517 0 NM_001287518 0 NM_001287519 0 NM_001287520 0 NM_001287522 0 NM_001287524 5 NM_004358 0 NM_021872 0 NM_021873 0 __no_feature 82 __ambiguous 3683 __too_low_aQual 0 __not_aligned 0 __alignment_not_unique 149
Of ~3900 pairs 3683 are ambiguous and only 5 are assigned to one of the feature. I get this, since there are so many overlapping genes within 10kb, there is no way to tell which gene a fragment originated from, hence so many ambiguous. But when I used cufflinks for the same (some of) genes I get high FPKM values -
NM_001287516 1.69492e-71 NM_001287517 9.62116e-75 NM_001287518 1.05537e-103 NM_001287519 0.000182958 NM_001287520 31.2244 NM_001287522 6.82458 NM_001287524 1.72693 NM_004358 1.47296 NM_021872 0.017841 NM_021873 17.0096
Does this mean cufflinks does not account for ambiguous reads while calculating FPKM?
Some of the threads I saw, say that for trueseq libraries we should use
fr-firststrand
as library type during tophat2 alignment. But here it is metioned to usefr-unstranded
(I guess it should also be considered while using htseq). Which is the right way?Since I do not have replicates/samples for normal tissue, using DESeq2 is advisable or should I be better off with fold changes from FPKM?
Thanks
Hi Michael,
Thank you for the reply.
I am using htseq-count in default mode (gene level). I checked annotation for above transcripts (NM_001287516 to 24) they all map to gene CDC25B but they are all duplicates. ucsc refflat gtf has same
gene_id
(CD25B) for all above transcripts but ucsc_refseq gtf has different gene_id (as above) for all transcripts. This is sort of confusing and results different outputs. What is the best gtf to use in this case? Ensembl also has differenttx_id
andgene_id
similar to refseq.EDIT: Sorry, Ensembl gtf works fine!
EDIT2: Using Ensembl gtf increased percentage counts to 65%. I guess it using proper gtf matters a lot !
This makes sense.
infer_experiment.py
from the suggested tool, I guess my library is unstranded ( I got 0.49 0.49 for paired end). Thanks again.Stick with Ensembl, the UCSC annotations are a mess.
Yup! Found it hard way.
Could you also comment on using DESeq in this case ? I have only one sample for normal and 4 for tumor. I tried both DESeq as well cuffdiff.. Both produces different results. Here it's mentioned cuffdiff seems to be a better option if one has no replicates.
Depends a bit on the exact comparison you want to do, though there's generally no good way to handle unreplicated data.