Question

Human Transcriptome Mapping

2

Entering edit mode

14.1 years ago

Stevelor ▴ 310

Hi Guys,

I just wanna ask you about your experiences of mapping paired-end reads to the human genome and the quantification. How many reads do you have? Which mapping parameters...allowed mismatches, intron length etc? Do you use Tophat or any other tools? And how many unique and repeat hits do you normally get? Which human assembly do you take as reference and why, is there some kind of standard?

What is your typical workflow analysing these reads and find out differential gene expression? I'll get 40M 2x100bp illumina reads next week and have no experience with paired-end RNA-Seq of human transcripts.

Waiting for some interesting comments :D

Cheers!

mapping human gene paired • 4.6k views

ADD COMMENT • link updated 14.1 years ago by Karl ▴ 350 • written 14.1 years ago by Stevelor ▴ 310

score 4 · Answer 1 · 2011-07-26

For alignment, TopHat is the only one I know of that is intended for spliced reads (transcriptome). You'll get poor results with a genomic aligner. Nothing is perfect, even with tophat, I have found several islands of reads that mapped equally well in several places, and tophat was forced to choose one probabilistically, leaving me with some reads that belong to a gene, just floating in the middle of no where.

For differential expression, the team that made tophat has a continuation tool called CuffLinks that constructions alternate isoforms (multiple transcript variants) and assigns probabilistic expression levels to them quite well. Here you have a choice of supplying it an annotation for exons vs letting it decide blindly. I let it construct novel transcripts because that's what we really wanted to see, but found this to be a noisy process (many novel forms that look like mistakes). Then when every gene has three to six transcripts, comparing expression levels across samples is tricky indeed. Cufflinks will do it all for you, but I'm suspicious of just handing over the p-values to my bosses given they come from such tenuous associations.

Another tool for differential expression is called DESeq, an R/Bioconductor package that requires a set of read counts per region and just does the math on that. So it doesnt try to construct new transcripts, you just supply it a table of regions (I used the refseq gene list from UCSC) by counts (coverageBED is a tool to count how many reads in a huge BAM file fall onto those locations). DESeq isn't trying to make up new transcripts, and doesnt handle overlapping genes so well, (that's up to the human preparing the read count table).. but the math is sound and sensible. Simpler tools feel more trustworthy.

With about any fixed amount of reads, you'll see some transcripts quite clearly, and some only roughly. More reads is better, but youre never going to have enough to see the lightest expression. CL sort of hides this from you, but DESeq is clear about their estimations.

I have found in regions of poor coverage depth, CL can construct incorrect transcripts, (best guess). It shouldnt be expected to be accurate in the lack of data at low coverage genes (most of them!). Since DESeq starts with an annotated set of genes, and simple read counts, the results are more robust in low coverage areas, but I repeat it will not construct novel isoforms or do anything with reads falling outside of known annotated regions!

Ram · Answer 2 · 2011-07-25

2

Entering edit mode

14.1 years ago

Marina Manrique ★ 1.3k

Hi Steve,

I found this Seqanswers thread pretty interesting. It's a bit basics but I think it explains very clearly RNAseq pipelines with tophat/cufflinks and associated tools

HTH

Marina

Edit: I forgot to add this link to this biostar question about the pros and cons of using hg18 or hg19 as reference genome

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 14.1 years ago by Marina Manrique ★ 1.3k

score 0 · Answer 3 · 2011-07-25

0

Entering edit mode

14.1 years ago

Stevelor ▴ 310

Thanks Marina, Do you also know if there is a comparion of different quantifications tools...so how are the cufflinks values different from them of edgeR, ERANGE and all those other tools?! So we also developed a quantification method...the values are very similar but i want to know the differences without spending time on using all this tools :D

ADD COMMENT • link 14.1 years ago by Stevelor ▴ 310

1

Entering edit mode

Don't add another answer - this is not a forum. The answers may change order based on their voters. Edit your original question to contain more information or use the comments under each answer to ask for more details.

ADD REPLY • link 14.1 years ago by Istvan Albert 103k