Question

Reference genome for mapping RNA-seq spike-in dataset

2

Entering edit mode

11.2 years ago

sarahmanderni ▴ 130

Using spike-in controls is a common way of evaluating statistical methods while finding differentially expressed genes. Having Fastq files containing ERCC controls and the corresponding gtf file for the ERCCs, how can one does the alignment step with TopHat? For instance if the samples are from human, we have the fastq file, hg19 reference and the ERCC.gtf file. How can one use TopHat to align the fatsq files to the reference genome while they include the ERCC reads? Should we combine the hg19 reference genome with the ERCC. gtf file? Following article can be an example of this situation:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3166838/

How should we include the ERCC controls information to the reference genome used in Tophat?

Thanks for the help.

RNA-Seq alignment • 9.4k views

ADD COMMENT • link updated 3.9 years ago by Ram 45k • written 11.2 years ago by sarahmanderni ▴ 130

Ram · Accepted Answer · 2014-06-25

You're on the right track. The ERCC sequences should be available as FASTA you can append to your reference genome as more chromosomes. Then tophat/bowtie will put the reads that belong to them onto those chromosomes. If you're using a GTF, go ahead and attach them there too, knowing they're unspliced single-exon sorts of mRNA.

Be aware some reads of human genome will fit on to some of the ERCC chromosomes as well. It's not many but it's not zero.