Hi All:
I have a general question about testing differential exon usage using DEXSeq. Suppose I have a sample called SLR (from a cell line), and it is sequenced on 5 lanes, so I have BAM files like L1_SLR, L2_SLR, L3_SLR, L5_SLR and L7_SLR.bam. Here, the letter "L" denotes "lanes". In a typical RNA-Seq experiment, we should expect that reads from those 5 lanes have little variations.
Now, similarly, I have three technical replicates of the cell line SLR, let's call them TEC1, TEC2 and TEC3. For each technical replicate (say TEC1), it is also sequenced using 5 lanes. For example, I have L1_TEC1, L3_TEC1, L4_TEC1, L5_TEC1 and L7_TEC1. Similarly for TEC2 and TEC3.
Since TEC1, TEC2 and TEC3 are technical replicates of cell line SLR, ideally there should not be any differential exon usage detected. Our goal is to focus on these technical replicates and see if we'll obtain larger than expected amounts of differential exon usage. The tool we consider is DEXSeq. I looked at this manual here: http://bioconductor.org/packages/devel/data/experiment/vignettes/pasilla/inst/doc/create_objects.pdf. In Table 1, I think the author "concatenated" all the lanes (BAM files) into a single one for each replicate (treat1fb, etc), so we see "total" number of reads and exon counts. In my situation, can I know how to obtain
(1) the combined BAM file for SLR? Shall I cat the 5 lanes of BAM files? (after combination, I can get SAM file using samtools, and then use dexseq_count.py GFF_file \ SLR_sorted.sam SLR.txt)
(2) the GFF file is converted from GTF file using dexseq_prepare_annotation.py; can I know if the GTF file is just downloaded from Ensembl website (choose Homo Sapiens if my cell line is from human)?
(3) since my ultimate objective is to compare (maybe pairwise?) those technical replicates: SLR, TEC1, TEC2, TEC3, instead of the traditional situation of comparing biological sample 1 (with several tech reps) with biological sample 2 (with several tech reps). Would there be any problem? I can see that maybe comparing SLR vs. TEC1 is impossible (also for other pairwise comparisons) as there is no "replicates" of the replicate.
Thank you so much!
Thank you for the reply! I think you're right -- the lanes need to be combined so that a single sample has a BAM file. One more question: for preparation of GFF file, if I want to use Ensembl annotation, shall I download the GTF file and then pass the uncompressed gtf file to dexseq_prepare_annotation.py to get the GFF file? The GFF file should apply to that species (regardless of the BAM files for particular samples), am I right? Sorry this may be a naive question... Just wanna clarify ;-)
You might not even want to change the format to .GFF. You can still use prepare_annotation.py and retain it as .gtf and get the ECS. Yes, the ensembl .gtf file should be for respective species. Hope it helps. It might looks little tricky in the beginning but once you finish every step you will get the big picture of it.