My Sample has been sequenced in several lanes and casava has split the fastqs.
./Sample_BruceBanner/BruceBanner_AGTCAA_L007_R1_002.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L006_R1_002.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L006_R2_002.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L008_R2_002.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L007_R2_002.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L008_R1_002.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L007_R1_003.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L006_R1_003.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L006_R2_003.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L008_R2_003.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L007_R2_003.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L008_R1_003.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L007_R1_001.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L006_R1_001.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L007_R2_001.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L006_R2_001.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L008_R2_001.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L008_R1_001.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L005_R1_002.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L005_R2_002.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L005_R2_003.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L005_R1_003.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L005_R2_001.fastq.gz
./Sample_BruceBanner/BruceBanner_AGTCAA_L005_R1_001.fastq.gz
I'm learning RNASeq: for an exome sequencing I would align and sort each pair (R1/R2) of fastq on our cluster and merge the BAMs later with picard.
In RNASeq , for the tophat2 step, can I parallelize the same way (and how should I merge the result) ? or should I run tophat with all the fastqs, like this ? (I'm not even sure how to write the Fastq R1 and R2...)
tophat2 -p $(NUM_THREADS) -o tophat_out \
ref \
BruceBanner_AGTCAA_L005_R1_001.fastq.gz,BruceBanner_AGTCAA_L008_R1_001.fastq.gz,(...) \
BruceBanner_AGTCAA_L005_R2_001.fastq.gz,BruceBanner_AGTCAA_L008_R2_001.fastq.gz,(...)
Note: @BioMickWatson said on twitter ( https://twitter.com/BioMickWatson/status/400385456207974401 )
"depends if you are using coverage in the calc of splice junctions"
do you have any pointers to this ?
thank you. As a beginner I will stick to tophat for now :-)
In consequence, I shouldn't use tophat options like
--rg-platform-unit (lane)
isn't it ? Won't it break downstream analysis (e.g: picard or gatk ) ?Right, I would just AddOrReplaceReadGroups after you merge the results. Are you trying to call SNPs from RNAseq data? Otherwise, there's no reason to use gatk or even add the read groups (unless you need them to keep track of samples).
no I'm just interested in differential expression for now. Thanks.
Apologies for reviving this old thread, but I am currently in the same boat as the OP and would like to probe your answer a bit further. From what I understand, evidence for novel splice junctions comes from two sources: "split reads" (two segments from the same read mapped apart nearby on the same chromosome) and "coverage islands" (distinct regions of piled up reads in the alignment) which typically indicate exons. When reads are >75bp, tophat by default disables "coverage-search" (usage of the latter source of evidence), because of the additional time and memory requirements, and also because the reads are long enough that split reads becomes a sufficient source of evidence on its own. This has been discussed here and here with the consensus that coverage-based search is generally unnecessary.
With this said, would you still argue it is better to concatenate the fastq's prior to alignment? Wouldn't it be better to process separately to perform lane-specific QC on the alignments as well to check the results of downstream analysis?
There's rarely any lane-specific QC needed in RNAseq.
This is a matter of how much sensitivity is needed and how much sequencing is available. Using a coverage-based search will maximize sensitivity, which may or may not be that useful. While the documentation states that this isn't needed with longer reads, that's really an over simplification. This is needed whenever you have low coverage of an isoform and, thereby, may have poor spanning of a junction. With longer reads and higher coverage this becomes less of an issue. However, for the most lowly expressed novel isoforms this may still be needed.
Appreciate you taking the time to respond. Cheers.