I have a simple two group experiment. Treatment Vs. Control, 4 biological replicates. The design is 150bp. Read depth is 100 million reads per sample, paired end. Each sample has a ERCC spike-in for post processing quality assessment. Pipeline is as follows->
- Run FastQC on each fastq file.
- Quality looks excellent, but Illumina adapters are present.
- Paired-End trimmomatic run to remove adapter sequences.
- FastQC is run again to verify adapter removal.
- Lanes are merged using bash cat.
- Tophat2 is run in paired end mode to align reads.
Command used:
subprocess.call(['tophat2','-p','64','-r','10','--mate-std-dev','51','--mate-inner-dist','-101','--b2-very-sensitive','--no-novel-juncs','--microexon-search', '--coverage-search','--library-type','fr-secondstrand','--output-dir',output_folder,'--GTF',gtf_file,bowtie2_builds,leftfileString, rightfileString])
Summary files look okay with ~70-75% concordant alignment.
HTseq is then used to count reads. So now I construct a read count table and the data are highly variable both within and between groups.
Does anyone have an idea about what could cause this? Thanks!
For context, here are the box plots from a different experiment I ran a few months back:
See: How to add images to a Biostars post
You need the direct link to the image, not the link to the page that hosts the image.
What do the box-plots show? You didn't tell us.
As you are finding the results strange, I would advise to not concatenate the lanes, but rather map and count them separately. Have all samples been sequenced to approximately the same depth? Besides lane, are there other batch effects?
On the other hand, this may be true biological variance, some experiments have higher variances and lower fold-changes than others.
Sorry. The box plots show the distribution of read counts per replicate. The first four are the vehicle treated replicates and the last four are drug treated. The desired read depth was 100M but what we received varied between 65M and 120M per replicate. There doesn't appear to be any correlation between read depth and variation.
Just a side comment: are yo sure that your data is forward-stranded? option
--library-type fr-secondstrand
. Most popular RNAseq lib. prep. protocols today are reverse-stranded.I was originally using QoRTs for read counting and it complained that the strandedness might be different than specified so I used Chipster to verify that it is fr-secondstranded. The runs I had done in the past were firststranded, but I think the sequencing facility has upgraded their platform.
What is the y-axis? Please provide some more details.
Y axis is (RLE = log-ratio of read count to median read count across sample). These were generated using RUVseq which is software that uses the ERCC spike-ins to nomalize the data. The data in the figure is unnormalized.