I've stepped into a little bit of a mess and I'm not sure how to fix it.
An RNA sequencing run was performed on 19 samples which yielded approx 20m reads per sample, it showed promising results but lacked statistical significance so I suggested re-sequencing the library to a better depth.
The samples were indeed rerun with the following issues:
- The samples were re-pooled without two of the samples (which were questionable) so now there are 17 samples
- The length of the reads changed from 36 bp to 75 bp
- The sequencing was sent to another location and not performed on the same machine
- The number of reads per sample was approximately the same
I was hoping that either a) it would be an identical run and I could combine the reads or b) the library would be sequenced to a greater depth so it would supersede the previous run. My main concern is that these sets of fastq files are too different to comfortably combine as I thought in (a).
How might I proceed?
Thanks in advance!
Resequencing is generally a neglectable source of batch effect. Trim the new ones down to 36bp to compensate for read length prior to alignment. Then do PCA, coloring by resequencing or not, then see whether there is any evidence for batch effect. If not then just combine.
While Illumina says that the sequence should be identical no matter which sequencer/chemistry, you may want to include that as a variable in your PCA and see if there is a difference (especially if a different chemistry was used).
Thanks for the reply.
I've pseudo-addressed your comment from the count tables which were processed independently albeit without trimming as you have suggested. This is what set my initial alarm bells off about the validity of combining these runs. I'll have a go trimming the reads down and see if it changes the picture.
Your mappings are going to be better at 76 bp. With 36 bp reads you may end up with more multimappers. Can you drop the short reads if you have enough data from long ones. I am assuming the libraries are amenable to longer reads i.e. there are no short inserts etc. and you are not losing data to adapter read-through.
Indeed, I'm getting the reads re-aligned tonight and will find out in the morning what the PCA looks like! In the meantime, here's the alignment output of one of the samples for flavour.
Thanks for your input!
Are these
bowtie2
stats? You ought to be using a splice-aware aligner for RNAseq data (like STAR, bbmap, unless you are working with bacteria). While these numbers look fine and you do not appear to have a lot of multimappers.This looks like hisat2 output to be, although it's been a very long time since I've mapped something myself! I have to confess to being spoilt and have people who do the primary alignments and processing for me.
I can also see a STAR folder, which puts run 2 at 9.36% multi-mappers and run 1 at 13.89%.
HISAT2 is great. Developed by same group that does bowtie. Longer reads make for lower multi-mappers in STAR so that makes sense.