I have 13 samples which were initially sequenced and expected to yield around 30 million reads per sample, however we only managed to get half of that. Therefore we re-pooled the samples and ran them again, getting more reads this time. Therefore I have 2 files per sample, run 1 (with low reads) and run 2 (with high reads). I tried merging these files together (using cat function) to get even greater depth but my QC analysis shows a lot of duplications. Is there a better way of merging these files whilst avoiding high levels of duplications? Thanks!
Thanks.
After using multiqc on my html files, my Sequence Counts shows me that there is a lot of duplications per sample. However, it is making me think that multiqc is considering the sequences from run 2 as 'duplicates' of run 1, which means I haven't merged the files properly. Is is normal then?
cat
'ing the files simply tacks contents of second file at end of first. So as far as FastQC/MultiQC is concerned that is just one set of data. That in itself is going to have no effect on duplicates per se. Your sample is going to contain that duplication (if present). At this point you can't fix that part (if sample was overamplified for example). Hopefully all samples in this dataset underwent an identical treatment so there will be no experimental bias.Thank you!