This is from FASTQC analysis of paired end data. R1 has twice as many reads as R2. I'm not sure why, but my FASTQC duplication looks like this for R1. Does anyone know why?
This is from FASTQC analysis of paired end data. R1 has twice as many reads as R2. I'm not sure why, but my FASTQC duplication looks like this for R1. Does anyone know why?
R1 has twice as many reads as R2
That's very wrong. Alert the people who made the fastqs, that is not right at all.
It might be that something went wrong with the fastq generation. Like someone somehow generated only R1, and realized their mistake, and then regenerated both files, but the new R1 was appended to the existing one, instead of overwriting it.
Grab the first read name of the fastq, and see if it turns up twice. If it does, tell whoever made the fastq to start from scratch.
The command to check the first line would be
zcat my_file.fastq.gz | head -n 1
Then when you get the first line:
grep readname my_file.fastq.gz
Ans see how many lines it returns.
You are going to have to poke around to find a command to clean up the fastq if that is the problem; I'm not sure how to do it off the top of my head.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
This blog post by authors of FastQC would be of interest: https://sequencing.qcfail.com/articles/libraries-can-contain-technical-duplication/
Since this is RNAseq data some duplication is expected since there are likely to be more copies of same RNA.
You could try running
repair.sh
tool from BBMap suite. Guide here: https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/repair-guide/Thank you! repair.sh worked great!