I've got sequence data back from illumina. It's pair-end 300bp reads with ~50bp overlap. I originally pair-ended the reads as my second step (first being fastQC), and it worked a treat. But we decided to filter the fastq's first.
I've made another post here where I ask how to filter the fastq's based on sequence similarity to another fastq, and why I did it in case you're interested.
This is what I've done:
FastQC
to check for adaptersUclust
search to find reads with 100% sequence similarity across R1's and across R2's (sepereately; we sequenced the same individual twice from different extractions) for each individual.Filter fastqs based on
uclust
'hits' usingseqtk
- you lose around 75% of reads for the ones I've checked.Pair-end (I've tried both
PEAR
andEA-Utils
). But I get errors saying the number of sequences is different, withPEAR
saying no files are in any of the R2 files.
I've re-run the uclust and filtering steps in case it was truncating files. Nothing. I've used EA-Utils fastq-stats to get a summary of the R2 files, and there are reads there. I've even tried pairing the filtered and the unfiltered reads, and the same error comes up.
I may be missing something obvious, but I'm genuinely at a loss. Any help would be greatly appreciated. If anything is unclear, please let me know.
EDIT: The line executed was:
$TOOL/pear -f $DATA_DIR/$forward_read -r $reverse_read -o $OUT_DIR/$output_name
after I removed all the filtering options when it didn't work the first time. It equates to:
pear -f /data//01_Fastq_Filtering/BVG080--BVC393-2_S97_L001_R1_001_filtered.fastq \
-r /data//01_Fastq_Filtering/BVG080--BVC393-2_S97_L001_R2_001_filtered.fastq \
> /data/02_Paired/BVG080-2_paired.fastq`
Please show the full command-lines you've used.
Ah, apologies, I've updated the post now.
Sounds like you lost information about the R1/R2 reads headers in step 2/3. If you are sure the information is there you can try to "re-pair" the reads using
repair.sh
from BBMap suite.Thanks, I'll give that a go. I was looking for a way to 'validate' fastq's to see where I was losing information, but couldn't find anything. This seems like a potential workaround.
Can you show us example fastq headers for R1 and R2?
If you need to overlap R1/R2 then order of recommendations for BBTools is
bbmerge
first and then do any other operations you need to do on them.Did you design your experiment so reads will overlap (chose an insert that was smaller than normal)? Reads will merge only when the length of sequencing is more than 1/2 length of the insert. If this was a WGS experiment then that would be a curious thing.