Hi everyone!
I have filtered the adapters from my Illumina PE reads with Trimmomatic. This was the output (as I expected):
sample.R1.trimmed.fastq
sample.R2.trimmed.fastq
sample.R1.unpaired.fastq
sample.R2.unpaired.fastq
Then I aligned the trimmed.fastq
pair with BWA just fine. But when I tried to align the unpaired reads I got this:
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (4, 1, 1, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] skip orientation FR as there are not enough pairs
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[mem_sam_pe] paired reads have different names: "HWI-1KL178:67:HAE0RADXX:1:1101:2363:2000", "HWI-1KL178:67:HAE0RADXX:1:1101:11567:2000"
This is the command line:
bwa/bin/bwa mem -aM -t 6 ${REF_BWA_INDEX}/genome.fa ${SAMPLE}.R1.unpaired.fastq ${SAMPLE}.R2.unpaired.fastq > ${i}.sam
My goal is to align trimmed and unpaired files separately because BWA do not support them together.
Thanks in advance!
Monica
Just a note that the latest bwa-mem supports this:
i.e., you can merge paired and unpaired reads in one stream, as long as paired reads are next to each other.
Thanks Istvan for your quick response!
I am kind of lost. My main goal here is to call variants, what do yo suggest me to do with these unpaired files once I aligned them separately? I was going to merge them with the trimmed ones and then call the variants...
Do I have to take them into account or I should only use the trimmed ones?
Thanks!
Monica
Check the documentation of the variant caller for information on whether it handles mixed content. We usually discard the unpaired reads to keep things simple but typically these are no more than a few percent of data - won't actually affect the results.
Hi Istvan,
Would you please give a general number for "a few percent"? I filtered out 8% unpaired reads. Will this amount of data loss affect the downstream analysis?
Thank you!
8% is not all that much but then it all depends how much data do you have left. The general rule is that it is best to get rid of bad data than to try to salvage it. in my opinion better data even if it is fewer is more desirable than salvaged data.
That is because errors rarely come isolated - we may think that we were able fix all that by trimming off the bad bases but perhaps there were more reasons that drove those errors in some regions of the flowcell and even the data that looks reliable is not.