For pair-end read files, read1.fastq
, read2.fastq
, I use the following command cat read1.fastq | head -n 4000 > read1_first1000.fastq
(and cat read2.fastq | head -n 4000 > read2_first1000.fastq
) to extract first 1000 reads, and align with bwa mem hg19.fa read1_first1000.fastq read2_first1000.fastq
. I find that all four orientations, FF, FR, RF, RR are skipped because there are not enough pairs, and bwa will map these reads without pair-end. Therefore, the same read pair will be mapped to a different place as when I align with bwa mem hg19.fa read1.fastq read2.fastq
.
If I use the command cat read1.fastq | head -n 400000 > read1_first100000.fastq
(and cat read2.fastq | head -n 400000 > read2_first100000.fastq
) to extract first 100000 reads and align them again, bwa will detect the orientation FR and then the results will be the same as aligning with bwa mem hg19.fa read1.fastq read2.fastq
. (for the first 100000 reads).
Therefore, I am curious about the following questions:
- How does bwa compute unique pairs for (FF, FR, RF, RR)?
- How many candidate unique read pairs would be enough for detecting each orientation? Is this related to the parameter
-K
(fixed chunk size)? - I find that usually, for a batch of e.g. 100000 read pairs, only around half of them will become candidate unique pairs for the orientations. If e.g., FR has significantly more candidate read pairs than other orientations, will bwa assume all reads (rather than only candidate read pairs of FR) have the orientation FR and map them accordingly?
Thanks!