I have whole genome sequences from different sources. A subset of the sequenced individuals had short read (~ 50 bp) libraries, while other genomes were sequence to generate libraries with longer reads (~250 bp). All were generated using the Illumina sequencing platform.
For most of the analyses I'm performing, these differences are irrelevant, but for others, the differences in read length may create sampling artifacts (such as when comparing differences in read depth across regions of the genome).
Consequently, I would like to "downsample" the long read libraries by only sampling the first (and last) 50 bp from each read pair, thus generating a short read library from my long read libraries. Is there any standard tool that would let me do this? The mechanics of the process are very much like read trimming, but in reverse so to speak, in that I'd be keeping the read ends rather than discarding them.
I'm using bwa as my primary alignment/mapping tool, though this probably doesn't matter.
Thank you for the feedback.
I have a rather naive question, since I've always done alignments with paired end reads and never had to convert paired-end data into single read data.
If I've created fastq files with the first 50 bp of the R1 and R2 reads, I will treat these as independent reads rather than paired end reads. Should I merge fastq 1 and 2 into a single read file