Question

Sampling short reads from a library with long reads

0

Entering edit mode

2.7 years ago

shpak.max ▴ 60

I have whole genome sequences from different sources. A subset of the sequenced individuals had short read (~ 50 bp) libraries, while other genomes were sequence to generate libraries with longer reads (~250 bp). All were generated using the Illumina sequencing platform.

For most of the analyses I'm performing, these differences are irrelevant, but for others, the differences in read length may create sampling artifacts (such as when comparing differences in read depth across regions of the genome).

Consequently, I would like to "downsample" the long read libraries by only sampling the first (and last) 50 bp from each read pair, thus generating a short read library from my long read libraries. Is there any standard tool that would let me do this? The mechanics of the process are very much like read trimming, but in reverse so to speak, in that I'd be keeping the read ends rather than discarding them.

I'm using bwa as my primary alignment/mapping tool, though this probably doesn't matter.

alignments bwa • 1.1k views

ADD COMMENT • link 2.7 years ago by shpak.max ▴ 60

score 0 · Answer 1 · 2022-08-24

Pretty much any read trimming tool will do what you're after as you just want to keep the first 50bp from R1 and the first 50bp from R2. In trimmomatic this is CROP:50.

but in reverse so to speak, in that I'd be keeping the read ends rather than discarding them.

You want to keep the start of R1 and R2 not the ends. The first 50bp are the outermost 50bp of the sequenced fragment (since Illumina sequencing start from the bases closest to the adapters). If you samples the last 50bp then you'll have huge bioinformatic artifacts unless your fragment sizes are much greater than 500bp - if R1 and R2 overlap at all then sampling the last 50bp will give you reads that aren't even in FR orientation!

I'm using bwa as my primary alignment/mapping tool, though this probably doesn't matter.

The trimming/cropping happens on the fastq file (not the bam file) so you'll need to realign anyway.

the differences in read length may create sampling artifacts

It's not just read length that produces artifacts. You many also need to consider:

Cropping to 50bp discards 80% sequencing data. That 2x250bp sequenced to 100x now only has 20x coverage
Mappability of 250bp reads is much better than 50bp (I assume this is what you're trying to counteract by your trimming)
Mappability also depends on the library fragment size distribution. You'll still have coverage biases after cropping if your library fragment size distributions are different.

Have you considered using physical coverage as your coverage metric?