Question

Tools for Subsampling Paired-End Reads to 25X Coverage ?

0

Entering edit mode

13 days ago

Sony ▴ 20

Hi everyone,

I have sequencing data for 20 rice accessions (paired-end). When I map the reads to the reference genome and calculate the coverage using the following command:

samtools depth ERR3890928_sorted.bam | awk '{sum+=$3} END { print "Average = ",sum/NR}'

I get an average coverage of 34.6737. The coverage varies from 25X to 40X across samples.

For downstream analysis, I want to subsample reads from the paired-end FASTQ files to achieve 25X coverage (Only reads mapping in proper pairs will be retained).

Does anyone know any tools that can handle this task and how to perform it ?

Many thanks!

paired-end fastq depth sequencing coverage • 583 views

ADD COMMENT • link updated 13 days ago by Mensur Dlakic ★ 28k • written 13 days ago by Sony ▴ 20

0

Entering edit mode

Since subsampling programs would not know about "depth" I am not sure you are going to find anything that will do what you are asking for (though I could be wrong). What is this going to achieve by the way? There may be areas of genome where there may be no reads at all so depth is meaningless in some sense.

If you just wanted to down sample the data then reformat.sh from BBMap suite would be able to do this. The parameters of interest are

reads=-1                Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1            Skip (discard) this many INPUT reads before processing the rest.
samplerate=1            Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0     (srt) Exact number of OUTPUT reads (or pairs) desired.
samplebasestarget=0     (sbt) Exact number of OUTPUT bases desired.

reformat.sh will likely work with BAM (though I have not tried it). You can then use the following to keep paired reads.

pairedonly=f            Toss reads that are not mapped as proper pairs.

seqtk sample and seqkit sample (LINK) would be alternate options.

ADD REPLY • link 13 days ago by GenoMax 148k

0

Entering edit mode

Thank you for your suggestion. Actually, I want to subsample the paired-end reads for the Gene presence and absence as describe in this paper: https://www.nature.com/articles/ncomms13390

"Gene presence/absence variation was characterized using the SGSGeneLoss package57. Reads from the 10 lines were mapped to the pangenome using Bowtie2 v2.2.5 (--end-to-end --sensitive -I 0 -X 1000). Reads from lines were subsampled to ~25X using Seqtk v1.1. Only reads mapping in proper pairs were retained. SGSGeneLoss utilizes a depth-of-coverage calculation across all exons of the gene and calls gene absence when the horizontal coverage across exons (total number of exon bases covered by reads) of the gene was 0.5%. Only genes that were annotated on contigs with a length > 1,000 bp were used in this analysis. A gene was considered core if it was present in all lines and variable if it was absent in at least one line."

I have check the manual of Seqtkenter link description here

Subsample 10000 read pairs from two large paired FASTQ files (remember to use the same random seed to keep pairing):

  seqtk sample -s100 read1.fq 10000 > sub1.fq
  seqtk sample -s100 read2.fq 10000 > sub2.fq

I think Seqtk can be used for this task. But the problem is How can I estimate the number of reads is enough to get 20X ? I would like to hear you advice on this matter. Many Thanks.

ADD REPLY • link 13 days ago by Sony ▴ 20

0

Entering edit mode

But the problem is How can I estimate the number of reads is enough to get 20X ?

That 20x is theoretical coverage even with example you mention above. So you would want (rice genome bp x 20 bases) = 20x theoretical coverage. To get to that number of bases you would need (20x theoretical coverage / length of your reads) = total number of reads required

So in theory you could do something like

reformat.sh -Xms10g some.bam out1=reads_R1.fq.gz out2=reads_R2.fq.gz pairedonly=t samplereadstarget=number_of_reads_from above

ADD REPLY • link 13 days ago by GenoMax 148k

score 0 · Answer 1 · 2024-12-13

0

Entering edit mode

13 days ago

Pierre Lindenbaum 164k

for Capping coverage in bam file I wrote https://jvarkit.readthedocs.io/en/latest/Biostar154220/

ADD COMMENT • link 13 days ago by Pierre Lindenbaum 164k

0

Entering edit mode

Just to be clear this is "capping" coverage across the genome to some max X value. Coverage could be much lower or non-existant in places.

ADD REPLY • link 13 days ago by GenoMax 148k

score 0 · Answer 2 · 2024-12-13

0

Entering edit mode

13 days ago

Mensur Dlakic ★ 28k

As you were told already, the BBMap suite can downsample the reads to a desired coverage. However, to the best of my knowledge this is random downsampling, which I do not recommend at the relatively low level of coverage you have.

A more comprehensive downsampling can be done with khmer:

https://github.com/dib-lab/khmer

They have something called digital normalization:

https://khmer-protocols.readthedocs.io/en/latest/mrnaseq/2-diginorm.html

In a nutshell, reads with <25x coverage in your case would not be touched. Only reads with >25x coverage would be downsampled. There is no such selectiveness when reads are randomly downsampled.

ADD COMMENT • link 13 days ago by Mensur Dlakic ★ 28k

1

Entering edit mode

BBTools also includes bbnorm.sh.

$ bbnorm.sh

Written by Brian Bushnell
Last modified October 19, 2017

Description:  Normalizes read depth based on kmer counts.
Can also error-correct, bin reads by kmer depth, and generate a kmer depth histogram.
However, Tadpole has superior error-correction to BBNorm.
Please read bbmap/docs/guides/BBNormGuide.txt for more information.

ADD REPLY • link 13 days ago by GenoMax 148k

0

Entering edit mode

In my hands, bbnorm.sh has been faster but khmer downsampling has given better assemblies.

ADD REPLY • link 13 days ago by Mensur Dlakic ★ 28k