Hi everyone,
I have sequencing data for 20 rice accessions (paired-end). When I map the reads to the reference genome and calculate the coverage using the following command:
samtools depth ERR3890928_sorted.bam | awk '{sum+=$3} END { print "Average = ",sum/NR}'
I get an average coverage of 34.6737. The coverage varies from 25X to 40X across samples.
For downstream analysis, I want to subsample reads from the paired-end FASTQ files to achieve 25X coverage (Only reads mapping in proper pairs will be retained).
Does anyone know any tools that can handle this task and how to perform it ?
Many thanks!
Since subsampling programs would not know about "depth" I am not sure you are going to find anything that will do what you are asking for (though I could be wrong). What is this going to achieve by the way? There may be areas of genome where there may be no reads at all so
depth
is meaningless in some sense.If you just wanted to down sample the data then
reformat.sh
from BBMap suite would be able to do this. The parameters of interest arereformat.sh
will likely work with BAM (though I have not tried it). You can then use the following to keep paired reads.seqtk sample
andseqkit sample
(LINK) would be alternate options.Thank you for your suggestion. Actually, I want to subsample the paired-end reads for the Gene presence and absence as describe in this paper: https://www.nature.com/articles/ncomms13390
"Gene presence/absence variation was characterized using the SGSGeneLoss package57. Reads from the 10 lines were mapped to the pangenome using Bowtie2 v2.2.5 (--end-to-end --sensitive -I 0 -X 1000). Reads from lines were subsampled to ~25X using Seqtk v1.1. Only reads mapping in proper pairs were retained. SGSGeneLoss utilizes a depth-of-coverage calculation across all exons of the gene and calls gene absence when the horizontal coverage across exons (total number of exon bases covered by reads) of the gene was 0.5%. Only genes that were annotated on contigs with a length > 1,000 bp were used in this analysis. A gene was considered core if it was present in all lines and variable if it was absent in at least one line."
I have check the manual of Seqtkenter link description here
I think Seqtk can be used for this task. But the problem is How can I estimate the number of reads is enough to get 20X ? I would like to hear you advice on this matter. Many Thanks.
That 20x is theoretical coverage even with example you mention above. So you would want
(rice genome bp x 20 bases) = 20x theoretical coverage
. To get to that number of bases you would need(20x theoretical coverage / length of your reads) = total number of reads required
So in theory you could do something like