How to reduce reads in fastq, bam, bed files in a random manner?
2
1
Entering edit mode
10.3 years ago
trakhtenberg ▴ 160

Hello, what command could I use to extract every 3rd or every 4th pair of reads from two fastq files corresponding to pair-end reads (file read1 and file read2)? I want to use it for getting smaller raw data files, which would be representative of the larger files, since they would be reduced randomly.
Also, the same question regarding how to randomly reduce bam and bed files, such as accepted_hits.bam and junctions.bed generated by tophat?

Thank you,
Ephraim Trakhtenberg

RNA-Seq • 8.4k views
ADD COMMENT
1
Entering edit mode

The approach you mention is not technically random, but this task is a pretty common one. There has been plenty of good discussion of this site about sampling large files, for example, Selecting Random Pairs From Fastq? (and see the 'similar posts' section to the right of that page for more posts on the topic). Hope that helps.

ADD REPLY
1
Entering edit mode
10.3 years ago

To uniformly sample a third of your paired reads (without replacement) from a paired fastq file(*), via bash and sample:

$ fq_count=`calc "$(wc -l < all_reads.fastq) / 8"`
$ samples=`calc "floor($fq_count / 3)"`
$ sample --lines-per-offset=8 --sample-size=${samples} all_reads.fastq > random_sample.fastq

(* - A paired fastq file would be one interleaved file that stores paired reads consecutively, one pair after its mate.)

Similarly, to sample a BED file:

$ sample --sample-size=12345 foo.bed > random_sample.bed

Or if you have sufficient memory:

$ shuf foo.bed | head -n12345 > random_sample.bed
ADD COMMENT
1
Entering edit mode

What is calc? Is that a newer builtin, or something else? Google did not turn up anything so I'm curious.

ADD REPLY
1
Entering edit mode

The calc tool is not a builtin, but I am using bash notation for shell commands. Source is available here, though it is probably available in convenient package form with your favorite package manager.

ADD REPLY
1
Entering edit mode

A minor point, recent versions of shuf will do memory-efficient reservoir sampling with the -n argument, so you don't need to use head.

ADD REPLY
1
Entering edit mode

I think expr $fq_count / 3 will give you the floor by default, so you could replace all "calc" with "expr" in this case.

ADD REPLY
1
Entering edit mode
10.3 years ago
Cytosine ▴ 460

I believe a truly random approach is already implemented in samtools for bam file extraction.

The option -s enables you to subsample a bam file.

-s FLOAT  Fraction of templates/pairs to subsample; the inte‐
                           ger part is treated as the seed for the random num‐
                           ber generator [-1]
samtools view -s 0.25 -b mymapping.bam > random_25%_of_mymapping.bam
ADD COMMENT
1
Entering edit mode

I would change the 0 to something more interesting though, like

samtools view -s 23241.25 -b mymapping.bam > random_25%_of_mymapping.bam

For the RNG to subsample a bit more randomly.

ADD REPLY
0
Entering edit mode

I had 167115 reads in the input BAM, so 25% is 41779: using 0.25 gets 42134; using 23241.25 gets 41567 (which is closer to 41779 but still 212 off). I would appreciate a clarification on how could I produce output closer to the specified percent, and a brief explanation on why replacing 0 with 23241 made the output closer to 25%, and how could it increase randomness? Also, just to confirm, in both approaches if reads are paired, they are randomly reduced also in pairs (just to make sure the reads are not selected randomly regardless of their pairing). Thank you.

ADD REPLY

Login before adding your answer.

Traffic: 1846 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6