Question

Sample Sam File

9

Entering edit mode

12.6 years ago

Steffi ▴ 580

Hi,

I have a sam file which contain alignments of about 30 million reads. I just would like to randomly sample e.g. 1 million alignments out of my sam file. Whats the best way to do this? Anything already available in any kind of toolkit? Or bash scripting? Or Perl?

thanks a lot, Steffi

sam random • 19k views

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 12.6 years ago by Steffi ▴ 580

Ram · Answer 1 · 2013-07-02

13

Entering edit mode

11.4 years ago

Fabio Marroni ★ 3.0k

even easier and faster:

samtools view -b -s 0.1 infile.bam > ten_percent_of_bam_file.bam

See here

ADD COMMENT • link updated 3.2 years ago by Ram 44k • written 11.4 years ago by Fabio Marroni ★ 3.0k

0

Entering edit mode

Excellent! I wish I had known this before!

ADD REPLY • link 11.1 years ago by Erik Garrison ★ 2.4k

0

Entering edit mode

I could not find the -s option on samtools. When I tried using, gave me an error 'invalid option'.

Would you know the reason. I am using samtools-0.1.16

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 10.9 years ago by sheetalshetty13 ▴ 60

0

Entering edit mode

Version is important. I just found out -s option is not available on version 0.1.16 Also option is cryptic: so I had to write it as -s -1.5 ( to get 50% of reads) -s -1.7 ( to get 70% of reads) 1 is the seed and .5/.7 the percent of reads for sub sampling

ADD REPLY • link 10.9 years ago by sheetalshetty13 ▴ 60

1

Entering edit mode

I think I found a problem with samtools -s

In version 0.1.19 if you extract two samples using the same random seed you always obtain the same reads.

So if you have several samples to extract, remember to change the random seed. As @sheetalshetty13 pointed out, the random seed is the part of the number before the point. So 0.5 will extract 50% with random seed "0" and 1.5 50% of the reads with random seed "1".

Interestingly, I noticed that if you use two digits random seeds you obtain different sample sizes (using the same probability).

ls -l tmp*
-rwxrwx--- 1 marroni Domain Users 124458 Apr 15 11:52 tmp.bam
-rwxrwx--- 1 marroni Domain Users 124458 Apr 15 11:52 tmp2.bam
-rwxrwx--- 1 marroni Domain Users 119221 Apr 15 11:54 tmp3.bam

If you run samtools view -s on a file on which samtools view -s was already used you have unpredictable results

samtools view -b -s 1.5 tmp.bam > small.bam
samtools view -b -s 2.5 tmp.bam > small2.bam

-rwxrwx--- 1 marroni Domain Users   124458 Apr 15 12:18 small.bam
-rwxrwx--- 1 marroni Domain Users   124458 Apr 15 12:20 small2.bam
-rwxrwx--- 1 marroni Domain Users   124458 Apr 15 11:52 tmp.bam
-rwxrwx--- 1 marroni Domain Users   124458 Apr 15 11:52 tmp2.bam
-rwxrwx--- 1 marroni Domain Users   119221 Apr 15 11:54 tmp3.bam

I am currently not using samtools view -s.

ADD REPLY • link updated 3.2 years ago by Ram 44k • written 10.6 years ago by Fabio Marroni ★ 3.0k

0

Entering edit mode

Hi Fabio,

I'm experiencing the same problem. You said you were not using samtools.

Do you use picard DownsampleSam or some custom script to downsample a sam/bam file?

Thank you!

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by biola ▴ 20

0

Entering edit mode

In that situation I used a completely different approach. Since the data were simulated I just simulated fastq files of different sizes (I had to align all of them, but better than having wrong results...).

DownsampleSam might be an option (I know that several people use it and they are happy with their results. I remember it being slow...)

The approaches suggested in the following answers are very good. I like the shuffle very much (you can remove the header and save it somewhere else and then add it back).

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.9 years ago by Fabio Marroni ★ 3.0k

0

Entering edit mode

Hi! You said 1 is the seed, and what the seed means?

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by 512788522 ▴ 20

Ram · Answer 2 · 2012-05-09

Hi, lets start using shell

###remove header and save the sam file head
sed 1,23d file.sam > file_noHeader.sam
head -n 23 file.sam > head

###randomly select one Million reads and save them (I took the one liner from here: http://www.unix.com/shell-programming-scripting/68686-random-lines-selection-form-file.html)
awk 'BEGIN {srand()} {printf "%05.0f %s \n",rand()*9999999, $0; }' file_noHeader.sam | sort -n | head - 1000000| sed 's/^[0-9]* //' > randomReads.tmp

###join the header back to have randomly sampled million reads subset of original file
cat head randomReads.tmp > randomReads.sam

###remove the tmp files
rm  file_noHeader.sam randomReads.tmp

Ofcourse it can be more efficient and automated using pipes.

Also, you can save the script in a file, and replace the file name with $1, like

awk 'BEGIN {srand()} {printf "%05.0f %s \n",rand()*9999999, $0; }' $1 | sort -n | head - 1000000| sed 's/^[0-9]* //' > $1.tmp

and then call it like sh rand.sh file.sam, it will produce file.tmp

Cheers

Ram · Answer 3 · 2012-05-09

2

Entering edit mode

12.6 years ago

Wen.Huang ★ 1.2k

how about

shuf your.sam | head -n 1000000

ADD COMMENT • link 12.6 years ago by Wen.Huang ★ 1.2k

3

Entering edit mode

SAM files have a header section, as well as a body section, thus simply shuffling the file would mess up the overall structure. You could break the file into parts, as Sukhdeep suggests, and then shuf the body part.

ADD REPLY • link 12.6 years ago by seidel 11k

0

Entering edit mode

You can do like that from a BAM file:

samtools view -H your.bam > header.txt
samtools view your.bam > your.sam
shuf your.sam | head -n 1000000 > small.sam

(Because I don't know if you can pipe to shuf, maybe you can and we do not need to create the file small.sam)

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.9 years ago by Fabio Marroni ★ 3.0k

0

Entering edit mode

actually, I just found you can use:

shuf your.sam -n 1000000 > small.sam

The option to limit the output is already given by shuf.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 8.2 years ago by Fabio Marroni ★ 3.0k

score 2 · Answer 4 · 2013-09-12

2

Entering edit mode

11.2 years ago

Matt Shirley 10k

Have you taken a look at Picard DownsampleSam? You just set PROBABILITY=n and you should get near n * 100 % of the original number of reads.

ADD COMMENT • link 11.2 years ago by Matt Shirley 10k

score 0 · Answer 5 · 2012-05-09

A simple workaround would be to sort your sam file by read names see the -n flag to samtools sort, then take the first, second etc 1 million to get a random sample without replacement.

In all honesty this approach most likely has some limitations since the read naming probably correlates to flow cell location that in turn may introduce some biases. I still think it could be useful and it is really easy to do.

score 0 · Answer 6 · 2013-09-12

One more possibility; assuming you have Illumina reads, use grep to grep out just the reads from a particular tile or two, preferably tiles not on the edge. That subset would have a slightly higher quality than the whole dataset, which includes reads on the edge of the flowcell, but that's probably a good thing.