Split BAM file into two subsamples
3
1
Entering edit mode
8.8 years ago
colin.kern ★ 1.1k

I want to create two random subsamples from a BAM file. I don't want to just do "samtools view -s 0.5" twice because I want to put half of the reads into one file and half in the other, without replacement. Is there an easy way to do this? I've considered using samtools to output the alignments in SAM format and pipe to the shuf and split commands, I think I'd need to do some extra things to keep the paired-end reads together. Anyone have any ideas?

alignment • 9.6k views
ADD COMMENT
0
Entering edit mode

"since reads are four lines " uh ?

ADD REPLY
0
Entering edit mode

Nevermind, I was thining of fastq

ADD REPLY
0
Entering edit mode

Hi, Colin,

I have the same question. I wonder if you already have a solution to solve this problem?

Regards,

Xiao

ADD REPLY
4
Entering edit mode
8.8 years ago

Assuming single-end reads, something like this should work:

samtools view -H foo.bam > f1.sam
cp f1.sam f2.sam
samtools view foo.bam | awk '{if(NR%2){print >> "f1.sam"} else {print >> "f2.sam"}}'

I'm sure there's a one line version of that, but this is simple enough.

For paired-end reads, just change NR%2 to NR%4<2. Again, there's likely a bug in there, but it's something to start with.

ADD COMMENT
0
Entering edit mode

Won't this work?

wc -l file.sam # line_number/4, say we've got 1000
sed -n 1,500p file.sam > first_half.sam
sed -n 501,1000p file.sam > second_half.sam
ADD REPLY
1
Entering edit mode

"file.sam" would contain a header, which I assume should be kept completely. Aside from that, sure, one could use sed instead. There are many ways to skin this proverbial cat. I'd personally use python with pysam, since then it'd be simple to handle singletons and such.

ADD REPLY
0
Entering edit mode

Yeah, I would've taken care of it if I thought of header. I just wanted to confirm what I was trying to solve this problem was correct. Thank you.

ADD REPLY
0
Entering edit mode

Does this do a random sample? If I'm understanding how it works, it just alternates putting reads into each file, which if the bam is sorted wouldn't really be random.

ADD REPLY
0
Entering edit mode

You would name sort the BAM file first, the results would then be random enough.

ADD REPLY
1
Entering edit mode
7.0 years ago
brian.l.hill ▴ 10

I have created a Python script to do this:

Script to split input BAM file randomly (without replacement) into two BAM files

Hopefully this is useful.

ADD COMMENT
0
Entering edit mode

Hi, Brian, Thanks for sharing the script. I tried it but I have the error as below: python split_bam.py -b 1-A5_1.bam -o1 1-A5_1_part1.bam -o2 1-A5_1_part2.bam

File "split_bam.py", line 18 print "ERROR: BAM file does not exist!" ^ SyntaxError: Missing parentheses in call to 'print'. Did you mean print("ERROR: BAM file does not exist!")?

I would appreciate it if you could give me some inputs. My bam file is from pair-end reads alignment.

Regards,

Xiao

ADD REPLY
0
Entering edit mode
8.8 years ago
Folder40g ▴ 190

Not the best way but a fast one could be grep all names of reads in sam file to a new file. With sort -R option shuffle them and then grep or intersect the n numbers names shuffled in this file created with the bam file.

You first need to convert the bam into sam.

Not sure if I explained myself well enough.

ADD COMMENT

Login before adding your answer.

Traffic: 1812 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6