Question

How To Split A Bam File By Chromosome

11

Entering edit mode

12.9 years ago

GPR ▴ 390

Hello,

I am having a hard time opening a very large bedgraph. I have been suggested to split my bam file by chromosome with chrom-bed.py but it didn't work.

Is there any other alternative?

Thanks,
GP

bam • 125k views

ADD COMMENT • link updated 15 months ago by Ram 45k • written 12.9 years ago by GPR ▴ 390

1

Entering edit mode

How are you having a hard time opening the file? What happens? How did your script to split the bam file "not work"? These details may help people answer your question.

ADD REPLY • link 12.9 years ago by Alex Paciorkowski 3.5k

0

Entering edit mode

I am pretty much having trouble with the indexing and header of my bam file. I have indexed in various ways, but keep having trouble. Have tried chrom-bed.py and samtools view chr1 > chr1.bam. I have indexed with samtools and bamtools, I even tried sorting the file before indexing. Any suggestions?

ADD REPLY • link 12.9 years ago by GPR ▴ 390

1

Entering edit mode

bam != bedgraph, fyi.

ADD REPLY • link 12.9 years ago by Madelaine Gogol 5.3k

0

Entering edit mode

I have indexed and when running samtools virew, I get the message fail to read the header from bam file. Any tips?

@ Madelaine: yes I have converted my bam files to bedgraph ones successfully. The problem is the size, that's why I want to split them by chromosome

ADD REPLY • link 12.9 years ago by GPR ▴ 390

0

Entering edit mode

Try samtools reheader to get the header back.

ADD REPLY • link 12.9 years ago by Michael 55k

0

Entering edit mode

@genaro Oh, sorry.

You can also use picard CreateSequenceDictionary on the genome fasta and add that to the top of the sam file if the sam file is missing a header.

ADD REPLY • link 12.9 years ago by Madelaine Gogol 5.3k

3

Entering edit mode

9.2 years ago

Shicheng Guo ★ 9.6k

You can use the following pipeline to extract chrY reads from the raw bam files and with the header

samtools sort A.bam -o A.sort.bam
samtools index A.sort.bam
samtools view -H A.sort.bam > output.extraction.sam
samtools view A.sort.bam chrY >> output.extraction.sam
samtools view -hb output.extraction.sam > output.extraction.bam
samtools view  -H output.extraction.bam

output.extraction.bam is the bam file which extracted chrY reads.

ADD COMMENT • link updated 5.5 years ago by Ram 45k • written 9.2 years ago by Shicheng Guo ★ 9.6k

3

Entering edit mode

This isn't really an answer to the original question.

Plus if you want to extract chrY reads you only need a single command:

samtools view -b in.bam chrY > out.bam

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.2 years ago by Jorge Amigo 14k

2

Entering edit mode

This is true but still requires a sorted and indexed bam file. From the samtools manual:

Use of region specifications requires a coordinate-sorted and indexed input file (in BAM or CRAM format).

samtools sort A.bam -o A.sort.bam
samtools index A.sort.bam
samtools view -b A.sort.bam chrY > output.extraction.bam

ADD REPLY • link 6.9 years ago by goodez ▴ 640

0

Entering edit mode

Using -b produces smaller bam than skipping it. Why so? For chrX, for example, I observed a bam of 2.1 GB with -b option but a 12 GB BAM. And they can be viewed with less or head command which is quite unusual.

ADD REPLY • link 4.5 years ago by rohitsatyam102 ▴ 930

1

Entering edit mode

7.2 years ago

hewm2008 ▴ 50

This soft can help all of you

https://github.com/BGI-shenzhen/BamSplit

ADD COMMENT • link 7.2 years ago by hewm2008 ▴ 50

0

Entering edit mode

well done. xiao-ming

ADD REPLY • link 5.2 years ago by Lhl ▴ 760

0

Entering edit mode

4.2 years ago

opplatek ▴ 300

There is also a nice blog post about this by Sam Nicholls here.

tl;dr (extracted from the blog post)

samtools view -H in.bam | grep -P '^@SQ' | cut -f 2 -d ':' | cut -f 1 | while read contig; do
    samtools faidx reference.fa $contig > my_contig.fa
    java -jar picard.jar CreateSequenceDictionary R=my_contig.fa O=my_contig.dict
    java -jar picard.jar ReorderSam INPUT=in.bam OUTPUT=out-${contig}.bam REFERENCE=my_contig.fa S=true VERBOSITY=WARNING
    rm my_contig.fa my_contig.dict
done

This will go through the input whole bam file (in.bam) and make separate bams for all the contigs (out-${contig}.bam) in the reference fasta (reference.fa). The output bam files are compatible with most of the common bioinformatics software. Of course, you can skip the for loop and just use the contig name of your choice instead.

ADD COMMENT • link 4.2 years ago by opplatek ▴ 300

0

Entering edit mode

3.7 years ago

etiennedanis ▴ 20

samtools view -b in.bam chr1 > out.bam

did not work for me. But

samtools view -b in.bam 1  > out.bam

worked.

ADD COMMENT • link 3.7 years ago by etiennedanis ▴ 20

0

Entering edit mode

Of course it depends on how your chromosomes are defined. If your bam file comes from a hg19 mapping you'll have to use "chr1", and if your bam file comes from a b37 mapping you'll have to use only "1".

ADD REPLY • link 3.7 years ago by Jorge Amigo 14k

0

Entering edit mode

Good to know! I used mm10.

ADD REPLY • link 3.7 years ago by etiennedanis ▴ 20

0

Entering edit mode

You can always check for that information in advance by inspecting the bam file's header:

samtools view -H file.bam | perl -lne '/SN:(\S+)/ and print $1'

ADD REPLY • link 3.7 years ago by Jorge Amigo 14k

0

Entering edit mode

Very nice! It does. I also found that

samtools idxstats file.bam

gives this info too. It is a good idea to check when using a new genome or a different version. Thank you!

ADD REPLY • link 3.7 years ago by etiennedanis ▴ 20

0

Entering edit mode

How should I get more than 1 chromosome? samtools view -b in.bam [chr1, chr2] > out.bam ? Thanks

ADD REPLY • link 3.1 years ago by Dan ▴ 180

0

Entering edit mode

I found the correct code is samtools view -b in.bam chr1 chr2 > out.bam

ADD REPLY • link 3.1 years ago by Dan ▴ 180

score 27 · Accepted Answer · 2013-04-24

27

Entering edit mode

12.0 years ago

Jorge Amigo 14k

in this other answer, Aaron Quinlan stated:

bamtools has a "split" command for exactly this purpose

I can only add that I've just tried it with this simple command

bamtools split -in file.bam -reference

and it works like a charm. the bam file gets split into different bam files, which are suffixed with .REF_xxx.bam by default, which is very convenient.

ADD COMMENT • link 12.0 years ago by Jorge Amigo 14k

1

Entering edit mode

Is it possible to specify output directory, and to mention which specify chromosome needs to be extracted (for eg., just chr5 and chr22)?

ADD REPLY • link 7.1 years ago by mg ▴ 290

0

Entering edit mode

Hi! Thank you Jorge Amigo for suggesting bamtools split. I successfully installed bamtools and am wondering, whether someone can answer if it is possible to not split the whole bam into all chromosomes/ contigs, but only extract those chromosomes/ contigs needed..?

ADD REPLY • link 18 months ago by ella • 0

0

Entering edit mode

bamtools split is intended to split the whole bam file. What you are describing is definititely not what it was designed for, and it would be then much easier to extract the derised chromomoses/contigs directly with samtools.

Either by an individual samtools command:

samtools view -bo out.chr21.bam in.bam chr21

Either by nested samtools commands:

for chr in chr20 chr21 chr22; do
 samtools view -bo out.$chr.bam in.bam $chr
done

ADD REPLY • link 18 months ago by Jorge Amigo 14k

0

Entering edit mode

This has the advantage of only generating a bamfile for the reference that you want.

ADD REPLY • link 7.5 years ago by russ.bainer • 0

Ram · Accepted Answer · 2012-06-05

19

Entering edit mode

12.9 years ago

Michael 55k

Try samtools: samtools view -?

A region should be presented in one of the following formats: `chr1',`chr2:1,000' and `chr3:1000-2,000'. When a region is specified, the input alignment file must be an indexed BAM file.

something like samtools view in.bam chr1 > chr1.bam should work

ADD COMMENT • link updated 5.5 years ago by Ram 45k • written 12.9 years ago by Michael 55k

12

Entering edit mode

You're missing the -b flag (bam output). So something like:

samtools view -b in.bam chr1 > in_chr1.bam

NOTE: The sequence dictionary (@SQ header lines) will still contain entries for everything. This can cause problems if the tools you're feeding those split bam files into use that header information. For example, picard CollectWgsMetrics will still assume that the bam is supposed to cover the whole genome, not just a single chrom. I'm certain lots of other tools will have similar problems.

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 10.5 years ago by travcollier ▴ 210

1

Entering edit mode

Is there a way to prevent each bam file to have entries for everything in the sequence dictionary (@SQ)? Or is there a way to recreate/parse/filter the header for each specific bam?

ADD REPLY • link 6.7 years ago by uribe.convers ▴ 10

0

Entering edit mode

I would try the bamtools way (Jorge Amigo's answer) instead. It might be possible to parse and filter the entries going through text format, but it can also easily mess up everything.

ADD REPLY • link 5.9 years ago by Michael 55k

0

Entering edit mode

A solution would be converting bam to fastq file, then use the fastq to map to specific chromosome

ADD REPLY • link 8.9 years ago by Chen Sun ★ 1.1k

0

Entering edit mode

This is silly as it's slow and the file wouldn't be indexed.

ADD REPLY • link 8.8 years ago by scchess ▴ 640

0

Entering edit mode

need to specify the output format to be BAM with option -O BAM or with -b

ADD REPLY • link 20 months ago by ammar.husami • 0

Ram · Accepted Answer · 2013-07-24

16

Entering edit mode

11.7 years ago

SHI Quan ▴ 160

samtools view in.bam chr1 -b > out.bam

Use -b to output bam format

ADD COMMENT • link updated 5.5 years ago by Ram 45k • written 11.7 years ago by SHI Quan ▴ 160

Ram · Accepted Answer · 2013-04-24

6

Entering edit mode

12.0 years ago

Pierre Lindenbaum 166k

I wrote a java tool to split a BAM per chromosome see http://code.google.com/p/jvarkit/wiki/SplitBam

It also creates an empty BAM (filled with a pair of mock SAMRecords) for each chromosome in the Reference, if no SAMRecord was found for the chromosome.

ADD COMMENT • link updated 15 months ago by Ram 45k • written 12.0 years ago by Pierre Lindenbaum 166k