Hello,
I am having a hard time opening a very large bedgraph. I have been suggested to split my bam file by chromosome with chrom-bed.py but it didn't work.
Is there any other alternative?
Thanks,
GP
Hello,
I am having a hard time opening a very large bedgraph. I have been suggested to split my bam file by chromosome with chrom-bed.py but it didn't work.
Is there any other alternative?
Thanks,
GP
in this other answer, Aaron Quinlan stated:
bamtools has a "split" command for exactly this purpose
I can only add that I've just tried it with this simple command
bamtools split -in file.bam -reference
and it works like a charm. the bam file gets split into different bam files, which are suffixed with .REF_xxx.bam by default, which is very convenient.
Hi! Thank you Jorge Amigo for suggesting bamtools split. I successfully installed bamtools and am wondering, whether someone can answer if it is possible to not split the whole bam into all chromosomes/ contigs, but only extract those chromosomes/ contigs needed..?
bamtools split
is intended to split the whole bam file. What you are describing is definititely not what it was designed for, and it would be then much easier to extract the derised chromomoses/contigs directly with samtools.
Either by an individual samtools command:
samtools view -bo out.chr21.bam in.bam chr21
Either by nested samtools commands:
for chr in chr20 chr21 chr22; do
samtools view -bo out.$chr.bam in.bam $chr
done
Try samtools: samtools view -?
A region should be presented in one of the following formats: `chr1',`chr2:1,000' and `chr3:1000-2,000'. When a region is specified, the input alignment file must be an indexed BAM file.
something like samtools view in.bam chr1 > chr1.bam
should work
You're missing the -b
flag (bam output). So something like:
​samtools view -b in.bam chr1 > in_chr1.bam
NOTE: The sequence dictionary (@SQ header lines) will still contain entries for everything. This can cause problems if the tools you're feeding those split bam files into use that header information. For example, picard CollectWgsMetrics
will still assume that the bam is supposed to cover the whole genome, not just a single chrom. I'm certain lots of other tools will have similar problems.
samtools view in.bam chr1 -b > out.bam
Use -b
to output bam format
I wrote a java tool to split a BAM per chromosome see http://code.google.com/p/jvarkit/wiki/SplitBam
It also creates an empty BAM (filled with a pair of mock SAMRecords) for each chromosome in the Reference, if no SAMRecord was found for the chromosome.
You can use the following pipeline to extract chrY reads from the raw bam files and with the header
samtools sort A.bam -o A.sort.bam
samtools index A.sort.bam
samtools view -H A.sort.bam > output.extraction.sam
samtools view A.sort.bam chrY >> output.extraction.sam
samtools view -hb output.extraction.sam > output.extraction.bam
samtools view -H output.extraction.bam
output.extraction.bam
is the bam file which extracted chrY reads.
This is true but still requires a sorted and indexed bam file. From the samtools manual:
Use of region specifications requires a coordinate-sorted and indexed input file (in BAM or CRAM format).
samtools sort A.bam -o A.sort.bam
samtools index A.sort.bam
samtools view -b A.sort.bam chrY > output.extraction.bam
This soft can help all of you
There is also a nice blog post about this by Sam Nicholls here.
tl;dr (extracted from the blog post)
samtools view -H in.bam | grep -P '^@SQ' | cut -f 2 -d ':' | cut -f 1 | while read contig; do
samtools faidx reference.fa $contig > my_contig.fa
java -jar picard.jar CreateSequenceDictionary R=my_contig.fa O=my_contig.dict
java -jar picard.jar ReorderSam INPUT=in.bam OUTPUT=out-${contig}.bam REFERENCE=my_contig.fa S=true VERBOSITY=WARNING
rm my_contig.fa my_contig.dict
done
This will go through the input whole bam file (in.bam
) and make separate bams for all the contigs (out-${contig}.bam
) in the reference fasta (reference.fa
). The output bam files are compatible with most of the common bioinformatics software. Of course, you can skip the for
loop and just use the contig name of your choice instead.
samtools view -b in.bam chr1 > out.bam
did not work for me. But
samtools view -b in.bam 1 > out.bam
worked.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
How are you having a hard time opening the file? What happens? How did your script to split the bam file "not work"? These details may help people answer your question.
I am pretty much having trouble with the indexing and header of my bam file. I have indexed in various ways, but keep having trouble. Have tried chrom-bed.py and samtools view chr1 > chr1.bam. I have indexed with samtools and bamtools, I even tried sorting the file before indexing. Any suggestions?
bam != bedgraph, fyi.
I have indexed and when running samtools virew, I get the message fail to read the header from bam file. Any tips?
@ Madelaine: yes I have converted my bam files to bedgraph ones successfully. The problem is the size, that's why I want to split them by chromosome
Try samtools reheader to get the header back.
@genaro Oh, sorry.
You can also use picard CreateSequenceDictionary on the genome fasta and add that to the top of the sam file if the sam file is missing a header.