Hello again,
this challenge was fun. But finaly I mastered it :)
The Task
Let me first summarize the task the OP give.
Run samtools depth
for multiple region on all bam
files listed in ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/current.tree, except those haven .chrom20.
or .chrom10.
in the name.
The Problem(s)
Time
The first thought is, to run samtools depth
directly on the bam
file on the server, as samtools
support remote file access. Trying this one will recognize that this take ages to complete.
In approvement might be to first use samtools view
to download only the region of interest and running than samtools depth
locally. With standard parameters this is faster, but takes also a very long time.
Contig names
Another problem is, that some files are mapped against hg19 (UCSC) which prefixed the contig name with "chr"
, e.g. "chr1"
, "chr2"
, ...
and some against NCBI Build 37 where the contigs aren't prefixed, e.g. "1"
,"2"
, ...
So if you try to query on a region named 1:115258827-115258827
you won't have success in any case.
The Solution
Prerequisite
Beside standard unix commands we need:
- gnu parallel
- samtools (version >= 1.7!)
0. Step: Create bed file
Before starting we need to create a bed
file containing all regions of interest. To address the problem with the prefixed/non-prefixed contig we just have to add each region twice: one with and one without prefix. This will later produce a warning. But we can ignore it.
$ cat regions.bed
1 115258826 115258827
16 32297736 33456560
20 52479434 52483091
chr1 115258826 115258827
chr16 32297736 33456560
chr20 52479434 52483091
1. Step: Download region of interest
Since samtools
version 1.7 there is an option called -M
.
-M Use the multi-region iterator on the union of the BED file and command-line region arguments. This avoids re-reading the same regions of files so can sometimes be much faster. Note this also removes duplicate sequences. Without this a sequence that overlaps multiple regions specified on the command line will be reported multiple times.
And this will increase our speed dramaticly.
Here is how it's go:
$ wget -q -O - "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/current.tree" | cut -f 1 | grep '.bam$' | grep -Ev -i ".chrom20.|.chrom11."| parallel -j 4 'samtools view -b -M -L regions.bed ftp://ftp.1000genomes.ebi.ac.uk/vol1/{} -o {/}'
With the -j
parameter one can decide how many processes are started in parallel. You can set it of course higher, if you have mor CPUs and enough bandwith for downloading data.
2. Step: Reindex bam files
samtools view
downloads the index
files of the bam
files. As we now have just an excerpt of the original file, we need to reindex the locally stored ones.
$ parallel -j 4 'samtools index {}' ::: *.bam
3. Step: Calculate depth
The last step is to run samtools depth
on our files:
$ parallel -j 4 'samtools depth -b regions.bed {} > {.}.depth' ::: *.bam
Further improvments
samtools view
provide some more option that might be useful:
-1 Enable fast BAM compression (implies -b).
-@ INT Number of BAM compression threads to use in addition to main thread [0].
What I haven't checked, is whether there are multiple files with the same name in current.tree
. If so this will lead to the problem that an already existings file will be overwritten, as all bam
files are now stored in the same folder.
Have fun :)
fin swimmer
Are you saying that you only have files that end in
.bai
and no corresponding.bam
files without the.bai
on the end?this is one example of the file that I get:
it is a
bam
file withbai
in the end. How can I get bam files without bai in the end?Where did you find the code in original post? It does not seem to work for me at all.
maybe you should add your own path to samtools!
I sure did. No CIGAR (pun intended).