Hello again,
this challenge was fun. But finaly I mastered it :)
The Task
Let me first summarize the task the OP give.
Run samtools depth for multiple region on all bam files listed in ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/current.tree, except those haven .chrom20. or .chrom10. in the name.
The Problem(s)
Time
The first thought is, to run samtools depth directly on the bam file on the server, as samtools support remote file access. Trying this one will recognize that this take ages to complete.
In approvement might be to first use samtools view to download only the region of interest and running than samtools depth locally. With standard parameters this is faster, but takes also a very long time.
Contig names
Another problem is, that some files are mapped against hg19 (UCSC) which prefixed the contig name with "chr", e.g. "chr1", "chr2", ... and some against NCBI Build 37 where the contigs aren't prefixed, e.g. "1","2", ...
So if you try to query on a region named 1:115258827-115258827 you won't have success in any case.
The Solution
Prerequisite
Beside standard unix commands we need:
- gnu parallel
- samtools (version >= 1.7!)
0. Step: Create bed file
Before starting we need to create a bed file containing all regions of interest. To address the problem with the prefixed/non-prefixed contig we just have to add each region twice: one with and one without prefix. This will later produce a warning. But we can ignore it.
$ cat regions.bed
1 115258826 115258827
16 32297736 33456560
20 52479434 52483091
chr1 115258826 115258827
chr16 32297736 33456560
chr20 52479434 52483091
1. Step: Download region of interest
Since samtools version 1.7 there is an option called -M.
-M Use the multi-region iterator on the union of the BED file and command-line region arguments. This avoids re-reading the same regions of files so can sometimes be much faster. Note this also removes duplicate sequences. Without this a sequence that overlaps multiple regions specified on the command line will be reported multiple times.
And this will increase our speed dramaticly.
Here is how it's go:
$ wget -q -O - "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/current.tree" | cut -f 1 | grep '.bam$' | grep -Ev -i ".chrom20.|.chrom11."| parallel -j 4 'samtools view -b -M -L regions.bed ftp://ftp.1000genomes.ebi.ac.uk/vol1/{} -o {/}'
With the -j parameter one can decide how many processes are started in parallel. You can set it of course higher, if you have mor CPUs and enough bandwith for downloading data.
2. Step: Reindex bam files
samtools view downloads the index files of the bam files. As we now have just an excerpt of the original file, we need to reindex the locally stored ones.
$ parallel -j 4 'samtools index {}' ::: *.bam
3. Step: Calculate depth
The last step is to run samtools depth on our files:
$ parallel -j 4 'samtools depth -b regions.bed {} > {.}.depth' ::: *.bam
Further improvments
samtools view provide some more option that might be useful:
-1 Enable fast BAM compression (implies -b).
-@ INT Number of BAM compression threads to use in addition to main thread [0].
What I haven't checked, is whether there are multiple files with the same name in current.tree. If so this will lead to the problem that an already existings file will be overwritten, as all bam files are now stored in the same folder.
Have fun :)
fin swimmer
Are you saying that you only have files that end in
.baiand no corresponding.bamfiles without the.baion the end?this is one example of the file that I get:
it is a
bamfile withbaiin the end. How can I get bam files without bai in the end?Where did you find the code in original post? It does not seem to work for me at all.
maybe you should add your own path to samtools!
I sure did. No CIGAR (pun intended).