Question

compute depth of coverage for certain regions of the genome using samtools depth

0

Entering edit mode

6.6 years ago

Ana ▴ 200

I have downloaded bam files of 1000 genomes project. I actually used samtools to download only certain regions of the genome, not the entire bam file by using this cmmand:

wget -q -O - "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/current.tree" | cut -f 1 | grep '.bam$' | while read B; do echo -n "$B " && /data/programs/samtools-1.3.1/samtools  view -bu "http://ftp.1000genomes.ebi.ac.uk/vol1/$B" "1:115258827-115258827"  && rm *.chrom20.* && rm *.chrom11.* ; done

This gives me a Buch of files in bam format; fro example:

HG03270.mapped.ILLUMINA.bwa.ESN.low_coverage.20121211.bam.bai
HG03297.mapped.ILLUMINA.bwa.ESN.low_coverage.20130415.bam.bai

I want to compute the depth of coverage for each of bam.bai files. I used samtools depth command

/data/programs/samtools-1.3.1/samtools depth  NA21142.mapped.ILLUMINA.bwa.GIH.low_coverage.20130415.bam.bai

but I get an error message:

[E::hts_hopen] fail to open file 'NA21142.mapped.ILLUMINA.bwa.GIH.low_coverage.20130415.bam.bai' [E::hts_open_format] fail to open file 'NA21142.mapped.ILLUMINA.bwa.GIH.low_coverage.20130415.bam.bai' samtools depth: Could not open "NA21142.mapped.ILLUMINA.bwa.GIH.low_coverage.20130415.bam.bai": Success

Am I doing something wrong?

depth of coverage samtools • 6.5k views

ADD COMMENT • link updated 6.6 years ago by finswimmer 16k • written 6.6 years ago by Ana ▴ 200

1

Entering edit mode

Are you saying that you only have files that end in .bai and no corresponding .bam files without the .bai on the end?

ADD REPLY • link 6.6 years ago by GenoMax 147k

0

Entering edit mode

this is one example of the file that I get:

HG03270.mapped.ILLUMINA.bwa.ESN.low_coverage.20121211.bam.bai

it is a bam file with bai in the end. How can I get bam files without bai in the end?

ADD REPLY • link 6.6 years ago by Ana ▴ 200

0

Entering edit mode

Where did you find the code in original post? It does not seem to work for me at all.

ADD REPLY • link 6.6 years ago by GenoMax 147k

0

Entering edit mode

maybe you should add your own path to samtools!

ADD REPLY • link 6.6 years ago by Ana ▴ 200

0

Entering edit mode

I sure did. No CIGAR (pun intended).

ADD REPLY • link 6.6 years ago by GenoMax 147k

4

Entering edit mode

6.6 years ago

finswimmer 16k

Hello again,

this challenge was fun. But finaly I mastered it :)

The Task

Let me first summarize the task the OP give.

Run samtools depth for multiple region on all bam files listed in ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/current.tree, except those haven .chrom20. or .chrom10. in the name.

The Problem(s)

Time

The first thought is, to run samtools depth directly on the bam file on the server, as samtools support remote file access. Trying this one will recognize that this take ages to complete.

In approvement might be to first use samtools view to download only the region of interest and running than samtools depth locally. With standard parameters this is faster, but takes also a very long time.

Contig names

Another problem is, that some files are mapped against hg19 (UCSC) which prefixed the contig name with "chr", e.g. "chr1", "chr2", ... and some against NCBI Build 37 where the contigs aren't prefixed, e.g. "1","2", ...

So if you try to query on a region named 1:115258827-115258827 you won't have success in any case.

The Solution

Prerequisite

Beside standard unix commands we need:

gnu parallel
samtools (version >= 1.7!)

0. Step: Create bed file

Before starting we need to create a bed file containing all regions of interest. To address the problem with the prefixed/non-prefixed contig we just have to add each region twice: one with and one without prefix. This will later produce a warning. But we can ignore it.

$ cat regions.bed
1   115258826   115258827
16  32297736    33456560
20  52479434    52483091
chr1    115258826   115258827
chr16   32297736    33456560
chr20   52479434    52483091

1. Step: Download region of interest

Since samtools version 1.7 there is an option called -M.

-M       Use the multi-region iterator on the union of the BED file and command-line region arguments. This avoids re-reading the same regions of files so can sometimes be much faster. Note this also removes duplicate sequences. Without this a sequence that overlaps multiple regions specified on the command line will be reported multiple times.

And this will increase our speed dramaticly.

Here is how it's go:

$ wget -q -O - "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/current.tree" | cut -f 1 | grep '.bam$' | grep -Ev -i ".chrom20.|.chrom11."| parallel -j 4 'samtools view -b -M -L regions.bed ftp://ftp.1000genomes.ebi.ac.uk/vol1/{} -o {/}'

With the -j parameter one can decide how many processes are started in parallel. You can set it of course higher, if you have mor CPUs and enough bandwith for downloading data.

2. Step: Reindex bam files

samtools view downloads the index files of the bam files. As we now have just an excerpt of the original file, we need to reindex the locally stored ones.

$ parallel -j 4 'samtools index {}' ::: *.bam

3. Step: Calculate depth

The last step is to run samtools depth on our files:

$ parallel -j 4 'samtools depth -b regions.bed  {} > {.}.depth' ::: *.bam

Further improvments

samtools view provide some more option that might be useful:

-1    Enable fast BAM compression (implies -b). 

-@ INT   Number of BAM compression threads to use in addition to main thread [0].

What I haven't checked, is whether there are multiple files with the same name in current.tree. If so this will lead to the problem that an already existings file will be overwritten, as all bam files are now stored in the same folder.

Have fun :)

fin swimmer

ADD COMMENT • link 6.6 years ago by finswimmer 16k

2

Entering edit mode

I guess one can remove the bam extension in the output (again user's choice though):

  $ parallel -j 4 'samtools depth -b regions.bed  {} > {.}.depth' ::: *.bam

ADD REPLY • link 6.6 years ago by cpad0112 21k

0

Entering edit mode

Good idea. I edited my post.

ADD REPLY • link 6.6 years ago by finswimmer 16k

0

Entering edit mode

6.6 years ago

Ana ▴ 200

Hello everyone, thanks you all so much for your time and help. This finally worked.

wget -q -O - "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/current.tree" | cut -f 1 | grep '.bam$' | while read B; do echo -n "$B " && /data/programs/samtools-1.3.1/samtools  view -b -o ${B##*/} "http://ftp.1000genomes.ebi.ac.uk/vol1/$B" "20:52479435-52483091"  && rm *.chrom20.* && rm *.chrom11.* ; done

I put the code in to a bash script and ran

nohup ./download_bam.sh &

ADD COMMENT • link 6.6 years ago by Ana ▴ 200

0

Entering edit mode

That is more or less the answer @cpad0112 gave. You should accept all answers as correct (you are able to do more than one).

ADD REPLY • link 6.6 years ago by GenoMax 147k

score 3 · Accepted Answer · 2018-04-26

3

Entering edit mode

6.6 years ago

finswimmer 16k

Hello,

bam.bai is the index file of your bam file. samtools depth should be run on the bam file and not on bam.bai.

fin swimmer

ADD COMMENT • link 6.6 years ago by finswimmer 16k

0

Entering edit mode

What should I do now? you mean I cannot run the samtools on bam.bai files? Is there any way to convert bam to bam files?

ADD REPLY • link 6.6 years ago by Ana ▴ 200

0

Entering edit mode

Where should your bam.bai file come from?

I guess you have first have to adapt your samtools view command to save the output to a local bam file. You can do this by adding the -o parameter and specifieng the output name.

fin swimmer

ADD REPLY • link 6.6 years ago by finswimmer 16k

0

Entering edit mode

I used the command above, samtools view bu -b produces files in bam format

ADD REPLY • link 6.6 years ago by Ana ▴ 200

0

Entering edit mode

No,

samtools view -bu just print the result in bam format to stdout (to your terminal). It doesn't save it into a file. This have to be done by -o or the unix way with > filename.bam

fin swimmer

ADD REPLY • link 6.6 years ago by finswimmer 16k

0

Entering edit mode

I cheated my code to this:

wget -q -O - "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/current.tree" | cut -f 1 | grep '.bam$' | while read B; do echo -n "$B " && /data/programs/samtools-1.3.1/samtools  view -o  "http://ftp.1000genomes.ebi.ac.uk/vol1/$B" "1:115258827-115258827"  && rm *.chrom20.* && rm *.chrom11.* ; done

but I get and error message:

tp/phase3/data/HG03086/alignment/HG03086.mapped.ILLUMINA.bwa.MSL.low_coverage.20130415.bam [E::hts_open_format] fail to open file '1:115258827-115258827' samtools view: failed to open "1:115258827-115258827" for reading: Protocol not supported

ADD REPLY • link 6.6 years ago by Ana ▴ 200

0

Entering edit mode

I looked at tree and tree has both bam and bai files. Your code requires subsetting of bam files for which samtools downloads the remote index files. However, you have not provided output file format and output file name. Hence only bai files remain on your local machine (which are not .bam files for obvious reasons). In addition, you are removing chrom20 and chrom 11 in the loop. They should be grepped out before loop..

Till this point, your code is okay:

wget -q -O - "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/current.tree" | cut -f 1 | grep '.bam$' | while read B; do echo -n "$B " ; done

Try working with a single file first. For eg. Download tree file and run the code on a single bam file first (as there are around 64455 bam files in the loop).

I hope you are aware that code downloads both mapped and unmapped files. Eg below:

ftp/pilot_data/data/NA18701/alignment/NA18701.SLX.maq.SRP000033.2009_09.bam; ftp/pilot_data/data/NA18701/alignment/NA18701.SLX.maq.SRP000033.2009_09.unmapped.bam

ADD REPLY • link 6.6 years ago by cpad0112 21k

1

Entering edit mode

This worked for me:

cut -f 1 test.tree | grep '.bam$' | while read B; do echo  "samtools view -b -o ${B##*/} http://ftp.1000genomes.ebi.ac.uk/vol1/$B 1:115258827-115258827"; done

I echoed function and independently ran the echoed URL and it working. Output file name is bam file. You are not furnishing the output file name and output format. In the above code, I supplied sample name. bam as output file and bam as output. I guess you are aware that you are extracting coverage for one base in your function. Keep this part out of the loop : rm .chrom20. && rm .chrom11..

test.tree (in your case current.tree) has only one bam with two other files:

$ cat test.tree 
ftp/pilot_data/data/NA18507/alignment/NA18507.SLX.maq.SRP000031.2009_08.bam file    7787005514  Sat Aug 15 01:28:49 2009    c755e38e66bb5267604f664a17a71670
ftp/pilot_data/data/NA18507/alignment/NA18507.SLX.maq.SRP000031.2009_08.bam.bas file    3147    Wed Sep 30 11:26:59 2009    78685fc1611af637d246877bc6fd2c08
ftp/pilot_data/data/NA18507/alignment/NA18507.SLX.maq.SRP000031.2009_08.bam.bai file    8028904 Sat Aug 15 01:28:51 2009    4676e6f3d87016b2654a888405be74d8

===================

your code should be some thing like this ( I added echo. Remove echo and " before you proceed. I aasumed that you wanted to remove any bam with chrom 11 and chrom20 names in them):

$ wget -q -O - "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/current.tree" | cut -f 1 | grep '.bam$' | grep -Ev -i "*.chrom20.*|*.chrom11.*"  | while read B; do echo  "samtools view -b -o ${B##*/} http://ftp.1000genomes.ebi.ac.uk/vol1/$B 1:115258827-115258827"; done

Few echoed URLs from loop are:

samtools view -b -o NA20822.unmapped.ILLUMINA.bwa.TSI.low_coverage.20130422.bam http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130513_p3l_alignment_data/NA20822/alignment/NA20822.unmapped.ILLUMINA.bwa.TSI.low_coverage.20130422.bam 1:115258827-115258827
samtools view -b -o HG02545.mapped.ILLUMINA.bwa.ACB.low_coverage.20121211.bam http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130513_p3l_alignment_data/HG02545/alignment/HG02545.mapped.ILLUMINA.bwa.ACB.low_coverage.20121211.bam 1:115258827-115258827
samtools view -b -o HG02545.unmapped.ILLUMINA.bwa.ACB.low_coverage.20121211.bam http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130513_p3l_alignment_data/HG02545/alignment/HG02545.unmapped.ILLUMINA.bwa.ACB.low_coverage.20121211.bam 1:115258827-115258827
samtools view -b -o HG02628.unmapped.ILLUMINA.bwa.GWD.low_coverage.20121211.bam http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130513_p3l_alignment_data/HG02628/alignment/HG02628.unmapped.ILLUMINA.bwa.GWD.low_coverage.20121211.bam 1:115258827-115258827
samtools view -b -o HG02628.mapped.ILLUMINA.bwa.GWD.low_coverage.20121211.bam http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130513_p3l_alignment_data/HG02628/alignment/HG02628.mapped.ILLUMINA.bwa.GWD.low_coverage.20121211.bam 1:115258827-115258827

You can download the bams with parallel, in addition to bash script above: ( I added dry-run to generate calls. Remove dry-run options to execute the command):

wget -q -O - "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/current.tree" | cut -f 1 | grep '.bam$' | grep -Ev -i "*.chrom20.*|*.chrom11.*"  | parallel --dry-run samtools view -b -o {/} {} 1:115258827-115258827"

Seems there is some truncation with error/warning: [bam_header_read] EOF marker is absent. The input is probably truncated.. Try for few bams or one bam first for coordinates of interest. Samtools version i used is: 0.1.19-96b5f2294a (available in xenial/mint repos)

ADD REPLY • link 6.6 years ago by cpad0112 21k

0

Entering edit mode

I wonder why you used an ancient version of samtools which mars an excellent effort. Can you test with a more current version of samtools or did it not work?

ADD REPLY • link 6.6 years ago by GenoMax 147k

0

Entering edit mode

I upgraded to recent samtools version 1.8 using conda/bioconda. No more such errors, but output is same. Btw, I use mint linux and mint linux always uses Ubuntu LTS (which is xenial now). Following are the query and result:

query and error/warning:

 $ samtools view -b -o NA18507.SLX.maq.SRP000031.2009_08.bam http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/data/NA18507/alignment/NA18507.SLX.maq.SRP000031.2009_08.bam 1:115258827-115258827
[E::knet_seek] SEEK_END is not supported for HTTP. Offset is unchanged

output:

$ samtools view NA18507.SLX.maq.SRP000031.2009_08.bam
IL5_582:3:78:654:184    147 1   115258812   0   36M =   115258706   -142    TCCTTGTTGACATCTTTCTGTTGGACAGTCCACTTT    4192:3;;;8<::<<<;<<;<=;<;=><<9=;<<<>    RG:Z:ERR002344  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:2  H1:i:0
IL5_582:1:151:562:846   147 1   115258813   28  36M =   115258722   -127    CCTTGTTGACATCTTTCTGTTGGACAGTCCACTTTT    ,21:12;7+3;:;</5;<;==<=<=><<8><<<<<=    RG:Z:ERR002342  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:1  H1:i:1
IL10_578:8:256:348:128  163 1   115258824   30  36M =   115258938   150 TCTTTCTGTTGGACAGTCCACTTTTGCTTTCTAGGC    @=???>@@?@@5@=<??;>;=@@77@=??3;4:614    RG:Z:ERR002355  MF:i:18 Aq:i:30 NM:i:0  UQ:i:0  H0:i:1  H1:i:1

$ samtools depth NA18507.SLX.maq.SRP000031.2009_08.bam -r 1:115258827-115258827
1   115258827   3

@OP: Once you downloaded all the bams, reindex your bam files if you are further processing them for coverage.

ADD REPLY • link 6.6 years ago by cpad0112 21k

0

Entering edit mode

/data/programs/samtools-1.3.1/samtools view -o "http://ftp.1000genomes.ebi.ac.uk/vol1/$B" "1:115258827-115258827"

This will not work, as -o expect the output filename. It should look like this:

/data/programs/samtools-1.3.1/samtools  view "http://ftp.1000genomes.ebi.ac.uk/vol1/$B" "1:115258827-115258827" -o outputfile.bam

As you are looping over many file outputfile.bam have to be replaced dynamicly based on the input file. I guess there is a way to read it out of your $B. But I'm not familiar with bash scripts.

Another problem I see in my test is the naming convention of the chromosom. The first file I tried doesn't have a chromosom "1" but "chr1". Don't know whether this is in all file in this way.

Another question:

What are the rm commands for?

fin swimmer

ADD REPLY • link 6.6 years ago by finswimmer 16k

0

Entering edit mode

@ admin: I deleted my post as I was trying to consolidate two posts into post. It seems along with my post and other posts also got deleted (sub posts).

ADD REPLY • link 6.6 years ago by cpad0112 21k

0

Entering edit mode

I undeleted your post.

I have moved your final tested comment to a new answer.

ADD REPLY • link 6.6 years ago by GenoMax 147k

0

Entering edit mode

Thanks @ genomax

ADD REPLY • link 6.6 years ago by cpad0112 21k

score 1 · Accepted Answer · 2018-04-27

1

Entering edit mode

6.6 years ago

cpad0112 21k

@ OP:

Your code doesn't store the output and also doesn't mention output's format. Here is the working code:

wget -q -O - "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/current.tree" | cut -f 1 | grep '.bam$' | grep -Ev -i "*.chrom20.*|*.chrom11.*"  | while read B; do echo  "samtools view -b -o ${B##*/} http://ftp.1000genomes.ebi.ac.uk/vol1/$B 1:115258827-115258827"; done

I added echo for dry run of commands. Remove echo for code execution. I hope you are aware of following points:

you are subsetting bam for one base
bam list contains mapped and unmapped for each sample

Note:

Reindex the bam files once bam files are downloaded.
Same can be done using parallel:

wget -q -O - "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/current.tree" | cut -f 1 | grep '.bam$' | grep -Ev -i ".chrom20.|.chrom11." | parallel --dry-run samtools view -b -o {/} {} 1:115258827-115258827"

ADD COMMENT • link 6.6 years ago by cpad0112 21k

0

Entering edit mode

thanks @cpad0112 for your help. yes, I have 22 intervals and this is just the first one that I am subsetting only for one base. why should I re-index the bam files? Have you ever downlodaded data from 1000 genomes project? it is just a bot strange, as I wrote below I put the code into a bash script and ran it in the background. But, I cannot download it for all individuals. The jobs terminate after nearly 10 hours without getting bam files for all individuals.Do have any idea what might be wrong?

ADD REPLY • link 6.6 years ago by Ana ▴ 200

0

Entering edit mode

To my knowledge, Index files (bai) that were downloaded are for entire sample. Using those index files for the probing downloaded bam files will throw an error. For eg. without reindexing, if you try samtools depth NA18507.SLX.maq.SRP000031.2009_08.bam -r 1:115258827-115258827, it would throw an error.
I didn't download the data. Probable cause is NCBI servers do not like get hammered by requests (probably as safety precaution). Some times, you may have to wait till you make a next request. Add some thing like "sleep 5s" in script loop. I guess you have enough storage. Some times, scripts run out due to lack of storage. Make sure system has enough memory to run the program and also your script is not killed by any process. I hope you know how to use screen or nohup.

Btw, if you are just looking for coverage, not interested in bam files, you can use samtools depth command direct. Some thing like this:

$ samtools depth -r 1:115258827-115258827  http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/data/NA18507/alignment/NA18507.SLX.maq.SRP000031.2009_08.bam 
[E::knet_seek] SEEK_END is not supported for HTTP. Offset is unchanged
1   115258827   3

ADD REPLY • link 6.6 years ago by cpad0112 21k

0

Entering edit mode

The jobs terminate after nearly 10 hours without getting bam files for all individuals.Do have any idea what might be wrong?

You mean some are missing? Or did you get no data at all?

If only some are missing: Maybe you have overlook my post yesterday.

Another problem I see in my test is the naming convention of the chromosom. The first file I tried doesn't have a chromosom "1" but "chr1". Don't know whether this is in all files in this way.

fin swimmer

ADD REPLY • link 6.6 years ago by finswimmer 16k

0

Entering edit mode

Did you try downloading one or two files manually with your script from CLI?

ADD REPLY • link 6.6 years ago by cpad0112 21k