Samtools And Region List
6
7
Entering edit mode
12.5 years ago
win ▴ 990

Hi all, Hope someone can help.

We can pass a region to samtools view such as 1:1000-1010.

Is it possible to pass a list of regions to samtools view and have the output into a single sam file i.e. (something like)

$ samtools view -h .bam regionlist.txt > mysam.sam

what would the syntax for this be?

also, in what format does the region list file should be in?

thanks, a

samtools list • 35k views
ADD COMMENT
0
Entering edit mode

Hello! Thanks everyone for help with a similar problem! I am now only struggling to keep the name of the region with the fragment (I had a bam file of fragments and chose the ones which overlap with the regions of interest (bed file with positions and names of regions) but I need the final file to remember which fragment belongs to which region name). Is it somehow possible?

Thank you! Lucie

ADD REPLY
20
Entering edit mode
10.9 years ago

Samtools have the -L option

samtools view -b -L ROI.bed  file.bam > ROI_file.bam

BUT -L does not use the samtools index so the search is slooow depending on how large is your bam file. In my benchmarks querying for 10, 100 or 1000 sequences take exactly the same time so seem that bed size does not matter too much. You need to do a bit of benchmarking if you prefer to do a -L or to do a loop (for a bam with 23 million reads take the same time the -L that writting all the stuff for looping ;-)) YMMV

$ time samtools view -b -L 100_ROI.bed file.bam > ROI.bam
real    0m38.831s
user    0m37.666s
sys    0m0.556s


$ time (samtools view -H file.bam > roi_xargs.sam; \
      cat 100_ROI.bed | perl -lane 'print "$F[0]:$F[1]-$F[2]"' | xargs -n1 -t -I{} samtools view file.bam {} >> roi_xargs.sam; \
      samtools view -bSh roi_xargs.sam > roi_xargs.bam \
  )
real   0m7.188s
user   0m5.080s
sys    0m1.304s

And if you feel adventurous and your disks are using isilon or lustre you can use gnu-parallel and forget about same file concurrency and do it in parallel in a breeze. The only issue is that you would need to do the bam merge afterwards.

ADD COMMENT
4
Entering edit mode

BUT -L does not use the samtools index so the search is slooow depending on how large is your bam file.

update 2022 it is now much faster with option -M

  -M, --use-index            Use index and multi-region iterator for regions
ADD REPLY
0
Entering edit mode

Thank you!!! This tips it is really useful improve the velocity.

ADD REPLY
0
Entering edit mode

Hi Pablo,

I want to extract reads from bam file using Samtools view from multiple regions.The regions I have are present in the VCF file,which I want to use.So there are not a small number of regions.

So,I can input VCF file instead of BED file or I have to do some conversions?

ADD REPLY
4
Entering edit mode
12.5 years ago

From the samtools documentation:

 -L FILE  output alignments overlapping the input BED FILE [null]

samtools view -L regionlist.bed -h in.bam > out.sam

The region list file would be in BED format.

ADD COMMENT
1
Entering edit mode

The -L option is much slower than directly specifying the region in the command line. I am wondering if anyone find the way to run -L faster.

ADD REPLY
0
Entering edit mode

I also noticed -L is really slow with samtools 0.1.19-44428cd, considering a bed file with only, say, 100 entries. Any ideas why?

ADD REPLY
1
Entering edit mode

That is because the -L does not use the index so scan the full file.

ADD REPLY
0
Entering edit mode

I'm having trouble find that in the documentation:

http://samtools.sourceforge.net/samtools.shtml

it might be renamed as -l (lowercase L)

ADD REPLY
0
Entering edit mode

The documentation I posted was taken from: Version: 0.1.18 (r982:295)

ADD REPLY
0
Entering edit mode

I see it now. I wish I could delete my answer.

ADD REPLY
0
Entering edit mode

I think you can, but it's still a nice answer.

ADD REPLY
0
Entering edit mode

The online documentation for samtools does not seem to be completely up to date. But if you have it installed you can run it without options and get an accurate list of current parameters and options

ADD REPLY
2
Entering edit mode
12.5 years ago

You can also use tabix, which is an index and search tool developed by the Samtools author. Here is an example to fit your needs:

Your SAM file should be coordinate sorted. If it is not, then use gnu sort.

sort -k 3,3 -k 4,4n mysam.sam > mysam.sorted.sam

Then, tabix needs a bgzip compressed file.

bgzip mysam.sorted.sam

Then you need tabix to index your file. This will create a file ending in .tbi

tabix -p sam mysam.sorted.sam.gz

Lastly, you may retrieve your regions in this format:

tabix mysam.sorted.sam.gz chr1:1000-1010 chr2:2000-2020 ... > mysam.sorted.regions.sam
ADD COMMENT
1
Entering edit mode
12.5 years ago
cl.parallel ▴ 140

You could use a for loop in a shell script to do this.

# ouput header only, once
samtools view -H aln.bam > mysam.sam
while read line;do
    samtools view aln.bam $line >> mysam.sam
done < regions.txt

where regions.txt is line-based, with each line being:

ref1:start-end

or xargs:

samtools view -H aln.bam > mysam.sam
xargs -a regions.txt -I {} samtools view aln.bam {} >> mysam.sam
ADD COMMENT
0
Entering edit mode

thanks, this is great. need one more slight detail.

when you say ref1:start-end, i just want to make sure ref1 refers to chromosome number, right?

ADD REPLY
0
Entering edit mode

yep, chrName:startIndex-endIndex

ADD REPLY
0
Entering edit mode

Also, you may want to use the -b flag to keep it in binary format for speed and space:

samtools view -bH aln.bam >> mysam.bam
...
samtools view -b aln.bam $line >> mysam.bam
ADD REPLY
0
Entering edit mode

This is a nice, simple solution, but I would urge you to consider tabix if you plan to grab many regions since it should give you much better performance than a shell loop.

ADD REPLY
0
Entering edit mode
ADD REPLY
1
Entering edit mode
8.9 years ago
Tom Koch ▴ 110

Sorry for the bump (it's for future Google visitors), I used cl.parallel's answer as slight guidance but I wanted to get the same region across a set of reads, so I wrote a script. Uses a BED file as input.

for i in $( ls *.bam ); do
    samtools view $i -H > OUTPUTPREFIX${i%.bam}.sam 
    while read -r -a myArray; do 
        samtools view $i ${myArray[0]}:${myArray[1]}-${myArray[2]} >> OUTPUTPREFIX${i%.bam}.sam
     done < INPUTFILE.bed 
done
ADD COMMENT
0
Entering edit mode
4.9 years ago
wangyu.bgi • 0

This is my solution so far and hope it helps. The key to the question is: samtools view accept multiple position. This is useful for small files.

samtools view -O BAM -o OUTPUT.bam INPUT.bam `perl -ane '{print "$F[0]:$F[1]-$F[2] "}' INPUT.bed `

where INPUT.bam is your BAM file

OUTPUT.bam is your output.

INPUT.bed is a file with the format of chr, start and end.

ADD COMMENT

Login before adding your answer.

Traffic: 2161 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6