Hello! Thanks everyone for help with a similar problem!
I am now only struggling to keep the name of the region with the fragment (I had a bam file of fragments and chose the ones which overlap with the regions of interest (bed file with positions and names of regions) but I need the final file to remember which fragment belongs to which region name). Is it somehow possible?
BUT -L does not use the samtools index so the search is slooow depending on how large is your bam file. In my benchmarks querying for 10, 100 or 1000 sequences take exactly the same time so seem that bed size does not matter too much. You need to do a bit of benchmarking if you prefer to do a -L or to do a loop (for a bam with 23 million reads take the same time the -L that writting all the stuff for looping ;-)) YMMV
$ time samtools view -b -L 100_ROI.bed file.bam > ROI.bam
real 0m38.831s
user 0m37.666s
sys 0m0.556s
$ time (samtools view -H file.bam > roi_xargs.sam; \
cat 100_ROI.bed | perl -lane 'print "$F[0]:$F[1]-$F[2]"' | xargs -n1 -t -I{} samtools view file.bam {} >> roi_xargs.sam; \
samtools view -bSh roi_xargs.sam > roi_xargs.bam \
)
real 0m7.188s
user 0m5.080s
sys 0m1.304s
And if you feel adventurous and your disks are using isilon or lustre you can use gnu-parallel and forget about same file concurrency and do it in parallel in a breeze. The only issue is that you would need to do the bam merge afterwards.
I want to extract reads from bam file using Samtools view from multiple regions.The regions I have are present in the VCF file,which I want to use.So there are not a small number of regions.
So,I can input VCF file instead of BED file or I have to do some conversions?
ADD REPLY
• link
updated 5.1 years ago by
Ram
44k
•
written 10.8 years ago by
Ron
★
1.2k
The online documentation for samtools does not seem to be completely up to date. But if you have it installed you can run it without options and get an accurate list of current parameters and options
This is a nice, simple solution, but I would urge you to consider tabix if you plan to grab many regions since it should give you much better performance than a shell loop.
Sorry for the bump (it's for future Google visitors), I used cl.parallel's answer as slight guidance but I wanted to get the same region across a set of reads, so I wrote a script. Uses a BED file as input.
for i in $( ls *.bam ); do
samtools view $i -H > OUTPUTPREFIX${i%.bam}.sam
while read -r -a myArray; do
samtools view $i ${myArray[0]}:${myArray[1]}-${myArray[2]} >> OUTPUTPREFIX${i%.bam}.sam
done < INPUTFILE.bed
done
Hello! Thanks everyone for help with a similar problem! I am now only struggling to keep the name of the region with the fragment (I had a bam file of fragments and chose the ones which overlap with the regions of interest (bed file with positions and names of regions) but I need the final file to remember which fragment belongs to which region name). Is it somehow possible?
Thank you! Lucie