How to Split 3000 WGS CRAM files into 1Mbp length chunks
1
0
Entering edit mode
18 months ago
Sd • 0

Hello, I have 3000 WGS CRAM files and I want to split them into 1Mbp chunks. I want to split with exact genomic coordinate locations, e.g. starting from 1 to 1000000bp, 1000001bp to 2000000bp, 2000001bp to 3000000 etc. for all chromosomes. Therefore, each chunks have the similar corresponding region in each sample. Is there any way that I can do this?

CRAM • 945 views
ADD COMMENT
0
Entering edit mode

in addition to the good answers provided, might be useful/helpful to look at tabix. Basically, for a one time cost, tabix will index a file, making subsequent searches for a range of data ultra fast.

ADD REPLY
1
Entering edit mode
18 months ago
cmdcolin ★ 4.0k

this is untested, but a possible quick shell script approach

# make one megabase windows and convert to 'locstrings' that samtools can use
bedtools makewindows -g hg19.chrom.sizes -w 1000000 | awk -F '\t' '{printf("%s:%d-%s\n",$1,int($2)+1,$3);}' > out.txt

for i in *.cram; do
    mkdir $i_folder # make subfolders, because otherwise you may exceed the amount of files allowed in a single directory on linux
    while read p; do
        samtools view -T ref.fa $i $p -o $i_folder/$i_$p.cram
    done < out.txt;
done;

i would question whether this is truly necessary, but it's your call

ADD COMMENT
2
Entering edit mode

a really minor point : bed is 0 based but samtools use 1-based coordinates. So it's should be something like:

bedtools makewindows -g hg19.chrom.sizes -w 1000000 | awk -F '\t' '{printf("%s:%d-%s\n",$1,int($2)+1,$3);}'
ADD REPLY
2
Entering edit mode

updated answer :+1:

ADD REPLY

Login before adding your answer.

Traffic: 1940 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6