Question

Split fai (fasta index) or VCF into N nearly equal total length chunks

0

Entering edit mode

6.5 years ago

FatihSarigol ▴ 260

It is easy to do it manually on your favorite genome, but I need to write a code that can split a fasta index into sets of scaffolds in the same order, based on size.

An example would be (only using the first 2 columns of an fai file):

Scaffold1 100

Scaffold2 50

Scaffold3 200

Scaffold4 500

If I want to split it into 2, my code should give me Scaffold 1,2,3 (total length of which adds up to 350) in first file and Scaffold 4 (with size of 500) in second file. If I want to split it into 3, my code should give me Scaffold 1,2 (total 150), Scaffold 3 (size 200), and Scaffold 4 (size 500) as 3 separate files.

I need this for genomes with over 30,000 scaffolds to split a set of jobs to run them on multiple sets of VCF regions simultaneously. Is there any program that does this, or anyone has a simple code, or suggestion to write?

Update=If any program can do this on the VCF directly, splitting it into sets of scaffolds, total length of which would be nearly equal, that would even be better! Note=It is also easy to split a VCF by lines, which I can't use, because I can't have the same scaffold in more than 1 file, each scaffold should exist in only 1 file at the end of splitting (where 1 file can of course have multiple scaffolds).

Update2=It would also work if I split into nearly total of 50million bases for example, so when the addition of X scaffold lengths reach 50million, the code outputs all those scaffolds and starts adding up the next ones to reach 50million again.

Thanks

genome faidx • 2.7k views

ADD COMMENT • link 6.5 years ago by FatihSarigol ▴ 260

2

Entering edit mode

The index file should be relatively small. You could read it from a central location. So are you sure you will need to do this?

ADD REPLY • link 6.5 years ago by GenoMax 153k

0

Entering edit mode

Thanks for your comment. Yes the fai itself is small, but the jobs I want to run on the total VCF take so long time, and I need this to be able to split the VCF into nearly equal chunks. I used to do this by manually splitting the fasta index by eyeballing, but I need to make a program or code do that step now for new genomes to come..

ADD REPLY • link 6.5 years ago by FatihSarigol ▴ 260

1

Entering edit mode

is it what you want: Programming Challenge: Divide The Human Genome Among X Cores, Taking Into Account Gaps ?

ADD REPLY • link 6.5 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

thanks, very similar actually, but I can't split the chromosome into chunks, I need 1 scaffold to exist only in 1 file, and I am only interested in total size of the scaffold. I am trying to do it on an fai file using awk now; will post my code here if I manage.

ADD REPLY • link 6.5 years ago by FatihSarigol ▴ 260

1

Entering edit mode

you don't have to split the chromosomes. You can use the whole chromosomes as a whole BED record.

ADD REPLY • link 6.5 years ago by Pierre Lindenbaum 166k

score 0 · Answer 1 · 2019-03-22

I wrote a code that does exactly what I want, if anyone else needs a similar thing, you can find it here

Run it as

./FASTAindexSPLITTERinEQUALsize.sh samtoolsExecutable fastaFile numberOfDivisions

to divide your fasta index into subsets of nearly equal total lengths, based on how many subsets you want, keeping each scaffold in only 1 subset. You can use the output subset index files to extract these regions (to run analyses on them that take long time normally) as they will be in bed format starting from 0 to the end of each scaffold.