Entering edit mode
2.9 years ago
selplat21
▴
20
- I currently have a series of vcf files for 500 individuals.
- I have a separate vcf file for each chromosome in each individual.
- The software I am using requires 2 Mb intervals to run and takes the argument -range [int_min] [int_max]
Is there a loop I could do that would generate a separate job for each 2Mb range based on chromosome sizes. This would mean a job for each 2Mb interval across each chromosome for each individual.
Hi Pierre, I will test this out and familiarize myself with the syntax and get back to you. Very much appreciated.
Okay the following is what I was able to trim it down to, but I have some followup questions which are commented out:
I assume we keep quotes here?
This would be a list of my gzipped vcf files (the input for my software of interest. I'm a little confused on how to structure this list. Should it be a list of vcf.gz files per sample, or all vcf.gz files together. If it is per sample, I will need to run this whole script as a loop and keep params.vcf as a variable.
This is essentially the structure of the command. Notice it does not include a chromosome argument. But range will apply only to the chromosome being analyzed I think. In terms of the output, I'm not sure if this should be different for each vcf.gz file.
Lastly, I'm unsure about how this will parallelize. The software can only handle 2Mb at a time because of limits on memory.