I have a VCF file from chromosome 1 of chicken and i want to divide this chromosome into five equal parts. Because its size is very large for some analyzes like LDhat. i only know that i can use vcftools to do this (option --from-bp --to-bp). But I do not know how to make a choice?
$ SnpSift split -l <number of lines> chr1.test.vcf
In your case count the number of lines in chr1.test.vcf, then divide by number of parts you would like to break vcf into. There would be that number of vcf files in current directory.
Ok, i am running first command (grep -v "#" chr1-all.vcf|split - -l $((($(grep -v "#" chr1-all.vcf|wc -l)+(5+1))/5)) -d chr1-split) and i split the original VCF file into five non-VCF files.
But second command line does not work and this error will appear.
But a question, together with the VCF files, created a number of other files that are exactly the same size as the VCF files. What are these? Please see attached photo:
do you mean by "equal parts" that there are equal number of variants in each file? If so, one way is to use split. But before we can use it, we have to calculate how many lines should be in each file. Also after splitting, the header of the vcf file have to be prepended to each file.
This will create files with the name chr1-split00, chr1-split01, etc. containing the equal number of variants. As we discard the header in the beginning, we have to prepend it now to each file.
yes, i just want to reduce the size of the VCF file without any changes to its contents. Because, as i said, the size of the VCF file is too large for LDhat analysis, and therefore it should be divided into at least five parts.
Try snpsift:
To split VCF:
In your case count the number of lines in chr1.test.vcf, then divide by number of parts you would like to break vcf into. There would be that number of vcf files in current directory.
To know the number of lines in a vcf file:
How do I specify the number of parts in the command line?
Hello,
you cannot specify the number of parts, but the number of lines. As cpad0112 said, you can calculate how many lines have to be in every single file.
What you can do is writing this after -l and replace n by the number of chunks you like:
E.g. for 5 parts:
fin swimmer
Ok, my VCF file have 3678290 lines, and as I said, I want to be splitting into five vcf files.
Please tell me what is the command line?
Just copy&paste the command above after installing SnpSift. If don't want to install it copy&paste the commands in my first post.
Ok, i am running first command (grep -v "#" chr1-all.vcf|split - -l $((($(grep -v "#" chr1-all.vcf|wc -l)+(5+1))/5)) -d chr1-split) and i split the original VCF file into five non-VCF files.
But second command line does not work and this error will appear.
[sqanbar@abrii1 Ind_biAll_Filtered.BRA_Population.Chr1]$ parallel 'grep "#" Ind_biAll_Filtered.BRA_Population.Chr1.vcf|cat - {} > {.}-final.vcf' ::: chr1-split*
-bash: parallel: command not found
parallel needs to be installed. Take the package manager of your distribution. For example in Ubuntu do this:
i could do it.
But a question, together with the VCF files, created a number of other files that are exactly the same size as the VCF files. What are these? Please see attached photo:
The files without the vcf extension are the files you get in the splitting step. You can remove them if the final files now contains the vcf header.