Break the chromosome using vcftools
1
0
Entering edit mode
6.6 years ago

Hi all,

I have a VCF file from chromosome 1 of chicken and i want to divide this chromosome into five equal parts. Because its size is very large for some analyzes like LDhat. i only know that i can use vcftools to do this (option --from-bp --to-bp). But I do not know how to make a choice?

What is the best idea?

Best Regard

Mostafa

genome SNP • 2.4k views
ADD COMMENT
1
Entering edit mode

Try snpsift:

To split VCF:

$ SnpSift split -l <number of lines> chr1.test.vcf

In your case count the number of lines in chr1.test.vcf, then divide by number of parts you would like to break vcf into. There would be that number of vcf files in current directory.

To know the number of lines in a vcf file:

 $ grep -v \# chr1.test.vcf | wc -l
ADD REPLY
0
Entering edit mode

How do I specify the number of parts in the command line?

ADD REPLY
0
Entering edit mode

Hello,

you cannot specify the number of parts, but the number of lines. As cpad0112 said, you can calculate how many lines have to be in every single file.

What you can do is writing this after -l and replace n by the number of chunks you like:

$((($(grep -v "^#" chr1.test.vcf|wc -l)+(n+1))/n))

E.g. for 5 parts:

$ java - jar SnpSift.jar split -l $((($(grep -v "^#" chr1.test.vcf|wc -l)+(5+1))/5)) chr1.test.vcf

fin swimmer

ADD REPLY
0
Entering edit mode

Ok, my VCF file have 3678290 lines, and as I said, I want to be splitting into five vcf files.

Please tell me what is the command line?

ADD REPLY
0
Entering edit mode

Just copy&paste the command above after installing SnpSift. If don't want to install it copy&paste the commands in my first post.

ADD REPLY
0
Entering edit mode

Ok, i am running first command (grep -v "#" chr1-all.vcf|split - -l $((($(grep -v "#" chr1-all.vcf|wc -l)+(5+1))/5)) -d chr1-split) and i split the original VCF file into five non-VCF files.

But second command line does not work and this error will appear.

[sqanbar@abrii1 Ind_biAll_Filtered.BRA_Population.Chr1]$ parallel 'grep "#" Ind_biAll_Filtered.BRA_Population.Chr1.vcf|cat - {} > {.}-final.vcf' ::: chr1-split*

-bash: parallel: command not found

ADD REPLY
0
Entering edit mode

parallel needs to be installed. Take the package manager of your distribution. For example in Ubuntu do this:

sudo apt-get install parallel
ADD REPLY
0
Entering edit mode

i could do it.

But a question, together with the VCF files, created a number of other files that are exactly the same size as the VCF files. What are these? Please see attached photo:

enter image description here

ADD REPLY
0
Entering edit mode

The files without the vcf extension are the files you get in the splitting step. You can remove them if the final files now contains the vcf header.

ADD REPLY
0
Entering edit mode
6.6 years ago

Hello,

do you mean by "equal parts" that there are equal number of variants in each file? If so, one way is to use split. But before we can use it, we have to calculate how many lines should be in each file. Also after splitting, the header of the vcf file have to be prepended to each file.

grep -v "^#" chr1-all.vcf|split - -l $((($(grep -v "^#" chr1-all.vcf|wc -l)+(5+1))/5)) -d chr1-split

This will create files with the name chr1-split00, chr1-split01, etc. containing the equal number of variants. As we discard the header in the beginning, we have to prepend it now to each file.

parallel 'grep "^#" chr1-all.vcf|cat - {} > {.}-final.vcf' ::: chr1-split*

fin swimmer

ADD COMMENT
0
Entering edit mode

many thanks for your reply,

yes, i just want to reduce the size of the VCF file without any changes to its contents. Because, as i said, the size of the VCF file is too large for LDhat analysis, and therefore it should be divided into at least five parts.

ADD REPLY

Login before adding your answer.

Traffic: 1943 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6