Split VCF to individual scaffolds
2
1
Entering edit mode
9.4 years ago
krp0001 ▴ 40

Dear Users,

I couldn't find any script that could split my vcf file that has scaffolds (10322) but not chromes number, i would like to split each scaffold into individual vcf file or a text file.

Thank you

snp genome sequence Assembly vcf • 5.9k views
ADD COMMENT
2
Entering edit mode
9.4 years ago

First get a list of uniq scaffolds:

bgzip merge-test.vcf
tabix merge-test.vcf.gz
gzcat merge-test.vcf.gz | grep -v "^#" | cut -f1 | sort | uniq > scaff_names​

Then Split by individual scaffolds:

parallel -a scaff_names "bcftools view merge-test.vcf.gz {} > {.}.vcf"​

or by for loop:

for scaff in `cat scaff_names`; do bcftools view merge-test.vcf.gz $scaff > $scaff.vcf; done​
ADD COMMENT
0
Entering edit mode

Hai Goutham,

thank you for your quick response, I have tried your script I don't what parallel -a is, I am running it on grid not in an local machine, I get an error -bash: parallel: command not found,

thank you

ADD REPLY
1
Entering edit mode

For time being, you could use the 'for' loop option. You need to install GNU-Parallel. Its a wonderful tool.

ADD REPLY
0
Entering edit mode

hai again,

here is the few lines of output, missing headers in the vcf file.

##source_20150622.1=vcf-annotate(r953) -f Q=30
##contig=<ID=scaffold1>
##contig=<ID=scaffold2>
##contig=<ID=scaffold3>
##contig=<ID=scaffold4>
##contig=<ID=scaffold5>

I need something which has all the headers for each scaffold some thing like this: please see below

scaffold1       382     .       T       A       298.864 PASS    .
scaffold2       460     .       C       T       2781.98 PASS    .
scaffold2       534     .       C       T       3079.93 PASS    .
scaffold3       828     .       C       T       3168.21 PASS    .
scaffold3       918     .       G       A       2002.13 PASS    .
scaffold4       929     .       C       T       1802.47 PASS    .
scaffold4       943     .       G       C       1848.41 PASS    .
scaffold5       995     .       G       A       897.007 PASS    .      

that has to be split into individual vcf files that has only one scaffold with variant information in each file. sorry if I haven't made it clear.

ADD REPLY
0
Entering edit mode

Run it on this file and let me know the problem you face. I could not understand from your data.

ADD REPLY
0
Entering edit mode

Hi Goutham,

Again looking for your help, I finally managed to split my large vcf file, with below script (from forums)

cut -f1 assm_1kb.fasta.fai | xargs -i mkfifo {}
cut -f1 assm_1kb.fasta.fai | xargs -i echo cat {} \| gzip \> {}.gz \&|sh
zcat fil-Q25-d10-QUAL30-PASS2.vcf.gz | awk '!/^#/{print>$1}'

The problem now is missing header for each scaffold, I managed to split the header with

head -n 60 fil-Q25-d10-QUAL30-PASS2.vcf | grep "^#" >header

Could you please help me with how to add the header information to each scaffold. please let me know if I am not clear.

Thank you

ADD REPLY
0
Entering edit mode

If you are looking for very simplistic approach, I would do something like this:

grep -v "^#" input.vcf | cut -f1 | sort | uniq > scaff_names​
grep "^#" input.vcf > vcf_header

while read line
do
grep -v "^#" input.vcf  | awk -v scaff=$line '{ if ( $1==scaff ) print }' | cat vcf_header - > $line.vcf
done < scaff_names

This should also split VCF by chromosome/scaffold with header intact. Hope this helps.

ADD REPLY

Login before adding your answer.

Traffic: 1496 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6