Question

Split VCF to individual scaffolds

1

Entering edit mode

9.4 years ago

krp0001 ▴ 40

Dear Users,

I couldn't find any script that could split my vcf file that has scaffolds (10322) but not chromes number, i would like to split each scaffold into individual vcf file or a text file.

Thank you

snp genome sequence Assembly vcf • 5.9k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.4 years ago by krp0001 ▴ 40

Ram · Accepted Answer · 2015-08-17

2

Entering edit mode

9.4 years ago

GouthamAtla 12k

First get a list of uniq scaffolds:

bgzip merge-test.vcf
tabix merge-test.vcf.gz
gzcat merge-test.vcf.gz | grep -v "^#" | cut -f1 | sort | uniq > scaff_names

Then Split by individual scaffolds:

parallel -a scaff_names "bcftools view merge-test.vcf.gz {} > {.}.vcf"

or by for loop:

for scaff in `cat scaff_names`; do bcftools view merge-test.vcf.gz $scaff > $scaff.vcf; done

ADD COMMENT • link 9.4 years ago by GouthamAtla 12k

0

Entering edit mode

Hai Goutham,

thank you for your quick response, I have tried your script I don't what parallel -a is, I am running it on grid not in an local machine, I get an error -bash: parallel: command not found,

thank you

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.4 years ago by krp0001 ▴ 40

1

Entering edit mode

For time being, you could use the 'for' loop option. You need to install GNU-Parallel. Its a wonderful tool.

ADD REPLY • link 9.4 years ago by GouthamAtla 12k

0

Entering edit mode

hai again,

here is the few lines of output, missing headers in the vcf file.

##source_20150622.1=vcf-annotate(r953) -f Q=30
##contig=<ID=scaffold1>
##contig=<ID=scaffold2>
##contig=<ID=scaffold3>
##contig=<ID=scaffold4>
##contig=<ID=scaffold5>

I need something which has all the headers for each scaffold some thing like this: please see below

scaffold1       382     .       T       A       298.864 PASS    .
scaffold2       460     .       C       T       2781.98 PASS    .
scaffold2       534     .       C       T       3079.93 PASS    .
scaffold3       828     .       C       T       3168.21 PASS    .
scaffold3       918     .       G       A       2002.13 PASS    .
scaffold4       929     .       C       T       1802.47 PASS    .
scaffold4       943     .       G       C       1848.41 PASS    .
scaffold5       995     .       G       A       897.007 PASS    .

that has to be split into individual vcf files that has only one scaffold with variant information in each file. sorry if I haven't made it clear.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.4 years ago by krp0001 ▴ 40

0

Entering edit mode

Run it on this file and let me know the problem you face. I could not understand from your data.

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.4 years ago by GouthamAtla 12k

0

Entering edit mode

Hi Goutham,

Again looking for your help, I finally managed to split my large vcf file, with below script (from forums)

cut -f1 assm_1kb.fasta.fai | xargs -i mkfifo {}
cut -f1 assm_1kb.fasta.fai | xargs -i echo cat {} \| gzip \> {}.gz \&|sh
zcat fil-Q25-d10-QUAL30-PASS2.vcf.gz | awk '!/^#/{print>$1}'

The problem now is missing header for each scaffold, I managed to split the header with

head -n 60 fil-Q25-d10-QUAL30-PASS2.vcf | grep "^#" >header

Could you please help me with how to add the header information to each scaffold. please let me know if I am not clear.

Thank you

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.3 years ago by krp0001 ▴ 40

0

Entering edit mode

If you are looking for very simplistic approach, I would do something like this:

grep -v "^#" input.vcf | cut -f1 | sort | uniq > scaff_names
grep "^#" input.vcf > vcf_header

while read line
do
grep -v "^#" input.vcf  | awk -v scaff=$line '{ if ( $1==scaff ) print }' | cat vcf_header - > $line.vcf
done < scaff_names

This should also split VCF by chromosome/scaffold with header intact. Hope this helps.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.3 years ago by GouthamAtla 12k