Question

Sort VCF File by Position?

2

Entering edit mode

6.9 years ago

Niell ▴ 20

Previously, I split out a vcf file by chromosome, and for my project, I have combined the X and XY vcf files into a single one. After changing the "XY" chromosome designation to "X" via:

awk '{gsub(/"XY"/, "X"); print;}' Genome_newX.vcf > Genome_newX2.vcf

I'm running into the issue of sorting this new "Genome_newX2.vcf" by position. The idea is that I'll subsequently run the vcf through GenotypeHarmonizer.

Are there any suggestions on how to do this easily? I'm brand new to this style of work, and I'd love some direction on where to read up on it as well. Thank you!

chromosome vcf • 36k views

ADD COMMENT • link updated 22 months ago by ATpoint 86k • written 6.9 years ago by Niell ▴ 20

score 11 · Accepted Answer · 2018-02-19

11

Entering edit mode

6.9 years ago

ATpoint 86k

Edit: 02/23

Just use bcftools sort https://samtools.github.io/bcftools/bcftools.html#sort

Original answer with awk-fu:

cat in.vcf | awk '$1 ~ /^#/ {print $0;next} {print $0 | "sort -k1,1 -k2,2n"}' > out_sorted.vcf

It takes a VCF and prints the sorted file including the header.

ADD COMMENT • link 22 months ago by ATpoint 86k

0

Entering edit mode

excellent, this solved the problem. I really appreciate!

ADD REPLY • link 6.9 years ago by Niell ▴ 20

1

Entering edit mode

sort -k1,1 -k2,2n

This works well in your case, as you seem to have just on chromosome. For sorting a vcf file I prefer this:

sort -k1,1V -k2,2n my.vcf

This makes sure that your chromosomes are sorted correctly. WIthout the 'V' "2" comes behind "19" for example.

fin simmer

ADD REPLY • link 6.9 years ago by finswimmer 16k

0

Entering edit mode

I do not recommend to use natural sorting on genomic data. Most other tools, e.g. samtools (for sorting bam files) do not support this by default. If you ever do operations like intersections with bedtools on two or more files that require files to be sorted, the different sort orders would/could cause conflict, e.g. bedtools intersect with the -sorted option

ADD REPLY • link 6.9 years ago by ATpoint 86k

1

Entering edit mode

too bad vcf-sort is garbage and the -c flag doesnt work even with the newest version

ADD REPLY • link 5.8 years ago by jon.klonowski ▴ 210

0

Entering edit mode

Hello ATPoint,

funny. This is exact the same reason why I use natural sorting. :) The data I've worked with (human) was always sorted this way and I got problems it a part in the analyse pipeline wasn't.

fin swimmer

ADD REPLY • link 6.9 years ago by finswimmer 16k

score 9 · Accepted Answer · 2020-02-10

9

Entering edit mode

4.9 years ago

beausoleilmo ▴ 600

Would there be an equivalent for a BCF? bcftools view | [...] code? Or why not using bcftools sort -Oz output.bcf -o output_sort.vcf.gz?

ADD COMMENT • link 4.9 years ago by beausoleilmo ▴ 600

3

Entering edit mode

bcftools sort is absolute the right way and the way I would go today :)

ADD REPLY • link 4.9 years ago by finswimmer 16k

1

Entering edit mode

Would also recommend SortVCF

ADD REPLY • link 2.4 years ago by DavidStreid ▴ 90

1

Entering edit mode

Ok just to double check here because I may be a chop, but dosnt SortVcf from GATK also just use the header formatting as well?

ADD REPLY • link 2.4 years ago by frd.graeme ▴ 20

0

Entering edit mode

Thank you! I have struck out my comment about not needing the seq-dict dependency (not sure what I was doing to make me think that) Appreciate it!