Previously, I split out a vcf file by chromosome, and for my project, I have combined the X and XY vcf files into a single one. After changing the "XY" chromosome designation to "X" via:
awk '{gsub(/"XY"/, "X"); print;}' Genome_newX.vcf > Genome_newX2.vcf
I'm running into the issue of sorting this new "Genome_newX2.vcf" by position. The idea is that I'll subsequently run the vcf through GenotypeHarmonizer.
Are there any suggestions on how to do this easily? I'm brand new to this style of work, and I'd love some direction on where to read up on it as well. Thank you!
excellent, this solved the problem. I really appreciate!
This works well in your case, as you seem to have just on chromosome. For sorting a vcf file I prefer this:
This makes sure that your chromosomes are sorted correctly. WIthout the 'V' "2" comes behind "19" for example.
fin simmer
I do not recommend to use natural sorting on genomic data. Most other tools, e.g. samtools (for sorting bam files) do not support this by default. If you ever do operations like intersections with bedtools on two or more files that require files to be sorted, the different sort orders would/could cause conflict, e.g.
bedtools
intersect
with the-sorted
optiontoo bad vcf-sort is garbage and the -c flag doesnt work even with the newest version
Hello ATPoint,
funny. This is exact the same reason why I use natural sorting. :) The data I've worked with (human) was always sorted this way and I got problems it a part in the analyse pipeline wasn't.
fin swimmer