Vcf Sort According To Order Of The Reference File
4
1
Entering edit mode
11.1 years ago
Tonyzeng ▴ 310

Hi, When I run GATK with Error and showed that the variant file with VCF fomat file has not the same order with reference file or not compatible. I used VCFsorter.pl to sort my vcf file according to the genome file using

$ perl vcfsorter.pl genome.dict New.vcf > New1.vcf 2>STDERR

It generated a new file called New1.vcf. However, New1.vcf produced a file with only header in the file but removing all the data line information ( original VCF or New.vcf file include both header and data line information), besides, ##contig+<id=number is="" totally="" the="" same="" as="" the="" new.vcf="" with="" no="" any="" changes.<="" p="">

The chromosome order in my reference file (genome.dict or genome.fa) is as 10,11,12,13,14,15,16,17,18,19,1,2,3,4,5,6,7,8,9,X,Y from header of New.vcf file, it has chromosome order as 1,10,11,12,13,14,15,16,17,18,19,2,3,4,5,6,7,8,9,X,Y New1.vcf file (sorted vcf file using vcfsorter.pl program above) has the same chromosome order as New.vcf or 1,10,11,12,13,14,15,16,17,18,19,2,3,4,5,6,7,8,9,X,Y

I am pretty confused about follows:

1) in-compatiblity between VCF file and Reference file means JUST chromosome order like above? since my reference is .fa format, VCF file header requires the same ##contig=number order with the Reference files?

2) why VCFsorter.pl generate the VCF file with just leaving header but ignore the data line information?

3) when i compare the reference .dict file (genome.dict) with sorted VCF file (New1.vcf), they seems quite different in the header content, is this normal?

[rzeng@prism reference]$ more genome.dict

@HD VN:1.4 SO:unsorted @SQ SN:chr10 LN:130694993 UR:file:/raid1/rzeng/reference/genome.fa M5:7831ecda5dd6bcf838e2452ea0139eac @SQ SN:chr11 LN:122082543 UR:file:/raid1/rzeng/reference/genome.fa M5:e168c7a3194813f597181f26bb1bd13f @SQ SN:chr12 LN:120129022 UR:file:/raid1/rzeng/reference/genome.fa M5:671f85bb54a6e097d631e2e2dd813ad4 @SQ SN:chr13 LN:120421639 UR:file:/raid1/rzeng/reference/genome.fa M5:7f9b9fa3fbd9a38634107dfdc7fd8dc8 @SQ SN:chr14 LN:124902244 UR:file:/raid1/rzeng/reference/genome.fa M5:bf4e1efa25a8fd23b41c91f9bcb86388 @SQ SN:chr15 LN:104043685 UR:file:/raid1/rzeng/reference/genome.fa M5:106358dace00e5825ae337c1f1821b5e @SQ SN:chr16 LN:98207768 UR:file:/raid1/rzeng/reference/genome.fa M5:5482110a6cedd3558272325eee4c5a17 @SQ SN:chr17 LN:94987271 UR:file:/raid1/rzeng/reference/genome.fa M5:0d21e8edbfcd8410523b2b94e6dae892 @SQ SN:chr18 LN:90702639 UR:file:/raid1/rzeng/reference/genome.fa M5:46fda2f7e6dbf91bff91d6703e004afb @SQ SN:chr19 LN:61431566 UR:file:/raid1/rzeng/reference/genome.fa M5:7d363594531514ce41dfacfd97bc995d @SQ SN:chr1 LN:195471971 UR:file:/raid1/rzeng/reference/genome.fa M5:c4ec915e7348d42648eefc1534b71c99 @SQ SN:chr2 LN:182113224 UR:file:/raid1/rzeng/reference/genome.fa M5:fe020a692e23f8468b376e14e304a10f @SQ SN:chr3 LN:160039680 UR:file:/raid1/rzeng/reference/genome.fa M5:50f9385167e70825931231ddf1181b80 @SQ SN:chr4 LN:156508116 UR:file:/raid1/rzeng/reference/genome.fa M5:e7bdfb3ce7f54d2720b0718ed2ea063c @SQ SN:chr5 LN:151834684 UR:file:/raid1/rzeng/reference/genome.fa M5:095f3d4ebe1f0bafff057cc9b130186d @SQ SN:chr6 LN:149736546 UR:file:/raid1/rzeng/reference/genome.fa M5:62628d042ea5e01adff5b481d23def67 @SQ SN:chr7 LN:145441459 UR:file:/raid1/rzeng/reference/genome.fa M5:65da9ab01a76dcbcaef6f32a753585c1 @SQ SN:chr8 LN:129401213 UR:file:/raid1/rzeng/reference/genome.fa M5:dd2d079a37c02e8a3f95abff9e37ac69 @SQ SN:chr9 LN:124595110 UR:file:/raid1/rzeng/reference/genome.fa M5:ef8a85e56b750c10568656361fac7990 @SQ SN:chrM LN:16299 UR:file:/raid1/rzeng/reference/genome.fa M5:11c8af2a2528b25f2c080ab7da42edda @SQ SN:chrX LN:171031299 UR:file:/raid1/rzeng/reference/genome.fa M5:b3db6d6da78d5268688ee395c2c8cb4a @SQ SN:chrY LN:91744698 UR:file:/raid1/rzeng/reference/genome.fa M5:837a35bcca18643d030d4eec5e5b9c64

[rzeng@prism reference]$ more New1.vcf

##fileformat=VCFv4.1 ##samtoolsVersion=0.1.18-r572 ##reference=ftp://ftp-mouse.sanger.ac.uk/ref/GRCm38_68.fa ##source_20130026.2=vcf-annotate(r813) -f +/D=200/d=5/q=20/w=2/a=5 (AJ,AKR,CASTEiJ,CBAJ,DBA2J,FVBNJ,LPJ,PWKPhJ,WSBEiJ) ##source_20130026.2=vcf-annotate(r813) -f +/D=250/d=5/q=20/w=2/a=5 (129S1,BALBcJ,C3HHeJ,C57BL6NJ,NODShiLtJ,NZO,Spretus) ##source_20130305.2=vcf-annotate(r818) -f +/D=155/d=5/q=20/w=2/a=5 (129P2) ##source_20130304.2=vcf-annotate(r818) -f +/D=100/d=5/q=20/w=2/a=5 (129S5) ##contig=<id=1,length=195471971> ##contig=<id=10,length=130694993> ##contig=<id=11,length=122082543> ##contig=<id=12,length=120129022> ##contig=<id=13,length=120421639> ##contig=<id=14,length=124902244> ##contig=<id=15,length=104043685> ##contig=<id=16,length=98207768> ##contig=<id=17,length=94987271> ##contig=<id=18,length=90702639> ##contig=<id=19,length=61431566> ##contig=<id=2,length=182113224> ##contig=<id=3,length=160039680> ##contig=<id=4,length=156508116> ##contig=<id=5,length=151834684> ##contig=<id=6,length=149736546> ##contig=<id=7,length=145441459> ##contig=<id=8,length=129401213> ##contig=<id=9,length=124595110> ##contig=<id=x,length=171031299> ##FILTER=<id=basequalbias,description="min p-value="" for="" baseq="" bias="" (info="" pv4)="" [0]"=""> ##FILTER=<id=enddistbias,description="min p-value="" for="" end="" distance="" bias="" (info="" pv4)="" [0.0001]"=""> ##FILTER=<id=gapwin,description="window size="" for="" filtering="" adjacent="" gaps="" [3]"=""> ##FILTER=<id=het,description="genotype call="" is="" heterozygous="" (low="" quality)="" []"=""> ##FILTER=<id=mapqualbias,description="min p-value="" for="" mapq="" bias="" (info="" pv4)="" [0]"=""> ##FILTER=<id=maxdp,description="maximum read="" depth="" (info="" dp="" or="" info="" dp4)="" [200]"=""> ##FILTER=<id=minab,description="minimum number="" of="" alternate="" bases="" (info="" dp4)="" [5]"=""> ##FILTER=<id=mindp,description="minimum read="" depth="" (info="" dp="" or="" info="" dp4)="" [5]"=""> ##FILTER=<id=minmq,description="minimum rms="" mapping="" quality="" for="" snps="" (info="" mq)="" [20]"=""> ##FILTER=<id=qual,description="minimum value="" of="" the="" qual="" field="" [10]"=""> ##FILTER=<id=refn,description="reference base="" is="" n="" []"=""> ##FILTER=<id=snpgap,description="snp within="" int="" bp="" around="" a="" gap="" to="" be="" filtered="" [2]"=""> ##FILTER=<id=strandbias,description="min p-value="" for="" strand="" bias="" (info="" pv4)="" [0.0001]"=""> ##FILTER=<id=vdb,description="minimum variant="" distance="" bias="" (info="" vdb)="" [0]"=""> ##FORMAT=<id=dp,number=1,type=integer,description="# high-quality="" bases"=""> ##FORMAT=<id=gl,number=3,type=float,description="likelihoods for="" rr,ra,aa="" genotypes="" (r="ref,A=alt)""> ##FORMAT=<id=gq,number=1,type=integer,description="genotype quality"=""> ##FORMAT=<id=gt,number=1,type=string,description="genotype"> ##FORMAT=<id=pl,number=g,type=integer,description="list of="" phred-scaled="" genotype="" likelihoods"=""> ##FORMAT=<id=sp,number=1,type=integer,description="phred-scaled strand="" bias="" p-value"=""> ##FORMAT=<id=fi,number=1,type=integer,description="pass(1) or="" fail="" (0)="" filter"=""> ##INFO=<id=ac1,number=1,type=float,description="max-likelihood estimate="" of="" the="" first="" alt="" allele="" count="" (no="" hwe="" assumption)"=""> ##INFO=<id=af1,number=1,type=float,description="max-likelihood estimate="" of="" the="" first="" alt="" allele="" frequency="" (assuming="" hwe)"=""> ##INFO=<id=dp,number=1,type=integer,description="raw read="" depth"=""> ##INFO=<id=dp4,number=4,type=integer,description="# high-quality="" ref-forward="" bases,="" ref-reverse,="" alt-forward="" and="" alt-reverse="" bases"=""> ##INFO=<id=indel,number=0,type=flag,description="indicates that="" the="" variant="" is="" an="" indel."=""> ##INFO=<id=mdv,number=1,type=integer,description="maximum number="" of="" high-quality="" non-reference="" bases"=""> ##INFO=<id=mq,number=1,type=float,description="rms mapping="" quality"=""> ##INFO=<id=msd,number=1,type=float,description="maximum depth="" across="" non-ref="" genotypes"=""> ##INFO=<id=pv0,number=1,type=float,description="p-value for="" strand="" bias"=""> ##INFO=<id=pv1,number=1,type=float,description="p-value for="" baseq="" bias"=""> ##INFO=<id=pv2,number=1,type=float,description="p-value for="" mapq="" bias"=""> ##INFO=<id=pv3,number=1,type=float,description="p-value for="" tail="" distance="" bias"=""> ##INFO=<id=pv4,number=4,type=float,description="p-values for="" strand="" bias,="" baseq="" bias,="" mapq="" bias="" and="" tail="" distance="" bias"=""> ##INFO=<id=qd,number=1,type=float,description="quality by="" depth"=""> ##INFO=<id=sb,number=1,type=float,description="strand bias"=""> ##INFO=<id=vdb,number=1,type=float,description="variant distance="" bias"=""> ##INFO=<id=ac,number=.,type=integer,description="allele count="" in="" genotypes"=""> ##INFO=<id=an,number=1,type=integer,description="total number="" of="" alleles="" in="" called="" genotypes"=""> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 129P2

gatk • 22k views
ADD COMMENT
1
Entering edit mode

please, edit and format your question.

ADD REPLY
5
Entering edit mode
11.1 years ago

(update: ) Use picard sortvcf

I wrote a tool named SortVcfOnRef to sort a VCF using an indexed reference.

Example:

cat input.vcf |\
   java -jar dist/sortvcfonref.jar  REF=ref.fa |\
   bgzip -c > result.vcf.gz && \
   tabix -p vcf -f result.vcf.gz

ADD COMMENT
0
Entering edit mode

Pierre, thank you for the information!

I opened the linker of SortVcfOnRef but could not find where to download sortvcfonref.jar script, did I miss something here?

Thanks again

ADD REPLY
2
Entering edit mode
ADD REPLY
3
Entering edit mode
11.1 years ago
lh3 33k
bgzip old.vcf; tabix -pvcf old.vcf.gz
cat chr_list.txt | xargs tabix -h old.vcf.gz > new.vcf
ADD COMMENT
0
Entering edit mode

Really, helpful! Kewl trick!

ADD REPLY
0
Entering edit mode
11.1 years ago
Tonyzeng ▴ 310

Thank you Pirre, when I run the SortVcfOnRef, it produced an empty file named result.vcf.gz. I think I missed something there. Now I copy most of the commands and process details, thank you!

Here is build.properties file I have edited as

bigwig.dir=/raid1/rzeng/softwares/bigwig
picard.version=1.101
picard.dir=/raid1/rzeng/softwares/picard-tools-${picard.version}
picard.jar=${picard.dir}/picard-${picard.version}.jar
sam.jar=${picard.dir}/sam-${picard.version}.jar`
variant.jar=${picard.dir}/variant-${picard.version}.jar
tribble.jar=${picard.dir}/tribble-${picard.version}.jar
berkeleydb.jar=/raid1/rzeng/softwares/je-5.0.34/lib/je-5.0.34.jar

Then I did

$ cat build.properties

then,

$ ant sortvcfonref

after it showed "successful", I run

$ cat New.vcf | java -jar /raid1/rzeng/jvarkit/dist/sortvcfonref.jar REF=genome.fa | bgzip -c > result.vcf.gz && tabix -p vcf -f result.vcf.gz &

It showed like this

-bash: line 37: bgzip: command not found
ADD COMMENT
1
Entering edit mode

bgzip is not in your PATH , install tabix or you can simply run

java -jar dist/sortvcfonref.jar  REF=ref.fa < input.vcf > output.vcf
ADD REPLY
1
Entering edit mode

This should be a comment, not an answer to your question

ADD REPLY
0
Entering edit mode

thank you, Pierre, I run java -jar dist/sortvcfonref.jar REF=ref.fa < input.vcf > output.vcf and generate a 8.0k file output.vcf (removing data line information). My original input.vcf is 1.6G with data line information. also the order of ##contig= still not change according to reference seq.

Did i miss something? thank you very much, Pierre

ADD REPLY
1
Entering edit mode

Run

 java -jar dist/sortvcfonref.jar REF=ref.fa < input.vcf > output.vcf  2> err.txt

what 's in err.txt ?

ADD REPLY
0
Entering edit mode

I run it as $ java -jar /raid1/rzeng/jvarkit/dist/sortvcfonref.jar REF=genome.fa < New.vcf > output.vcf 2> err.txt

then err.txt say

Mon Oct 28 15:35:29 CDT 2013] com.github.lindenb.jvarkit.tools.sortvcfonref.SortVcfOnRef REF=genome.fa    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVE L=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false



[Mon Oct 28 15:35:29 CDT 2013] Executing as rzeng@prism.cluster on Linux 2.6.32-358.18.1.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_25-mockbuild_2013_07_01_09_31-b00; Picard version: null

[Mon Oct 28 15:35:29 CDT 2013] Executing as rzeng@prism.cluster on Linux 2.6.32-358.18.1.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_25-mockbuild_2013_07_01_09_31-b00; Picard version: null

java.lang.RuntimeException: unknown chromosome 1 in 1    3000185    .    G    T    234.33    PASS    AC1=1;AC=22;AF1=1;AN=36;DP4=134,13,186,26;DP=377;MDV=0;MQ=54;MSD=0;PV0=1;PV

1=0.49;PV2=1;PV3=1;PV4=1,0.49,1,1;SB=0.4821;VDB=0.0244 GT:GQ:DP:SP:PL:FI 1/1:99:17:0:216,51,0:1

at com.github.lindenb.jvarkit.tools.sortvcfonref.SortVcfOnRef$VariantComparator.ref(SortVcfOnRef.java:83)
at com.github.lindenb.jvarkit.tools.sortvcfonref.SortVcfOnRef$VariantComparator.compare(SortVcfOnRef.java:95)
at com.github.lindenb.jvarkit.tools.sortvcfonref.SortVcfOnRef$VariantComparator.compare(SortVcfOnRef.java:78)
at java.util.TimSort.countRunAndMakeAscending(TimSort.java:324)
at java.util.TimSort.sort(TimSort.java:203)
at java.util.Arrays.sort(Arrays.java:727)
at net.sf.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:203)
at net.sf.samtools.util.SortingCollection.add(SortingCollection.java:150)
at com.github.lindenb.jvarkit.tools.sortvcfonref.SortVcfOnRef.doWork(SortVcfOnRef.java:203)
at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177)
at net.sf.picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:119)
at com.github.lindenb.jvarkit.tools.sortvcfonref.SortVcfOnRef.main(SortVcfOnRef.java:231)

[Mon Oct 28 15:35:30 CDT 2013] com.github.lindenb.jvarkit.tools.sortvcfonref.SortVcfOnRef done. Elapsed time: 0.01 minutes. Runtime.totalMemory()=2025979904

ADD REPLY
1
Entering edit mode

and there is a chromosome "1" in "genome.fa " ???

ADD REPLY
0
Entering edit mode

I grep all the words with chr* in genome.fa, it showed as following

chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrM chrX chrY

in my variants VCF, it shows #chrom 1, 10 , ....

Maybe, I should change #chrom to #chr as the same as genome.fa?

ADD REPLY
0
Entering edit mode

yes, you should

ADD REPLY
0
Entering edit mode

thank you Pirrer

ADD REPLY
0
Entering edit mode

I have changed "chrom" of input.vcf to "chr" as follows

#CHR    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    129P2
1    3000126    .    G    T    51.33    Qual;MinAB;EndDistBias;MinDP    
AC1=1;AC=34;AF1=1;AN=36;DP4=4,0,77,11;DP=237;MDV=99;MQ=42;MSD=6;PV0=1;PV1=1;PV2=
1;PV3=0.094;PV4=1,1,1,0.094;QD=0.0061;SB=0.3000;VDB=0.0038    GT:GQ:DP:SP:PL:F
I    1/1:99:6:0:82,3,0:1
1    3000185    .    G    T    234.33    PASS    AC1=1;AC=22;AF1=1;AN=36;
DP4=134,13,186,26;DP=377;MDV=0;MQ=54;MSD=0;PV0=1;PV1=0.49;PV2=1;PV3=1;PV4=1,0.49

Then, I rerun sortvcfonref.jar using input.vcf that has been changed

java -jar dist/sortvcfonref.jar  REF=ref.fa < input.vcf > output.vcf

It still showed the same information as,

java.lang.RuntimeException: unknown chromosome 1 in 1    3000185    .    G    T    234.33    PASS    AC1=1;AC=22;AF1=1;AN=36;DP4=134,13,186,26;DP=377;MDV=0;MQ=54;MSD=0;PV0=1;PV1=0.49;PV2=1;PV3=1;PV4=1,0.49,1,1;SB=0.4821;VDB=0.0244    GT:GQ:DP:SP:PL:FI    1/1:99:17:0:216,51,0:1

Then, I go back to grep the line which has unknown chromosome 1 and it showed as follows

1       3000185 .       G       T       234.33  PASS    AC1=1;AC=22;AF1=1;AN=36;DP4=134,13,186,26;DP=377;MDV=0;MQ=54;MSD=0;PV0=1;PV1=0.49;PV2=1;PV3=1;PV4=1,0.49,1,1;SB=0.4821;VDB=0.0244       GT:GQ:DP:SP:PL:FI       1/1:99:17:0:216,51,0:1

Do not find anything special here in the line of input.vcf and feel confused now.

ADD REPLY
0
Entering edit mode
11.1 years ago

I found this tool, for resorting a vcf according to a reference .dict file, simple to use: http://code.google.com/p/vcfsorter/

ADD COMMENT

Login before adding your answer.

Traffic: 1830 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6