Question

modifying the header of VCF file

0

Entering edit mode

8.7 years ago

zengtony743 ▴ 80

Thank you for whoever can help me!!!

1) When I ran

$ `java -jar /hpf/tools/centos6/gatk/3.6.0/GenomeAnalysisTK.jar -R genome.fa -T SelectVariants --variant Newsnp.vcf -o testSNP.vcf -sn
AKRJ -sn AJ &

`

2) it shows

ERROR variant contigs = [1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 3, 4, 5, 6, 7, 8, 9, X]
ERROR sequence contigs = [chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chrM, chrX, chrY]
ERROR ------------------------------------------------------------------------------------------

3) the VCF header of Newsnp.vcf is

contig=<ID=1,length=195471971> 
contig=<ID=10,length=130694993> 
contig=<ID=11,length=122082543> 
contig=<ID=12,length=120129022> 
contig=<ID=13,length=120421639> 
contig=<ID=14,length=124902244> 
contig=<ID=15,length=104043685> 
contig=<ID=16,length=98207768> 
contig=<ID=17,length=94987271> 
contig=<ID=18,length=90702639> 
contig=<ID=19,length=61431566> 
contig=<ID=2,length=182113224> 
contig=<ID=3,length=160039680> 
contig=<ID=4,length=156508116> 
contig=<ID=5,length=151834684> 
contig=<ID=6,length=149736546> 
contig=<ID=7,length=145441459> 
contig=<ID=8,length=129401213> 
contig=<ID=9,length=124595110> 
contig=<ID=X,length=171031299>

However, the sequence contigs are

chr10,
chr11, 
chr12, 
chr13, 
chr14, 
chr15, 
chr16, 
chr17, 
chr18, 
chr19, 
chr1, 
chr2, 
chr3, 
chr4, 
chr5, 
chr6, 
chr7, 
chr8, 
chr9, 
chrM, 
chrX, 
chrY

How can I change the VCF header from above to the followings to make it compatible ?

This is the header of VCF should be like after modifying

contig=<ID=chr10,length=130694993>
contig=<ID=chr11,length=122082543>
contig=<ID=chr12,length=120129022>
contig=<ID=chr13,length=120421639>
contig=<ID=chr14,length=124902244>
contig=<ID=chr15,length=104043685>
contig=<ID=chr16,length=98207768>
contig=<ID=chr17,length=94987271>
contig=<ID=chr18,length=90702639>
contig=<ID=chr19,length=61431566>
contig=<ID=chr1,length=195471971>
contig=<ID=chr2,length=182113224>
contig=<ID=chr3,length=160039680>
contig=<ID=chr4,length=156508116>
contig=<ID=chr5,length=151834684>
contig=<ID=chr6,length=149736546>
contig=<ID=chr7,length=145441459>
contig=<ID=chr8,length=129401213>
contig=<ID=chr9,length=124595110>
contig=<ID=chrX,length=171031299>

header order vcf • 6.7k views

ADD COMMENT • link updated 6.3 years ago by zx8754 12k • written 8.7 years ago by zengtony743 ▴ 80

2

Entering edit mode

It would be safer to just use the reference genome which was used for the mapping of the reads.

ADD REPLY • link 8.7 years ago by WouterDeCoster 47k

1

Entering edit mode

If the vcf file is not so big I would do it manually with a text editor. If a text editor doesn't work I would try with "sed"

    sed 's/contig=<ID=1/contig=<ID=chr1/g' old.vcf > new1.vcf
    sed 's/contig=<ID=2/contig=<ID=chr2/g' new1.vcf > new2.vcf
    sed 's/contig=<ID=3/contig=<ID=chr3/g' new3.vcf > new3.vcf

But you have to do it 23 times and maybe there are some problems with characters like "=" or "<"

ADD REPLY • link 8.7 years ago by Pol ▴ 70

2

Entering edit mode

Not only the header needs modification. Every variant would need modification.

ADD REPLY • link 8.7 years ago by WouterDeCoster 47k

score 2 · Answer 1 · 2016-08-05

2

Entering edit mode

8.7 years ago

GouthamAtla 12k

the complaint is more about the order and representation of chr in your VCF file. Just do something like:

awk '{ if ($0 ~ /^#/) { gsub("ID=","ID=chr",$0); print } else { print "chr"$0} }'  tmp.vcf > new.vcf

and then

use SortVcf to sort your vcf providing the sequence dictionary.
delete the vcf index file. The picard indexes are sometimes not compatible with GATK. GATK creates its own vcf index.

As mentioned by WouterDeCoster, try to have only the chromosomes that are present in genome fasta.

Check here for some info. http://gatkforums.broadinstitute.org/gatk/discussion/1328/errors-about-contigs-in-bam-or-vcf-files-not-being-properly-ordered-or-sorted

ADD COMMENT • link 8.7 years ago by GouthamAtla 12k

0

Entering edit mode

Thanks very much, Goutham Atla! I have changed the "ID=" to "ID=chr" perfectly using awk that you posted.

However, SortVcf does not work in this case because the order of "chr" in header lines and data lines are BOTH inconsistent with the order of reference.dict. According to the dictionary it says that SortVcf sorts the records in VCF files according to the order of the contigs in the header/sequence dictionary and then by coordinate. in my case, the "chr" order in the header of VCF is not the same as reference. unless the "chr" order of header is the same as reference, it will sort the data lines in the way i need.

in the example of "SortVcf"

java -jar picard.jar SortVcf \
      I=vcf_1.vcf \
      I=vcf_2.vcf \
      O=sorted.vcf

there is no reference file called "reference.dict". it does not work in my case.

I tried another script called vcfsorter.pl.

perl vcfsorter.pl genome.dict new.vcf > Newdbsnp.vcf 2> STDERR &

you can see "genome.dict" is used as a reference to sort the chr order of both header lines and data lines.

Unfortunately, my supercomputer shows

+  Killed                  perl vcfsorter.pl genome.dict new.vcf > Newdbsnp.vcf 2> STDERR

not sure why, but seems like this script has used a lot of computer resources even my account has 70GB of vmem quota that i can use to run program

any idead?

ADD REPLY • link 8.7 years ago by zengtony743 ▴ 80

score 0 · Answer 2 · 2016-08-06

Here is the problem when I ran SorterVcf

java -jar /hpf/tools/centos6/picard-tools/2.5.0/picard.jar SortVcf SD=genome.dict I=new.vcf o=Newdbsnp.vcf

it shows:

[rzeng@qlogin4 reference]$ [Sat Aug 06 18:02:21 EDT 2016] picard.vcf.SortVcf INPUT=[new.vcf] OUTPUT=Newdbsnp.vcf SEQUENCE_DICTIONARY=genome.dict VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=true CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Sat Aug 06 18:02:21 EDT 2016] Executing as rzeng@qlogin4 on Linux 2.6.32-642.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_91-b14; Picard version: 2.5.0(2c370988aefe41f579920c8a6a678a201c5261c1_1466708365)
[Sat Aug 06 18:02:21 EDT 2016] picard.vcf.SortVcf done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=2027945984
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.IllegalArgumentException: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=chr1,length=195471971,dict_index=0,assembly=null) was found when SAMSequenceRecord(name=chr10,length=130694993,dict_index=0,assembly=null) was expected.
at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:126)
at picard.vcf.SortVcf.doWork(SortVcf.java:95)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:208)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)
Caused by: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=chr1,length=195471971,dict_index=0,assembly=null) was found when SAMSequenceRecord(name=chr10,length=130694993,dict_index=0,assembly=null) was expected.
at htsjdk.samtools.SAMSequenceDictionary.assertSameDictionary(SAMSequenceDictionary.java:166)
at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:124)
... 4 more

[1]+ Exit 1 java -jar /hpf/tools/centos6/picard-tools/2.5.0/picard.jar SortVcf SD=genome.dict I=new.vcf o=Newdbsnp.vcf

My computer has 70GB vmem capacity. It should not be killed because of computer. Did i miss something?