modifying the header of VCF file
2
0
Entering edit mode
8.3 years ago
zengtony743 ▴ 80

Thank you for whoever can help me!!!

1) When I ran

$ `java -jar /hpf/tools/centos6/gatk/3.6.0/GenomeAnalysisTK.jar -R genome.fa -T SelectVariants --variant Newsnp.vcf -o testSNP.vcf -sn
AKRJ -sn AJ &

`

2) it shows

ERROR variant contigs = [1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 3, 4, 5, 6, 7, 8, 9, X]
ERROR sequence contigs = [chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chrM, chrX, chrY]
ERROR ------------------------------------------------------------------------------------------

3) the VCF header of Newsnp.vcf is

contig=<ID=1,length=195471971> 
contig=<ID=10,length=130694993> 
contig=<ID=11,length=122082543> 
contig=<ID=12,length=120129022> 
contig=<ID=13,length=120421639> 
contig=<ID=14,length=124902244> 
contig=<ID=15,length=104043685> 
contig=<ID=16,length=98207768> 
contig=<ID=17,length=94987271> 
contig=<ID=18,length=90702639> 
contig=<ID=19,length=61431566> 
contig=<ID=2,length=182113224> 
contig=<ID=3,length=160039680> 
contig=<ID=4,length=156508116> 
contig=<ID=5,length=151834684> 
contig=<ID=6,length=149736546> 
contig=<ID=7,length=145441459> 
contig=<ID=8,length=129401213> 
contig=<ID=9,length=124595110> 
contig=<ID=X,length=171031299>

However, the sequence contigs are

chr10,
chr11, 
chr12, 
chr13, 
chr14, 
chr15, 
chr16, 
chr17, 
chr18, 
chr19, 
chr1, 
chr2, 
chr3, 
chr4, 
chr5, 
chr6, 
chr7, 
chr8, 
chr9, 
chrM, 
chrX, 
chrY

How can I change the VCF header from above to the followings to make it compatible ?

This is the header of VCF should be like after modifying

contig=<ID=chr10,length=130694993>
contig=<ID=chr11,length=122082543>
contig=<ID=chr12,length=120129022>
contig=<ID=chr13,length=120421639>
contig=<ID=chr14,length=124902244>
contig=<ID=chr15,length=104043685>
contig=<ID=chr16,length=98207768>
contig=<ID=chr17,length=94987271>
contig=<ID=chr18,length=90702639>
contig=<ID=chr19,length=61431566>
contig=<ID=chr1,length=195471971>
contig=<ID=chr2,length=182113224>
contig=<ID=chr3,length=160039680>
contig=<ID=chr4,length=156508116>
contig=<ID=chr5,length=151834684>
contig=<ID=chr6,length=149736546>
contig=<ID=chr7,length=145441459>
contig=<ID=chr8,length=129401213>
contig=<ID=chr9,length=124595110>
contig=<ID=chrX,length=171031299>
header order vcf • 6.5k views
ADD COMMENT
2
Entering edit mode

It would be safer to just use the reference genome which was used for the mapping of the reads.

ADD REPLY
1
Entering edit mode

If the vcf file is not so big I would do it manually with a text editor. If a text editor doesn't work I would try with "sed"

    sed 's/contig=<ID=1/contig=<ID=chr1/g' old.vcf > new1.vcf
    sed 's/contig=<ID=2/contig=<ID=chr2/g' new1.vcf > new2.vcf
    sed 's/contig=<ID=3/contig=<ID=chr3/g' new3.vcf > new3.vcf

But you have to do it 23 times and maybe there are some problems with characters like "=" or "<"

ADD REPLY
2
Entering edit mode

Not only the header needs modification. Every variant would need modification.

ADD REPLY
2
Entering edit mode
8.3 years ago

the complaint is more about the order and representation of chr in your VCF file. Just do something like:

awk '{ if ($0 ~ /^#/) { gsub("ID=","ID=chr",$0); print } else { print "chr"$0} }'  tmp.vcf > new.vcf

and then

  1. use SortVcf to sort your vcf providing the sequence dictionary.

  2. delete the vcf index file. The picard indexes are sometimes not compatible with GATK. GATK creates its own vcf index.

As mentioned by WouterDeCoster, try to have only the chromosomes that are present in genome fasta.

Check here for some info. http://gatkforums.broadinstitute.org/gatk/discussion/1328/errors-about-contigs-in-bam-or-vcf-files-not-being-properly-ordered-or-sorted

ADD COMMENT
0
Entering edit mode

Thanks very much, Goutham Atla! I have changed the "ID=" to "ID=chr" perfectly using awk that you posted.

However, SortVcf does not work in this case because the order of "chr" in header lines and data lines are BOTH inconsistent with the order of reference.dict. According to the dictionary it says that SortVcf sorts the records in VCF files according to the order of the contigs in the header/sequence dictionary and then by coordinate. in my case, the "chr" order in the header of VCF is not the same as reference. unless the "chr" order of header is the same as reference, it will sort the data lines in the way i need.

in the example of "SortVcf"

java -jar picard.jar SortVcf \
      I=vcf_1.vcf \
      I=vcf_2.vcf \
      O=sorted.vcf

there is no reference file called "reference.dict". it does not work in my case.

I tried another script called vcfsorter.pl.

perl vcfsorter.pl genome.dict new.vcf > Newdbsnp.vcf 2> STDERR &

you can see "genome.dict" is used as a reference to sort the chr order of both header lines and data lines.

Unfortunately, my supercomputer shows

+  Killed                  perl vcfsorter.pl genome.dict new.vcf > Newdbsnp.vcf 2> STDERR

not sure why, but seems like this script has used a lot of computer resources even my account has 70GB of vmem quota that i can use to run program

any idead?

ADD REPLY
0
Entering edit mode
8.3 years ago
zengtony743 ▴ 80

Here is the problem when I ran SorterVcf

java -jar /hpf/tools/centos6/picard-tools/2.5.0/picard.jar SortVcf SD=genome.dict I=new.vcf o=Newdbsnp.vcf

it shows:

[rzeng@qlogin4 reference]$ [Sat Aug 06 18:02:21 EDT 2016] picard.vcf.SortVcf INPUT=[new.vcf] OUTPUT=Newdbsnp.vcf SEQUENCE_DICTIONARY=genome.dict VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=true CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Sat Aug 06 18:02:21 EDT 2016] Executing as rzeng@qlogin4 on Linux 2.6.32-642.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_91-b14; Picard version: 2.5.0(2c370988aefe41f579920c8a6a678a201c5261c1_1466708365)
[Sat Aug 06 18:02:21 EDT 2016] picard.vcf.SortVcf done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=2027945984
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.IllegalArgumentException: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=chr1,length=195471971,dict_index=0,assembly=null) was found when SAMSequenceRecord(name=chr10,length=130694993,dict_index=0,assembly=null) was expected.
at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:126)
at picard.vcf.SortVcf.doWork(SortVcf.java:95)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:208)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)
Caused by: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=chr1,length=195471971,dict_index=0,assembly=null) was found when SAMSequenceRecord(name=chr10,length=130694993,dict_index=0,assembly=null) was expected.
at htsjdk.samtools.SAMSequenceDictionary.assertSameDictionary(SAMSequenceDictionary.java:166)
at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:124)
... 4 more

[1]+ Exit 1 java -jar /hpf/tools/centos6/picard-tools/2.5.0/picard.jar SortVcf SD=genome.dict I=new.vcf o=Newdbsnp.vcf

My computer has 70GB vmem capacity. It should not be killed because of computer. Did i miss something?

ADD COMMENT
1
Entering edit mode

the error message is really clear:

SAM dictionaries are not the same: 

SAMSequenceRecord(name=chr1,length=195471971,dict_index=0,assembly=null)

was found when 

SAMSequenceRecord(name=chr10,length=130694993,dict_index=0,assembly=null) was expected.
ADD REPLY
0
Entering edit mode

Thanks, Pierre. Do you have idea why

perl vcfsorter.pl genome.dict new.vcf > Newdbsnp.vcf 2> STDERR

Was killed? Thanks

ADD REPLY

Login before adding your answer.

Traffic: 2172 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6