Hi all,
I have problem with vcf file, the problem is (-) in one or two columns in some lines.
I am trying to remove them or replace them but I couldn't do it. please could anyone help with that.
Thanks in advance,
Ahmed
Hi all,
I have problem with vcf file, the problem is (-) in one or two columns in some lines.
I am trying to remove them or replace them but I couldn't do it. please could anyone help with that.
Thanks in advance,
Ahmed
Those offending lines are all "in-dels" and they are formatted in an old style where the context is not given. To reformat them correctly, you need to look at the corresponding position of genome and find out what are the bases in context. Alternatively, you may just filter out all the indels, if they are not critical for your analysis.
zgrep -v in-del vcf_chr_33.vcf.gz > chr33.snp.vcf
Thanks Santosh. when I ran your script the result was so weird. check the below error
90.1, AADN04023940.1, AADN04024429.1, AADN04024431.1, AADN04024444.1, AADN04024144.1, AADN04024001.1, AADN04014231.1, AADN04024498.1, AADN04024158.1, AADN04007914.1, AADN04024197.1, AADN04024254.1, AADN04024353.1, AADN04024314.1, AADN04024215.1, AADN04012257.1, AADN04024035.1, AADN04002975.1, AADN04024046.1, AADN04005144.1, AADN04023967.1, AADN04024299.1, AADN04024287.1, AADN04024354.1, AADN04024189.1, AADN04024268.1, AADN04005826.1, AADN04014001.1, AADN04023962.1, AADN04024296.1, AADN04005569.1, AADN04024134.1, AADN04023992.1, AADN04024274.1, AADN04005217.1, AADN04024275.1, AADN04007607.1, AADN04023976.1, AADN04024086.1, AADN04023974.1, AADN04024084.1, AADN04024045.1, AADN04024121.1, AADN04023941.1, AADN04023935.1, AADN04008379.1, AADN04024312.1, AADN04023978.1, AADN04024034.1, AADN04024375.1, AADN04024080.1, AADN04024118.1, AADN04024186.1, AADN04024070.1, AADN04024185.1, AADN04018283.1, AADN04023936.1, AADN04024255.1, AADN04024071.1, AADN04024100.1, AADN04024105.1, AADN04017094.1, AADN04020914.1, AADN04023942.1, AADN04024023.1, AADN04010121.1, AADN04024345.1, AADN04024305.1, AADN04024310.1, AADN04024358.1, AADN04024369.1, AADN04006246.1, AADN04023943.1, AADN04009947.1, AADN04024079.1, AADN04024313.1, AADN04024130.1, AADN04024309.1, AADN04023972.1, AADN04024089.1, AADN04024213.1, AADN04024292.1, AADN04024125.1, AADN04017325.1, AADN04024346.1, AADN04024441.1, AADN04024005.1, AADN04024020.1, AADN04024077.1, AADN04024009.1, AADN04024032.1, AADN04024192.1, AADN04024328.1, AADN04024038.1, AADN04023969.1, AADN04024326.1, AADN04024056.1, AADN04013227.1, AADN04024224.1, AADN04024243.1, AADN04024206.1, AADN04024246.1, AADN04024176.1, AADN04016551.1, AADN04024014.1, AADN04024129.1, AADN04024198.1, AADN04023983.1, AADN04024164.1, AADN04024167.1, AADN04023944.1, AADN04024374.1, AADN04020824.1, AADN04024306.1, AADN04024106.1, AADN04024145.1, AADN04024281.1, AADN04024351.1, AADN04020241.1, AADN04024322.1, AADN04024109.1, AADN04024141.1, AADN04024156.1, AADN04024360.1, AADN04023958.1, AADN04023959.1, AADN04023960.1, AADN04023961.1, AADN04023985.1, AADN04023986.1, AADN04023989.1, AADN04023990.1, AADN04024101.1, AADN04024102.1, AADN04024103.1, AADN04024104.1, AADN04024110.1, AADN04024112.1, AADN04024139.1, AADN04024140.1, AADN04024272.1, AADN04024273.1, AADN04024290.1, AADN04024291.1, AADN04024300.1, AADN04024301.1, AADN04024303.1, AADN04024304.1, AADN04024338.1, AADN04024339.1, AADN04024343.1, AADN04024344.1, AADN04024352.1, AADN04024355.1, AADN04024356.1, AADN04024361.1, AADN04024364.1, AADN04024365.1, AADN04024376.1, AADN04024377.1, AADN04024350.1, AADN04024181.1, AADN04024207.1, AADN04017424.1, AADN04024052.1, AADN04024147.1, AADN04024124.1, AADN04024237.1, AADN04023953.1, AADN04024044.1, AADN04023979.1, AADN04024219.1, AADN04024252.1, AADN04024119.1, AADN04024030.1, AADN04024049.1, AADN04024230.1, AADN04021209.1, AADN04024085.1, AADN04024262.1, AADN04024278.1, AADN04024000.1, AADN04024163.1, AADN04024263.1, AADN04024383.1, AADN04024228.1, AADN04024279.1, AADN04004651.1, AADN04024036.1, AADN04024209.1, AADN04024241.1, AADN04024212.1, AADN04024126.1, AADN04024155.1, AADN04023973.1, AADN04023981.1, AADN04024136.1, AADN04024217.1, AADN04024083.1, AADN04024072.1, AADN04024349.1, AADN04024076.1, AADN04024216.1, AADN04024251.1, AADN04024067.1, AADN04024316.1, AADN04016302.1, AADN04023971.1, AADN04024053.1, AADN04024233.1, AADN04024245.1, AADN04024261.1, AADN04024266.1, AADN04024152.1, AADN04024203.1, AADN04024229.1, AADN04024244.1, AADN04024123.1, AADN04024063.1, AADN04024327.1, AADN04010267.1, AADN04024091.1, AADN04024253.1, AADN04024039.1, AADN04024295.1, AADN04023987.1, AADN04024027.1, AADN04024293.1, AADN04024297.1, AADN04024081.1, AADN04024061.1, AADN04024068.1, AADN04023977.1, AADN04024367.1, AADN04023963.1, AADN04024107.1, AADN04024235.1, AADN04024382.1, AADN04023957.1, AADN04024127.1, AADN04024308.1, AADN04024173.1, AADN04023954.1, AADN04024078.1, AADN04024220.1, AADN04009592.1, AADN04024239.1, AADN04024298.1, AADN04024004.1, AADN04024006.1, AADN04024122.1, AADN04024099.1, AADN04024264.1, AADN04024318.1, AADN04024319.1, AADN04024116.1, AADN04024214.1, AADN04014749.1, AADN04005924.1, AADN04024307.1, AADN04010376.1, AADN04023956.1, AADN04024050.1, AADN04000887.1, AADN04024111.1, AADN04024280.1, AADN04024117.1, AADN04024221.1, AADN04024222.1, AADN04024342.1, AADN04024128.1, AADN04024146.1]
##### ERROR --------------------------------------------------------------------------------
That makes no sense your output should have looked something like this for vcf_chr_33.vcf.gz
as noted in @Santosh's answer (if that is what you used). Basically minus all lines that had VC=in-del
33 1172178 rs3136901 C T . . RSPOS=1172178;GENEINFO=107055416:COPZ1;dbSNPBuildID=104;SAO=0;VC=snp;VP=050100000305000000000100
33 1188596 rs3137350 T C . . RSPOS=1188596;GENEINFO=107055417:LOC107055417;dbSNPBuildID=104;SAO=0;VC=snp;VP=050100000305000000000
100
33 1358968 rs3137351 C A . . RSPOS=1358968;RV;dbSNPBuildID=104;SAO=0;VC=snp;VP=050100000005000000000100
33 135849 rs3137550 G A . . RSPOS=135849;RV;GENEINFO=426871:METTL7A|426872:TMPRSS12;dbSNPBuildID=104;SAO=0;VC=snp;VLD;VP=0501004
20005000000000100
wget ftp://ftp.ncbi.nih.gov/snp/organisms/chicken_9031/VCF/00-All.vcf.gz
gunzip 00-All.vcf.gz
cat 00-All.vcf |sed -r 's|SERPINB10 CPOX|SERPINB10_CPOX|; s|SET domain containing 5|SETD5|;' >check_all.vcf
sed -e 's/SET domain containing /SETdomaincontaining/g' check_all.vcf > test252.vcf
bgzip test252.vcf
tabix -p vcf test252.vcf.gz
vcf-sort Gallus_gallus.Gallus_gallus-5.0.dna.chromosome.1.dict test252.vcf.gz > dbsnp_sorted.vcf.gz
java -d64 -Xmx48g -jar /home/mbxao2/R-drive/tools/GATK/GenomeAnalysisTK.jar -T ValidateVariants -R Gallus_gallus.Gallus_gallus-5.0.dna.chromosome.1.fa -V dbsnp_sorted.vcf.gz --validationTypeToExclude ALL
at the final stage the error appear all the time.
You should have added this information against @Santosh's post above. Adding this as an answer
throws off the logical flow of this thread. If you can move the content and delete this post that would be great.
It is a bit hard to tell but is this still related/in continuation of the original question you had asked?
Then you should edit the original post and add/organize the information in such a way that the question and this entire thread makes sense to someone who will come by this in future. As things stand now, I have lost track of what is happening and others may have the same problem in future.
Ok, now I see the complete picture. Unless you say which tool is generating the error, it's difficult to understand what is happening! By seeing your commandLine, I can see at least one error: You are using the reference as chr1 (-R Gallus_gallus.Gallus_gallus-5.0.dna.chromosome.1.fa
), whereas your vcf is composed of all of the chromosomes. Your Reference should contain at least all the chromosomes / contigs that the VCF file has. There might be other errors, but first see if this resolves the issue. If not, paste the GATK complete errror output again. I'm quite sure that you have missed some part of GATK error logging.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
VCF is a highly structured format, can you please provide information on:
which columns
how you generate the vcf (or where do you download it from)
how are you trying to remove them
Can you please add more details to your question? What is the error message? What are you trying to do?
I am trying to fix dbsnp file that has been downloaded from NCBI, the problem with column 4 and 5. at the end of the file I have (-) in the REF or ALT. when I try to validate variant I had this error ((##### ERROR MESSAGE: The provided VCF file is malformed at approximately line number 13866253: unparsable vcf record with allele -
when I check the line: I found the (-) in column 5 and also some lines have the same in column 4..
I have tried the below but didn't work:
I sense a disturbance in the force, precisely in the section "breakends" of the VCF manual:
https://samtools.github.io/hts-specs/VCFv4.2.pdf
I am not sure if this will help you, but I would give it a look and maybe there is what you search for. I hope for you, at least!
P.S. generally I am not a fan of editing heavily formatted files with oneliners, as they were normal text files. Especially with SAM and VCF formats, you'll never know everything about them. Every time a new discovery!
From where and how you dloaded you dbSNP file? I am unable to find even a single reference for 'rs15996913' while doing google search!
please check the link ftp://ftp.ncbi.nih.gov/snp/organisms/chicken_9031/VCF/