Entering edit mode
22 months ago
Eliza
▴
40
Hi,
I have a tsv file from the CADD website of SNPs with the CADD score. The file looks like this:
##CADD GRCh38-v1.6 (c) University of Washington, Hudson-Alpha Institute for Biotechnology and Berlin Institute of Health 2013-2020. All rights reserved.
#Chrom Pos Ref Alt RawScore PHRED
1 13116 T G -0.184119 0.553
1 13118 A G 0.249405 3.697
1 16682 G A 1.498900 15.73
1 900161 C G 0.250372 3.709
1 902288 G A 0.154766 2.625
1 980460 G A 0.378618 5.188
1 1362903 G C -0.072717 0.945
1 1414714 A G 0.595507 7.469
1 1420704 C T 0.533685 6.852
1 1560103 C T 0.003631 1.358
1 1600156 C G 1.424234 15.24
1 1608229 C T -0.138069 0.691
1 1648140 G C -0.003037 1.316
1 1650001 C T -0.366057 0.226
1 1650007 C T 0.049118 1.673
1 1666342 A G 0.351431 4.881
1 1670036 C A -0.453113 0.149
1 1846582 A G 0.237802 3.561
1 1848109 G C 0.045210 1.644
1 1854321 A G 0.213451 3.278
1 1870210 G A 1.248068 14.00
1 1888369 A G 0.445696 5.927
1 1902466 A C 0.261213 3.836
1 1902566 G A -0.009076 1.280
I would like to convert it to a VCF file. I tried this code in UBUNTU:
awk -F "\t" '{print "CHROM"$1"\t"$POS"\t"$REF"\t"$ALT"\t"$RawScore"\t"$PHRED"}' GRCh38-v1.6_1e1bfdf83583b30a108d7c9b6ad51134.tsv > df_1_50k.
But it didn't produce the file in the correct format. I would be happy to know where is my mistake and how to fix it. Thank you:)
Please do not paste screenshots of plain text content, it is counterproductive. You can copy paste the content directly here (using the code formatting option shown below), or use a GitHub Gist if the content volume exceeds allowed length here.
Look into bcftools convert --tsv2vcf. You'll need to explore, do a bunch of trials and tweaks but you should be able to do better than awk. It also looks like you don't understand the VCF format, please read the VCF spec.
I know that there should be other columns such as ID, QUAL, and INFO .... but this is the file that the CADD website returned also unfortunately the bcftools don't work on my PC :(
Find out why and fix it; get it working. bcftools is very well tested and will address failure scenarios that you can't think of.
If you need these fields with legitimate values downstream, you cannot generate a VCF file. If you just need those fields, you can create fake values.
ID
can be.
all over, you can skipQUAL
(I think) andINFO
can have a single entry (calledCOMMENT
, say, with some text that adds information on why it exists). It might be possible to skipINFO
altogether. Explore more.I needed to convert a TSV to a VCF recently and made a post about it. I also used AWK instead of
bcftools convert
for conversion but confirmed the output VCF was correctly formatted with bcftools like Ram suggests.