Table of Contents
- Some background
- Taking a look at the data
- VCF metalines
- Multiallelic variants
- Converting to VCF file
Some background
I have a patient cohort I want to annotate. I like Ensembl VEP so that’s what I am going to use but this is not about VEP. I also want to add a custom annotation. There is this paper https://www.pnas.org/doi/10.1073/pnas.2026076118 with following dataset https://figshare.com/articles/dataset/The_genetic_structure_of_the_Turkish_population_reveals_high_levels_of_variation_and_admixture/15147642. Authors shared their data as TSV file. The fields I am intrested in this dataset are columns like AF, AC, AN. In order to use this in VEP, I need to convert this to a VCF file. We are just going to use $awk$ to extract the data but first lets take a look at the data.
Taking a look at the data
zcat TurkishVariome.tsv.gz | head | column -ts$'\t'
CHROM GRCh37Pos GRCh38Pos ID IDrs REF ALT AF AC AN Hom Het SequencingMethod Filter QUAL DP GeneName GeneID Feature FeatureID Effect LoF HGVS_C HGVS_P GERP_RS CADD_phred SIFT_pred Polyphen2_HVAR_pred AF_gnomAD_WES AF_gnomAD_WGS GME_AF 1000GP_AF ESP_AF
1 100000012 99534456 1_100000012_G_T 1_100000012_G_T;rs10875231 G T 0.240621 372 1546 48 276 WGS PASS 75630.0 23720 RP11-413P11.1 ENSG00000224445 transcript ENST00000438829 upstream_gene_variant . n.-441C>A . . . . . . 0.2900 . 0.301118 0.0
1 10000006 9939948 1_10000006_G_A 1_10000006_G_A;rs186077422 G A 0.000646831 1 1546 0 1 WGS PASS 159.0 15959 RP11-84A14.4 ENSG00000228150 transcript ENST00000445884 upstream_gene_variant . n.-2975G>A . . . . . . 0.0030 . 0.00259585 0.0
1 100000072 99534516 1_100000072_T_C 1_100000072_T_C T C 0.00129366 2 1546 0 2 WGS PASS 372.0 21070 RP11-413P11.1 ENSG00000224445 transcript ENST00000438829 upstream_gene_variant . n.-501A>G . . . . . . . . 0.0 0.0
1 100000186 99534630 1_100000186_G_A 1_100000186_G_A;rs778254717 G A 0.000646831 1 1546 0 1 WGS PASS 195.0 20771 RP11-413P11.1 ENSG00000224445 transcript ENST00000438829 upstream_gene_variant . n.-615C>T . . . . . . 3.184e-05 . 0.0 0.0
1 100000378 99534822 1_100000378_T_G 1_100000378_T_G T G 0.000646831 1 1546 0 1 WGS PASS 253.0 20798 RP11-413P11.1 ENSG00000224445 transcript ENST00000438829 upstream_gene_variant . n.-807A>C . . . . . . . . 0.0 0.0
1 100000634 99535078 1_100000634_T_C 1_100000634_T_C T C 0.000646831 1 1546 0 1 WGS PASS 117.0 21162 RP11-413P11.1 ENSG00000224445 transcript ENST00000438829 upstream_gene_variant . n.-1063A>G . . . . . . . . 0.0 0.0
1 100000787 99535231 1_100000787_T_C 1_100000787_T_C T C 0.000646831 1 1546 0 1 WGS PASS 189.0 21139 RP11-413P11.1 ENSG00000224445 transcript ENST00000438829 upstream_gene_variant . n.-1216A>G . . . . . . . . 0.0 0.0
1 100000827 99535271 1_100000827_C_T 1_100000827_C_T;rs6678176 C T 0.293661 454 1546 72 310 WGS PASS 90471.0 23386 RP11-413P11.1 ENSG00000224445 transcript ENST00000438829 upstream_gene_variant . n.-1256G>A . . . . . . 0.3891 . 0.392772 0.0
1 100000843 99535287 1_100000843_T_C 1_100000843_T_C;rs78286437 T C 0.0523933 81 1546 0 81 WGS PASS 12146.0 21326 RP11-413P11.1 ENSG00000224445 transcript ENST00000438829 upstream_gene_variant . n.-1272A>G . . . . . . 0.0984 . 0.091853 0.0
Another thing I want to get beforehand is which column is in which index because we are going to use this indices when referencing the column we want to extract with $awk$.
zcat TurkishVariome.tsv.gz | sed 1q | sed 's/\t/\n/g' | nl
1 CHROM
2 GRCh37Pos
3 GRCh38Pos
4 ID
5 IDrs
6 REF
7 ALT
8 AF
9 AC
10 AN
11 Hom
12 Het
13 SequencingMethod
14 Filter
15 QUAL
16 DP
17 GeneName
18 GeneID
19 Feature
20 FeatureID
21 Effect
22 LoF
23 HGVS_C
24 HGVS_P
25 GERP_RS
26 CADD_phred
27 SIFT_pred
28 Polyphen2_HVAR_pred
29 AF_gnomAD_WES
30 AF_gnomAD_WGS
31 GME_AF
32 1000GP_AF
33 ESP_AF
These are the columns AF, AC, AN, Hom, Het
I want to extract and I think theses, SequencingMethod, Filter, QUAL, DP
would be helpful too. We need to put these into the INFO column in our VCF and add meta lines for these column. Meta lines have key and value pairs that describe fields in the INFO columns.
VCF metalines
##INFO=<ID=AF,Number=1,Type=Float,Description="Allele Frequency">
These keys are:
- ID: name of the key in the info column.
- Number: Number of values.
- Type: Type of the values.
- Description: Description of the value.
One thing I want to make sure is how many fields are in the columns. One slow way of doing that is by checking the characters in that column. If there are separators in like comma or semicolon that would mean that there are multiple values.
zcat ../TurkishVariome.tsv.gz | sed 1d | awk '{print $8}' | sed 's/./\0\n/g' | sort -u
-
.
0
1
2
3
4
5
6
7
8
9
E
There is nothing out of the ordinary. “.” is for the decimal point; “E” and “-” is for the scientific notation e.g. 2E-6. I assume there is only one value here.
I checked the others by just looking at them and they all seem to have just one value.
zcat ../TurkishVariome.tsv.gz| awk '{print $9}' | sort -u | less
zcat ../TurkishVariome.tsv.gz| awk '{print $10}' | less
zcat ../TurkishVariome.tsv.gz| awk '{print $11}' | less
Multiallelic variants
Columns don’t have multiple values which raises the question: Where are the multiallelic variants? We can check by finding the duplicates in the position column.
zcat TurkishVariome.tsv.gz| awk '{print $1" "$2}' | sed 100q | sort -k1,1V -k2,2n | uniq -d
1 100001671
1 100001698
1 100004209
We can than grep for these positions.
zcat TurkishVariome.tsv.gz| grep -m 2 '100001671' | column -ts$'\t' | less -S
1 100001671 99536115 1_100001671_CT_C 1_100001671_CT_C;rs1212539010 CT C 0.00323415 5 1546 0 5 WGS PASS 1875.0 20603 RP11-413P11.1 ENSG00000224445 transcript ENST00000438829 upstream_gene_variant . n.-2101delA . . . . . . 0.0035 . 0.0 0.0
1 100001671 99536115 1_100001671_C_CTT 1_100001671_C_CTT;rs571399445 C CTT 0.0535248 82 1532 2 78 WGS PASS 205865.0 21435 RP11-413P11.1 ENSG00000224445 transcript ENST00000438829 upstream_gene_variant . n.-2101_-2100insAA . . . . . . 0.0960 . 0.0 0.0
zcat TurkishVariome.tsv.gz| grep -m 7 '100004209' | column -ts$'\t' | less -S
1 100004209 99538653 1_100004209_G_GTTTTTTTT 1_100004209_G_GTTTTTTTT;rs10680837 G GTTTTTTTT 0.00981675 15 1528 0 15 WGS PASS 307984.0 23790 RP11-413P11.1 ENSG00000224445 transcript ENST00000438829 upstream_gene_variant . n.-4639_-4638insAAAAAAAA . . . . . . 0.0016 . 0.0 0.0
1 100004209 99538653 1_100004209_G_GTTTTTT 1_100004209_G_GTTTTTT;rs10680837 G GTTTTTT 0.0013089 2 1528 0 2 WGS PASS 307984.0 23790 RP11-413P11.1 ENSG00000224445 transcript ENST00000438829 upstream_gene_variant . n.-4639_-4638insAAAAAA . . . . . . 0.0030 . 0.0 0.0
1 100004209 99538653 1_100004209_G_GTTTTTTTTT 1_100004209_G_GTTTTTTTTT;rs10680837 G GTTTTTTTTT 0.0013089 2 1528 0 2 WGS PASS 307984.0 23790 RP11-413P11.1 ENSG00000224445 transcript ENST00000438829 upstream_gene_variant . n.-4639_-4638insAAAAAAAAA . . . . . . . . 0.0 0.0
1 100004209 99538653 1_100004209_G_GTTTTTTT 1_100004209_G_GTTTTTTT;rs10680837 G GTTTTTTT 0.0327225 50 1528 0 50 WGS PASS 307984.0 23790 RP11-413P11.1 ENSG00000224445 transcript ENST00000438829 upstream_gene_variant . n.-4639_-4638insAAAAAAA . . . . . . 0.0609 . 0.0 0.0
1 100004209 99538653 1_100004209_G_GTTTTTGT 1_100004209_G_GTTTTTGT;rs10680837 G GTTTTTGT 0.0039267 6 1528 0 6 WGS PASS 307984.0 23790 RP11-413P11.1 ENSG00000224445 transcript ENST00000438829 upstream_gene_variant . n.-4639_-4638insACAAAAA . . . . . . 0.0033 . 0.0 0.0
1 100004209 99538653 1_100004209_G_GTTTTT 1_100004209_G_GTTTTT;rs10680837 G GTTTTT 0.181937 278 1528 28 222 WGS PASS 307984.0 23790 RP11-413P11.1 ENSG00000224445 transcript ENST00000438829 upstream_gene_variant . n.-4639_-4638insAAAAA . . . . . . 0.1504 . 0.186302 0.0
1 100004209 99538653 1_100004209_G_GTTTT 1_100004209_G_GTTTT;rs10680837 G GTTTT 0.0503927 77 1528 4 69 WGS PASS 307984.0 23790 RP11-413P11.1 ENSG00000224445 transcript ENST00000438829 upstream_gene_variant . n.-4639_-4638insAAAA . . . . . . 0.1620 . 0.13139 0.0
INDELs and SNPs are all splitted. However INDELs are not left aligned which we might need to do so after we create the VCF file.
Converting to VCF file
Our command is make out of few parts:
- write the header We are writing VCF version, assembly and source. Assembly and source lines are optional but are nice to have. In the INFO meta lines we are defining the types and number of the annotation values we are placing in the INFO column.
- unzip the tsv
convert the tsv to VCF
- Another thing I noticed is that some of the GRCh38Pos data are missing, so we are skipping where they’re “.” with awk.
We are getting the columns:
- CHROM
- POS (GRCh38)
- ID
- REF
- ALT
- QUAL
- FILTER
- Values for our INFO columns, AF, AC, AN, nHom, nHet, DP, SeqMethod
- sort by position
- compress it back
- write to a new file
(echo -e '
##fileformat=VCFv4.3
##reference=GRCh38
##source=TurkishVariome
##ID=<Description="Turkish Variome variantion ID">
##INFO=<ID=AF,Number=1,Type=Float,Description="Allele Frequency">
##INFO=<ID=AC,Number=1,Type=Integer,Description="Alternative Allele Count">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total Allele Number">
##INFO=<ID=nHom,Number=1,Type=Integer,Description="Number of Homozygous Individuals">
##INFO=<ID=nHet,Number=1,Type=Integer,Description="Number of Heterozygous Individuals">
##INFO=<ID=SeqMethod,Number=1,Type=String,Description="Sequencing Method, WGS or WES">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth">
#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO
' | sed '/^$/d';
zcat TurkishVariome.tsv.gz | sed 1d | awk -F"\t" '{
if ($3!=".") {
printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\tAF=%s;AC=%s;AN=%s;nHom=%s;nHet=%s;SeqMethod=%s;DP=%s\n",
$1, $3, $4, $6, $7, $15,$14, $8, $9, $10, $11, $12, $13, $16}}' | sort -k1,1V -k2,2n --parallel=24 ) | bgzip -c > TurkishVariome.GRCh38.vcf.gz
We can now index the file with tabix -p vcf TurkishVariome.GRCh38.vcf.gz
and use this VCF with VEP command --custom TurkishVariome.GRCh38.vcf.gz,TV,vcf,exact,0,AF,AC,AN
.
you could avoid that first
echo
and the sub-shell by adding aBEGIN {}
statement in the awk script and piping into "bcftools sort"