Generating Positional List from VCF
1
1
Entering edit mode
3.5 years ago
Ared445 ▴ 60

I'd like to generate a 1-based position list from VCF file for all variants. I believe that by VCF convention, the listed position in POS column specifies the same base for a single nucleotide substitution, but the preceding base for both insertions and deletions.

So, I thought that to specify the position of each variant as start - end - with a script you could take the position N provided by the VCF and convert as follows:

Insertion = N - N+1
SNP = N - N
Deletion = N+1 - N+length(REF)-1 

So for the following sample:

CHROM   POS             REF     ALT
11      66091886        T       TTTC
11      66108375        T       G
11      67180763        GTATT   G

It becomes:

CHROM   START           END 
11      66091886        66091887
11      66108375        66108375
11      67180764        67180767

Just wondering if I have gone about this correctly, and this method would in fact specify where in my alignment the variant itself occurs?

guidance • 1.8k views
ADD COMMENT
1
Entering edit mode
3.5 years ago
sbstevenlee ▴ 480

I think one of the ways to achieve your goal is by converting your VCF file to a MAF (Mutation Annotation Format) file. To this end, you may want to check out the fuc package I wrote:

Python API solution (the pymaf.MafFrame.from_vcf method):

>>> from fuc import pyvcf, pymaf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1'],
...     'POS': [100, 200, 300],
...     'ID': ['.', '.', '.'],
...     'REF': ['G', 'C', 'TTC'],
...     'ALT': ['A', 'CAG', 'T'],
...     'QUAL': ['.', '.', '.'],
...     'FILTER': ['.', '.', '.'],
...     'INFO': ['.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT'],
...     'Steven': ['0/1', '0/1', '0/1']
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID  REF  ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .    G    A    .      .    .     GT    0/1
1  chr1  200  .    C  CAG    .      .    .     GT    0/1
2  chr1  300  .  TTC    T    .      .    .     GT    0/1
>>> mf = pymaf.MafFrame.from_vcf(vf)
>>> # mf = pymaf.MafFrame.from_vcf('your_file.vcf') # Above is just an example, you can directly import your VCF file
>>> mf.df
  Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome  Start_Position  End_Position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 Protein_Change Tumor_Sample_Barcode
0           .              .      .          .       chr1             100           100      .                      .          SNP                G                 A                 A              .               Steven
1           .              .      .          .       chr1             200           201      .                      .          INS                -                AG                AG              .               Steven
2           .              .      .          .       chr1             301           302      .                      .          DEL               TC                 -                 -              .               Steven
>>> # mf.to_file('your_file.maf')

CLI solution (the maf-vcf2maf command):

$ fuc maf-vcf2maf -h
usage: fuc maf-vcf2maf [-h] vcf

This command will convert an annotated VCF file to a MAF file.

Usage examples:
  $ fuc maf-vcf2maf in.vcf > out.maf

Positional arguments:
  vcf         VCF file.

Optional arguments:
  -h, --help  Show this help message and exit.
ADD COMMENT
0
Entering edit mode

Thanks Steve, haven't looked into MAF files - been using various combinations of awk and grep to get the job done and produce the list itself. Was more wondering if my position adjustment flow was correct. It seems that my initial strictness in narrowing position may be a bit off and it may be beneficial to include buffer bases on either end. Curious to hear your thoughts.

ADD REPLY
1
Entering edit mode

If you're just talking about "model" SNVs and Indels, then the logic and examples in OP look fine to me. In fact, if you compare your examples to mine (MAF), there is virtually no difference. The nice thing about converting to MAF is that you don't have to worry about complex variants where multiple REF bases are changed to a variable number of ALT bases (e.g. CAG > TGGC), if that makes sense.

ADD REPLY
0
Entering edit mode

sbstevenlee fuc tool seems very useful. Is it possible to collect other information such as "tumour read count" for each variant using maf-vcf2maf?

ADD REPLY

Login before adding your answer.

Traffic: 1776 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6