I'd like to generate a 1-based position list from VCF file for all variants.
I believe that by VCF convention, the listed position in POS
column specifies the same base for a single nucleotide substitution, but the preceding base for both insertions and deletions.
So, I thought that to specify the position of each variant as start - end
- with a script you could take the position N
provided by the VCF and convert as follows:
Insertion = N - N+1
SNP = N - N
Deletion = N+1 - N+length(REF)-1
So for the following sample:
CHROM POS REF ALT
11 66091886 T TTTC
11 66108375 T G
11 67180763 GTATT G
It becomes:
CHROM START END
11 66091886 66091887
11 66108375 66108375
11 67180764 67180767
Just wondering if I have gone about this correctly, and this method would in fact specify where in my alignment the variant itself occurs?
Thanks Steve, haven't looked into MAF files - been using various combinations of
awk
andgrep
to get the job done and produce the list itself. Was more wondering if my position adjustment flow was correct. It seems that my initial strictness in narrowing position may be a bit off and it may be beneficial to include buffer bases on either end. Curious to hear your thoughts.If you're just talking about "model" SNVs and Indels, then the logic and examples in OP look fine to me. In fact, if you compare your examples to mine (MAF), there is virtually no difference. The nice thing about converting to MAF is that you don't have to worry about complex variants where multiple REF bases are changed to a variable number of ALT bases (e.g. CAG > TGGC), if that makes sense.
sbstevenlee fuc tool seems very useful. Is it possible to collect other information such as "tumour read count" for each variant using maf-vcf2maf?