Count and filter sites with degenarate bases in VCF files
0
0
Entering edit mode
4 months ago
ja569116 • 0

Hi,

I genotyped samples from methylation reads/bisulfite sequencing. I was surprised that many of the alternative alleles were degenerate bases: R or Y.

V00001.vcf

##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA00001
NW_022882922.1  28895   .       C       T       0       PASS    NS=1:DP=52      GT:GQ:DP        0/1:0:52
NW_022882922.1  36586   .       C       T,Y     0       PASS    NS=1:DP=23:GU=T/C       GT:GQ:DP        1/2:0:23
NW_022882922.1  36640   .       G       A       0       PASS    NS=1:DP=40      GT:GQ:DP        1/1:0:40
NW_022882922.1  39071   .       A       G       0       PASS    NS=1:DP=43      GT:GQ:DP        1/1:0:43

V0021

##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA00001
NW_022882922.1  25160   .       G       Y       0       PASS    NS=1:DP=34:GU=T/C       GT:GQ:DP        0/1:0:34
NW_022882922.1  25676   .       T       C       0       PASS    NS=1:DP=41      GT:GQ:DP        0/1:0:41
NW_022882922.1  28342   .       G       A,R     0       PASS    NS=1:DP=35:GU=A/G       GT:GQ:DP        1/2:0:35
NW_022882922.1  29887   .       C       A       0       PASS    NS=1:DP=48      GT:GQ:DP        0/1:0:48

One sample had way more degenerate bases:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA00001
NW_022882922.1  8082    .   G   A   0   PASS    NS=1:DP=6   GT:GQ:DP    0/1:0:6
NW_022882922.1  11106   .   T   G   0   PASS    NS=1:DP=19  GT:GQ:DP    0/1:0:19
NW_022882922.1  17828   .   C   G   0   PASS    NS=1:DP=27  GT:GQ:DP    0/1:0:27
NW_022882922.1  25160   .   G   Y   0   PASS    NS=1:DP=37:GU=T/C   GT:GQ:DP    0/1:0:37
NW_022882922.1  27396   .   G   A,R 0   PASS    NS=1:DP=33:GU=A/G   GT:GQ:DP    1/2:0:33
NW_022882922.1  28342   .   G   A,R 0   PASS    NS=1:DP=27:GU=A/G   GT:GQ:DP    1/2:0:27
NW_022882922.1  28895   .   C   T   0   PASS    NS=1:DP=32  GT:GQ:DP    0/1:0:32
NW_022882922.1  29887   .   C   A   0   PASS    NS=1:DP=35  GT:GQ:DP    0/1:0:35
NW_022882922.1  40905   .   T   C,Y 0   PASS    NS=1:DP=17:GU=T/C   GT:GQ:DP    1/2:0:17
NW_022882922.1  43671   .   A   C   0   PASS    NS=1:DP=11  GT:GQ:DP    0/1:0:11
NW_022882922.1  43859   .   A   T   0   PASS    NS=1:DP=18  GT:GQ:DP    0/1:0:18
NW_022882922.1  46336   .   G   A,R 0   PASS    NS=1:DP=26:GU=A/G   GT:GQ:DP    1/2:0:26

When I try to combine them with GATK, I got an error because of them.

I have preprocessed my samples in two different ways. My goals are:

  • Count and estimate the percentage of degenerate sites (with R/Y). I can count how many total sites there are with bcftools but I don't know how to count degenerate sites.
  • After knowing which preprocessing is better, I would like to filter those degenerate bases/sites to finally make my dataset.

Thanks;

VCF degenerate-bases bisulfite-sequencing • 285 views
ADD COMMENT
0
Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or use one of (a) the option highlighted in the image below/ (b) fenced code blocks for multi-line code. Fenced code blocks are useful in syntax highlighting. If your code has long lines with a single command, break those lines into multiple lines with proper escape sequences so they're easier to read and still run when copy-pasted. I've done it for you this time.
code_formatting

ADD REPLY
0
Entering edit mode

what was the error that GATK returned?

ADD REPLY

Login before adding your answer.

Traffic: 1292 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6