Entering edit mode
8.2 years ago
Jautis
▴
580
Hi, I have a vcf file and I would like to get a site-by-individual matrix of read depths (the DP label) and a second matrix of just the GQ scores.
What is the easiest way to do this? Thanks in advance!
Ex input:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT samp1 samp2
chr1 100 . C T 3106.72 SnpCluster . GT:AD:DP:GQ:PL 0/0:1,0:1:3:0,3,42 0/0:3,0:3:9:0,9,132
chr1 120 . C G 3106.72 SnpCluster . GT:AD:DP:GQ:PL 0/1:3,1:4:30:30,0,123 1/1:0,1:1:3:45,3,0
Ex output for DP:
1 3
4 3
If you need the stats for just one sample (column),
grep -v '#' test.vcf | cut -f10 | awk -F ':' '{print $3"\t"$4}'
should do. For statistics over multiple samples, I would write a script to parse out the details, which should be pretty straightforward.Hi,I want to know what the "snpcluster" displayed in the "info" column of your vcf file means
Weird that you would necropost a 6-year old topic for this; but SnpCluster is a default filter in FreeBayes that filters out variants within a certain distance of one another. Typically a mis-modeled indel will show up as multiple mismatches within the same window. This is largely obviated by more modern local-assembly approaches and local realignment, as well as rank-sum annotations for mapping quality or strand direction, which also tend to correlated with "clustered" variants.
Thank you for your reply! I am doing RNA-Seq-related research recently. Do you think Freebayes can use transcriptome data for SNV-calling?
It can - I refer you to https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1863-4 for a detailed discussion.
I see,Thank you so much!