I am brand new to sequence data and am not sure if this is even the correct place to ask this question. I am simply trying to understand what is in my data set. I have prefiltered sequence snp data file saved as a vcf.gz file with its corresponding vcf.gz.tbi file on linux based server. I can run tabix -h <my file>
and get headers which are the standard headers below followed by my samples.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
When I run
tabix <my file> chr1 0-1
I get a ton repeating units/lines which look like this:
chr1 36560865 . T G 2379.81 PASS AC=4;AF=0.024;AN=166;BaseQRankSum=-1.216e+00;ClippingRankSum=-3.170e-01;DP=2210;FS=0.633;InbreedingCoeff=-0.0255;MLEAC=4;MLEAF=0.024;MQ=60.00;MQ0=0;MQRankSum=0.011;PG=0,13,31;QD=17.25;ReadPosRankSum=0.810;SOR=0.607;VQSLOD=1.82;culprit=FS GT:AD:DP:GQ:JL:JP:PGT:PID:PL:PP 0/0:15,0:15:49:.:.:.:.:0,36,521:0,49,552 0/0:21,0:21:73:.:.:.:.:0,60,748:0,73,779 0/0:22,0:22:73:.:.:.:.:0,60,900:0,73,931 ...
I have as many repeats of the section
0/0:21,....:0,73,779
as I do samples. I tried running tabix <my file> chr1 36560865-36560865
thinking it would only print the line containing position 36560865 however it seems to print every line. Any fix for this?
Secondly is there a way to only print select columns and rows of a vcf.gz file? I tried uses the tabix options -b 1 -e 1
, but still appears to print every column of every line. I would like to be able to look values for only columns I to j and rows r to s. For example, looking at the first 10 lines of the first two samples.
I found that specification document, however I have outputs such as FS, MLEAC, MLEAF, PG, etc. which are not specified in that document. I don't know if these will be important to my data or not as I am mainly after the allele.
I initially had the ":", however when I add it, I get no output at all, Even if i change it to include a range of 1000 positions. I did get the last suggestion to work. I assume
grep -v '#'
is so it skips over the header line? I will probably bi-pass the ":" problem with sed and position column. I just started using linux about a month ago so I understand that is a big part of understanding this all.I am also getting an error after it runs of
I assume that this means the file is corrupt somewhere along the line?
Thanks again for help.
For
tabix
input you should useThen do whatever you want on your VCF using
tabix
.The FS, MLEAC, MLEAF, etc are data associated with the sequencing, variant, or extra annotations added by software to your sequencing results (variant caller, GATK annotatevariants, other variant annotation software). These are all part of the INFO string, which is defined in the specification. There are some reserved fields (DP for instance), but otherwise they are specific to whatever software ran on or created the VCF and put them there. They are defined in the header of the VCF field itself as to what they are.
thank you very much. after months of working with this I am starting to unstand these files better. Thanks for the help.