Understanding tabix output
1
0
Entering edit mode
8.8 years ago
bjlemmer ▴ 20

I am brand new to sequence data and am not sure if this is even the correct place to ask this question. I am simply trying to understand what is in my data set. I have prefiltered sequence snp data file saved as a vcf.gz file with its corresponding vcf.gz.tbi file on linux based server. I can run tabix -h <my file> and get headers which are the standard headers below followed by my samples.

#CHROM  POS ID REF ALT QUAL  FILTER INFO FORMAT

When I run

tabix <my file> chr1 0-1

I get a ton repeating units/lines which look like this:

chr1    36560865        .       T       G       2379.81 PASS    AC=4;AF=0.024;AN=166;BaseQRankSum=-1.216e+00;ClippingRankSum=-3.170e-01;DP=2210;FS=0.633;InbreedingCoeff=-0.0255;MLEAC=4;MLEAF=0.024;MQ=60.00;MQ0=0;MQRankSum=0.011;PG=0,13,31;QD=17.25;ReadPosRankSum=0.810;SOR=0.607;VQSLOD=1.82;culprit=FS  GT:AD:DP:GQ:JL:JP:PGT:PID:PL:PP 0/0:15,0:15:49:.:.:.:.:0,36,521:0,49,552     0/0:21,0:21:73:.:.:.:.:0,60,748:0,73,779        0/0:22,0:22:73:.:.:.:.:0,60,900:0,73,931   ...

I have as many repeats of the section

0/0:21,....:0,73,779

as I do samples. I tried running tabix <my file> chr1 36560865-36560865 thinking it would only print the line containing position 36560865 however it seems to print every line. Any fix for this?

Secondly is there a way to only print select columns and rows of a vcf.gz file? I tried uses the tabix options -b 1 -e 1, but still appears to print every column of every line. I would like to be able to look values for only columns I to j and rows r to s. For example, looking at the first 10 lines of the first two samples.

vcf tabix • 7.3k views
ADD COMMENT
4
Entering edit mode
8.8 years ago

I get a ton repeating units/lines which look like this

and there is a specification: https://samtools.github.io/hts-specs/VCFv4.2.pdf

I tried running tabix

not

chr1 36560865-36560865

but (colon)

chr1:36560865-36560865

Secondly is there a way to only print select columns and rows of a vfc.gz file?

use cut (unix tool) or tools like bcftools to select a given sample

For example, looking at the first 10 lines of the first two samples.

gunzip -c file.vcf.gz | grep -v "#" | head -n 10 | cut -f1-11

learning linux command line is the key.

ADD COMMENT
0
Entering edit mode

I found that specification document, however I have outputs such as FS, MLEAC, MLEAF, PG, etc. which are not specified in that document. I don't know if these will be important to my data or not as I am mainly after the allele.

I initially had the ":", however when I add it, I get no output at all, Even if i change it to include a range of 1000 positions. I did get the last suggestion to work. I assume grep -v '#' is so it skips over the header line? I will probably bi-pass the ":" problem with sed and position column. I just started using linux about a month ago so I understand that is a big part of understanding this all.

I am also getting an error after it runs of

gzip: file.vcf.gz: unexpected end of file

I assume that this means the file is corrupt somewhere along the line?

Thanks again for help.

ADD REPLY
1
Entering edit mode

For tabix input you should use

bgzip -c file.vcf > file.vcf.gz
tabix -p vcf file.vcf.gz #Creates tabix index file

Then do whatever you want on your VCF using tabix.

ADD REPLY
1
Entering edit mode

The FS, MLEAC, MLEAF, etc are data associated with the sequencing, variant, or extra annotations added by software to your sequencing results (variant caller, GATK annotatevariants, other variant annotation software). These are all part of the INFO string, which is defined in the specification. There are some reserved fields (DP for instance), but otherwise they are specific to whatever software ran on or created the VCF and put them there. They are defined in the header of the VCF field itself as to what they are.

ADD REPLY
0
Entering edit mode

thank you very much. after months of working with this I am starting to unstand these files better. Thanks for the help.

ADD REPLY

Login before adding your answer.

Traffic: 2019 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6