Select Certain Column From Vcf File
4
2
Entering edit mode
12.6 years ago
michealsmith ▴ 800

I have a zebrafish SNP vcf file at hand; in the genotype field, there are information for six strains. However, I'm only interested in TA and TUB strain. Then how can I select these two columns only?

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    AB    TLF    TUB    TUG    WKB    WKG
chr10    587    .    G    A,C    106    PASS    AF1=0.9999;CI95=0.5833,1;DP=9;DP4=0,0,1,5;FQ=-29.7;MQ=60    GT:PL:GQ    1/1:35,3,0,35,3,35:39    1/1:0,0,0,0,0,0:36    1/1:44,6,0,44,6,44:42    1/1:45,11,8,37,0,34:39    1/1:0,0,0,0,0,0:36    1/1:26,3,0,26,3,26:39

I tried to used vcf-subset ; but doesnt' work producing "multiple read header error"

Or can I simply use some bash/python to manipulate?

Also, if I wanna select against the genotype, say, to discard the SNP where genotype for both TA and TUB are "0/0". What shall I do? thx

EDIT: THank you guys. The problem is actually quite silly. The vcf files I got contains multiple header. Seems my collaborator simply concatenate several vcf files using cat instead of merging them. Then I got errors in the processing using both vcftools-subset and python.

Both python and vcftools can easily solve these problem.

So take my lesson to double-check the original vcf file before any processing

vcf • 17k views
ADD COMMENT
2
Entering edit mode

Probably just going to be fastest to script it up. You can also use awk '{print $your.column}' file.vcf

ADD REPLY
1
Entering edit mode
awk '{OFS="\t";print $your.column1_number, $your.column2_number}'  >
   your_2_columns_file

 awk '{OFS="\t"; print $your.column1_number,$your.column2_number}' |egrep -v '0/0\s+0/0'  >
   your_2_columns_file_without_0/0_genotypes

Since you working with large files, you should consider to take a look at unix tutorial.

Another approach is to use Galaxy - take a look at tutorials for text parsing

ADD REPLY
0
Entering edit mode

thx...but seems egrep -v '0/0s+0/0' doesn't work.....actually I don't know how to grep such complex pattern

ADD REPLY
1
Entering edit mode

It would probably be worth figuring out where the error is coming from with vcf-subset as that suggests you have a poorly formatted vcf file

ADD REPLY
5
Entering edit mode
12.6 years ago

You can try using the 'cut' unix command:

cut yourFile.vcf -f 12,13

That would return column 12 and 13. 'f' parameter specifies columns you want to return, 'd' specifies delimiter to cut.

I recently wrote a brief entry about using unix commands to work with tab delimited files: http://blog.nextgenetics.net/?e=28

If you want to use a script solution, you can try this in python:

import sys
for line in open(sys.argv[1],'r'):
  columns = line.strip().split('\t')
  print columns[1], columns[2]

Save as yourName.py and use by: python yourName.py yourFile.vcf

ADD COMMENT
3
Entering edit mode
12.6 years ago

PyVCF (code, documentation) has already done all the hard work of figuring out how to parse a *.vcf file, and presents it to you in a way similar to the csv module in the Python stdlib. The code below should allow you to get started:

import vcf

vcf_reader = vcf.Reader(open(r'path\to\zebrafish.vcf', 'rb'))
for record in vcf_reader:
    for call in record.samples:
        if call.sample=='TUB' or call.sample=='TA':
            if call.gt_nums != '0/0':
                print call.gt_nums
ADD COMMENT
2
Entering edit mode
10.0 years ago
arronslacey ▴ 320

I just get all lines that don't contain #

sed -n '/#/!p' g1.vcf

and then pipe into the cut command:

sed -n '/#/!p' g1.vcf | cut -f 1,8,11

hope that helps.

ADD COMMENT
1
Entering edit mode
12.6 years ago

A command-line one liner might be cooler, but this should work and it's not mysterious. Replace idx_first and idx_second with the index of your genotypes of interest, remembering that python arrays are 0-based.

python
f = open('foo.vcf')
fo = open('foo_stripped.vcf', 'w')
idx_first = 11
idx_first = 12
for line in f:
    if line[0]=='#':
        fo.write(line)
        continue
    a = line.rstrip('\r\n').split('\t')
    out = a[0:9]
    if not "0/0" in a[idx_first] and not "0/0" in a[idx_second]:
        out.append( a[idx_first] )  # first column with genotype
        out.append( a[idx_second] ) # second column with genotype
        fo.write( '\t'.join(out) + '\n')

fo.close()
f.close()
ADD COMMENT

Login before adding your answer.

Traffic: 1804 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6