Extract the SNP list of a VCF file
3
2
Entering edit mode
7.2 years ago
janhuang.cn ▴ 230

Is it possible to extract the list of all SNP (save as a .txt) of a VCF file? I do not want to convert it into .bed, and check the .bim, because the VCFs I have are large, and I have many of them.

VCF SNP • 16k views
ADD COMMENT
1
Entering edit mode

Hello,

what exactly do you like to extract? Just the ID of a variant if there is one? Just the position and the REF/ALT?

fin swimmer

ADD REPLY
0
Entering edit mode

I want to the chr, pos, and rsid.

ADD REPLY
0
Entering edit mode
$ sed  '/##/d' example.vcf | cut -f1,3,2

ouptut:

#CHROM  POS ID
29  11  Chr29:11
ADD REPLY
0
Entering edit mode

Try vcflib vcf2tsv function to convert vcf to tab separate file and you can extract any information you want.

ADD REPLY
9
Entering edit mode
7.2 years ago

I want to the chr, pos, and rsid

grep -v "^##" input.vcf | cut -f1-3

EDIT 2021: bcftools query -f '%CHROM\t%POS\t%ID\n' input.vcf

ADD COMMENT
0
Entering edit mode

Thanks. This is very helpful.

Is it possible to apply to vcf.gz? This command return the filename of VCF, if I apply it to vcf.gz file.

ADD REPLY
2
Entering edit mode

Replace grep with zgrep.

ADD REPLY
0
Entering edit mode

Thank you very much.

ADD REPLY
0
Entering edit mode
7.2 years ago

The vcf2bed binary uses Unix streams, so it will be about as fast as any extraction gets:

$ vcf2bed < snps.vcf | cut -f4 > ids.txt

There are --insertions, --deletions and --snvs options to get subsets of the input VCF. See vcf2bed --help for more detail.

ADD COMMENT
0
Entering edit mode

Combine with the command suggested by ATPoint, I used the below command to extract the list, extracting chr, pos and rsid.

vcf2bed < foo.vcf | awk '{if (length($4) == 1 && length($5) == 1 || $1 ~ /^#/) print $0}'| cut -f1,3,4 > ids.txt

But when I use my VCFs, I got an error as below. I tried it with a few VCF I have, it is always line 164, but the numbers of Segmentation fault are different

 line 164: 46368 Segmentation fault      ${cmd} ${options} - 0<&0
ADD REPLY
0
Entering edit mode

The output of vcf2bed is BED, so whatever you have downstream of that would need to be processed as a BED file, with fields in that ordering.

ADD REPLY
0
Entering edit mode
7.2 years ago
ATpoint 86k

As characteristic for a SNP, one nucleotide is exchanged by another. Therefore, unlike InDel where gain or lost takes place, the length of the REF and ALT column must be 1. Using that to discriminate from Indel with awk:

# || $1 ~ /^#/ makes sure that the header and column names are printed:   
awk '{if (length($4) == 1 && length($5) == 1 || $1 ~ /^#/) print $0}' foo.vcf

Instead of printing $0, you can specify any information you want from the current line and append (>>) it to a new file.

ADD COMMENT
0
Entering edit mode

using length($5) is not enough, for example with multiple ALT:

REF =A
ALT=T,C

is a SNP

ADD REPLY
1
Entering edit mode

Ok, got me on this one. I never had files with entries of this type.

ADD REPLY

Login before adding your answer.

Traffic: 1515 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6