filter several VCF files for fixed SNP
2
0
Entering edit mode
9.5 years ago

Hi all,

I have 5 vcf files belonging to 5 different sheep breeds. Each file is the results of the variant calling of a DNA pool of thirty individuals. I'm interested to find for each breed the SNPs that are fixed in one and not present in the others breeds. However the pool is not indexed and therefore it is processed as a single individual by the software. Any suggestions will be very appreciated.

Thanks

SNP sequencing • 4.7k views
ADD COMMENT
0
Entering edit mode

Thank you for the answer. I know that using pools of individuals we lost a lot of informations but as you have pointed out it was for economic reasons. I tried to apply very strong filter: Q>=100-DP>=100 as we are interested in very few SNP fixed (private allele) in each breed to traceability purpose. So after the filter I used the vcf-isec command to find SNPs that are in A.vcf but not in B.vcf and in C.vcf:

vcf-isec A.vcf B.vcf C.vcf > out.vcf

Doing in this way Can I be sure that I reach my target?

thank you in advance

Marco

ADD REPLY
3
Entering edit mode
9.5 years ago
biocyberman ▴ 870

My first thought about this: pooling 30 into one is surely an economical solution, but also a bad idea for data analysis. What is good if you save 80% of money by pooling, but waste the rest 20 % because the data is not reliable?

Well, let's hope that does not happen.

I think the first question you have to answer is how to decide a variant is found for each pool. Is it a real variant (i.e. SNP/Indel) or an artifact? What does it mean if a variant has coverage < 30 (number of individual).

Some primary quality check may need even to be done before variant calling to see how the raw data looks like. Did you do this?

Now suppose you have the best variants you can get in VCF file format. You can use this tool to answer your question:

http://vcftools.sourceforge.net/man_latest.html. Or maybe better suite in your case: http://vcftools.sourceforge.net/perl_module.html#vcf-compare

COMPARISON OPTIONS

These options are used to compare the original variant file to another variant file and output the results. All of the diff functions require both files to contain the same chromosomes and that the files be sorted in the same order. If one of the files contains chromosomes that the other file does not, use the --not-chr filter to remove them from the analysis.

DIFF VCF FILE

--diff <filename>
--gzdiff <filename>
--diff-bcf <filename>

These options compare the original input file to this specified VCF, gzipped VCF, or BCF file. These options must be specified with one additional option described below in order to specify what type of comparison is to be performed. See the examples section for typical usage.

DIFF OPTIONS

--diff-site

Outputs the sites that are common / unique to each file. The output file has the suffix ".diff.sites_in_files".

--diff-indv

Outputs the individuals that are common / unique to each file. The output file has the suffix ".diff.indv_in_files".

--diff-site-discordance

This option calculates discordance on a site by site basis. The resulting output file has the suffix ".diff.sites".

--diff-indv-discordance

This option calculates discordance on a per-individual basis. The resulting output file has the suffix ".diff.indv".

--diff-indv-map <filename>

This option allows the user to specify a mapping of individual IDs in the second file to those in the first file. The program expects the file to contain a tab-delimited line containing an individual’s name in file one followed by that same individual’s name in file two with one mapping per line.

--diff-discordance-matrix

This option calculates a discordance matrix. This option only works with bi-allelic loci with matching alleles that are present in both files. The resulting output file has the suffix ".diff.discordance.matrix".

--diff-switch-error

This option calculates phasing errors (specifically "switch errors"). This option creates an output file describing switch errors found between sites, with suffix ".diff.switch".

ADD COMMENT
0
Entering edit mode

Thank you for the answer. I know that using pools of individuals we lost a lot of informations but as you have pointed out it was for economic reasons. I tried to apply very strong filter: Q>=100-DP>=100 as we are interested in very few SNP fixed (private allele) in each breed to traceability purpose. So after the filter I used the vcf-isec command to find SNPs that are in A.vcf but not in B.vcf and in C.vcf:

vcf-isec A.vcf B.vcf C.vcf > out.vcf

Doing in this way Can I be sure that I reach my target?

thank you in advance

Marco

ADD REPLY
0
Entering edit mode
9.4 years ago
biocyberman ▴ 870

You need to add -c parameter to the command like this:

vcf-isec -c A.vcf B.vcf C.vcf > out.vcf
#also beware that this command only compares based on position, not genotype.

a more flexible way would be to use vcf-contrast in combination with vcf-query, but you need to merge all vcf files into one file first. Run vcf-contrast -h for some help on usage.

ADD COMMENT
0
Entering edit mode

Thank you very much. I tried with -c and the results do not change. I will try to create a single .vcf file and will use vcf-contrast.

ADD REPLY

Login before adding your answer.

Traffic: 2074 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6