How To Efficiently Remove Snps That Are Present In All Samples?
5
7
Entering edit mode
13.9 years ago
Ian 6.1k

I have identified SNPs in 32 resequenced samples relative to the same reference genome. The output is in VCF format, using the mpileup method. I would like to efficiently remove SNPs that are present in all 32 of the samples as they are likely to be present due to differences between the reference and the resequenced samples.

I can work out a slow and probably inefficient method in PERL, but i was wondering whether anyone has tackled this sort of question or even whether such a task could be accomplished within samtools?

CLARIFICATION: The 32 samples are separate strains, i.e. may be expected to contain novel mutations relative to the reference. They are not 32 samples of the same strain, therefore when mpileup is run with all samples together i end up with very few SNPs. This is why i ran mpileup 1 sample at a time as i want to find novel SNPs between strains

Thanks.

samtools snp vcf • 13k views
ADD COMMENT
0
Entering edit mode

just out of curiosity, you have got 1 vcf file containing all the results, right? or do you have one file per sample? I think I didn't understand correctly how your resequencing experiment worked.

ADD REPLY
0
Entering edit mode

There were 32 separate VCF files. I have now then all together, as suggested by lh3

ADD REPLY
7
Entering edit mode
13.9 years ago
lh3 33k
perl -ne 'print if /AF1=([^;s]+)/&&$1<0.99'
ADD COMMENT
2
Entering edit mode

If the observed non-reference allele frequency is >= 0.99, it is clearly present as a homozygote in each of your samples. This approach only works if the AF1 tag is populated in your VCF. Otherwise, you will need to compute the AF directly from the genotypes.

ADD REPLY
0
Entering edit mode

Can you explain this please.

ADD REPLY
0
Entering edit mode

I expect this is the best official way to do what i want, but i do not understand this solution. I'll need to look more closely at the VCF format.

ADD REPLY
0
Entering edit mode

I used AF1 because Ian said that this is the samtools output.

ADD REPLY
6
Entering edit mode
13.8 years ago

You can use vcftools to achieve this:

  1. Start by finding those positions that occur in all the files (replace '3' by the actual number of files)

    vcf-isec -o -n =3 A.vcf.gz B.vcf.gz C.vcf.gz (...) | bgzip -c > intersect.vcf.gz
    
  2. Exclude from A positions which appear in the intersection

    vcf-isec -c A.vcf.gz intersect.vcf.gz > newA.vcf
    
ADD COMMENT
0
Entering edit mode

Any thoughts on why i am getting this error?

Warning: The column names do not match (e.g. sample1.bam):
sample1.bam
sample2.bam
 at /home/idonaldson/sources/vcftools_0.1.4a/bin/vcf-isec line 22

I run bgzip on my .vcf files, then tabix on the .vcf.gz files.

ADD REPLY
0
Entering edit mode

I believe you have one column on your vcf files that changes from file to file. My guess is that the last column (look for a line starting with #) can be changed to something like 'sample1'. That should fix it.

ADD REPLY
4
Entering edit mode
13.9 years ago

I would extract the mutations present in all your VCFs (here, say 5 samples)

cut -d '   ' -f 1-5 *.vcf |\ #chrom,pos,id,ref,alt
egrep -v "^#" |\ #remove the header
sort |\
uniq -c |\ # -c is for 'count'
egrep "^      5 " |\ #keep the mutations present 5 times
cut -c 9- > snp.txt #remove the count

You can the use this snp.txt to filter out your VCF with

grep -f snp.txt -v sample1.vcf > sample1.filtered.vcf

(it could be slow for a large number of snps) or by using unix join -v (faster , but you'll need to create a extra column in your VCFs to create a uniq key(chrom/position/ref/alt)

ADD COMMENT
0
Entering edit mode

I like the simplicity of the first part, but i would actually like to remove the contents of 'snp.txt' from each sample. Is there an equally simple way to do that?

ADD REPLY
0
Entering edit mode

Great! I have also learnt a few new functions for 'cut' and 'grep'. Just one thing - the vcf file is tab delimited, so -d is not required in the first line.

ADD REPLY
0
Entering edit mode

-1. Sorry I have to downvote this answer as it does not do what Ian has intended.

ADD REPLY
0
Entering edit mode

There was an error: I updated my script 5 minutes ago and I changed cut '1,2,4,5' to '1-5' . Does it work now ?

ADD REPLY
0
Entering edit mode

Ian would want to find SNPs that are homozygous in all samples, so you have to look at the genotype field. However, if the coverage is shallow, the genotype is not accurate. A better way is to check AF1 which is estimated differently. In addition, given 32 samples, it would be preferred to call SNPs jointly thus we have one output VCF.

ADD REPLY
0
Entering edit mode

OK my lack of understanding of this is not helping. The samples are of bacteria (haploid), so wouldn't i just expect to see a single base variant. E.g. chr 10000 . C A

ADD REPLY
0
Entering edit mode

Pierre it seemed to work fine as -f 1,2,4,5 as the ID column is just .

ADD REPLY
0
Entering edit mode

Even for bacteria, you would also want to kill homozygous sites only. Heterozygotes reported by samtools usually means something of biological interest and you may want to keep them.

ADD REPLY
0
Entering edit mode

I'll try running all 32 sample together - thanks everyone!

ADD REPLY
2
Entering edit mode
13.9 years ago
Michael 55k

There is Rsamtools, an implementation of Samtools in R and Bioconductor.

I have never used it nor worked with VCF files but it looks so straight forward that I am quite positive you can easily accomplish what you want to do. I believe it would even pay off to try it if you never used or intended to use R before or have to install R for that purpose.

I could help you if you need further assistance because the result of parsing is a standard GRanges object which is easy to deal with.

Here is a short Howto, that should be sufficient to read in the files

See also ?readPileup and example(readPileup)

The algorithm how to do this is:

  1. Read all your VCF into R yielding N = # samples GRanges objects
  2. Combine all GRanges objects into a single object
  3. Find the interval regions with coverage = N, that's your SNPs occurring in all samples.

Btw, does 'present in all samples' refer to a certain allele?

Put a comment here if you want to try this method, or post some code if you get stuck.

ADD COMMENT
0
Entering edit mode

Thanks Michael i will certainly give this method a look. The samples are bacterial so haploid. I guess i should say i want to identify and eliminate all mutations (relative to the reference) that are present in all resequenced genomes.

ADD REPLY
0
Entering edit mode

If they are all haploid, can't you just grep for "0/0"? That, just retrieve lines that have at least one sample whose haploid genome is the same as the reference? samtools assumes diploid, so I think the genotypes will still be reported this way.

grep "0/0" [VCF]
ADD REPLY
0
Entering edit mode
13.2 years ago
Deniz ▴ 210

Hi Ian,

Could you tell me which parameters did you use for mpileup to compare 32 samples with reference genome?

Did you default parameters?

samtools mpileup -ugf *.fa *.bam *.bam *.bam *.bam *.bam *.bam *.bam *.bam
bcftools view -bvcg - > var.raw.bcf
bcftools view var.raw.bcf | bcftools/vcfutils.pl varFilter -D100 > var.flt.vcf

Could you write your parameters?

Thanks

ADD COMMENT
0
Entering edit mode

In the end i ran each sample separately. But the example you give looks about right to me.

ADD REPLY
0
Entering edit mode

Why we are using u and g together -ugf?

ADD REPLY

Login before adding your answer.

Traffic: 3365 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6