Tabix For Retrieving High Quality Snps From Sanger Mouse Project
2
1
Entering edit mode
11.3 years ago
Sakti ▴ 530

Dear Biostars,

I'm trying to get all the high quality SNPs of the strains 129S1 and 129S5 from the Sanger Mouse sequencing project. For this I have used tabix previously, and specifying chromosomal ranges of my interest and then filtering based on strain and call quality (i.e. high confidence SNP). Usually I use the following command:

tabix -h ftp://ftp-mouse.sanger.ac.uk/REL-1111-SNPs/mouse-snps-all.annots.vcf.gz 7:123000000-124000000 > 7_123Mb_124Mb_129S1_5_SNPs_Sanger.txt

However doing this always gives me gigantic (>1Gb) files which I then have to process to find my 129S1 and 129S5 SNPs compared to C57Bl6 reference.

This time I want to do the whole genome and get all high confidence SNPs using tabix, but I'm confused with the syntax and how to specify the program to only retrieve S1 and S5 high confidence SNPs compared to the reference.

Anyone done something similar???

Thanks!!

Sakti

tabix snp sanger • 3.3k views
ADD COMMENT
0
Entering edit mode

for whole genome dont limit to a specific region as you did it above; its better to download the vcf.gz n vcf.tbi files and run them locally

ADD REPLY
0
Entering edit mode
11.3 years ago

For call quality you can use the ones that have been marked as "PASS". Basically they have already done the QC for you. To get the SNP calls for particular strain, you can use Genotype information (GQ: tag in VCF file) that is mentioned for every strain that they sequenced. Make sure the calls are homozygous. Read more about vcf format from http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41 here.

ADD COMMENT
0
Entering edit mode

Thanks! Yes, but my question is how to indicate the PASS on the tabix command for filtering, plus the strains, that's where I get stuck :(

ADD REPLY
0
Entering edit mode
11.3 years ago

The tabix outputs the filestream that is in vcf format. You can use unix pipe "|" to grep "PASS" to get all the SNPs that passed the QC. You can then use vcftools (http://vcftools.sourceforge.net/) in pipe and extract what you want.

Some command like

tabix -h ftp://ftp-trace.ncbi.nih.gov/18mousestrainsgenomes/ftp/release/someXYZ.genotypes.vcf.gz 1:1000000-1001000 | perl vcf-subset -c Sample1 | bgzip -c sample1.vcf.gz

This is not the exact command but sth similar to it should work. Read vcf tools to see which command will work better for you.

ADD COMMENT
0
Entering edit mode

Thanks! Will try this one and see the output :)

ADD REPLY

Login before adding your answer.

Traffic: 1662 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6