I am doing mouse exome sequencing SNP calling using BaseRecalibrator from GATK. It requires input of mouse snp database with VCF format, who has experience on this? What I have is snp137.txt downloaded from Illumina resource. Thank you,
I am doing mouse exome sequencing SNP calling using BaseRecalibrator from GATK. It requires input of mouse snp database with VCF format, who has experience on this? What I have is snp137.txt downloaded from Illumina resource. Thank you,
You can use vcf file for SNPs from Mouse Genome Project P (ftp://ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/). They also have mm9 version. Look carefully on their page in case you need mm9. You can use the whole file or extract the SNP calls for strain of your interest and use it. I am sure the SNP vcf file from MGP should be as comprehensive as dbSNP in terms of number of SNPs. I work on a particular mouse strain so i don't use the full file but SNPs between that particular strain and reference strain B6.
Hi Tony,
In case you are using VCF file from MGP, you still have to add "chr" string in front of each line except the header information. Also, make sure that the order of chromosome in VCF file is same as order of chromosome in BAM file (both chromosome order in header and chromosome order in sequences). I am sure it is not. To get the SNPs for your particular strain you can use the below code which is not the best one but will still work. The usage is python code.py VCFfile_MGP number_of_column_for_that_strain(129P2 is 9, 129S1 is 10) New_vcf_file
import re,sys,fileinput
Argument = []
Argument = sys.argv[1:]
Filepath = Argument[0]
Strain_column = int(Argument[1])
Outpath = Argument[2]
newfile = open(str(Outpath),"w")
for line in fileinput.input([Filepath]):
if line.startswith("#"):
if line.startswith("##"):
newfile.write(str(line))
continue
else:
header = ""
header = ("\t".join(line.split("\t")[0:9]))+"\t"+line.split("\t")[Strain_column]+"\n"
newfile.write(str(header))
rowlist = []
rowlist = line.split("\t")
genotype = []
genotype = rowlist[Strain_column].split(":")
if genotype[-1] == "1":
if genotype[0] == "1/1" or genotype[0] == "2/2" or genotype[0] == "3/3" or genotype[0] == "4/4" or genotype[0] == "5/5" or genotype[0] == "6/6" or genotype[0] == "7/7":
newline = ""
newline = "chr"+("\t".join(rowlist[0:9]))+"\t"+rowlist[Strain_column]+"\n"
newfile.write(str(newline))
newfile.close()
ashutoshmits, I am trying make the VCF file "compatible" with the genome.fa ( indexed reference file downloaded from Illumina based on mouse mm10). HOwever, it did not work easily due to my limited experience on this, could you suggest me how to make it? what kind of tools you use to like add "chr" to each line and order the chromosome which is corresponding to the genome.fa chromosome order, et ac. Or if possible, would u like to share with me for variant VCF file (C57BL/6J and 129S6/SvEvTac background) which are ready for using?
thank you so much!
The script above is the script you need to use. Just copy and paste it in a file and save the file as script.py . Then put the script in a system that has python and use command: python script.py MGP_all_strains.vcf_file (name of the big vcf file) column_number_strain New_output_file
Ashutosh,
I ran the script as indicated and it seems to have appended the contig as indicated, and also selected the strain of interested (e.g. 16 = C57BL/6NJ. However, I noticed that the size of the new .vcf file is much diminished as compared to the original, and when looking at the file itself the number of lines has decreased by multiple orders of magnitude. When writing the new file does your script remove entries/rows that were present in the original. If so what criteria are used to determine when an entry/row is to be eliminated?
Thanks!
Regards,
John
Dear John,
The big vcf file contains a comprehensive list of all variants that have been identified in different mouse strains. A large number of these variants are only present in a few strains and absent from the others. Some of them are even unique (private variants). For example, those variants that are unique to the reference genome (C57BL/6J) will appear in all the other strains because sequence data from all the other strains was compared against the reference genome. My script will fetch all the variants that have been identified for a strain irrespective of if they are unique to that particular strain or also present in other strains. Now coming back to your question, C57BL/6NJ is a substrain of C57BL/6J (reference) so they share a large fraction of IBD (identical by descent) regions. In other words, these strains have been recently separated and that's why don't show lots of differences when compared to each other. Both strains had a fairly recent common ancestor. Researchers started maintaining them (C57BL/6J and C57BL/6NJ) separately at Jackson and NIH and during this time both of these strains acquired spontaneous mutations. Nevertheless they shouldn't differ much and if I remember correctly they have ~20,000 variants (a fairly small file compared to the big file) between them. Actually, the real number should be even less than 20,000 if you remove the low quality variants.
Thank you, Rm, I did use the mouse snp data with VCF format from the link that you provided. However, when I run GATK, there is error as follows,
ERROR MESSAGE: Input files /raid1/rzeng/reference/mousesnp.vcf and reference have incompatible contigs: No overlapping contigs found
Seems like the reference sequence I downloaded from Illumina and VCF format mouse snp database are not compatible. I may consider to convert mouse snp137.txt to VCF format with whatever softwares. Any one has experience on this?
I will keep updated on my progress on this, thanks!
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
ftp://ftp.ncbi.nih.gov/snp/organisms/mouse_10090/VCF/00--README.txt