HI, I am new in next generation sequencing data analysis (exome sequencing) in mouse community, Very appreciated for giving me any suggestions if you have done with mouse sequencing work or you are doing it.
Here is what i have done so far.
Quality control (no problem)
short reads alignment by BWA ( I downloaded indexed mouse reference data (build mm10) from Illunmina which included genome.fa, genome.fa.amb/ann/bwt/fai/pac) . It was running good with using these indexed reference data without any problems. ( i used the same version of BWA as used in Illunmina data)
Mark PCR duplicates after sorting the SAM files and converting them into BAM files, I used PICARD to mark pcr duplicates.
Creating indels table by RealignerTargetCreator using GATK software. (To use GATK, i generated genome.dict and genome.fai file using PICARD)
realigning reads around indels. I used GATK with Indelrealigner and it run good so far.
Quality score recalibration I used BaseRecalibrator of GATK to recalibrate base quality. Here, I downloaded mouse (snp137.vcf) variant data from Sanger. I was struck right here because the vcf data and reference data (genome.fa from Illumina) have incompatible contigs.
I need suggestions on the follows,
Do I need to index the mouse mm10 reference data using BWA but giving up using Indexed data from Illumina from the beginning? or It is good to use the data downloaded from only one resource ? in my pipeline, you can see that I use the indexed reference data or snpdata from two different resources (Illunima and sanger)
Where can I download the compatible or ready to use mouse reference data and VCF format snpdata (build mm10)? What I have collected of SNP data in my computer are mouse snp137.txt file (from Illumina), dbsnp137.vcf (from Sanger) and SC_MOUSE_GENOMES.genotype.vcf (from NCBI). As for reference data, I only use the genome.fa downloaded from Illumina.
I really need to make it consist which reference data (NCBI, Sanger or Illunmina) and dbsnp database (NCBI, Sanger or Illumina) are used in data analysis pipeline that will make my analysis more straight forward.
Thank you Ashutoshmits, for my vcf files and reference file, they do have difference on both chromosome name (Chr1 in Reference and 1 in Vcf) and the order of chromosomes (10,11,12....19,1, 2,....Y in reference and 1,10...19,2,3...Y in VCf) also the header of VCF includes ##contig= 1,10,,,,19,2,3...Y which has the same order with the data line of each variant. Since I do not have script for using, i can not do this easily.
Hi Toni, I had faced similar issues and I could help you but please take it as an opportunity to acquaint yourself with little programming or unix commands. They will help you for sure. First of all read this link http://gatkforums.broadinstitute.org/discussion/1204/what-input-files-does-the-gatk-accept. Now the easiest way for you would be to add "chr" string in VCF file. If you use my script that I have provided it will give you new vcf file with lines beginning with "chr". it will also take care of the headings. Now you need to sort the other files. I am writing a new answer to make it more clear. See below.