Need Suggestions For The Sequencing Data Analysis From Mouse Community
2
2
Entering edit mode
11.1 years ago
Tonyzeng ▴ 310

HI, I am new in next generation sequencing data analysis (exome sequencing) in mouse community, Very appreciated for giving me any suggestions if you have done with mouse sequencing work or you are doing it.

Here is what i have done so far.

  1. Quality control (no problem)

  2. short reads alignment by BWA ( I downloaded indexed mouse reference data (build mm10) from Illunmina which included genome.fa, genome.fa.amb/ann/bwt/fai/pac) . It was running good with using these indexed reference data without any problems. ( i used the same version of BWA as used in Illunmina data)

  3. Mark PCR duplicates after sorting the SAM files and converting them into BAM files, I used PICARD to mark pcr duplicates.

  4. Creating indels table by RealignerTargetCreator using GATK software. (To use GATK, i generated genome.dict and genome.fai file using PICARD)

  5. realigning reads around indels. I used GATK with Indelrealigner and it run good so far.

  6. Quality score recalibration I used BaseRecalibrator of GATK to recalibrate base quality. Here, I downloaded mouse (snp137.vcf) variant data from Sanger. I was struck right here because the vcf data and reference data (genome.fa from Illumina) have incompatible contigs.

I need suggestions on the follows,

  1. Do I need to index the mouse mm10 reference data using BWA but giving up using Indexed data from Illumina from the beginning? or It is good to use the data downloaded from only one resource ? in my pipeline, you can see that I use the indexed reference data or snpdata from two different resources (Illunima and sanger)

  2. Where can I download the compatible or ready to use mouse reference data and VCF format snpdata (build mm10)? What I have collected of SNP data in my computer are mouse snp137.txt file (from Illumina), dbsnp137.vcf (from Sanger) and SC_MOUSE_GENOMES.genotype.vcf (from NCBI). As for reference data, I only use the genome.fa downloaded from Illumina.

I really need to make it consist which reference data (NCBI, Sanger or Illunmina) and dbsnp database (NCBI, Sanger or Illumina) are used in data analysis pipeline that will make my analysis more straight forward.

dbsnp reference mouse gatk vcf • 4.4k views
ADD COMMENT
1
Entering edit mode
11.1 years ago

The incompatible contigs mean either the chromosome names are different between BAM files and the vcf files. Most probably "chr" string. Another possibility is that the order of chromosomes may be different between BAM and the vcf file. As I have already mentioned earlier if your strain or any strain that is close to your strain is in Mouse Genome Project, you should only use its SNP calls from it to calibrate the quality scores. You can also try the SNP data for all the strains, it wont make a big difference.

ADD COMMENT
0
Entering edit mode

Thank you Ashutoshmits, for my vcf files and reference file, they do have difference on both chromosome name (Chr1 in Reference and 1 in Vcf) and the order of chromosomes (10,11,12....19,1, 2,....Y in reference and 1,10...19,2,3...Y in VCf) also the header of VCF includes ##contig= 1,10,,,,19,2,3...Y which has the same order with the data line of each variant. Since I do not have script for using, i can not do this easily.

ADD REPLY
1
Entering edit mode

Hi Toni, I had faced similar issues and I could help you but please take it as an opportunity to acquaint yourself with little programming or unix commands. They will help you for sure. First of all read this link http://gatkforums.broadinstitute.org/discussion/1204/what-input-files-does-the-gatk-accept. Now the easiest way for you would be to add "chr" string in VCF file. If you use my script that I have provided it will give you new vcf file with lines beginning with "chr". it will also take care of the headings. Now you need to sort the other files. I am writing a new answer to make it more clear. See below.

ADD REPLY
1
Entering edit mode
11.1 years ago

See the input file description: http://gatkforums.broadinstitute.org/discussion/1204/what-input-files-does-the-gatk-accept

Basically you have three files Reference fasta ( with chromosomes in some order), BAM files (with chromosomes in some order) and VCF file (with chromosomes in some order).

1) Normally, BAM file produced from aligner follows the chromsome order of the input reference fasta file. 2) But if you have sorted the BAM file , then it may have changed the chromosome order (different from reference fasta). 3) VCF file you downloaded from MGP is in different order and doesnt have "chr" charcter in the beginning of a line.

Easiest thing to do now would be to:

1) Check the order of chromosome in BAM file. Make sure the header and reads follow the same chromosome order. 2) Download the individual fasta files from UCSC genome browser and concatenate them in an order that follows the chromosomal order in BAM file. Index this reference fasta file. 3) Add "chr" in VCF file and sort it so that it follows the same chromosomal order as BAM file.

Now, it should work for you. Use new reference fasta file.

ADD COMMENT
1
Entering edit mode

I also searched on biostars and there are plenty of helpful posts. Try searching them. For example, thsi can be useful : VCF sort according to order of the reference file

ADD REPLY
0
Entering edit mode

Fantastic! Thank you, ashutoshmits

ADD REPLY
0
Entering edit mode

hi, Ashutoshmits,

I have checked the chromosome order of the HEADER of the BAM file and reference fasta, they are the same order and I do not have to change anything now. However, BAM file has both chromosome order of header and reads, how can I check the chromosome order of the read too? in addition, I made a new VCF format file which has replace chromosome number to Chr 1, 2 ,3 and I have reordered them perfectly according to the reference order. However, Contig=< ID order was still not the right order with reference file, is it necessary to change it and how?

Thanks

ADD REPLY
0
Entering edit mode

To check the order of chromosome followed by reads in BAM file try the command below. It assumes that you have already sorted bam file.

samtools view input.bam | grep -v "^#" | cut -f1 | uniq

Tell me the output.

ADD REPLY
0
Entering edit mode

Here is the output and it never stop...... like this (is this mean it starts from chromosome 7 in the read line)

$ /raid1/rzeng/softwares/samtools-0.1.19/samtools view K1_bam.marked.realigned.fixed.bam | grep -v "^#" | cut -f1 | uniq &

120920_SN743_0324_AC14GBACXX:7:2113:8031:5444:1#0
120920_SN743_0324_AC14GBACXX:7:2113:8031:5444:1#0/3
120920_SN743_0324_AC14GBACXX:7:2102:9117:86456:1#0/3
120920_SN743_0324_AC14GBACXX:7:2102:9117:86456:1#0
120920_SN743_0324_AC14GBACXX:7:1107:6619:5777:1#0
120920_SN743_0324_AC14GBACXX:7:1107:6619:5777:1#0/3
120920_SN743_0324_AC14GBACXX:7:1105:3372:25293:1#0
120920_SN743_0324_AC14GBACXX:7:1105:3372:25293:1#0/3
120920_SN743_0324_AC14GBACXX:7:1207:7253:81786:1#0
120920_SN743_0324_AC14GBACXX:7:1207:7253:81786:1#0/3
120920_SN743_0324_AC14GBACXX:7:2216:6169:57764:1#0
..............................................................................................
ADD REPLY

Login before adding your answer.

Traffic: 1791 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6