Mouse strains and Reference genome choice
1
2
Entering edit mode
9.3 years ago
tiago211287 ★ 1.5k

I have RNA-seq data from the BALB/c mouse strain.

Looking for the reference genome on Ensembl, I found that, the most recent version, GRCm38 was build using the C57BL/6J strain.

I suppose that, the PATCH files contains haplotypes and variation also from other strain, like balb/c.

Are these information on the primary assembly file?

If I want to, How can I use the patch files for building the index? Just concatenate the files?

Or use the toplevel file for indexing?

Thank you.

reference-genome mouse-strains • 6.3k views
ADD COMMENT
4
Entering edit mode
9.3 years ago

If your goal is to reduce the allele bias in RNA-seq read mapping then using PATCH files won't be of much help. You will have to generate a strain-specific reference genome in order to align your reads or use some sensitive aligner. BALB/c has been sequenced as a part of Mouse Genome Project and the variant calls (VCF) can be downloaded from the following page: http://www.sanger.ac.uk/resources/mouse/genomes/. There are around 4 million SNPs and around 0.8 million indels between C57BL/6J and BALB/c. Although most of these variants fall into the intergenic regions but it would be a god practice to try to align reads in a haplotype-sensitive manner. You will have to create a customize reference genome by substituting small SNPs and Indels, and then perform the alignment. Sanger provides a big vcf file for all the strains, so you will have to 1) first extract variants for BALB/c strain and 2) then substitute them into the reference genome. There are many relevant posts on Biostars that have discussed both of these steps in a detailed manner.

ADD COMMENT
0
Entering edit mode

Thank you. Can you post the relevant posts on Biostars discussing these steps?

I don't know how to start substituting the variants in the reference.

ADD REPLY
1
Entering edit mode
ADD REPLY
0
Entering edit mode

Great tool. Did not know until now. Thank you.

ADD REPLY
1
Entering edit mode

You can use vcf-subset in order to extract variants for a particular strain from the big VCF file from MGP or you can read this post: Where To Download Mouse Mm10 Dbsnp Database With Vcf Format. Once you have the vcf file, you can use FastaAlternateReferenceMaker as Goutham suggested. Be aware that FastaAlternateReferenceMaker will not create a modified "GTF" file for you with new coordinates which is important if you are also substituting indels in the reference genome. You can use Personal Genome Constructor (http://alleleseq.gersteinlab.org/tools.html) from Gerstein Lab that will also output a modified "GTF" file. However, if this is your first time dealing with all this, you may only substitute SNPs in the reference genome. This way you will be able to use the original GTF file as substituting SNPs won't change the positions of transcripts.

ADD REPLY
0
Entering edit mode

I followed your advice but get stuck in an error with gatk. Try to solve and find another problem.

I posted the problem in here: Error when trying to fix the contigs order in the reference and vcf for FastaAlternateReferenceMaker

ADD REPLY

Login before adding your answer.

Traffic: 2440 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6