I have RNA-seq data from the BALB/c mouse strain.
Looking for the reference genome on Ensembl, I found that, the most recent version, GRCm38 was build using the C57BL/6J strain.
I suppose that, the PATCH files contains haplotypes and variation also from other strain, like balb/c.
Are these information on the primary assembly file?
If I want to, How can I use the patch files for building the index? Just concatenate the files?
Or use the toplevel file for indexing?
Thank you.
Thank you. Can you post the relevant posts on Biostars discussing these steps?
I don't know how to start substituting the variants in the reference.
FastaAlternateReferenceMaker
Great tool. Did not know until now. Thank you.
You can use vcf-subset in order to extract variants for a particular strain from the big VCF file from MGP or you can read this post: Where To Download Mouse Mm10 Dbsnp Database With Vcf Format. Once you have the vcf file, you can use FastaAlternateReferenceMaker as Goutham suggested. Be aware that FastaAlternateReferenceMaker will not create a modified "GTF" file for you with new coordinates which is important if you are also substituting indels in the reference genome. You can use Personal Genome Constructor (http://alleleseq.gersteinlab.org/tools.html) from Gerstein Lab that will also output a modified "GTF" file. However, if this is your first time dealing with all this, you may only substitute SNPs in the reference genome. This way you will be able to use the original GTF file as substituting SNPs won't change the positions of transcripts.
I followed your advice but get stuck in an error with gatk. Try to solve and find another problem.
I posted the problem in here: Error when trying to fix the contigs order in the reference and vcf for FastaAlternateReferenceMaker