Hi,
I'm following the Broad Institute's best practices for variant calling using the Genome Analysis Toolkit, and am having issues with the realignment around indels. Specifically, my issue pertains to the use of the RealignerTargetCreator.
Briefly, my preprocessing so far consists of mapping the reads to the mm10.fasta file (downloaded from UCSC) using BWA-mem, realigning the aligned BAM file by coordinate using Picard's SortSam function, and dedupping with Picard's MarkDuplicates function. I'm using data from FVB/N mice, sequenced using Agilent's Mouse All Exon Kit on an Illumina HiSeq.
In order to realign around indels, the RealignerTargetCreator tool from GATK must first be used (see documentation here). In order to do this I use the following code:
java -jar GenomeAnalysisTK.jar \
-T RealignerTargetCreator \
-R mm10.fa \
-I input.bam \
-known mgp.v3.indels.rsIDdbSNPv137.vcf \
-o output.list
The "known" file is downloaded from the Sanger FTP, and is aligned to GRCm38, which as far as I understand should be identical to mm10. This however leads to the following error:
##### ERROR MESSAGE: Input files /Users/Christian/Documents/Forskerlinja/DMBA-indusert/Sequencing/ReferenceFiles/mgp.v3.indels.rsIDdbSNPv137.vcf and reference have incompatible contigs. Please see http://gatkforums.broadinstitute.org/discussion/63/input-files-have-incompatible-contigsfor more information. Error details: No overlapping contigs found.
##### ERROR /Users/Christian/Documents/Forskerlinja/DMBA-indusert/Sequencing/ReferenceFiles/mgp.v3.indels.rsIDdbSNPv137.vcf contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, X]
##### ERROR reference contigs = [chr1, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr1_GL456210_random, chr1_GL456211_random, chr1_GL456212_random, chr1_GL456213_random, chr1_GL456221_random, chr2, chr3, chr4, chr4_GL456216_random, chr4_JH584292_random, chr4_GL456350_random, chr4_JH584293_random, chr4_JH584294_random, chr4_JH584295_random, chr5, chr5_JH584296_random, chr5_JH584297_random, chr5_JH584298_random, chr5_GL456354_random, chr5_JH584299_random, chr6, chr7, chr7_GL456219_random, chr8, chr9, chrM, chrX, chrX_GL456233_random, chrY, chrY_JH584300_random, chrY_JH584301_random, chrY_JH584302_random, chrY_JH584303_random, chrUn_GL456239, chrUn_GL456367, chrUn_GL456378, chrUn_GL456381, chrUn_GL456382, chrUn_GL456383, chrUn_GL456385, chrUn_GL456390, chrUn_GL456392, chrUn_GL456393, chrUn_GL456394, chrUn_GL456359, chrUn_GL456360, chrUn_GL456396, chrUn_GL456372, chrUn_GL456387, chrUn_GL456389, chrUn_GL456370, chrUn_GL456379, chrUn_GL456366, chrUn_GL456368, chrUn_JH584304]
So in order to fix this, I've considered two possible solutions. The GATK website recommends doing a liftover using Picard's LiftoverVCF; I don't see this as being possible, seeing as it's already meant to be using the correct reference genome. Second, I've tried using the method suggested by Ashutosh Pandey, here; as far as I can tell this however simply appends "chr" to the chromosome number, but it doesn't remedy the issue with the chrUn_xxxxx(_random) chromosomes.
So, does anyone have any recommendations as to how I can get the indel file to work with sequence data aligned to mm10?
Thanks!
Note: I've also considered using the FVB/N specific list of Indels (and SNPs) found here, which would likely be sufficient for my purposes. It is however aligned to mm9, and I have tried, unsuccessfully, to use Picard's LiftoverVcf and the "mm9ToMm10.over.chain" file, and the mm10.fa reference. While this doesn't give me an error per se, these are the results of the liftover:
INFO 2016-03-22 22:05:46 LiftoverVcf Processed 5137790 variants.
INFO 2016-03-22 22:05:46 LiftoverVcf 5137790 variants failed to liftover.
INFO 2016-03-22 22:05:46 LiftoverVcf 0 variants lifted over but had mismatching reference alleles after lift over.
INFO 2016-03-22 22:05:46 LiftoverVcf 100.0000% of variants were not successfully lifted over and written to the output.
If anyone believes pursuing this approach would be better, I'd be happy to hear any suggestions as to why the Liftover may not be working.
Thanks so much for the answer! Unfortunately, when I run this code:
I get the following error:
I'm looking into the issue, but I'm unfamiliar with the 'awk' command, so I haven't figured it out yet. Any idea what the issue might be?
Hm, somehow the site's formatting makes that awk command go crazy. Here it is again, outside of the code block:
awk '{ if($0 !~ /^#/) print "chr"$0; else if(match($0,/(##contig=<id=)(.*) ,m))="" print="" m[1]"chr"m[2];="" else="" print="" $0="" }'="" mm10.indels.dbsnp142.vcf="" >="" mm10.indels.vcf<="" p="">
OK well that didn't fix it, and more importantly the content changed even though it wasnt in a code block this time. Here's the "raw" command: http://paste.ofcode.org/m9HQPEDfsUkfQzYgk5uGYQ
...as if i needed another reason to avoid awk -_-
Hmm, I'm still getting an error when I try running it:
Any other ideas? Maybe take a screenshot, and I can type it myself? Thanks again for the help!!
I don't know why it isn't working man, sorry :(
Fullscreen: http://i.imgur.com/HvxB7ji.png
Success! Don't know what the issue was, but I ran the code from paste.ofcode on my lab's server (instead of my own computer) and it worked, so it must have been something with my runtime environment. Trying the RealignerTargetCreator now and it seems to be working. Thanks a lot!
Another issue is the version of awk/mawk/gawk.
For me, gawk works, mawk does not.