I have RNA-seq bam files that I need to call somatic variants. The problem is that GATK is very strict with how the bam is formatted (karyotypically sorted, no 'chr' notation, read group).
Because my bam file was aligned against Ensembl reference I keep running into validation errors. For example I have to change the chromosome notation in the header which I am hesitant after many failures (samtools view --> sed --> reheader) and I am stuck on error as well:
"Discordant contig lengths: read MT LN=16571, ref MT LN=16569" (note that I was referencing against GATK's homo sapiens hg19 reference)
Does anyone have an Ensembl reference and its corresponding dbSNP useable for GATK? There is the Ensembl ftp I can access but I am quite lost with which files are the right ones. Thank you very much for your help.
See ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/README. You may want the "toplevel" version.
I downloaded Homo_sapiens.GRCh37.75.dna.toplevel.fa but it is lexicographically sorted.
If you really want something that requires no work to get working with GATK, you can download the GATK resource bundle.
ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle
Your choice of reference(s) will be limited, though.