Hi all,
I try some reference genome such as Homo_sapiens_assembly38.fasta and Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa but I still got the error below. Would you please have a suggestion? Thank you so much. The link in the error message doesn't work.
gatk BaseRecalibrator -I Library_1Aligned.out.sorted.bam -R /home/user/Homo_sapiens.GRCh38.dna_sm.primary_assembly
.fa --known-sites 1000G_phase1.snps.high_confidence.hg38.vcf.gz --known-sites Homo_sapiens_assembly38.known_indels.vcf.gz -O recal_data.table
A USER ERROR has occurred: Fasta dict file file:///oak/Homo_sapiens.GRCh38.dna_sm.primary_assembly.dict for reference file:///oak/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa does not exist.
Please see http://gatkforums.broadinstitute.org/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference for help creating it.
In my book I am writing a chapter on human genome variation calling by reproducing a published paper with different methods.
I found that
bcftools
with almost default settings, outperforms GATK lengthy and tedious "best practices". Moreoverbcftools
runs in a fraction of time and needs a small fraction of resources ...In my opinion, the so-called "GATK best practices", marking duplicates, base recalibration, etc are a bit outdated, it is information that is being cited and referred to a lot on the account that it was the default approach at the Broad Institute. But the method is so complicated and has so many moving parts and as you note so obtuse and tedious to run.
If accuracy is of utmost importance and you have the computational resources then run the Google DeepVariant; it is substantially better than GATK. And simpler to run as well.
Thank you for the suggestion! I tried nf-core/sarek but I got an error that the dev team still working on so I find tools that get the job done in the mean time.
ah yes, the most important advice is to stay away from nextflow and the like.
These workflow platforms were never designed to teach you how to run anything. As you yourself experienced, all you end up with is endless chasing around nonexisting documentation and fighting the platform instead of learning bioinformatics.
Workflow management platforms are to be used only once you know a bioinformatics process so well you are bored and annoyed that you must retype commands.
To learn a bioinformatics tool, look at the tool documentation, the deepvariant is exquisitely well documented.
You said you used STAR. Are you working on RNAseq or DNAseq?