What is best source for obtaining the hg19 exome reference and its index? This reference will be used with BWA. Do I need to build the index if I’m using a specific version of BWA ? Thank you!
What is best source for obtaining the hg19 exome reference and its index? This reference will be used with BWA. Do I need to build the index if I’m using a specific version of BWA ? Thank you!
You should always align DNA-seq data to the entire genome. For hg19, download the hg19.2bit file from here: http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/bigZips/
Then, convert it to FASTA format with twobittofa
: http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/bigZips/
Be aware, also, that GRCh38 / hg38 is the latest release of the human genome reference. hg19 has 'issues': A: Alternate nucleotide is more frequent than reference nucleotide. OMG I'm dizzy. (as does hg38...)
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Request for clarification.
I was able to successfully run ./twoBitToFa hg19.2bit hg19.fa and ensured hg19.fa was generated. I need both an index as well as the actual reference sequence. What is the best way to proceed forward so that both the reference sequence and associated index are generated?
Thank you.
Hello. To index the FASTA genome reference with bwa, you should use the
bwa index
command, for example:It will produce a few different files, each of which you will not have to directly reference again provided they are kept in the same directory as your FASTA reference file.
Then, I would use
bwa mem
for the alignment if your reads are >70bp in length. For shorter reads, you should be using one of the previous bwa algorithms (like we used to do...) or using something like bowtie, which are more tailoured for shorter reads. For example:Prior to alignment, you may consider performing some QC of your reads and 'trimming' in order to eliminate junk that would not have otherwise aligned or that could result in false variant calls further down the line due to low quality bases. For a full idea of pipeline involving trimming, alignment (bwa), generation of QC metrics, and then variant calling (mostly using tools coming from the Wellcome Trust Sanger Inst. in the UK and not Broad Inst), take a look at my GitHub pipeline: https://github.com/kevinblighe/ClinicalGradeDNAseq (in particular, you may look at AnalysisMasterVersion1.sh for the code).
Kevin