I'm learning the basis of preprocessing, and I can't find anywhere a source that would tell me what's the difference between preprocessing a genome for vc and an exome for vc. Do I use a ref genome? In that case, is there any extra step/s to implement?
I was mentioned somewhere that I might need a padded BED file (or some BED file) containing genomic coordinates for exonic regions if using a ref genome.
Extrapolating from genome preprocessing pipeline, I know have to:
- Obtain a ref
- bwa index the ref
- FastQC the samples
- bwa mem alignment samples onto ref (maybe I add the mythical BED in this command?)
- Obtain mapstats
- Convert SAM to BAM
- Sort exome BAM
- MarkDup with Picard
- Create .dict for ref and knownsites in order to recalibrate and apply BQSR
- Recalibrate and apply BQSR
Am I missing any step for Exome vc (before Haplotype caller ofc). Any feedback will be highly appreciated!
align to the genome so you don't force an alignment. Maybe consider using or studying the WARP or Sarek variant calling pipelines.
isn't there a risk of alignments to pseudo-genes in non-transcriptive regions? So your advice is not to use the BED file during alignment?
you should use it, but the BED file has nothing to do with the alignment itself - that is a downstream step you can use for QC or masking calls outside your regions of interest
maybe. but if a read maps to multiple locations equally well, a typical aligner will assign it randomly. you will you still have some coverage if there are pseudogenes that manage to attract alignments
Thank you! Could I ask you to specify where in my pipeline I apply this downstream step for QC or masking calls?
I would study the WARP github repo where they use
calling_interval_list
,evaluation_interval_list
,target_interval_list
,bait_interval_list
you the reference genome. because: Exome Sequencing: Masking The Non-Genic Sequences ? ; Why Shouldn'T I Use Masking When Doing A Reference Alignment? ;
Your answer got cut, I think your info is important, do you mind completing the previous part of your response? Thanks a lot!