Exome alignment and preprocessing: When I perform an exome alignment, should I use a Ref Genome or Ref Exome .fasta?
1
1
Entering edit mode
17 months ago
javiflaja ▴ 50

I'm learning the basis of preprocessing, and I can't find anywhere a source that would tell me what's the difference between preprocessing a genome for vc and an exome for vc. Do I use a ref genome? In that case, is there any extra step/s to implement?

I was mentioned somewhere that I might need a padded BED file (or some BED file) containing genomic coordinates for exonic regions if using a ref genome.

Extrapolating from genome preprocessing pipeline, I know have to:

  1. Obtain a ref
  2. bwa index the ref
  3. FastQC the samples
  4. bwa mem alignment samples onto ref (maybe I add the mythical BED in this command?)
  5. Obtain mapstats
  6. Convert SAM to BAM
  7. Sort exome BAM
  8. MarkDup with Picard
  9. Create .dict for ref and knownsites in order to recalibrate and apply BQSR
  10. Recalibrate and apply BQSR

Am I missing any step for Exome vc (before Haplotype caller ofc). Any feedback will be highly appreciated!

vcf pipeline exome-alignment bed • 2.1k views
ADD COMMENT
1
Entering edit mode

align to the genome so you don't force an alignment. Maybe consider using or studying the WARP or Sarek variant calling pipelines.

ADD REPLY
0
Entering edit mode

isn't there a risk of alignments to pseudo-genes in non-transcriptive regions? So your advice is not to use the BED file during alignment?

ADD REPLY
1
Entering edit mode

So your advice is not to use the BED file during alignment?

you should use it, but the BED file has nothing to do with the alignment itself - that is a downstream step you can use for QC or masking calls outside your regions of interest

isn't there a risk of alignments to pseudo-genes in non-transcriptive regions?

maybe. but if a read maps to multiple locations equally well, a typical aligner will assign it randomly. you will you still have some coverage if there are pseudogenes that manage to attract alignments

ADD REPLY
0
Entering edit mode

Thank you! Could I ask you to specify where in my pipeline I apply this downstream step for QC or masking calls?

ADD REPLY
1
Entering edit mode

I would study the WARP github repo where they use calling_interval_list, evaluation_interval_list, target_interval_list, bait_interval_list

ADD REPLY
1
Entering edit mode

Your answer got cut, I think your info is important, do you mind completing the previous part of your response? Thanks a lot!

ADD REPLY
2
Entering edit mode
17 months ago
amy__ ▴ 190

If you have WES, you will need the bed file which will have been used by the sequencing company to specify the regions which should be sequenced. For WGS you won't have this bed file. The bed file comes in handy when you get to the variant caller stage or for determining coverage over that region. The bed files are usually available online but you need to make sure it was the same one that was used for sequencing.

ADD COMMENT
0
Entering edit mode

So I'm not going to submit the BED during the alignment command? Would you mind specifying which step requires this BED? pileup (coverage) or haplotypecaller? thanks for the help!

ADD REPLY
2
Entering edit mode

Hey, so I've not used the GATK best practices before but from my experience the bed file has come when I used qualimap to work out coverage over those regions and when using my variant callers - so I used deepvariant which requires the bed file as an input too

ADD REPLY

Login before adding your answer.

Traffic: 2003 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6