Hello. I'm building my first human exome variant call pipeline, and I'm learning the basics.
I encountered this issue for the first time when trying to obtain a per-base coverage. I realised that my reference genome (HG38) contains alternate haplotypes, random regs, onto which my bam aligned (bad mapping quality, more SNP's compared with the same gene in the canonical chr).
Naturally, since BWA aligned my reads onto the canonical seq and all the "_alt"/"_fix"/"_random" the MAPQ scores for those reads are low or 0.
My samples were obtained from Agilent SureSelectXT Human All Exon V6. The Agilent .bed files do not contain targeted regions for those seqs. Wouldn't it be detrimental to my study if I have reads aligned to coding regions in these seqs, which may contain variants? Or should I use my judgement and circunscribe myself to the canonical coverage/variant call? Does the MAPQ score influence HaplotypeCaller in any way? Thanks all!
Careful here. There is two different things:
The _alt contigs are alternative haplotypes, so for the same genomic locus a different sequence. These should be excluded from the genome prior to index building unless you have a alt-aware aligner, which is not standard these days. Having these in will result in multimappers since for a given "canonical" region there will be an ALT with different but largely similar sequence. Then there is this
_chrU/_random_
sort of contigs which are random / unplaced contigs. These should be included. One simply does not know where exactly they belong into the genome.Thank you so much for your feedback ATpoint ! Do you know of a script or tool to remove only the _alt seqs from the ref genome .fasta? Additionally, the _chrU/_random_ contigs are not accounted for in the bed files provided by Agilent. How does a researcher get data analysis for those portions? Should I manually edit the padded bed file adding the intervals? It would be hard as I don't have the chr start or end positions for the bed.
Depending on where you got your genome sequence from you should be able to get a file that does not have the ALT sequence. From Ensembl it would be the file called "primary" sequence: http://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
From NCBI it would be the "no_alt" file: https://ftp.ncbi.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/latest_assembly_versions/GCF_000001405.40_GRCh38.p14/GRCh38_major_release_seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
These are GRCh38 builds. If the Exome kit you used is based on GRCh37 you should get corresponding files instead.
GenoMax So it would be impossible to analyse my set of samples with an updated consortium refgenome and with updated variant Known_Sites, given that the kit was based on a previous assembly? that's a bummer! Is there a way to do it with updated data?
Thank you for your response!
You should be able to use GRCh38. Did you get the correct BED file for Agilent v6? There are files available for both GRCh38 and hg19 (GRCh37) on Agilent site.
I did, mate! And I used Agilent's main portal, however I couldnt find a way to tell which is which from the SureSelect XT Human All Exon V6+UTR. Does this kit contain intervals in its .beds accounting for the random seqs?
THANKS FOR THE FEEDBACK!
If you look at the "targets" file it only contains genes/Ensembl/NCBI transcripts. Only "main" chromosomes are in the "covered" BED file. So you will need to realign the data.