I have HLA typed whole genome sequencing data using HLA-VBSeq. Reads were first aligned to hg19 with BWA-MEM, then reads aligning to all HLA loci and unmapped reads were extracted from the bam file using samtools. The extracted reads were then combined and re-aligned to the collection of all the genomic HLA allele sequences in the IMGT/HLA database using BWA-MEM, allowing for multiple alignments to the reference sequences.
Now, I would like to subset the reads of a specific HLA-E gene allele, however, using samtools view input.bam HLA:HLA00936
, I see that all of the sequences have a MAPQ score of 0. I know that this is specific to BWA and indicates that the reads are mapping to multiple locations. Does this mean that the alignment results for this allele are insignificant and should be disregarded? The researcher interested in this allele would like to know the HLA-E sequences for gene editing purposes, but I feel as though the coverage is far too low. Any input or advice is much appreciated.
Thanks for your input. The estimated HLA type from the software shows my sample is heterozygous E01:03:01:01/E01:01:01:01. I subset the reads matching E*01:03:01:01 by
samtools view -h input.bam HLA:HLA00936 > HLA00936.bam
and then used MUSCLE to align those sequences against the nuclear coding sequences of HLA-E alleles, as you suggested: http://www.ebi.ac.uk/Tools/services/web/toolresult.ebi?jobId=muscle-I20170309-184416-0547-98835251-pg&analysis=alignments.I also aligned the sequences from the bam against E*01:03:01:01 fasta: http://www.ebi.ac.uk/Tools/services/web/toolresult.ebi?jobId=muscle-I20170309-185005-0383-51440835-pg&analysis=alignments. Do you have any input as to how to retrieve the unique sequences? I'm not exactly sure how the algorithm is making the allele calls, but the developer has stated that the read alignment between similar HLA alleles needs to be optimized.
That's not what I suggested. I suggested that you take the two HLA alleles and see how similar they are to each other with MUSCLE. Using E01:03:01:01 and E01:01:01:01 shows that they are both 1077 nucleotides long and that the only sequence difference between them is one SNP. Therefore, I'd expect almost all genome sequencing reads (assuming you have short reads such as 100 bp) mapping to HLA-E would be classified as multi-mapping because 1076 nucleotides are the same for both varieties. That explains the reads with MAPQ of 0. You can simply count the number of reads overlapping the A SNP (in HLA00934) and the G SNP (in HLA00936) to get your HLA-E allele frequencies, since that single nucleotide difference is specifically what defines them.
I misunderstood - my mistake. That makes sense. Thank you.