Question

HLA typing WGS data

1

Entering edit mode

8.3 years ago

emblake ▴ 90

I have HLA typed whole genome sequencing data using HLA-VBSeq. Reads were first aligned to hg19 with BWA-MEM, then reads aligning to all HLA loci and unmapped reads were extracted from the bam file using samtools. The extracted reads were then combined and re-aligned to the collection of all the genomic HLA allele sequences in the IMGT/HLA database using BWA-MEM, allowing for multiple alignments to the reference sequences.

Now, I would like to subset the reads of a specific HLA-E gene allele, however, using samtools view input.bam HLA:HLA00936, I see that all of the sequences have a MAPQ score of 0. I know that this is specific to BWA and indicates that the reads are mapping to multiple locations. Does this mean that the alignment results for this allele are insignificant and should be disregarded? The researcher interested in this allele would like to know the HLA-E sequences for gene editing purposes, but I feel as though the coverage is far too low. Any input or advice is much appreciated.

bwa hla wgs samtools • 4.4k views

ADD COMMENT • link updated 8.3 years ago by dario.garvan ▴ 520 • written 8.3 years ago by emblake ▴ 90

score 2 · Accepted Answer · 2017-03-09

2

Entering edit mode

8.3 years ago

dario.garvan ▴ 520

If you have multiple inferred HLA-E alleles, then it is expected to have most reads mapping to multiple sequences, because HLA-E is one of the least polymorphic HLA genes. Many sequence segments will be identical between the alleles. You can get the nucleotide sequences of the HLA-E alleles as a FASTA file and see how similar they are with the multiple sequence alignment tool MUSCLE (choose HTML output format). You must have some reads mapping uniquely to the allele you are investigating because you inferred that the allele exists from the data.

ADD COMMENT • link 8.3 years ago by dario.garvan ▴ 520

0

Entering edit mode

Thanks for your input. The estimated HLA type from the software shows my sample is heterozygous E01:03:01:01/E01:01:01:01. I subset the reads matching E*01:03:01:01 by samtools view -h input.bam HLA:HLA00936 > HLA00936.bam and then used MUSCLE to align those sequences against the nuclear coding sequences of HLA-E alleles, as you suggested: http://www.ebi.ac.uk/Tools/services/web/toolresult.ebi?jobId=muscle-I20170309-184416-0547-98835251-pg&analysis=alignments.

I also aligned the sequences from the bam against E*01:03:01:01 fasta: http://www.ebi.ac.uk/Tools/services/web/toolresult.ebi?jobId=muscle-I20170309-185005-0383-51440835-pg&analysis=alignments. Do you have any input as to how to retrieve the unique sequences? I'm not exactly sure how the algorithm is making the allele calls, but the developer has stated that the read alignment between similar HLA alleles needs to be optimized.

ADD REPLY • link 8.3 years ago by emblake ▴ 90

0

Entering edit mode

That's not what I suggested. I suggested that you take the two HLA alleles and see how similar they are to each other with MUSCLE. Using E01:03:01:01 and E01:01:01:01 shows that they are both 1077 nucleotides long and that the only sequence difference between them is one SNP. Therefore, I'd expect almost all genome sequencing reads (assuming you have short reads such as 100 bp) mapping to HLA-E would be classified as multi-mapping because 1076 nucleotides are the same for both varieties. That explains the reads with MAPQ of 0. You can simply count the number of reads overlapping the A SNP (in HLA00934) and the G SNP (in HLA00936) to get your HLA-E allele frequencies, since that single nucleotide difference is specifically what defines them.