Question

HLA genotyping of whole genome sequencing data

0

Entering edit mode

9 months ago

Biomed-jeh ▴ 70

Hello,

I'm encountering some difficulties in initiating my analysis, which involves creating an immune-compatible stem cell line by knocking out selected HLA genes. The primary task is to remap whole genome sequences to identify the HLA sequences.

Initially, I attempted to map the whole genome sequencing data (using paired-end sequencing, approximately 45 GB of data per fastq.gz file) to the human chromosome 6 genome. However, I quickly realized that this approach was not fruitful.

Subsequently, I downloaded all known HLA sequences from the following database: https://www.ebi.ac.uk/ipd/imgt/hla/. However, with approximately 29,000 unique HLA sequences available, it became evident that managing this volume of data would be challenging without a tool for visualization.

Currently, I find myself at an impasse. I experimented with an approach mentioned in a paper (https://www.sciencedirect.com/science/article/pii/S1934590919300475?via%3Dihub), which unfortunately relies on Python 2, and I could not make it work on the HPC that I have access to.

Consequently, I am reaching out for assistance. Does anyone know of a tool supported by Bioconductor that could aid in identifying HLA sequences within my whole genome sequencing data?

Thank you for any comments, recommendations and/or solutions

HLA genotyping WGS • 714 views

ADD COMMENT • link 9 months ago by Biomed-jeh ▴ 70

1

Entering edit mode

Have you already looked at arcas HLA https://github.com/RabadanLab/arcasHLA?

ADD REPLY • link 9 months ago by DBScan ▴ 450

1

Entering edit mode

Here is a benchmarking study for HLA typing tools using different data types.

ADD REPLY • link 9 months ago by dthorbur ★ 2.5k

score 0 · Answer 1 · 2024-02-28

Hi dthorbur and DBScan

Thank you very much for your replies. I took a look on the benchmarking study and found that arcasHLA linked by DBScan ranks quite well. I have been working on this, and I ran into multiple issues, and I would like to share those, in case someone in the future reads this post.

I usually align sequences to a reference genome using HISAT2 or STAR (depending whether it is bulk or single cell sequencing data), but for whole genome sequencing those tools do not work (or maybe I am not declaring some parameters correctly). I am currently trying Kallisto as I saw it being mentioned in the arcasHLA manual.
Kallisto installation requires a lot of memory (had to request for 64 gb of memory to index the hg38 ref genome) and also requires a processor that can read AVX instructions.
I use anaconda to create environments, make sure you install kallisto version > 0.50, as for some reason kallisto v. 0.44 is downloaded by default... Same goes for python, make sure that you install > 3.6, otherwise numpy will cause issues.

If you have any recommendations for HLA aligner tools, please feel free to share :)