Question

reading scRNAseq .h5 file into anndata format from GEO - No clear matrix?

0

Entering edit mode

10 months ago

charlieclark1ee ▴ 20

Hi,

I'm trying to parse .h5 files containing scRNAseq data from this GEO entry https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE148073

However, I'm not sure how I can access the count matrix, sparse or dense.

From using with h5py.File(testpath, 'r') as f: print(list(f.keys()))

I can only find these headers in the h5 files: ['barcode', 'barcode_corrected_reads', 'conf_mapped_uniq_read_pos', 'gem_group', 'gene', 'gene_ids', 'gene_names', 'genome', 'genome_ids', 'metrics', 'nonconf_mapped_reads', 'reads', 'umi', 'umi_corrected_reads', 'unmapped_reads']

But it's not immediately obvious to me by their shapes how I can extract any kind of read matrix from this data. I'm expecting about 3000 cells per sample (and each sample is one .h5 file).

Shapes:

<HDF5 dataset "barcode": shape (33395910,), type "<u8">
<HDF5 dataset "barcode_corrected_reads": shape (33395910,), type "<u4">
<HDF5 dataset "conf_mapped_uniq_read_pos": shape (33395910,), type "<u4">
<HDF5 dataset "gem_group": shape (33395910,), type "<u2">
<HDF5 dataset "gene": shape (33395910,), type "<u4">
<HDF5 dataset "gene_ids": shape (33694,), type "|S15">
<HDF5 dataset "gene_names": shape (33694,), type "|S19">
<HDF5 dataset "genome": shape (33395910,), type "|u1">
<HDF5 dataset "genome_ids": shape (1,), type "|S6">
<HDF5 group "/metrics" (0 members)>
<HDF5 dataset "nonconf_mapped_reads": shape (33395910,), type "<u4">
<HDF5 dataset "reads": shape (33395910,), type "<u4">
<HDF5 dataset "umi": shape (33395910,), type "<u4">
<HDF5 dataset "umi_corrected_reads": shape (33395910,), type "<u4">
<HDF5 dataset "unmapped_reads": shape (33395910,), type "<u4">

Any help would be really appreciated. Thank you!

RNA-seq • 1.0k views

ADD COMMENT • link updated 10 months ago by bk11 ★ 3.0k • written 10 months ago by charlieclark1ee ▴ 20

GenoMax · Answer 1 · 2024-02-12

1

Entering edit mode

10 months ago

Radu Tanasa ▴ 140

Have you tried using scanpy?

import scanpy as sc
adata = sc.read_10x_h5(“path”)

ADD COMMENT • link 10 months ago by Radu Tanasa ▴ 140

0

Entering edit mode

Radu Tanasa

Thank you for the speedy reply!

I did but it throws an error - according to the docs this reads in 10x-Genomics-formatted hdf5 files. I'm not sure if the .h5 files I have here are formatted the way that I see most hdf5 files being formatted in- it was definitely created using cell ranger, but doesn't contain a 'matrix' path as specified in https://www.10xgenomics.com/support/software/cell-ranger/latest/analysis/outputs/cr-outputs-h5-matrices

adata = sc.read_10x_h5(testpath, genome='genome')
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/homebrew/lib/python3.11/site-packages/scanpy/readwrite.py", line 195, in read_10x_h5
    adata = _read_legacy_10x_h5(filename, genome=genome, start=start)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/scanpy/readwrite.py", line 221, in _read_legacy_10x_h5
    _collect_datasets(dsets, f[genome])
  File "/opt/homebrew/lib/python3.11/site-packages/scanpy/readwrite.py", line 253, in _collect_datasets
    for k, v in group.items():
                ^^^^^^^^^^^
AttributeError: 'Dataset' object has no attribute 'items'

ADD REPLY • link updated 10 months ago by GenoMax 148k • written 10 months ago by charlieclark1ee ▴ 20

score 0 · Answer 2 · 2024-02-13

If you want to do it in R, you can read file and create Seurat object in the following way-

library(Seurat)
library(DropletUtils)

tmp <- DropletUtils::read10xMolInfo("GSM4453632_T1D5_HPAP032_molecule_info.h5")
mtx <- DropletUtils::makeCountMatrix(tmp$data$gene, tmp$data$cell, value = tmp$data$reads)
data.SO <- CreateSeuratObject(counts = mtx)