reading scRNAseq .h5 file into anndata format from GEO - No clear matrix?
2
0
Entering edit mode
10 months ago

Hi,

I'm trying to parse .h5 files containing scRNAseq data from this GEO entry https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE148073

However, I'm not sure how I can access the count matrix, sparse or dense.

From using with h5py.File(testpath, 'r') as f: print(list(f.keys()))

I can only find these headers in the h5 files: ['barcode', 'barcode_corrected_reads', 'conf_mapped_uniq_read_pos', 'gem_group', 'gene', 'gene_ids', 'gene_names', 'genome', 'genome_ids', 'metrics', 'nonconf_mapped_reads', 'reads', 'umi', 'umi_corrected_reads', 'unmapped_reads']

But it's not immediately obvious to me by their shapes how I can extract any kind of read matrix from this data. I'm expecting about 3000 cells per sample (and each sample is one .h5 file).

Shapes:

<HDF5 dataset "barcode": shape (33395910,), type "<u8">
<HDF5 dataset "barcode_corrected_reads": shape (33395910,), type "<u4">
<HDF5 dataset "conf_mapped_uniq_read_pos": shape (33395910,), type "<u4">
<HDF5 dataset "gem_group": shape (33395910,), type "<u2">
<HDF5 dataset "gene": shape (33395910,), type "<u4">
<HDF5 dataset "gene_ids": shape (33694,), type "|S15">
<HDF5 dataset "gene_names": shape (33694,), type "|S19">
<HDF5 dataset "genome": shape (33395910,), type "|u1">
<HDF5 dataset "genome_ids": shape (1,), type "|S6">
<HDF5 group "/metrics" (0 members)>
<HDF5 dataset "nonconf_mapped_reads": shape (33395910,), type "<u4">
<HDF5 dataset "reads": shape (33395910,), type "<u4">
<HDF5 dataset "umi": shape (33395910,), type "<u4">
<HDF5 dataset "umi_corrected_reads": shape (33395910,), type "<u4">
<HDF5 dataset "unmapped_reads": shape (33395910,), type "<u4">

Any help would be really appreciated. Thank you!

RNA-seq • 1.0k views
ADD COMMENT
1
Entering edit mode
10 months ago
Radu Tanasa ▴ 140

Have you tried using scanpy?

import scanpy as sc
adata = sc.read_10x_h5(“path”)
ADD COMMENT
0
Entering edit mode

Radu Tanasa

Thank you for the speedy reply!

I did but it throws an error - according to the docs this reads in 10x-Genomics-formatted hdf5 files. I'm not sure if the .h5 files I have here are formatted the way that I see most hdf5 files being formatted in- it was definitely created using cell ranger, but doesn't contain a 'matrix' path as specified in https://www.10xgenomics.com/support/software/cell-ranger/latest/analysis/outputs/cr-outputs-h5-matrices

adata = sc.read_10x_h5(testpath, genome='genome')
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/homebrew/lib/python3.11/site-packages/scanpy/readwrite.py", line 195, in read_10x_h5
    adata = _read_legacy_10x_h5(filename, genome=genome, start=start)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/scanpy/readwrite.py", line 221, in _read_legacy_10x_h5
    _collect_datasets(dsets, f[genome])
  File "/opt/homebrew/lib/python3.11/site-packages/scanpy/readwrite.py", line 253, in _collect_datasets
    for k, v in group.items():
                ^^^^^^^^^^^
AttributeError: 'Dataset' object has no attribute 'items'
ADD REPLY
0
Entering edit mode
10 months ago
bk11 ★ 3.0k

If you want to do it in R, you can read file and create Seurat object in the following way-

library(Seurat)
library(DropletUtils)

tmp <- DropletUtils::read10xMolInfo("GSM4453632_T1D5_HPAP032_molecule_info.h5")
mtx <- DropletUtils::makeCountMatrix(tmp$data$gene, tmp$data$cell, value = tmp$data$reads)
data.SO <- CreateSeuratObject(counts = mtx)
ADD COMMENT

Login before adding your answer.

Traffic: 1825 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6