Hi!
I retrieved single-cell data from GEO datasets (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3489183). The file format is .h5, produce by CellRanger V2.0 pipeline (10x Genomics). To open it and to have a look at the datasets inside, I used the following Python code:
import h5py
import pandas
import numpy
f = h5py.File('GSM3489183_IPF_01_filtered_gene_bc_matrices_h5.h5', 'r')
list(f.keys())
['GRCh38']
dset = f['GRCh38']
list(dset)
['barcodes', 'data', 'gene_names', 'genes', 'indices', 'indptr', 'shape']
According to CellRanger manual, the dataset called 'data' should contain the Nonzero UMI counts in column-major order, The 'shape' dataset is a tuple of (# rows, # columns) indicating the matrix dimensions. Each of these datasets has 1 column. To see the relative data I used the code:
a = np.array(f['GRCh38/data'])
pd.DataFrame(a)
However, I don't see how I can retrieve, from this data, a table in which genes are rows and cells are columns. The 'data' datasets must be the expression data about each gene, in each cell, but since it is a 1-column dataset, I don't see how I can build a table with cells as columns with the relative data for each gene. Do you have experience with this type of file? Thank you in advance!