I'm working on a project involving datasets from GEO. My project requires having a matrix with counts vs genes. However, a dataset (GSE57152) I am working with is formatted in a less than useful way.
It has a normalization matrix in RPKM instead of counts and no genes corresponding to the values. It has separate .txt files with a list of genes that were measured. As well as separate .txt files with raw data for each sample in the dataset (these files do not have corresponding gene names inside). What is the best way to get a matrix with counts and genes for this dataset?
When you open up the raw data .txt files for each sample this is what you get:
GEO datasets often have additional supplementary files along with the main data. At the bottom of the page you linked, "GSE57152_readme.txt" has details explaining that you use the supplementary file "gene_list.txt" to determine what gene each row corresponds to:
Gene_list.txt List of associated RefSeq gene names corresponding to rows in Sample*.txt files (see below).
Sample.txt Sample.txt files contain RPKM values of genes for each of the SQUARE matrix fields. Each field represents a transcript with
one of the 12 different ending dinucleotides: 'AC' 'AG' 'CA'
'GT' 'CT' 'GA' 'GC' 'GG' 'AA' 'AT' 'CC' 'CG'
Columns correspond to matrix fields. Rows correspond to genes
according to Gene_list.txt file. Gene list order is identical to all
Sample*.txt files. First row is the 12 SQUARE ending dinucleotides.
NormMatrix.txt Scaling factor based on the mean PCR field yield after patient normalization. These values should be used normalize
RPKM values per patient per field.
I'm not super familiar with microarray/expression data in this format, but I believe the samples.txt FPKM data plus the scaling factors in "NormMatrix.txt" should allow you to back-calculate the original counts for each sample. The readme also gives details on how the analysis was done, which might help:
Details for read alignment Sequence and quality files, in Lifetech proprietary .xsq binary format, were mapped against the GRGCh37/hg19
version of the Homo sapiens genome using the Lifetech Lifescope 2.5.1
whole Transcriptome analysis pipeline. The files produced by this
analytical pipeline were coverage; alignment (.bam files); exon
junction; gene expression in RPKM with reference to the RefSeq gene
structure; read counts with reference to each gene. Quality control
metrics were generated both with the Lifetech suite and the
Integromics SeqSolve analysis suite on all the samples. With the
TopHat 2.0.11 and Cufflinks 2.1.1 suite against GRCh37 ENSEMBL hg19
genome sequence and associated ENSEMBL exon / transcript annotation
in .gtf format, hence excluding 'ab initio' assembled transcripts.
Tables of read counts per gene were generated from the alignments
using the HTSEQ package. Read lengths were 75 nucleotides, fragments,
with a percentage of genome alignments over the whole sequence length
over 80%. The minimal sequence base quality value selected for further
processing was 10 (Phred score). Bases with a quality value below
this parameter were replaced with 'N'. Progressive alignment method
was selected. The minimum genome alignment quality value for an
alignment to be processed was again 10 (Phred value). Only primary
alignments were considered for gene counts and quantification. The
minimal identity seed for alignment extension was 25 nucleotides. The
genome mapping percentage of the libraries was always between 60% and
80% of the initial transcripts.