Getting an Expression Matrix from GDC Manifest File
1
0
Entering edit mode
8 days ago
cthangav ▴ 110

Using a manifest file from a paper, I am trying to obtain an expression matrix with genes and samples on the axes from GDC.

I tried to use TCGAbiolinks but GDCPrepare won't accept a manifest file and wants to specify samples by creating a "query" with GDCquery or summarized experiment object.

The manifest looks like this. Is there a way to turn it into a GDCquery object?

id  filename    md5 size    state
752efdfa-92e4-4f0d-8c9f-23791ac82eae    90985f9d-0403-4f49-9cd5-020d220ac220.rna_seq.augmented_star_gene_counts.tsv 0934480ded0ec7be97fc0407a9b1da11    4230676 released
f8ba326c-a068-4d8c-a544-07fd13c39cbd    b79c9932-bc2b-4c87-b954-3b6efd0d76e5.rna_seq.augmented_star_gene_counts.tsv a5ba87a5d2a3b70ef6f90c71d5b69ec9    4245285 released
0cf3b928-c697-4d9d-a920-519db2d3d060    b4da5920-c3b0-434b-9fb8-e2909d898b3a.rna_seq.augmented_star_gene_counts.tsv 6088855cd134a064acd793a8fbfae906    4211903 released
a3ed21c8-e48e-4a4c-83ea-13d378909970    5b4ed5af-d39b-432e-9d41-9862403c9208.rna_seq.augmented_star_gene_counts.tsv e6b6601dd5353457a692d609840b92ce    4228730 released
58e2eaf8-916c-4601-85d0-0a01ffcbb9ef    b1ddf742-6c66-4f3d-a405-5d26d03b431a.rna_seq.augmented_star_gene_counts.tsv fbb3cd3557cd141d8141488597c2b665    4239375 released
196cdd75-bb1f-4778-83f5-e26986ed2b2f    04464a56-e420-4e61-aa79-f1afacdb3c91.rna_seq.augmented_star_gene_counts.tsv e0556cbe056f239a98a5152d3fd02155    4254799 released
650165ae-5691-4dc1-b36f-1d9fb92ec7f1    7720992f-1f3e-46c0-a8f2-11149d70dd4a.rna_seq.augmented_star_gene_counts.tsv e645772a9ace36e794a7f5567fe66497    4239486 released
839e6752-8bce-4eab-8b31-91661aab52f9    f6ec3da4-b8e6-4f35-8473-7a8bb9bf5cc8.rna_seq.augmented_star_gene_counts.tsv 9dcff604aa9d505cf5ae1b7769827bc9    4202828 released
54532364-a3ea-4f72-990e-a173198139f9    ddbb58a4-beb7-49f8-8f82-38fa4ea61642.rna_seq.augmented_star_gene_counts.tsv 26912c1419ef897c31a8c3ff1e62b507    4241592 released
6c2b6438-4faf-4e1c-bee4-8dbcace35871    fd342f63-b31b-4d95-bb94-029aff2b4ed0.rna_seq.augmented_star_gene_counts.tsv 48c67baa080721d1e2dfdf61f2717b7d    4237262 released
b25942f0-b57a-4a96-820c-4115cf471572    2dca88b8-727f-446f-899a-86d8871aa148.rna_seq.augmented_star_gene_counts.tsv b444069c4a08d3121ce7d9118e79decf    4259327 released
RNAseq • 306 views
ADD COMMENT
0
Entering edit mode
6 days ago
Zhenyu Zhang ★ 1.2k

The files downloaded are ./id/filename It's probably 20 mins' work to collect a particular column from these files into a single file. You can either use bash paste and cut together, or just write a R read.tsv loop.

ADD COMMENT
0
Entering edit mode

The original paper used around 700 files, so is downloading them all and running a read loop the best way in terms of time/space?

ADD REPLY

Login before adding your answer.

Traffic: 1667 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6