problem in matching the names between file names and patients Id in TCGA
1
2
Entering edit mode
6.5 years ago

Hi all,

I have downloaded total CNV files for a cancer from GDC portal.

I also have the clinical data for all patients, however I cannot map the names of file to submitter IDs.

The file name is some thing like "AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A01_735406.hg18.seg.txt", while the submitter ID is something like "TCGA-DJ-A2QA".

Can any body guide me how to mach these two names?

Thank you in advance

Nazanin

TCGA CNV problem in matching • 6.2k views
ADD COMMENT
0
Entering edit mode

Could you give us an example, for one patient, of what you have downloaded with links and/or pictures please ?

ADD REPLY
0
Entering edit mode

I have downloaded the whole CNV files using TCGA2bed software.

These are some cnv files which have been downloaded: "AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A01_735406.hg18.seg.txt",

"AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A02_735476.hg18.seg.txt"

I want to map these file to the clinical file that I have previously downloaded.

In the clinical file only submitter ID and patients ID is available.

ADD REPLY
0
Entering edit mode

sample names might be inside the text files. Did you check the headers of the files?

ADD REPLY
0
Entering edit mode

Hi,

No the header just includes the results, something like this: "Sample Chromosome Start End Num_Probes Segment_Mean AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A01_735406 1 51598 9250000 4679 0.0076 AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A01_735406 1 9250070 9324990 55 0.5138"

ADD REPLY
0
Entering edit mode

Do you have the annotations.txt file coming with the CNV files ?

In this file you will have the entity_id which could also be found in the clinical files

ADD REPLY
0
Entering edit mode

Hi, Yes I have also downloaded the annotation file. However it does not include the names of CNV files that I can use for matching. The following is the header of annotation file:

"category   classification  entity_type created_datetime    annotation_id   case_submitter_id   project/project_id  entity_submitter_id id
Alternate sample pipeline   Notification    case    2012-11-13T00:00:00 29ba39af-b266-547a-b2c9-7795eba2e202    TCGA-AB-2822    TCGA-LAML   TCGA-AB-2822    29ba39af-b266-547a-b2c9-7795eba2e202
History of unacceptable prior treatment related to a prior/other malignancy Notification    case    2014-06-16T00:00:00 3d086829-de62-5d08-b848-ce0724188ff0    TCGA-AG-A014    TCGA-READ   TCGA-AG-A014    3d086829-de62-5d08-b848-ce0724188ff0
Center QC failed    CenterNotification  aliquot 2012-07-20T00:00:00 5cf05f41-ce70-58a3-8ecb-6bfaf6264437    TCGA-13-0913    TCGA-OV TCGA-13-0913-02A-01R-1564-13    5cf05f41-ce70-58a3-8ecb-6bfaf6264437
History of unacceptable prior treatment related to a prior/other malignancy Notification    case    2014-06-16T00:00:00 c53f22b1-677b-5528-a438-39d5390e2c68    TCGA-21-1077    TCGA-LUSC   TCGA-21-1077    c53f22b1-677b-5528-a438-39d5390e2c68
"
ADD REPLY
0
Entering edit mode

Coming with your AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A01_735406.hg18.seg.txt you have an annotation file where you can find an entity_id which I think in this case is this one 29ba39af-b266-547a-b2c9-7795eba2e202 corresponding to case_id in your clinical file.

To check

ADD REPLY
0
Entering edit mode

The problem is I have downloaded the CNV files for 507 patients with TCGA2bed. I know that I can find the patients or submitter ID via GDC, but I cannot do this for all 507 cases manually and I am seeking a way to find the equal patients or submitter ID automatically.

In other word, I want to find the patients or submitter ID based on "AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A01_735406.hg18.seg.txt". In annotation file there is no column including part of this "AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A01_735406.hg18.seg.txt" name.

ADD REPLY
0
Entering edit mode

What are the commands you used ?

ADD REPLY
0
Entering edit mode

TCGA2bed is a graphical tool in which toy can select bet ween annotation and experiment. After selecting tumor type, you have to select the type of data: CNV,RNASeq,...

ADD REPLY
0
Entering edit mode

As I don't know this API and it's not open source, I can't really help you more. In your CNV files you have sample names, you can try to get a list of it.

Then, I found this in R (https://cran.r-project.org/web/packages/TCGAretriever/TCGAretriever.pdf) Which I think you can request TCGA database with your list of sample names.

Or you can try to contact persons from this publication (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1419-5)

ADD REPLY
0
Entering edit mode

Hi,

I'm getting an error:

 Error in UseMethod("filter") : 
  no applicable method for 'filter' applied to an object of class "c('gdc_files', 'GDCQuery', 'list')"

Can anyone assist please?

ADD REPLY
0
Entering edit mode

Please actually show the code that produced the error

ADD REPLY
4
Entering edit mode
6.5 years ago

Edit: original function written by Bioinfo (via Sean Davis' blog) for translating file UUIDs into TCGA barcodes ( C: Sample names for TCGA data from GDC-legacy archive ). This function (below) translates file names into TCGA barcodes.

A manual lookup of 507 samples is not that bad, if the desire is really there to get the work done. I have done manual lookups of >1000 TCGA samples back when there were no automated services.

The one solution that I thought would work was this function:

library(GenomicDataCommons)
library(magrittr)

TCGAtranslateID = function(file_names, legacy = TRUE) {
  info = files(legacy = legacy) %>%
    filter( ~ file_name %in% file_names) %>%
    select('cases.samples.submitter_id') %>%
    results_all()

  id_list = lapply(info$cases, function(a) {
    a[[1]][[1]][[1]]})

    barcodes_per_file = sapply(id_list,length)

    return(
      data.frame(
        file_id = rep(ids(info), barcodes_per_file),
        submitter_id = unlist(id_list),
        row.names=file_names))
}

TCGAtranslateID('AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A01_735406.hg18.seg.txt')

Output

                                 file_id                               submitter_id
AMAZE_p_TCGASNP_b86_87...seg.txt 6352ceaf-99f4-4b74-94a2-dc5e405543f0  TCGA-BJ-A0Z9-01A
ADD COMMENT
0
Entering edit mode

Hi Kevin,

Thank u so much. It worked

I'll never forget your helps

ADD REPLY
0
Entering edit mode

No problem. This function should also accept an entire vector of filenames, like:

c("filename1", "filename2", "filename3", "filename4",...)
ADD REPLY
0
Entering edit mode

Hi Kevin, I run the code successfully.

However I faced with another problem again. I have 1026 file names, however only 1020 IDs were found.

More over I did not get the file names (AMAZE_p_TCGASNP_b86_87...seg.txt) in the results to map them to my original input. The results include file_id(fda02baa-b6ba-47cd-88d2-20bd14a193a4) , submitter_ID (fda02baa-b6ba-47cd-88d2-20bd14a193a4) and third column (TCGA-BJ-A0Z9-01A).

Do I have to include "file_name" in "return(data.frame(file_id=rep(ids(info),barcodes_per_file), submitter_id=unlist(id_list), row.names=file_names))"?

ADD REPLY
1
Entering edit mode

I could get the full description of my files finally.

Thank u all for your helps and comments

ADD REPLY

Login before adding your answer.

Traffic: 1625 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6