About TCGA CNV data preprocessing
1
1
Entering edit mode
6.1 years ago
Eric Wang ▴ 50

Hello all, I am looking at the Level 3 CNV files on TCGA. I have a few questions:

I download copy number variance data from TCGA database and mapped genomic regions to gene symbols using this method(https://www.biostars.org/p/311199/#311746). Now i get a matrix that its rows are genes and its columns are samples. If I want to use this data to cluster samples, how do I pre process the data? (p.s. for probes mapped to the same gene, I averaged their Segment_mean values, right?)

Thanks for any help,

SNP CNV • 7.4k views
ADD COMMENT
4
Entering edit mode
6.1 years ago

Hey,

You are referring to a post that I made. From where did you obtain the original data? - Broad Firebrowse (somatic copy number alterations) or just downloaded the original files from GDC?

If you followed the data processing exactly as follows:

  • Part I - download segmented sCNA data for any TCGA cohort from Broad Institute's FireBrowse server and identify recurrent sCNA regions in these with GAIA
  • Part II - plot recurrent sCNA gains and losses from GAIA
  • Part III - annotate the recurrent sCNA regions (this post, just below)
  • Part IV - generate heatmap of recurrent sCNA regions over your cohort

Then, the statistically significant recurrent somatic copy number alterations (sCNA) are held in the *.igv.gistic files. You can extract statistically significant regions from this file and then pull out the original copy number over these on a per sample basis using GenomicRanges - the copy number that you take is indeed the segment mean from the original copy number program that was used (in the case of TCGA data, likely DNAcopy (R)).

If you do that, then you can build a matrix of:

  • statistically significant recurrent sCNAs in a group of patients as rows
  • patients as columns
  • Segment Mean over each region as the values

With that, I generated this and identified clusters of patients based on recurrent sCNA via Partitioning Around Medoids (PAM)::

fgh

[from: https://www.ncbi.nlm.nih.gov/pubmed/29682207]

Of course, you don't have to use that data, exactly, but you really have to know to what your data relates.

Kevin

ADD COMMENT
0
Entering edit mode

Hi, Thanks for your help! I just downloaded the CMV data from TCGA database and mapped genomic regions to gene symbols using the following code:

mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
genes <- getBM(attributes=c("hgnc_symbol","chromosome_name","start_position","end_position"), mart=mart)
genes <- genes[genes[,1]!="" & genes[,2] %in% c(1:22,"X","Y"),]
colnames(genes) <- c("GeneSymbol","Chr","Start","End")
genes_GR <- makeGRangesFromDataFrame(genes,keep.extra.columns = TRUE)
df = read.csv("BRCA-CNV.csv") # CNV data downloaded from TCGA database.
cnv =  makeGRangesFromDataFrame(df, keep.extra.columns = TRUE)
hits <- findOverlaps(genes_GR, cnv, type="within")
df_ann <- cbind(df[subjectHits(hits),],genes[queryHits(hits),])
df_ann <- unique(df_ann)
write.table(df_ann, file="BRCA-CNV-withGeneSymbol.txt", quote = F, sep = "\t", row.names = F)

Then I constructed the CNV matrix using the following code:

cnv <- as.matrix(read.table("BRCA-CNV-withGeneSymbol.txt"))
cnv <- cnv[,c(7:9)]
samples.tumor <- samples[which(substr(samples, 14, 16) %in% c("01A", "01B", "06A"))]
tumor <- matrix(0, nrow = length(genes), ncol = length(samples.tumor))
colnames(tumor) <- samples.tumor
rownames(tumor) <- genes
for(i in 1:nrow(tumor)){
    for(j in 1:ncol(tumor)){
        gene.tmp <- rownames(tumor)[i]
        sample.tmp <- colnames(tumor)[j]
        cnv.tmp <- cnv[which(cnv[,3] %in% gene.tmp),]
        tumor[i,j] <- mean(as.numeric(cnv.tmp[which(cnv.tmp[,2] %in% sample.tmp),1]), rm.na = T)
    }
}

At last, I got the CNV matrix. Is there any error in this process? The final data is like this:

                                gene1      gene2       gene3       gene4      gene5
TCGA-LD-A9QF-01A-32D-A41E-01 -0.3258360  0.2291041 -0.29244881  0.4456383  0.2990137
TCGA-EW-A1OX-01A-11D-A141-01  5.4529917 -0.6737190 -0.17351193  0.2469372  0.4106143
TCGA-A2-A0YT-01A-11D-A107-01  0.5855575 -1.4167641 -0.09580651 -1.9142909 -0.7032367
TCGA-E2-A14N-01A-31D-A134-01 -0.5166007 -0.4208664  0.64834184  0.1254564  0.3653708
TCGA-BH-A18F-01A-11D-A12A-01 -1.4072964 -0.5098043  0.22809823  0.1885511 -0.1663476
ADD REPLY
1
Entering edit mode

You just took the tumour samples, merged them together, and then summarised copy number per gene? There's nothing wrong with that; however, if you try to publish it, a reviewer would ask if these copy number events are somatic or germline. Without subtracting the normal copy number profiles, you cannot know whether or not these are true somatic events. It depends on what exactly you want to do, of course.

  • Germline copy number event = copy number variation
  • Somatic copy number event = copy number alteration

The last time that I checked, the open access (level 3) TCGA copy number data is output per sample and is available separately for tumour and normal at the Genomic Data Commons (GDC). This is the benefit of just taking the Broad FireBrowse data, i.e., because they have re-processed this level 3 data to produce somatic copy number events, available as files with names like this: 'genome_wide_snp_6-segmented_scna_minus_germline_cnv_hg19' ( http://firebrowse.org/?cohort=BRCA&download_dialog=true )

ADD REPLY
0
Entering edit mode

I have downloaded the masked CNV data filtered germline variation as input data and get the average of all probes per gene. If I just want to use CNV data for cluster analysis, is this right? Thank you for your sharing. Now I know the difference between variation and alteration. Thanks again!

ADD REPLY
0
Entering edit mode

Hey, it was not I who up-voted your comment. I am reading your comment for the first time right now. If that is the data that you have, then it seems okay!

ADD REPLY
0
Entering edit mode

@Kevin Blighe Hello Kevin, if i want to extract the genes from both tumor and normal samples That will make more sense instead or no

Actually, my data have "01A", "10A and "11A" and i also want to built the CNV matrix, I tried @Django code but an error is shown?

ADD REPLY
0
Entering edit mode

@Kevin Blighe Hello Kevin Could you plz answer my question?

ADD REPLY
1
Entering edit mode

You have not shown the error message, so, how can I advise further?

ADD REPLY
0
Entering edit mode

@Kevin Blighe Hi, Kevin sorry for that here is the piece of code i tried and the following error

    cnv <- as.matrix(read.table("CNV-withGeneSymbol.txt"))
    cnv <- cnv[,c(7:9)]
    samples <- gsub("TCGA-[A-Z0-9]*-[A-Z0-9]*-", "", (gsub("-[0-9A-Z]*-[0-9A-Z]*-01$", "", cnv[,1])))
    samples.tumor <- samples[which(substr(samples, 14, 16) %in% c("01A", "01B", "06A"))]
    tumor <- matrix(0, nrow = length(genes), ncol = length(samples.tumor))
    colnames(tumor) <- samples.tumor
    rownames(tumor) <- genes
    for(i in 1:nrow(tumor)){
        for(j in 1:ncol(tumor)){
            gene.tmp <- rownames(tumor)[i]
            sample.tmp <- colnames(tumor)[j]
            cnv.tmp <- cnv[which(cnv[,3] %in% gene.tmp),]
            tumor[i,j] <- mean(as.numeric(cnv.tmp[which(cnv.tmp[,2] %in% sample.tmp),1]), rm.na = T)
    }
    }

Error shown here; Error in [<-(*tmp*, i, j, value = mean(as.numeric(cnv.tmp[which(cnv.tmp[, : subscript out of bounds

ADD REPLY
1
Entering edit mode

Hey, thank you! Can you show the first few lines and columns of the object cnv? Just paste them here, highlight them, and then click the 101 010 button

ADD REPLY
0
Entering edit mode

@Kevin Blighe, I'm working on RGui(64-bit) and the objects are not shown here. I have posted the full code i applied, and i can share the data as well if y want ?

library(GenomicRanges)
> mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
> mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="grch37.ensembl.org", path="/biomart/martservice", dataset="hsapiens_gene_ensembl")
> genes <- getBM(attributes=c("hgnc_symbol","chromosome_name","start_position","end_position"), mart=mart)
> genes <- genes[genes[,1]!="" & genes[,2] %in% c(1:22,"X","Y"),]
> xidx <- which(genes[,2]=="X")
> yidx <- which(genes[,2]=="Y")
> genes[xidx, 2] <- 23
> genes[yidx, 2] <- 24
> genes[,2] <- sapply(genes[,2],as.integer)
> genes <- genes[order(genes[,3]),]
> genes <- genes[order(genes[,2]),]
> colnames(genes) <- c("GeneSymbol","Chr","Start","End")
> df<- read.table("E:/results/Stage1_CNV_data_v2.txt", sep="\t", stringsAsFactors=FALSE, header=TRUE)
> colnames(df) <- c("Barcode", "chr", "start", "end", "extra1", "extra2")
> cnv <-  makeGRangesFromDataFrame(df, keep.extra.columns = TRUE)
> hits <- findOverlaps(genes_GR, cnv, type="within")
> df_ann <- cbind(df[subjectHits(hits),],genes[queryHits(hits),])
> df_ann <- unique(df_ann)
> write.table(df_ann, file="outputfile-GeneSymbol.txt", quote = F, sep = "\t", row.names = F)
> cnv <- as.matrix(read.table("outputfile-GeneSymbol.txt"))
> cnv <- cnv[,c(7:9)]
> samples <- gsub("TCGA-[A-Z0-9]*-[A-Z0-9]*-", "", (gsub("-[0-9A-Z]*-[0-9A-Z]*-01$", "", cnv[,1])))
> samples.tumor <- samples[which(substr(samples, 14, 16) %in% c("01A", "01B", "06A"))]
> tumor <- matrix(0, nrow = length(genes), ncol = length(samples.tumor))
> colnames(tumor) <- samples.tumor
> rownames(tumor) <- genes
> for(i in 1:nrow(tumor)){
+     for(j in 1:ncol(tumor)){
+         gene.tmp <- rownames(tumor)[i]
+         sample.tmp <- colnames(tumor)[j]
+         cnv.tmp <- cnv[which(cnv[,3] %in% gene.tmp),]
+         tumor[i,j] <- mean(as.numeric(cnv.tmp[which(cnv.tmp[,2] %in% sample.tmp),1]), rm.na = T)
+ }
+ }
Error in `[<-`(`*tmp*`, i, j, value = mean(as.numeric(cnv.tmp[which(cnv.tmp[,  : 
  subscript out of bounds
ADD REPLY
0
Entering edit mode
ADD REPLY
1
Entering edit mode

I am not exactly sure what you are aiming to do with this for loop. However, the way to debug them is, after they crash and produce the error, check the last value of i and j, and then take a look at those indices in tumor and cnv. Then you will be able to debug it.

ADD REPLY
0
Entering edit mode

@Kevin Blighe hi Kevin, So, i have to change the values of i and j and then run the for loop again or what? I didn't get exactly what do you mean by taking a look and then debug it again?

Actually, i want to construct the CNV matrix (rows represent the samples and columns represent the annotated genes) .

Exactly, i want to integrate this data with others multi-omics datasets such as gene expression and DNA methylation for further analysis. All these datasets have such format except CNV data which i get stuck to reformat it now. That's why i used that for loop to get the matrix and maybe after that i can merge all of these datasets together. Is this for loop right or no to make such matrix?

ADD REPLY
1
Entering edit mode

Yes, but what is the value of i and j after the error is produced?

You may need to do the following:

  1. define a unique list of regions (chr:start:end) across all of your samples
  2. create an empty data-frame with just the unique list of regions as row-names - this data-frame will eventually hold your final output
  3. loop over your input data per sample (i), and output 'segment mean' where there is overlap with each segment (j). Overlap can be determined by GenomicRanges
ADD REPLY
0
Entering edit mode

@Kevin Blighe Hi Kevin, Thanks for yr quick reply! your helpful suggestions!

The last values of i and j are 1L.

I want my output like a matrix of samples rows and genes in the columns with their corresponding segment values.

I'm more familiar with MATLAB, i used the output file produced from your annotation code and i write a piece of code that can allow allocating this matrix but my code there doesn(t work too. I tried the code in R which i share with you yesterday and also no results!

ADD REPLY
0
Entering edit mode

@Kevin Blighe Hello. I'm currently working with that same goal (to create a dataframe containing samples in rows and genes in columns). I somehow managed to reproduce the code released in the first comment, though I'm not sure about the variables format. My cnv matrix is made up of ("Samples", "Chromosome", "Start", "End", "Num_probes", "Segment_mean", "GeneSymbol", "Chr", "Start" and "End"). I think I understood how to do the first and second point, though I'm not sure about the third one.

    cnv <- as.matrix(read.table("CNV-withGeneSymbol.txt"))
cnv <- cnv[,c(7:9)]
samples <- gsub("TCGA-[A-Z0-9]*-[A-Z0-9]*-", "", (gsub("-[0-9A-Z]*-[0-9A-Z]*-01$", "", cnv[,1])))
samples.tumor <- samples[which(substr(samples, 14, 16) %in% c("01A", "01B", "06A"))]
tumor <- matrix(0, nrow = length(genes), ncol = length(samples.tumor))
colnames(tumor) <- samples.tumor
rownames(tumor) <- genes
for(i in 1:nrow(tumor)){
    for(j in 1:ncol(tumor)){
        gene.tmp <- rownames(tumor)[i]
        sample.tmp <- colnames(tumor)[j]
        cnv.tmp <- cnv[which(cnv[,3] %in% gene.tmp),]
        tumor[i,j] <- mean(as.numeric(cnv.tmp[which(cnv.tmp[,2] %in% sample.tmp),1]), rm.na = T)
}
}

In this code, could you please specify the third column of cnv (after Samples and Genes) ? I'm not sure on which value you calculate the mean to produce the tumor matrix.

Thank you very much for your help :)

ADD REPLY
0
Entering edit mode

Hi Juliette, for this, I actually used GenomicRanges, which makes it a lot easier. I can try to 'dig out' the code if you need it.

The steps were:

  1. define a list of unique regions across all samples
  2. determine copy number across each unique region per sample
  3. perform clustering and create heatmap

Are you familiar with GenomicRanges?

ADD REPLY
0
Entering edit mode

@Kevin Blighe I discovered GenomicRanges with your code. I tried to recreate the code, I wrote this

mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
genes <- getBM(attributes=c("hgnc_symbol","chromosome_name","start_position","end_position"), mart=mart)
genes <- genes[genes[,1]!="" & genes[,2] %in% c(1:22,"X","Y"),]
colnames(genes) <- c("GeneSymbol","Chr","Start","End")
genes_GR <- makeGRangesFromDataFrame(genes,keep.extra.columns = TRUE)

df = brca.nocnv # obtained via the function GDCquery and GDCprepare
df[which(df$Chromosome=="X"),"Chromosome"] <- 23
df[which(df$Chromosome=="Y"),"Chromosome"] <- 24
cnv =  makeGRangesFromDataFrame(df, keep.extra.columns = TRUE)

hits <- findOverlaps(genes_GR, cnv, type="within")
df_ann <- cbind(df[subjectHits(hits),],genes[queryHits(hits),])
df_ann <- unique(df_ann)
write.table(df_ann, file="BRCA-CNV-withGeneSymbol.txt", quote = F, sep = "\t", row.names = F)

Then, I was planning to do the for loop, that I copied in my previous comment.

Dropping your code would be very helpful, thank you !

ADD REPLY
0
Entering edit mode

Hey juliette, I took a look and here is what I did (below). So, I use the .igv.gistic file that is ultimately produced for the entire cohort being analysed, and overlap these regions, which are statistically significant, with the original copy number segment mean per sample. The matrix that's produced from this process, in the heat object, ultimately forms the heatmap.

Let me know if you can follow this and adapt it to your own code.

  require(GenomicRanges)
  recCNV.Template.GR <- makeGRangesFromDataFrame(
    read.table("Tumor.txt.igv.gistic", sep="\t",
      header=TRUE), keep.extra.columns=TRUE)

  # Form a new data frame that will eventually hold all regions from the
  # teplate cnvMatrix and the respective CN segment value per sample
  heat <- data.frame(
    pastedata.framerecCNV.Template.GR)$seqnames,
      data.framerecCNV.Template.GR)$start,
      data.framerecCNV.Template.GR)$end, sep = '.'),
    data.framerecCNV.Template.GR))
  colnames(heat)[1] <- 'Key'

  # Read in the individal / original sample CN segment values
  CN.tumor.recCNV <- CN.tumor
  colnames(CN.tumor.recCNV) <- c('Sample','Chr','Start','End',
    'Num_Probes','Segment_Mean')

  # Get a sample listing
    samples <- unique(CN.tumor.recCNV$Sample)

  # Each loop overlaps individual sample CN segment calls with the
  # template cnvMatrix regions. If present, the CN segment value is added.
  mylist <- list()
  for (i in 1:length(samples)) {
    GR <- makeGRangesFromDataFrame(
      CN.tumor.recCNV[CN.tumor.recCNV$Sample==samples[i],],
      keep.extra.columns = TRUE)
    hits <- findOverlapsrecCNV.Template.GR, GR, type = "any")

    mylist[[i]] <- data.frame(
      recCNV.Template.GR[queryHits(hits),],
      GR[subjectHits(hits),])

    df.tmp <- mylist[[i]]
    df.tmp$key <- paste(df.tmp$seqnames, df.tmp$start, df.tmp$end,
      sep = '.')
    df.tmp <- df.tmp[match(heat$Key, df.tmp$key),]
    heat<- cbind(heat, df.tmp$Segment_Mean)
    colnames(heat)[11+i] <- samples[i]
}
heat[is.na(heat)] <- 0
ADD REPLY
0
Entering edit mode

@Kevin Blighe

Thank you very much for your answer. I tried to adapt it to what I already had. I used your pipeline to get RecCNV. Here is the code :

recCNV.Template.GR <- makeGRangesFromDataFrame(RecCNV, keep.extra.columns = TRUE)

heat <- data.framepastedata.framerecCNV.Template.GR)$seqnames,
                         data.framerecCNV.Template.GR)$start,
                         data.framerecCNV.Template.GR)$end, sep = '.'),
                   data.framerecCNV.Template.GR))
colnames(heat)[1] <- 'Key'

CN.tumor.recCNV <- cnvMatrix
colnames(CN.tumor.recCNV) <- c('Sample','Chr','Start','End','Num_Probes','Segment_Mean')

samples <- unique(CN.tumor.recCNV$Sample)

mylist <- list()


for (i in 1:length(samples)) {
  GR <- makeGRangesFromDataFrame(CN.tumor.recCNV[CN.tumor.recCNV$Sample==samples[i],],
                                 keep.extra.columns = TRUE)
  hits <- findOverlapsrecCNV.Template.GR, GR, type = "any")

  mylist[[i]] <- data.framerecCNV.Template.GR[queryHits(hits),],
                            GR[subjectHits(hits),])
  df.tmp <- mylist[[i]]
  df.tmp$key <- paste(df.tmp$seqnames, 
                      df.tmp$start, 
                      df.tmp$end,
                      sep = '.')
  df.tmp <- df.tmp[match(heat$Key, df.tmp$key),]
  heat<- cbind(heat, df.tmp$Segment_Mean)
  colnames(heat)[10+i] <- samples[i]
}
heat[is.na(heat)] <- 0

Then, I obtain heat with the columns (Key, seqnames, start, end, width, strand, aberration, q.value, df.tmp$Segment_Mean).

However, I'm not sure what you want to do with the command colnames(heat)[11+i] <- samples[i] (it seems that in my case, it would be colnames(heat)[10+i] but I'm not sure).

Again, thank you so much for taking the time to help !

ADD REPLY
0
Entering edit mode

Great. That colnames() command is not critical, so, probably best to use your own judgement about how to use it for your own data.

ADD REPLY
0
Entering edit mode

@Kevin Blighe Hi, I just wanted to let you know that I managed to get the heat matrix after all. Concerning that colnames command, I had only tried the loop for specific values of 'i', and therefore did not understand why you wanted to name the columns with samples. Anyway, thank you very much for your code and explanation, you helped a lot !

ADD REPLY
0
Entering edit mode

That is great that you got it complete!

ADD REPLY
0
Entering edit mode

Hi @Kevin Blighe, Dear Kevin , is there any mistake in this command ? it doesnt work, with me

 heat <- data.frame(
        pastedata.framerecCNV.Template.GR)$seqnames,
          data.framerecCNV.Template.GR)$start,
          data.framerecCNV.Template.GR)$end, sep = '.'),
        data.framerecCNV.Template.GR))
ADD REPLY
1
Entering edit mode

Thank you, i fix it

     heat <- data.frame(paste
data.framerecCNV.Template.GR)$seqnames,
    data.framerecCNV.Template.GR)$start,
    data.framerecCNV.Template.GR)$end, sep = '.'),
    data.framerecCNV.Template.GR))
ADD REPLY
0
Entering edit mode

@Kevin Blighe Hi Kevin, i have checked as you said but i couldn't fix it This is the first few lines and columns of the object cnv

 1
    GeneSymbol
    Chr
    Start
    2
    ARHGEF16
    1
    3370990
    3
    ARHGEF16
    1
    3370990
    4
    ARHGEF16
    1
    3370990
    5
    ARHGEF16
    1
    3370990
    6
    ARHGEF16
    1
    3370990
    7
    ARHGEF16
    1
    3370990

I already explain to you why i used this for Loop, hope you can help me plz!

ADD REPLY
0
Entering edit mode

@Django i try your code but an error is shown Error in [<-(*tmp*, i, j, value = mean(as.numeric(cnv.tmp[which(cnv.tmp[, : subscript out of bounds Can you help me to fix it ?

ADD REPLY
1
Entering edit mode

I think the Error code means there is no element in "sample.tmp". And which() will return nothing.

ADD REPLY
0
Entering edit mode

@Django Itried your code and i think it has some missing line before samples, where is the object samples? could y plz fix it

ADD REPLY
0
Entering edit mode

Hello @Kevin Blighe can you plz tell me how to open this file "Tumor.All.txt.igv.gistic" and how to plot the heatmap using which package( ComlpexHeatmap package in your paper) (Can you plz provide the code for this step too).

With my respect to you!

ADD REPLY
1
Entering edit mode

Hey bro, you can read back in the file like this:

require(GenomicRanges)
recCNV.Template.GR <- makeGRangesFromDataFrame(read.table("Tumor.All.txt.igv.gistic", sep="\t", header=TRUE), keep.extra.columns=TRUE)

This reads it into a GenomicRanges object.

Do you also have an object called CN.tumor in your R session?

ADD REPLY
0
Entering edit mode

@Kevin Blighe yes it should be there bcz i just follow your whole process step by step and everything work smoothly but only with FFPE file.

ADD REPLY
0
Entering edit mode

Hi,is the object CN.tumor just the df? if not, df="Tumor.All.txt.igv.gistic"? thanks!

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

@Kevin Blighe , Hi dear Kevin, how you generated these 2 heatmaps please? i was wondering if the code that you used for the genration is availaibe and if you can share with me? i wloud like to apply it on my data, if that possible?

ADD REPLY
1
Entering edit mode

Hey, you should read through this thread: C: About TCGA CNV data preprocessing

ADD REPLY
0
Entering edit mode

@Kevin Blighe , Hi dear Kevin, thanks for your reply, that thread i already check it and it works well for me, but i mean the heatmap visualization how you make those beautiful clusters, As far i know you have used ComplexHeatmap package. For the Dendrograms you used Euclidean distance and Ward’s linkage. Could you please share these parts if that possibe, i would like to apply it for my data ?

ADD REPLY
1
Entering edit mode

Hey, I have actually recently posted a complete tutorial here: https://github.com/kevinblighe/E-MTAB-6141

ADD REPLY
1
Entering edit mode

@ Kevin Blighe, Hi dear Kevin, Wow! Great job, I'm really proud of you, i will read the tutorial carefully, Thank you very very much!

ADD REPLY

Login before adding your answer.

Traffic: 1950 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6