Hello all,
I am looking at the Level 3 CNV files on TCGA. I have a few questions:
I download copy number variance data from TCGA database and mapped genomic regions to gene symbols using this method(https://www.biostars.org/p/311199/#311746). Now i get a matrix that its rows are genes and its columns are samples.
If I want to use this data to cluster samples, how do I pre process the data? (p.s. for probes mapped to the same gene, I averaged their Segment_mean values, right?)
You are referring to a post that I made. From where did you obtain the original data? - Broad Firebrowse (somatic copy number alterations) or just downloaded the original files from GDC?
If you followed the data processing exactly as follows:
Part I - download segmented sCNA data for any TCGA cohort from Broad Institute's FireBrowse server and identify recurrent sCNA regions in these with GAIA
Part II - plot recurrent sCNA gains and losses from GAIA
Part III - annotate the recurrent sCNA regions (this post, just below)
Part IV -
generate heatmap of recurrent sCNA regions over your cohort
Then, the statistically significant recurrent somatic copy number alterations (sCNA) are held in the *.igv.gistic files. You can extract statistically significant regions from this file and then pull out the original copy number over these on a per sample basis using GenomicRanges - the copy number that you take is indeed the segment mean from the original copy number program that was used (in the case of TCGA data, likely DNAcopy (R)).
If you do that, then you can build a matrix of:
statistically significant recurrent sCNAs in a group of patients as
rows
patients as columns
Segment Mean over each region as the values
With that, I generated this and identified clusters of patients based on recurrent sCNA via Partitioning Around Medoids (PAM)::
You just took the tumour samples, merged them together, and then summarised copy number per gene? There's nothing wrong with that; however, if you try to publish it, a reviewer would ask if these copy number events are somatic or germline. Without subtracting the normal copy number profiles, you cannot know whether or not these are true somatic events. It depends on what exactly you want to do, of course.
Germline copy number event = copy number variation
Somatic copy number event = copy number alteration
The last time that I checked, the open access (level 3) TCGA copy number data is output per sample and is available separately for tumour and normal at the Genomic Data Commons (GDC). This is the benefit of just taking the Broad FireBrowse data, i.e., because they have re-processed this level 3 data to produce somatic copy number events, available as files with names like this: 'genome_wide_snp_6-segmented_scna_minus_germline_cnv_hg19' ( http://firebrowse.org/?cohort=BRCA&download_dialog=true )
I have downloaded the masked CNV data filtered germline variation as input data and get the average of all probes per gene. If I just want to use CNV data for cluster analysis, is this right? Thank you for your sharing. Now I know the difference between variation and alteration. Thanks again!
Hey, it was not I who up-voted your comment. I am reading your comment for the first time right now. If that is the data that you have, then it seems okay!
Hey, thank you! Can you show the first few lines and columns of the object cnv? Just paste them here, highlight them, and then click the 101 010 button
@Kevin Blighe, I'm working on RGui(64-bit) and the objects are not shown here.
I have posted the full code i applied, and i can share the data as well if y want ?
I am not exactly sure what you are aiming to do with this for loop. However, the way to debug them is, after they crash and produce the error, check the last value of i and j, and then take a look at those indices in tumor and cnv. Then you will be able to debug it.
@Kevin Blighe hi Kevin,
So, i have to change the values of i and j and then run the for loop again or what? I didn't get exactly what do you mean by taking a look and then debug it again?
Actually, i want to construct the CNV matrix (rows represent the samples and columns represent the annotated genes) .
Exactly, i want to integrate this data with others multi-omics datasets such as gene expression and DNA methylation for further analysis.
All these datasets have such format except CNV data which i get stuck to reformat it now.
That's why i used that for loop to get the matrix and maybe after that i can merge all of these datasets together. Is this for loop right or no to make such matrix?
Yes, but what is the value of i and jafter the error is produced?
You may need to do the following:
define a unique list of regions (chr:start:end) across all of your samples
create an empty data-frame with just the unique list of regions as
row-names - this data-frame will eventually hold your final output
loop over your input data per sample (i), and output 'segment
mean' where there is overlap with each segment (j). Overlap can be
determined by GenomicRanges
@Kevin Blighe Hi Kevin, Thanks for yr quick reply! your helpful suggestions!
The last values of i and j are 1L.
I want my output like a matrix of samples rows and genes in the columns with their corresponding segment values.
I'm more familiar with MATLAB, i used the output file produced from your annotation code and i write a piece of code that can allow allocating this matrix but my code there doesn(t work too.
I tried the code in R which i share with you yesterday and also no results!
@Kevin Blighe
Hello. I'm currently working with that same goal (to create a dataframe containing samples in rows and genes in columns). I somehow managed to reproduce the code released in the first comment, though I'm not sure about the variables format. My cnv matrix is made up of ("Samples", "Chromosome", "Start", "End", "Num_probes", "Segment_mean", "GeneSymbol", "Chr", "Start" and "End").
I think I understood how to do the first and second point, though I'm not sure about the third one.
In this code, could you please specify the third column of cnv (after Samples and Genes) ? I'm not sure on which value you calculate the mean to produce the tumor matrix.
Hey juliette, I took a look and here is what I did (below). So, I use the .igv.gistic file that is ultimately produced for the entire cohort being analysed, and overlap these regions, which are statistically significant, with the original copy number segment mean per sample. The matrix that's produced from this process, in the heat object, ultimately forms the heatmap.
Let me know if you can follow this and adapt it to your own code.
require(GenomicRanges)
recCNV.Template.GR <- makeGRangesFromDataFrame(
read.table("Tumor.txt.igv.gistic", sep="\t",
header=TRUE), keep.extra.columns=TRUE)
# Form a new data frame that will eventually hold all regions from the
# teplate cnvMatrix and the respective CN segment value per sample
heat <- data.frame(
pastedata.framerecCNV.Template.GR)$seqnames,
data.framerecCNV.Template.GR)$start,
data.framerecCNV.Template.GR)$end, sep = '.'),
data.framerecCNV.Template.GR))
colnames(heat)[1] <- 'Key'
# Read in the individal / original sample CN segment values
CN.tumor.recCNV <- CN.tumor
colnames(CN.tumor.recCNV) <- c('Sample','Chr','Start','End',
'Num_Probes','Segment_Mean')
# Get a sample listing
samples <- unique(CN.tumor.recCNV$Sample)
# Each loop overlaps individual sample CN segment calls with the
# template cnvMatrix regions. If present, the CN segment value is added.
mylist <- list()
for (i in 1:length(samples)) {
GR <- makeGRangesFromDataFrame(
CN.tumor.recCNV[CN.tumor.recCNV$Sample==samples[i],],
keep.extra.columns = TRUE)
hits <- findOverlapsrecCNV.Template.GR, GR, type = "any")
mylist[[i]] <- data.frame(
recCNV.Template.GR[queryHits(hits),],
GR[subjectHits(hits),])
df.tmp <- mylist[[i]]
df.tmp$key <- paste(df.tmp$seqnames, df.tmp$start, df.tmp$end,
sep = '.')
df.tmp <- df.tmp[match(heat$Key, df.tmp$key),]
heat<- cbind(heat, df.tmp$Segment_Mean)
colnames(heat)[11+i] <- samples[i]
}
heat[is.na(heat)] <- 0
Then, I obtain heat with the columns (Key, seqnames, start, end, width, strand, aberration, q.value, df.tmp$Segment_Mean).
However, I'm not sure what you want to do with the command colnames(heat)[11+i] <- samples[i] (it seems that in my case, it would be colnames(heat)[10+i] but I'm not sure).
Again, thank you so much for taking the time to help !
@Kevin Blighe
Hi, I just wanted to let you know that I managed to get the heat matrix after all. Concerning that colnames command, I had only tried the loop for specific values of 'i', and therefore did not understand why you wanted to name the columns with samples.
Anyway, thank you very much for your code and explanation, you helped a lot !
@Django i try your code but an error is shown
Error in [<-(*tmp*, i, j, value = mean(as.numeric(cnv.tmp[which(cnv.tmp[, :
subscript out of bounds
Can you help me to fix it ?
Hello @Kevin Blighe can you plz tell me how to open this file "Tumor.All.txt.igv.gistic" and how to plot the heatmap using which package( ComlpexHeatmap package in your paper) (Can you plz provide the code for this step too).
@Kevin Blighe , Hi dear Kevin, how you generated these 2 heatmaps please? i was wondering if the code that you used for the genration is availaibe and if you can share with me? i wloud like to apply it on my data, if that possible?
@Kevin Blighe , Hi dear Kevin, thanks for your reply, that thread i already check it and it works well for me, but i mean the heatmap visualization how you make those beautiful clusters, As far i know you have used ComplexHeatmap package. For the Dendrograms you used Euclidean distance and Ward’s linkage. Could you please share these parts if that possibe, i would like to apply it for my data ?
Hi, Thanks for your help! I just downloaded the CMV data from TCGA database and mapped genomic regions to gene symbols using the following code:
Then I constructed the CNV matrix using the following code:
At last, I got the CNV matrix. Is there any error in this process? The final data is like this:
You just took the tumour samples, merged them together, and then summarised copy number per gene? There's nothing wrong with that; however, if you try to publish it, a reviewer would ask if these copy number events are somatic or germline. Without subtracting the normal copy number profiles, you cannot know whether or not these are true somatic events. It depends on what exactly you want to do, of course.
The last time that I checked, the open access (level 3) TCGA copy number data is output per sample and is available separately for tumour and normal at the Genomic Data Commons (GDC). This is the benefit of just taking the Broad FireBrowse data, i.e., because they have re-processed this level 3 data to produce somatic copy number events, available as files with names like this: 'genome_wide_snp_6-segmented_scna_minus_germline_cnv_hg19' ( http://firebrowse.org/?cohort=BRCA&download_dialog=true )
I have downloaded the masked CNV data filtered germline variation as input data and get the average of all probes per gene. If I just want to use CNV data for cluster analysis, is this right? Thank you for your sharing. Now I know the difference between variation and alteration. Thanks again!
Hey, it was not I who up-voted your comment. I am reading your comment for the first time right now. If that is the data that you have, then it seems okay!
@Kevin Blighe Hello Kevin, if i want to extract the genes from both tumor and normal samples That will make more sense instead or no
Actually, my data have "01A", "10A and "11A" and i also want to built the CNV matrix, I tried @Django code but an error is shown?
@Kevin Blighe Hello Kevin Could you plz answer my question?
You have not shown the error message, so, how can I advise further?
@Kevin Blighe Hi, Kevin sorry for that here is the piece of code i tried and the following error
Error shown here; Error in
[<-
(*tmp*
, i, j, value = mean(as.numeric(cnv.tmp[which(cnv.tmp[, : subscript out of boundsHey, thank you! Can you show the first few lines and columns of the object
cnv
? Just paste them here, highlight them, and then click the101 010
button@Kevin Blighe, I'm working on RGui(64-bit) and the objects are not shown here. I have posted the full code i applied, and i can share the data as well if y want ?
Here is the data
I am not exactly sure what you are aiming to do with this for loop. However, the way to debug them is, after they crash and produce the error, check the last value of
i
andj
, and then take a look at those indices intumor
andcnv
. Then you will be able to debug it.@Kevin Blighe hi Kevin, So, i have to change the values of i and j and then run the for loop again or what? I didn't get exactly what do you mean by taking a look and then debug it again?
Actually, i want to construct the CNV matrix (rows represent the samples and columns represent the annotated genes) .
Exactly, i want to integrate this data with others multi-omics datasets such as gene expression and DNA methylation for further analysis. All these datasets have such format except CNV data which i get stuck to reformat it now. That's why i used that for loop to get the matrix and maybe after that i can merge all of these datasets together. Is this for loop right or no to make such matrix?
Yes, but what is the value of
i
andj
after the error is produced?You may need to do the following:
i
), and output 'segment mean' where there is overlap with each segment (j
). Overlap can be determined by GenomicRanges@Kevin Blighe Hi Kevin, Thanks for yr quick reply! your helpful suggestions!
The last values of i and j are 1L.
I want my output like a matrix of samples rows and genes in the columns with their corresponding segment values.
I'm more familiar with MATLAB, i used the output file produced from your annotation code and i write a piece of code that can allow allocating this matrix but my code there doesn(t work too. I tried the code in R which i share with you yesterday and also no results!
@Kevin Blighe Hello. I'm currently working with that same goal (to create a dataframe containing samples in rows and genes in columns). I somehow managed to reproduce the code released in the first comment, though I'm not sure about the variables format. My cnv matrix is made up of ("Samples", "Chromosome", "Start", "End", "Num_probes", "Segment_mean", "GeneSymbol", "Chr", "Start" and "End"). I think I understood how to do the first and second point, though I'm not sure about the third one.
In this code, could you please specify the third column of cnv (after Samples and Genes) ? I'm not sure on which value you calculate the mean to produce the tumor matrix.
Thank you very much for your help :)
Hi Juliette, for this, I actually used GenomicRanges, which makes it a lot easier. I can try to 'dig out' the code if you need it.
The steps were:
Are you familiar with GenomicRanges?
@Kevin Blighe I discovered GenomicRanges with your code. I tried to recreate the code, I wrote this
Then, I was planning to do the for loop, that I copied in my previous comment.
Dropping your code would be very helpful, thank you !
Hey juliette, I took a look and here is what I did (below). So, I use the .igv.gistic file that is ultimately produced for the entire cohort being analysed, and overlap these regions, which are statistically significant, with the original copy number segment mean per sample. The matrix that's produced from this process, in the
heat
object, ultimately forms the heatmap.Let me know if you can follow this and adapt it to your own code.
@Kevin Blighe
Thank you very much for your answer. I tried to adapt it to what I already had. I used your pipeline to get RecCNV. Here is the code :
Then, I obtain heat with the columns (Key, seqnames, start, end, width, strand, aberration, q.value, df.tmp$Segment_Mean).
However, I'm not sure what you want to do with the command colnames(heat)[11+i] <- samples[i] (it seems that in my case, it would be colnames(heat)[10+i] but I'm not sure).
Again, thank you so much for taking the time to help !
Great. That
colnames()
command is not critical, so, probably best to use your own judgement about how to use it for your own data.@Kevin Blighe Hi, I just wanted to let you know that I managed to get the heat matrix after all. Concerning that colnames command, I had only tried the loop for specific values of 'i', and therefore did not understand why you wanted to name the columns with samples. Anyway, thank you very much for your code and explanation, you helped a lot !
That is great that you got it complete!
Hi @Kevin Blighe, Dear Kevin , is there any mistake in this command ? it doesnt work, with me
Thank you, i fix it
@Kevin Blighe Hi Kevin, i have checked as you said but i couldn't fix it This is the first few lines and columns of the object cnv
I already explain to you why i used this for Loop, hope you can help me plz!
@Django i try your code but an error is shown Error in
[<-
(*tmp*
, i, j, value = mean(as.numeric(cnv.tmp[which(cnv.tmp[, : subscript out of bounds Can you help me to fix it ?I think the Error code means there is no element in "sample.tmp". And which() will return nothing.
@Django Itried your code and i think it has some missing line before samples, where is the object samples? could y plz fix it
Hello @Kevin Blighe can you plz tell me how to open this file "Tumor.All.txt.igv.gistic" and how to plot the heatmap using which package( ComlpexHeatmap package in your paper) (Can you plz provide the code for this step too).
With my respect to you!
Hey bro, you can read back in the file like this:
This reads it into a GenomicRanges object.
Do you also have an object called
CN.tumor
in your R session?@Kevin Blighe yes it should be there bcz i just follow your whole process step by step and everything work smoothly but only with FFPE file.
Hi,is the object CN.tumor just the df? if not, df="Tumor.All.txt.igv.gistic"? thanks!
Hi! That object is created here: C: How to extract the list of genes from TCGA CNV data
@Kevin Blighe , Hi dear Kevin, how you generated these 2 heatmaps please? i was wondering if the code that you used for the genration is availaibe and if you can share with me? i wloud like to apply it on my data, if that possible?
Hey, you should read through this thread: C: About TCGA CNV data preprocessing
@Kevin Blighe , Hi dear Kevin, thanks for your reply, that thread i already check it and it works well for me, but i mean the heatmap visualization how you make those beautiful clusters, As far i know you have used ComplexHeatmap package. For the Dendrograms you used Euclidean distance and Ward’s linkage. Could you please share these parts if that possibe, i would like to apply it for my data ?
Hey, I have actually recently posted a complete tutorial here: https://github.com/kevinblighe/E-MTAB-6141
@ Kevin Blighe, Hi dear Kevin, Wow! Great job, I'm really proud of you, i will read the tutorial carefully, Thank you very very much!