How to extract the list of genes from TCGA CNV data
2
1
Entering edit mode
6.7 years ago
Chaimaa ▴ 260

Hi guys, I have got a CNV data from TCGA as shown below and my goal to extract the list of the genes by using GISTIC Could anyone plz told me how can i deal with this data by using GISTIC and how to apply GISTIC.

  • TCGA-2A-A8VL-10A-01D-A379-01 1 51598 1500664 226 0.1646
  • TCGA-2A-A8VL-10A-01D-A379-01 1 1617778 1653196 12 -0.4115
  • TCGA-2A-A8VL-10A-01D-A379-01 1 1653256 15362197 7748 0.0056
  • TCGA-2A-A8VL-10A-01D-A379-01 1 15362212 15362449 6 0.7626

https://postimg.cc/image/g4kusd56j/

TCGA CNV genes • 23k views
ADD COMMENT
0
Entering edit mode

Hi,

I want to check for the frequency of CNV (amplification or deletion) in the given cancer datasets from TCGA. please kindly help, suggest tool or method, Thanks is advance.

ADD REPLY
1
Entering edit mode

Maftools does that pretty easily

ADD REPLY
0
Entering edit mode

Hello, Thank you for your kind response. I want to check the frequency of particular deletion/amplification in no of patients.(like: how many patients have same gene deleted or amplified, exact no of patients). please suggest. thank you

ADD REPLY
1
Entering edit mode

Do you have GISTIC data? If you have that the exact number of altered samples is very simple by maftools If you have segment data, by annotating the genes in copy number range again you can say each specific gene has been altered in which samples

ADD REPLY
0
Entering edit mode

Hello A, I am trying to do CNV analysis. I have gistic 2 and segnment files. I have no idea how to do the analysis. I want to copmare cnv between two groups. Can you help me with this. Is this possible to share the r code for maftools? Thanks

ADD REPLY
0
Entering edit mode

If you can, please follow this: https://f1000research.com/articles/5-1542

...or my code, starting from here: C: How to extract the list of genes from TCGA CNV data

ADD REPLY
0
Entering edit mode

Thank you Kevin, I do not know how to customize these to my data. I have all primary tumor (not normal) and I want to do CNV to compare between females and males. can you help me with this please? Thanks

ADD REPLY
0
Entering edit mode

I am sorry. I am obviously quite limited in how I can help you from my current position. You will have to help me so that I help you, if you understand me.

ADD REPLY
0
Entering edit mode

Hi Kevin,

I download TCGA BRCA CNV data and ran your R codes in part A, B, and C and also generated results.all files, however, R showed

warning message: 
1: In if (tmp_list[[1]] == -1) { ... :
  the condition has length > 1 and only the first element will be used.

the results file looks correct, with 6 columns and 6862 rows, in the aberration kind, 1=gain and 0=loss of CNV?

thank you very much!!

Yuanchun Ding from California of USA

> #Run GAIA, which looks for recurrent aberrations in your input file
> n <- length(unique(cnvMatrix[,1]))
> cnv_obj <- load_cnv(cnvMatrix, markers_obj, n)
Loading Copy Number Data
................................................
Done
> results.all <- runGAIA(cnv_obj, markers_obj, output_file_name="Tumor.All.txt", aberrations=-1, chromosomes=-1, num_iterations=10, threshold=0.15)

Performing Data Preprocessing

Done

Computing Discontinuity Matrix
........................
Done
Computing Probability Distribution
................................................................................................
Done
Assessing the Significance of Observed Data
................................................
Done
Writing Tumor.All.txt.igv.gistic File for Integrative Genomics Viewer (IGV) Tool
................................................
Done
Running Homogeneous peel-off Algorithm With Significance Threshold of 0.15 and Homogenous Threshold of 0.12
................................................
Done

Writing Output File 'Tumor.All.txt' Containing the Significant Regions

File 'Tumor.All.txt' Saved

There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: In if (tmp_list[[1]] == -1) { ... :
  the condition has length > 1 and only the first element will be used
ADD REPLY
0
Entering edit mode

Hi Yuanchun, I am not sure about this warning message.

Yes, 0 is loss, and 1 is gain. These are encoded in Part C, here: C: How to extract the list of genes from TCGA CNV data

ADD REPLY
20
Entering edit mode
6.7 years ago

Credits to Tiago Silva; https://f1000research.com/articles/5-1542

Update April 5, 2020:

To be clear, this pipeline has little to do with GISTIC. Here, the segmented copy number data that is obtained from Broad Institute is processed via an R package, gaia, which performs the same function as GISTIC, i.e., given a cohort of samples, identifies recurrent copy number events.

Update October 14, 2018:

This thread has become a sort of focal point, so, I wanted to make it absolutely clear the process to follow. If your goal is to obtain somatic copy number alterations (sCNA) for a group of TCGA patients and/or identify recurrent sCNA in these patients, then follow these steps:

  • Part I - download segmented sCNA data for any TCGA cohort from Broad Institute's FireBrowse server and identify recurrent sCNA regions in these with gaia (R)
  • Part II - plot recurrent sCNA gains and losses from GAIA
  • Part III - annotate the recurrent sCNA regions (this post, just below)
  • Part IV - generate heatmap of recurrent sCNA regions over your cohort

----------------------

.

Part III

A

Create a reference dataset of all genes and save it as a GenomicRanges object

library(biomaRt)
library(GenomicRanges)

#Set up an gene annotation template to use
mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="grch37.ensembl.org", path="/biomart/martservice", dataset="hsapiens_gene_ensembl")
genes <- getBM(attributes=c("hgnc_symbol","chromosome_name","start_position","end_position"), mart=mart)
genes <- genes[genes[,1]!="" & genes[,2] %in% c(1:22,"X","Y"),]
xidx <- which(genes[,2]=="X")
yidx <- which(genes[,2]=="Y")
genes[xidx, 2] <- 23
genes[yidx, 2] <- 24
genes[,2] <- sapply(genes[,2],as.integer)
genes <- genes[order(genes[,3]),]
genes <- genes[order(genes[,2]),]
colnames(genes) <- c("GeneSymbol","Chr","Start","End")
genes_GR <- makeGRangesFromDataFrame(genes,keep.extra.columns = TRUE)

B

Store your own data as a GenomicRanges object:

NB - if following from Part II, the df used here is the RecCNV object, produced in Part II

colnames(df) <- c("chr", "start", "end", "extra1", "extra2")
df
chr    start      end extra1  extra2
  1    51598  1500664    226  0.1646
  1  1617778  1653196     12 -0.4115
  1  1653256 15362197   7748  0.0056

df_GR <- makeGRangesFromDataFrame(df, keep.extra.columns = TRUE)
df_GR
GRanges object with 4 ranges and 3 metadata columns:
      seqnames               ranges strand |                      
         <Rle>            <IRanges>  <Rle> | 
  [1]        1 [   51598,  1500664]      * | 
  [2]        1 [ 1617778,  1653196]      * | 
  [3]        1 [ 1653256, 15362197]      * | 
         extra1    extra2
      <integer> <numeric>
  [1]       226    0.1646
  [2]        12   -0.4115
  [3]      7748    0.0056
  -------

C

Overlap your regions with the reference dataset that you created

hits <- findOverlaps(genes_GR, df_GR, type="within")
df_ann <- cbind(df[subjectHits(hits),],genes[queryHits(hits),])
head(df_ann)
chr   start     end extra1 extra2 GeneSymbol Chr
  1   51598 1500664    226 0.1646     OR4G4P   1
  1   51598 1500664    226 0.1646    OR4G11P   1
  1   51598 1500664    226 0.1646      OR4F5   1
Start    End
52473  54936
62948  63887
69091  70008

D

To make filtering easier, we can further manipulate the data to show the precise co-ordinates of each gene and the region co-ordinates in which it was found to have CNA / CNV:

AberrantRegion <- paste0(df_ann[,1],":", df_ann[,3],"-", df_ann[,4])
GeneRegion <- paste0(df_ann[,7],":", df_ann[,8],"-", df_ann[,9])
Final_regions <- cbind(df_ann[,c(6,2,5)], AberrantRegion, GeneRegion)
Final_regions[seq(1, nrow(Final_regions), 20),]
extra2 chr extra1     AberrantRegion
0.1646   1    226    1:51598-1500664
0.1646   1    226    1:51598-1500664
0.1646   1    226    1:51598-1500664
0.1646   1    226    1:51598-1500664

       GeneRegion
   OR4G4P:1-52473
TUBB8P11:1-808672
FAM132A:1-1177826
TMEM240:1-1470554
ADD COMMENT
0
Entering edit mode

@Kevin Blighe Thank you too much for your reply but I'm not familiar with the language of your source code and I'm sure that I have to use GISTIC as mentioned in many papers that i have already read in order to extract a more reliable list of genes and their loci

ADD REPLY
0
Entering edit mode

if anyone familiar with GISTIC plz give me some guidance

ADD REPLY
1
Entering edit mode

Please explain the exact source from where you downloaded your data. There are various types of copy number data in various states of processing available from different web-sites. The data is usually available as somatic copy number alteration data (SCNA), not copy number variation (CNV) - CNVs are more spoken of in terms of the germline and 'natural' variation in copy number. GISTIC is just one of a few programs that are commonly used to analyse SCNA data for the purposes of 'summarising' the regions by grouping them into large regions more amenable for downstream analysis and interpretation.

Have you taken a look at the documentatation to see what exactly you require to execute the program? - ftp://ftp.broadinstitute.org/pub/GISTIC2.0/GISTICDocumentation_standalone.htm

The code that I put above is code using base R functions (with the exception of biomaRt) that can be used to identify the genes overlapping any type of segment data (chr : startbp : endbp).

Thank you.

ADD REPLY
1
Entering edit mode

ADD REPLY
0
Entering edit mode

@Kevin Blighe Thanks a lot for your interest and quick reply.

ADD REPLY
0
Entering edit mode

@Kevin Blighe, I checked your paper and thread through the section" Somatic mutation and copy number aberrations' in the methods"

I got The CNV data level3 from this website:http://gdac.broadinstitute.org and specifically from this link http://firebrowse.org/?cohort=PRAD&download_dialog=true.

I'm interested in getting the list of the significant genes from this data by using GISTIC.

As i know from many papers Gistic2.0 can be used to analyze the copy number dataset (Level 3) for the identification of recurrent regions of copy number alteration and the copy number of genes, and this my goal too.

So the data i got have no gene names.

Additionally, i found that the link from where i got my data have GISTIC results http://firebrowse.org/?cohort=PRAD&download_dialog=true# but i really don't know which file i will download and how to deal with them.

Finally, my main goal is to get a matrix of m*n in which m represent the list of patients(Samples) and n represent the list of genes from this dataset.

@Kevin Blighe Plz, i hope you can help me out to do this job, and i really really appreciate your help!

ADD REPLY
1
Entering edit mode

No problem.

Yes, recurrent somatic copy number alterations (recSCNA) are the target (for analysis), and the genes that overlap these. I am almost certain that I also started with the data from Broad Firebrowse, but I will double check when I arrive home.

I will go through the analysis steps one-by-one for you.

ADD REPLY
0
Entering edit mode

Sorry @Kevin, my question is silly but how I could install GISTIC?

ADD REPLY
1
Entering edit mode
ADD REPLY
0
Entering edit mode

@ Kevin Blighe how to save these final regions into a file?

ADD REPLY
1
Entering edit mode

Hey, you can do something like:

write.table(Final_regions, "FinalRegions.csv", sep=",", quote=FALSE, row.names=FALSE)
ADD REPLY
0
Entering edit mode

So in this case is the df used in part A-D have only Tumor copy number data or normal as well? Thank you very much for your many great posts!

ADD REPLY
0
Entering edit mode

Hey, you're welcome. This post (above) is purely about annotating any type of segment data (requires minimum of chr start end). In the context of the related posts, it is recurrent somatic copy number alteration (recurrent sCNA) segments that have had the normal subtracted out. The pipeline is a mess but goes in this order:

ADD REPLY
0
Entering edit mode

I have completed the first part of the pipeline and produced tumor.all.igv.gistic. However this file does not contain individual data rather it gives me a measure of the significance of the copy number alteration for regions of the genome. To produce a FinalRegions.csv file that matches yours should I be using the cn.tumor dataframe (as shown in Part I) as the dataframe I submit in the code "df_GR <- makeGRangesFromDataFrame(df, keep.extra.columns = TRUE)"? Could you also tell me how you restructured your CNV data from the structure of df_ann or GenomicRegions to a matrix that can be read into the ComplexHeatmap package which requires barcode as columns and genomic regions as rows? Thanks very much for your help with this.

ADD REPLY
0
Entering edit mode

I see. The df in the code above is the same object as RecCNV from Part II. Do you have that object? I will update the code above to reflect this.

The files that are output by the runGAIA function are the ones that I use in Part IV to produce the heatmaps, but I have not put the code on Biostars (it is complex...).

ADD REPLY
0
Entering edit mode

I understand. I will save RecCNV to file and then run the code producing FinalRegions. Will the data structure of FinalRegions be barcodes as columns and genes as rows? If not, do you recommend using the reshape package to produce such a matrix? Thanks very much for your help with this.

ADD REPLY
1
Entering edit mode

In my project code, the output is like this:

GeneSymbol  Aberration  q-value AberrantRegion  GeneRegion
RNU6-904P   Del 0.0280807421052632  2:141915235-142013612   2:141924826-141924926
PRR14   Amp 0.0684173368421053  16:30618633-30986671    16:30662038-30667761
FBRS    Amp 0.0684173368421053  16:30618633-30986671    16:30669752-30682135
RNU6-416P   Amp 0.0684173368421053  16:30618633-30986671    16:30686621-30686723
SRCAP   Amp 0.0684173368421053  16:30618633-30986671    16:30709530-30755602
RNU6-1043P  Amp 0.0684173368421053  16:30618633-30986671    16:30712658-30712756
SNORA30 Amp 0.0684173368421053  16:30618633-30986671    16:30721858-30721986
PHKG2   Amp 0.0684173368421053  16:30618633-30986671    16:30759591-30772490

My project code is different from the above, slightly, because I originally gave the above answer as an independent answer. That was my worry about linking up these steps on Biostars, i.e., they were each independent answers and don't 'harmoniously' link up together. Even a moderately skilled R user should be able to get the exact data that they want, though.

ADD REPLY
0
Entering edit mode

Sorry Kevin

I know my question is too non-sense

I was given a list of files (one for each patient), for example like this

LP2000104-DNA_A01_vs_LP2000101-DNA_A01.copynumber.caveman.csv

Inside them I have

Chromosome  Start   End Total_CN    Minor_CN
1   10583   1017587 3   1
1   1018144 2466425 5   2
1   2475113 7770901 3   1

Likely I should find genes under positive selection; I looked at GISTIC documentation but what I have does not look like what GISTIC needs at all.

What do you think please? I mean you think I am at which part of CNV analysis by these files?

Thank you for any help

ADD REPLY
0
Entering edit mode

From where did you obtain the files?

ADD REPLY
0
Entering edit mode

From our collaborator in Cambridge

ADD REPLY
0
Entering edit mode

Cambridge, England or Cambridge, Massachusetts, USA? In either case, this is not quite what I was looking for... I was implying about the ultimate source of the files, i.e., how they were produced. You should ask the collaborator. The data seems to be non-standard format, but could be implemented in the CNV process that I use in this thread.

ADD REPLY
0
Entering edit mode

Cambridge in England. They processed their own data and sent us our requested samples. They published their results here. The question behind my files is different with what they have done in paper.

The landscape of selection in 551 esophageal adenocarcinomas defines genomic biomarkers for the clinic
ADD REPLY
0
Entering edit mode

Sorry,

My files look the same as the input OP provided screen shot of that only my last column is minor _CN differs with OP's file

ADD REPLY
0
Entering edit mode

@Kevin

When I try to keep my sample names with my seg file like below

sample  chr start   end marker_num  seg_mean
s1  1   136962  534547  11  1.311879592
s23 1   562020  672823  16  -0.010048503
s3  1   672948  2692196 1310    0.574913998

Your tutorial finally gives like this some mess

> head(Final_regions[1:2,1:5])
            seg chr marker                  AberrantRegion
9185 -0.4132345   1   3526 s1:10583-5726928
5591  0.1215680   1   3525 s2:10583-5725898
          GeneRegion
9185 DDX11L1:1-11869
5591 DDX11L1:1-11869
>

How I can keep sample name from the scratch without disturbing your code?

ADD REPLY
1
Entering edit mode

You should use my code as a guide to help you through this analysis, but please make modifications where you see that they are required.

ADD REPLY
0
Entering edit mode

@Kevin Blighe, Hi, kevin i have a question please? if after we determine the significant regions and then their annotated genes (from GeneRegion object) , could we conduct Cross validation on these genes, I wonder how we can do it? Thanks in advance

ADD REPLY
1
Entering edit mode

Hi Chaimaa, you could try a 'sensitivity analysis', that is, remove one or more samples, and then repeat everything to see how it changes the results. You can do this multiple times with different samples. If the results change too much, then the analysis is not 'robust' to sample exclusion.

ADD REPLY
1
Entering edit mode

@ Kevin Blighe, Hi Kevin, thank you too much great idea i will do that, im glad to know a bionformatics expert like you , your comments realy really always help me, Once again Thank you !

ADD REPLY
0
Entering edit mode

Hi Kevin, do you know if gaia package is still available? I'm trying to do the CNV analysis following your detailed steps, but I couldn't even get the package installed.....

it gave me warning message: " Warning message: package ‘gaia’ is not available for Bioconductor version '3.16' "

my Rstudio has Bioconductor version 3.16 (BiocManager 1.30.19), R 4.2.0 (2022-04-22)

ADD REPLY
0
Entering edit mode

Unfortunately it seems that gaia has been deprecated. You could still try to install a previous version of it, or set up a new conda environment with an older version of R that dates to when gaia was still in use. https://bioconductor.org/packages/release/bioc/html/gaia.html

ADD REPLY
1
Entering edit mode

That's unfortunate..... thank you for the information!

ADD REPLY
7
Entering edit mode
6.7 years ago

Credits to Tiago Silva; https://f1000research.com/articles/5-1542

Hello, so, here are the steps that I did, for endometrial cancer:

First downloaded copy number data from here: http://firebrowse.org/?cohort=UCEC# (access the data by clicking on the green bar for SNP6 CopyNum). Specifically, the file that you should obtain is for hg19 and will have 'scna' and 'minus_germline_cnv' as parts of its filename.

In my case, the file was called UCEC.Level_3_segmented_scna_minus_germline_cnv_hg19.seg.txt

Then, there are a series of steps that you should do in order to process this data:

Part I

A, separate out the tumour from nomals

require(data.table)
CN <- read.table("UCEC.Level_3_segmented_scna_minus_germline_cnv_hg19.seg.txt", sep="\t", stringsAsFactors=FALSE, header=TRUE)

#   Tumor types range from 01 - 09
#   Normal types from 10 - 19
#   Control samples from 20 - 29
types <- gsub("TCGA-[A-Z0-9]*-[A-Z0-9]*-", "", (gsub("-[0-9A-Z]*-[0-9A-Z]*-01$", "", CN[,1])))
unique(types)

# NB - in the next 2 lines, it is important to account for all types that are listed from unique(types) command
matched.normal <- c("10A", "10B", "11A", "11B")
tumor <- c("01A", "01B", "06A")

#Divide up the data into tumor and normal
tumor.indices <- which(types %in% tumor)
matched.normal.indices <- which(types %in% matched.normal)
CN.tumor <- CN[tumor.indices,]
CN.matched.normal <- CN[matched.normal.indices,]

#Change IDs to allow for matching to metadata
CN.tumor[,1] <- gsub("-[0-9A-Z]*-[0-9A-Z]*-[0-9A-Z]*-01$", "", CN.tumor[,1])
CN.matched.normal[,1] <- gsub("-[0-9A-Z]*-[0-9A-Z]*-[0-9A-Z]*-01$", "", CN.matched.normal[,1])

--------------------

B, create a markers object

Uses same markers file as GISTIC would use, but, here, we will eventually use gaia, not GISTIC, to identify recurrent sCNA.

require(gaia)

# Retrieve probes meta file from Broad Institute
gdac.root <- "ftp://ftp.broadinstitute.org/pub/GISTIC2.0/hg19_support/"
markersMatrix <- read.delim(paste0(gdac.root,"genome.info.6.0_hg19.na31_minus_frequent_nan_probes_sorted_2.1.txt"), as.is=TRUE, header=FALSE)
colnames(markersMatrix) <- c("Probe.Name", "Chromosome", "Start")

# Change sex chr names
unique(markersMatrix$Chromosome)
markersMatrix[which(markersMatrix$Chromosome=="X"),"Chromosome"] <- 23
markersMatrix[which(markersMatrix$Chromosome=="Y"),"Chromosome"] <- 24
markersMatrix$Chromosome <- sapply(markersMatrix$Chromosome, as.integer)

# Create a marker ID
markerID <- apply(markersMatrix, 1, function(x) paste0(x[2], ":", x[3]))

#Remove duplicates
markersMatrix <- markersMatrix[-which(duplicated(markerID)),]

# Filter markersMatrix for common CNV
# The 'common CNV' is derived from a Broad Institute panel of normals
markerID <- apply(markersMatrix, 1, function(x) paste0(x[2], ":", x[3]))
commonCNV <- read.delim(paste0(gdac.root,"CNV.hg19.bypos.111213.txt"), as.is=TRUE)
commonCNV[,2] <- sapply(commonCNV[,2], as.integer)
commonCNV[,3] <- sapply(commonCNV[,3], as.integer)
commonID <- apply(commonCNV, 1, function(x) paste0(x[2], ":", x[3]))
table(commonID %in% markerID)
table(markerID %in% commonID)
markersMatrix_fil <- markersMatrix[!markerID %in% commonID,]

# Create the markers object
markers_obj <- load_markers(markersMatrix_fil)

-------------------------------

C, determine recurrent aberrations in tumour

#Prepare CNV matrix
cnvMatrix <- CN.tumor

#Add label (0 for loss, 1 for gain)
#A segment mean of 0.3 is defined as the cut-off
cnvMatrix <- cbind(cnvMatrix, Label=NA)
cnvMatrix[cnvMatrix$Segment_Mean < -0.3,"Label"] <- 0
cnvMatrix[cnvMatrix$Segment_Mean > 0.3,"Label"] <- 1
cnvMatrix <- cnvMatrix[!is.na(cnvMatrix$Label),]

#Remove segment mean as we now go by the binary classification of gain or loss
cnvMatrix <- cnvMatrix[,-6]
colnames(cnvMatrix) <- c("Sample.Name", "Chromosome", "Start", "End", "Num.of.Markers", "Aberration")

#Substitute Chromosomes "X" and "Y" with "23" and "24"
xidx <- which(cnvMatrix$Chromosome=="X")
yidx <- which(cnvMatrix$Chromosome=="Y")
cnvMatrix[xidx,"Chromosome"] <- 23
cnvMatrix[yidx,"Chromosome"] <- 24
cnvMatrix$Chromosome <- sapply(cnvMatrix$Chromosome, as.integer)

#Run GAIA, which looks for recurrent aberrations in your input file
n <- length(unique(cnvMatrix[,1]))
cnv_obj <- load_cnv(cnvMatrix, markers_obj, n)
results.all <- runGAIA(cnv_obj, markers_obj, output_file_name="Tumor.All.txt", aberrations=-1, chromosomes=-1, num_iterations=10, threshold=0.15)
ADD COMMENT
0
Entering edit mode

@ Kevin Blighe Hello, can you help me again, I have installed R and I'm running your code line by line. I arrived in the line

#Create the markers object
markers_obj <- load_markers(markersMatrix_fil) 
Error in load_markers(markersMatrix_fil) : 
  could not find function "load_markers", plz fix this line to continue the other lines plz
ADD REPLY
2
Entering edit mode

Hey dude, ensure that the gaia package loaded correctly. load_markers() is part of that package.

ADD REPLY
0
Entering edit mode

Do you mean the line require (gaia)? I 'm following your code line by line and now i fail to continue...

ADD REPLY
1
Entering edit mode

That's indeed what Kevin means.

ADD REPLY
1
Entering edit mode

Yep, ensure that gaia installed correctly (source("https://bioconductor.org/biocLite.R"); biocLite("gaia"))

load_markers
Erro: objeto 'load_markers' não encontrado

Load package and then try again:

require(gaia)
Carregando pacotes exigidos: gaia
Warning message:
package ‘gaia’ was built under R version 3.2.5 


load_markers

function (marker_matrix) 
{
    message("Loading Marker Informations")
    chromosomes <- sort(unique(marker_matrix[, 2]))
    chromosome_marker_list <- list()
    end_position <- FALSE
    if (ncol(marker_matrix) == 4) {
        end_position <- TRUE
    }
    for (i in as.numeric(chromosomes)) {
        chr_ids <- which(marker_matrix[, 2] == i)
        tmp_matrix <- matrix(0, 2, length(chr_ids))
        tmp_matrix[1, ] <- marker_matrix[chr_ids, 3]
        if (end_position) {
            tmp_matrix[2, ] <- marker_matrix[chr_ids, 4]
        }
        else {
            tmp_matrix[2, ] <- tmp_matrix[1, ]
        }
        chromosome_marker_list[[i]] <- tmp_matrix
        names(chromosome_marker_list)[[i]] <- i
        message(".", appendLF = FALSE)
    }
    message("\nDone")
    return(chromosome_marker_list)
}
<environment: namespace:gaia>
ADD REPLY
0
Entering edit mode

Kevin Blighe Thanks, i have installed all the packages successfully. Now, i have the following 2 issues in these corresponding lines

markers_obj <- load_markers(markersMatrix_fil)

Loading Marker Informations ......................Error in chromosome_marker_list[[i]] <- tmp_matrix :

attempt to select more than one element in integerOneIndex

In addition: Warning message:

In load_markers(markersMatrix_fil) : NAs introduced by coercion

colnames(df) <- c("Barcode", "chr", "start", "end", "extra1", "extra2")

Error in colnames<-(*tmp*, value = c("Barcode", "chr", "start", "end", :

attempt to set 'colnames' on an object with less than two dimensions

ADD REPLY
0
Entering edit mode

Hey Chief, let me take a look later.

ADD REPLY
0
Entering edit mode

@ Kevin Blighe, sure, take your time and thanks a lot for your reply.

ADD REPLY
1
Entering edit mode

Just to be sure, you are running these commands first, starting with the data that you download from Broad FireBrowse:

Then, you run the steps that I originally mentioned:

It is very important that the load_markers function does not give any warnings. A successful execution of this function will produce:

markers_obj <- load_markers(markersMatrix_fil)
Loading Marker Informations
........................
Done

Sometimes, available memory is an issue.

ADD REPLY
0
Entering edit mode

@Kevin Blighe Yes I'm following your steps one by one and by the correct order.

ADD REPLY
1
Entering edit mode

Which OS? How much RAM?

Can you definitely 100% confirm that each of these steps has completed successfully:

require(gaia)

#Retrieve probes meta file from broadinstitute website
gdac.root <- "ftp://ftp.broadinstitute.org/pub/GISTIC2.0/hg19_support/"
markersMatrix <- read.delim(paste0(gdac.root,"genome.info.6.0_hg19.na31_minus_frequent_nan_probes_sorted_2.1.txt"), as.is=TRUE, header=FALSE)
colnames(markersMatrix) <- c("Probe.Name", "Chromosome", "Start")

#Change sex chromosome names
unique(markersMatrix$Chromosome)
markersMatrix[which(markersMatrix$Chromosome=="X"),"Chromosome"] <- 23
markersMatrix[which(markersMatrix$Chromosome=="Y"),"Chromosome"] <- 24
markersMatrix$Chromosome <- sapply(markersMatrix$Chromosome, as.integer)

#Create a marker ID
markerID <- apply(markersMatrix, 1, function(x) paste0(x[2], ":", x[3]))

#Remove duplicates
markersMatrix <- markersMatrix[-which(duplicated(markerID)),]

#Filter markersMatrix for common CNV
markerID <- apply(markersMatrix, 1, function(x) paste0(x[2], ":", x[3]))
commonCNV <- read.delim(paste0(gdac.root,"CNV.hg19.bypos.111213.txt"), as.is=TRUE)
commonCNV[,2] <- sapply(commonCNV[,2], as.integer)
commonCNV[,3] <- sapply(commonCNV[,3], as.integer)
commonID <- apply(commonCNV, 1, function(x) paste0(x[2], ":", x[3]))
table(commonID %in% markerID)
table(markerID %in% commonID)
markersMatrix_fil <- markersMatrix[!markerID %in% commonID,]

#Create the markers object
markers_obj <- load_markers(markersMatrix_fil)
ADD REPLY
0
Entering edit mode

@ Kevin Blighe I do every step and run it carefully, i can see the output variables in the Global environment of R right corner.

Can i share my data with you and you can test your code on it?

My OS is windows10 and RAM is 12GB and my data size is 6.87 MB.

ADD REPLY
0
Entering edit mode

@ Kevin Blighe, i have fixed the first error. Now, these 2 parts are not working

    cnv_obj <- load_cnv(cnvMatrix, markers_obj, n)
Loading Copy Number Data
Error in matrix(0, length(samples), ncol(markers_list[[chromosomes[i]]])) : 
  non-numeric matrix extent



results.all <- runGAIA(cnv_obj, markers_obj, output_file_name="Tumor.All.txt", aberrations=-1, chromosomes=-1, num_iterations=10, threshold=0.15)

Performing Data Preprocessing
Error in runGAIA(cnv_obj, markers_obj, output_file_name = "Tumor.All.txt",  : 
  object 'cnv_obj' not found

    colnames(df) <- c("Barcode", "chr", "start", "end", "extra1", "extra2")
Error in `colnames<-`(`*tmp*`, value = c("Barcode", "chr", "start", "end",  : 
  attempt to set 'colnames' on an object with less than two dimensions
ADD REPLY
1
Entering edit mode

There must be something wrong with your cnvMatrix. In my example, I us the endometrial cancer (UCEC) data. Which cancer's data do you have? Just check the output of each command in Steps 1 and 3 in order to ensure that it works.

If you want to tell me the file that you downloaded, then I can also take a look here.

Peace.

ADD REPLY
0
Entering edit mode

@ Kevin Blighe ok the cancer Data which i'm working on is the prostate adenocarcinoma (PRAD).

ADD REPLY
0
Entering edit mode

@ Kevin Blighe, Hi i know that i disturb you a lot I apologize for that and i really really appreciate your help! Did y check your code with my data, i really need to do this job so pls give help me out.

ADD REPLY
1
Entering edit mode

Hey chief - no problem. I went here: http://firebrowse.org/?cohort=PRAD&download_dialog=true

I then downloaded the file: 'genome_wide_snp_6-FFPE-segmented_scna_minus_germline_cnv_hg19' and then used that as the data input.

I then followed all of my code and it worked fine. The only thing that I notice is that there is no copy number data for normal samples in the PRAD dataset, but this is not important because the germline copy number is already subtracted by Broad Firebrowse.

cnv_obj <- load_cnv(cnvMatrix, markers_obj, n)
Loading Copy Number Data
................................................
Done

Then:

results.all <- runGAIA(cnv_obj, markers_obj, output_file_name="Tumor.All.txt", aberrations=-1, chromosomes=-1, num_iterations=10, threshold=0.15)

Performing Data Preprocessing

Done

Computing Discontinuity Matrix
........................
Done
Computing Probability Distribution
................................................................................................
Done
Assessing the Significance of Observed Data
................................................
Done
Writing Tumor.All.txt.igv.gistic File for Integrative Genomics Viewer (IGV) Tool
................................................
Done
Running Homogeneous peel-off Algorithm With Significance Threshold of 0.15 and Homogenous Threshold of 0.12
................................................
Done

Writing Output File 'Tumor.All.txt' Containing the Significant Regions

File 'Tumor.All.txt' Saved

The problem that you have may be an issue with a new version of one of the packages. Dude, to help, I have put the output files and my R session online for you. You can download them (3 files) here: https://app.box.com/s/992llscqeabyw3rzrjg5stxd6945r73r (you may have to open an account).

The Tumor.All.txt file contains the significant recurrent somatic copy number alterations (recSCNA). You will want to input these back into R and then annotate them using the first steps that I mention in this thread. You will have to rename the columns to have at least "chr", "start", "end"

ADD REPLY
0
Entering edit mode

@Kevin Blighe well done Bro!

But when i only downloaded the FFPE file like you and so my question is why do you use FFPE and not "genome_wide_snp_6-segmented_scna_minus_germline_cnv_hg19"? Can you plz answer my last question and i really really appreciated your help.

BEST WISHES.

ADD REPLY
1
Entering edit mode

Hey,

I just chose a file 'at random'. I was only testing. You should use the file more suitable to your experiment.

FFPE tissue is, of course, lower quality.

ADD REPLY
1
Entering edit mode

Wait. Allow me to re-run that using the non-FFPE

ADD REPLY
0
Entering edit mode

OK i can wait .

ADD REPLY
1
Entering edit mode

Here you go, chief (same files but for PRAD.snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg19__seg.seg.txt): https://app.box.com/s/992llscqeabyw3rzrjg5stxd6945r73r

Dude, the PRAD dataset is very big, so, the Rdata file is ~140 megabytes

ADD REPLY
0
Entering edit mode

@Kevin Blighe i'm very thankful and happy for the great help you gave me!

Plz can you also do the annotation Step for me I tried in my computer but it show me this error

    df_ann <- cbind(df[subjectHits(hits),],genes[queryHits(hits),])
Error: cannot allocate vector of size 124.9 Mb
ADD REPLY
0
Entering edit mode

I have a size issue actually.

ADD REPLY
1
Entering edit mode

Getting it to you now - hang on.

ADD REPLY
1
Entering edit mode

I have uploaded the new Rdata file, and also a list of the final annotated regions. In the 'type' column, the following is true:

  • deletion=0
  • amplification=1

https://app.box.com/s/992llscqeabyw3rzrjg5stxd6945r73r

And that is it: recurrent somatic copy number alterations (recSCNA) in the TCGA PRAD dataset. If you want to cite how the data was processed, then just cite my recent publication: https://www.ncbi.nlm.nih.gov/pubmed/29682207

ADD REPLY
0
Entering edit mode

@ Kevin Blighe ,Thanks a lot. Yes definitely i will cite your recent publication in my manuscript and Plz if you have any other publications share with me via e-mail or link. Your topic is of great interest for me . go ahead......

ADD REPLY
0
Entering edit mode

@Kevin Blighe Hi,
Finally, I got a server and start installing the packages but I failed to install "Gaia" package

**package ‘gaia’ is not available (for R version 3.2.3)**
ADD REPLY
1
Entering edit mode

Perhaps open a new question, and show that it has a definitive relationship to bioinformatics.

ADD REPLY
0
Entering edit mode

Ok, Kevin Blighe, but I think this problem has no relation to bioinformatics!

  source("https://www.bioconductor.org/biocLite.R")
    Error in file(filename, "r", encoding = encoding) :
      cannot open the connection
    In addition: Warning message:
    In file(filename, "r", encoding = encoding) :
      unable to connect to 'bioconductor.org' on port 80

. The probem seems related to the server may be

ADD REPLY
0
Entering edit mode

Try: source("https://bioconductor.org/biocLite.R") (without www)

If that still does not work, then try: source("http://bioconductor.org/biocLite.R") (without http)

If that does not work, then you can post on Stack Exchange

ADD REPLY
1
Entering edit mode

@Kevin Blighe I tried with http and with https and nothing work!
I will try in Stack Exchange thanks always for your great help Kevin

ADD REPLY
1
Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. It has been done for you this time.
code_formatting

ADD REPLY
0
Entering edit mode

@Kevin Blighe Hi, Kevin, Is there any possibility to output The Tumor.All.txt file with the samples too, means the significant recurrent somatic copy number alterations (recSCNA) with their corresponding samples from the original input file or Barcode name.

Then, I can annotate this file?

Plz, reply me?

ADD REPLY
1
Entering edit mode

Hi, it would be dishonest and mal-practice for me to continually help you to that extent, and that would defeat the purpose of this website. You will have to find a way using your local resources, preferably in conjunction with your supervisor.

ADD REPLY
0
Entering edit mode

Sorry Kevin

I have had called copy number variations by ascatNgs and tried to find significant aberrations by GISTIC.2.0 and this lines of my segmentation file

Sample  Chr Start   End Num_Prob    Segment_Mean
t_005   1   13116   2063094 996 -0.722466024
t_005   1   2065339 5926507 2675    -10.68825031
t_005   1   5927738 28678709    14307   -0.722466024
t_005   1   28681947    28707612    19  -10.68825031
t_005   1   28708846    29440873    425 -0.722466024
t_005   1   29442291    31262499    1196    -10.68825031
t_005   1   31263510    121484236   57505   -0.722466024

where I did not find any genes significantly amplified or deleted. My boss now has asked me to give him a list of amplified and deleted regions (and/or genes?) in each sample but I don't know how to do that; Can you please give me a clue by having such a such and where GISTIC not giving any recurrent amplifications/deletions, what should I do? Shall I still go through your tutorial to annotate this data to gene name?

Thank you for any help

ADD REPLY
1
Entering edit mode

Yes, you could annotate these regions with genes. You could also just define a cut-off for Segment_Mean for amplification and deletion. For example, -10 is surely a deletion.

ADD REPLY
0
Entering edit mode

Thank you so much for your time and paying attention

For one sample I obtained this

> head(Final_regions)
    GeneSymbol   start    extra2 AberrantRegion        GeneRegion
1      C1orf86 2065339 -10.68825 1:5926507-2675 1:2115903-2144159
1.1        SKI 2065339 -10.68825 1:5926507-2675 1:2160134-2241558
1.2      MORN1 2065339 -10.68825 1:5926507-2675 1:2252692-2323146
1.3       RER1 2065339 -10.68825 1:5926507-2675 1:2323267-2336883
1.4      PEX10 2065339 -10.68825 1:5926507-2675 1:2336236-2345236
1.5      PLCH2 2065339 -10.68825 1:5926507-2675 1:2357419-2436969
> 


> tail(Final_regions)
        GeneSymbol    start    extra2    AberrantRegion
193.712       ARSA 18194188 -10.68825 22:51240820-21994
193.713     SHANK3 18194188 -10.68825 22:51240820-21994
193.714  RNU6-409P 18194188 -10.68825 22:51240820-21994
193.715        ACR 18194188 -10.68825 22:51240820-21994
193.716  RPL23AP82 18194188 -10.68825 22:51240820-21994
193.717     RABL2B 18194188 -10.68825 22:51240820-21994
                  GeneRegion
193.712 22:51061182-51066607
193.713 22:51112843-51171726
193.714 22:51129688-51129791
193.715 22:51176624-51183762
193.716 22:51195376-51239737
193.717 22:51205929-51222091
> 
> 


> dim(Final_regions)
[1] 31500     5

In this data frame only 90 genes have positive (+) sign for extra2 column (segment mean) and rest have negative(-) sign for extra2 column; Does that mean all these genes with negative(-) sign for extra2 column all have been deleted? Such a huge difference is normal?

ADD REPLY
0
Entering edit mode

I do not know how you have produced that output. Generally, though, yes, the negative means lower copy number in tumour when compared to normal. It follows that positive means higher copy number in tumour when compared to normal.

ADD REPLY
0
Entering edit mode

Thank you, I followed your tutorial for annotation; About output, you mean you are not sure how I have called copy number variations?

ADD REPLY
1
Entering edit mode

Yes, but, now it is okay because you have told me that you have followed my tutorial. In fact, if you go to Part 1C ( C: How to extract the list of genes from TCGA CNV data ), you will see that we define the following:

  • Segment mean < -0.3 = loss
  • Segment mean > 0.3 = gain
ADD REPLY
0
Entering edit mode

sorry @Kevin

I only have followed your tutorial for annotation here https://www.biostars.org/p/311199/#311444; I mean I have used my own segment file here

https://www.dropbox.com/s/amhlvjbk6mx91tj/segmentationfile.txt?dl=0

I followed your tutorial on my own segment file to identify recurrent sCNA regions but I did not find anything as already GISTIC also did not give me any significant sCNA.

So, my boss asked me a list of amplified or deleted genes (even not significant), I used your tutorial for getting such a list by annotating segment file

ADD REPLY
0
Entering edit mode

Sir,

as you mentioned pre-computed GISTIC 2.0 data for "UCEC.Level_3_segmented_scna_minus_germline_cnv_hg19.seg.txt" this file. I am not understanding whether this is input or output of GISTIC 2.0. Is this file compiled version of tcga cnv data? can i use this file for gistic analysis?

Thank you

ADD REPLY
1
Entering edit mode

I do not believe the data in UCEC.Level_3_segmented_scna_minus_germline_cnv_hg19.seg.txt is the output of GISTIC; however, it can be used as the input to GISTIC.

The data in UCEC.Level_3_segmented_scna_minus_germline_cnv_hg19.seg.txt is likely produced from Circular Binary Segmentation (CBS).

ADD REPLY
0
Entering edit mode

Sorry @Kevin Blighe

Please, can you help me with a confusion?

In whole genome sequencing, if

Total copy number (major + minor allele copy number by SCAT R package) for SMAD4 in a sample = 4

Th estimated ploidy by SCAT for this sample = 4.043696

So, the segment mean (required for GISTIC) for SMAD4 for this given position of genome in the given sample = log2(4/4.043696) = -0.01567461

We can consider this as a loss of copy number for SMAD4

But with GISTIC threshold -td = 0.1 this is neutral copy number change (if I am not wrong)

So, If I want to ask GISTIC to give me

amplification: copy number ≥2 × average ploidy​
gain:  copy number >1.25 × average ploidy
loss: copy number <0.75 ×average ploidy
deletion: copy number 0

If I put -ta = 1.25 and -td = 0.75 , do I achieve the above?

I am seriously struggling with this intuition of relationship between segment mean and GISTIC copy number threshold

Thank you for any help

ADD REPLY
0
Entering edit mode

@Kevin Blighe, Hi dear Kevin, hope you are doing well?, please i have a small confused question about the qvalue threshold=0.15 You set this value to look for recurrent aberration, plotting and annotation, if we set this value to 0.05 is it possibe or too small to find regions in this case?, we can consider the recurrent abberrations and the genes annotated from them as significant (known to be 0.05) genes ? could you please explain this point for me if its possibe please?

ADD REPLY
0
Entering edit mode

@Kevin Blighe, Dear Kevin could you please answer my question if thats convenient for you?

ADD REPLY
1
Entering edit mode

Hey Chaimaa. You can use qvalue = 0.05, if you wish. It is set to 0.15, initially, in order to retain more regions.

If you have identified many regions at a threshold of 0.05, then that is good.

ADD REPLY
0
Entering edit mode

@Kevin Blighe Thank you so muck Kevin, as i remember i have tried before a number less than 0.1 and get no regions so the qvalue is much depended on the signficant regions and not genes right? how we can say that the annotated genes at 0.15 are signficant?, this is the last qestion please and sorry to bother you

ADD REPLY
1
Entering edit mode

We can still say, for example, 'we identified XYZ statistically significant regions at a threshold of qvalue 0.15'. Then, you are being 100% open / honest about the threshold [for statistical significance] that is being used

ADD REPLY
0
Entering edit mode

@Kevin Blighe that is the genes on that regions are signficant too? Thank you Kevin for your clarification

ADD REPLY

Login before adding your answer.

Traffic: 4234 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6