Entering edit mode
6.3 years ago
nazaninhoseinkhan
▴
530
Dear all,
I am trying to perform CNV analysis on TCGA data. I have around 1000 CNV files and I need to map the coordination to the gene names. The coordination of files differ from each other.
Is there any annotation system to do this automatically? Something like BiomaRt that we use for mapping Ensembl Ids to gene symbols?
I am looking forward to your comments
Nazanin Hosseinkhan
It is unclear how the data you have looks like, please share an example.
VCF or bedpe files, perhaps?
Here is the header of one of my files:
I guess you could convert this in a bed file and take a look at bedtools and annotate the intervals using a bed file off all exons/genes.
To annotate, just use GenomicRanges, as I do here: C: How to extract the list of genes from TCGA CNV data
However, it looks like you have downloaded the CN data separately for each patient? - you have "1000" CNV files.
You can download pre-computed GISTIC 2.0 scores from here (assuming that you're looking at breast cancer): http://firebrowse.org/?cohort=BRCA#
Thank u. I want to compare copy number variation between different cancer stages. Can I separate samples from different stages in this file and perform analysis as u suggested to me for each individual stages?
Yes, I think that sample information is retained in the GISTIC file. You could then separate it into different stages and identify the recurrent copy number alterations separately for each stage. That would be interesting.
The hypothesis, I guess, would be that higher frequency CN alterations correlate with higher tumour stage.
Thank u so much.
Wishing every thing goes well with you
خواهش مي كنم
wow, you know Farsi? It is great!
No, just an interest in all cultures, and I have many Iranian, Iraqi, Saudi Arabian friends.
I guessed you may have some Iranian colleagues. The true spelling of خواهش می کنم was so interesting for me:-)
Hi Kevin, In the last step of running the code I get this error message: "> results.all <- runGAIA(cnv_obj, markers_obj, output_file_name="Tumor.All.txt", aberrations=-1, chromosomes=-1, num_iterations=10, threshold=0.15)
Performing Data Preprocessing Error in if (length(aberrations) > 0 && aberrations[1] == 0) { : missing value where TRUE/FALSE needed In addition: Warning message: In runGAIA(cnv_obj, markers_obj, output_file_name = "Tumor.All.txt", : NAs introduced by coercion"
I tried to remove NAs from cnv_obj and markers_obj, but nothing changed. Can u guide me how to solve this problem?
Maybe try to increase the adjusted P value threshold from
threshold=0.15
tothreshold=0.2
?What is the output of
head(cnv_obj)
?I repeated the last command with adjusted p-value threshold 0.2,0.25,0.3 .... 1. But I got the same error message.
I could not get the head(cnv_obj) completely because it seems so huge.
Even though I separated different stages, but the whole process is so slow on my laptop. However the only error message that I received is that I already sent you
That command requires a lot of memory. For the TCGA breast cancer data (~1000 samples), it neither runs on my laptop with 16GB RAM. Do you have access to a supercompute cluster?
I will try to rerun this command on the server next week. However that server also does not have enough memory I guess.
I tried to increase R memory using "memory.limit(size=100000), however I got the same error message.
Hi Kevin, The following is the results of "Final_regions <- cbind(CN_ann[,c(6,2,5)], AberrantRegion, GeneRegion)":
Is this what I have to get?
I am a little confused, because in " C: How to extract the list of genes from TCGA CNV data", you suggested two different codes. I got the above results from the first code.
By the way, when I got the head of "cnv_obj" from the second code, it was empty, just had the colunm label from X1 to X16384.
Okay, but did you solve the problem with memory / RAM? Also, can you confirm from where you obtained this data? - the Genomic Data Commons (GDC) Data Portal (https://portal.gdc.cancer.gov/ )?
If you obtained the data direct from the GDC Data Portal, then that explains why you have ~1000 files, for, I assume, the breast cancer (BRCA) dataset. These files would be copy number regions identified in each sample. From these, the task would be:
The other question thread became somewhat confusing.
Even though you have downloaded your ~1000 files, you do not need them. You can skip the first step that I mention (above) and just download the pre-computed SCNA data for BRCA by going here: http://firebrowse.org/?cohort=BRCA# (go to the SNP6 CopyNum and download the file genome_wide_snp_6-segmented_scna_minus_germline_cnv_hg19*). Then, start from Step 2.
I downloaded the thyroid cancer cnv data (scna minus germline) from FireBrowse as you suggested to me. I categorized the original file in to four stages according to the clinical data provided by FireBrowse.
Then I started to run this code from your comments on the R.3.5.0 installed on our Server:
. . . . . xidx <- which(cnvMatrix$Chromosome=="X") yidx <- which(cnvMatrix$Chromosome=="Y") cnvMatrix[xidx,"Chromosome"] <- 23 cnvMatrix[yidx,"Chromosome"] <- 24 cnvMatrix$Chromosome <- sapply(cnvMatrix$Chromosome, as.integer) n <- length(unique(cnvMatrix[,1])) cnv_obj <- load_cnv(cnvMatrix, markers_obj, n) results.all <- runGAIA(cnv_obj, markers_obj, output_file_name="Tumor.All.txt", aberrations=-1, chromosomes=-1, num_iterations=10, threshold=0.2)
However in the last step ( "results.all <- runGAIA(cnv_obj, markers_obj, output_file_name="Tumor.All.txt", aberrations=-1, chromosomes=-1, num_iterations=10, threshold=0.2"), I got the
When I read your post (C: How to extract the list of genes from TCGA CNV data ) again, I noticed that the first part of your post suggested different codes:
This time it seemed that everything goes well and I got
Can you tell me what is the difference between these two punch of codes? Thanks again for your prompt responses Kevin
The error message is from the
runGAIA()
command, and it is most likely due to insufficient amount of RAM / memory. This command is supposed to identify the recurrent SCNA regions in your data, i.e., the regions that are more frequently amplified or deleted in your samples.The output of
runGAIA()
is then used with the biomaRt code in order to annotate the regions identified byrunGAIA()
Currently, the results that you obtained:
These are just your original Firebrowse regions and are not the recurrent SCNA regions.
Of course, you do not have to run the
runGAIA()
function and can just leave your data as the annotated SCNA regions (withrunGAIA()
, you would have recurrent SCNA regions).Thank you so much for the comment. I will ask for larger amount of memory for the Server. I prefer to use runGAIA() to identify the recurrent SCNA.
Hi Kevin, I got the final result :"Tumor.all.txt". Now I want to use Biomart to annotate the significant regions.
Do I have to use "Tumor.all.txt" directly? I introduced "Tumor.all.txt" to Biomart, however I got an error message:
Can u tell me what threshold should I consider for the qvalue column of the Tumor.all.txt"? Is 0.15 a standard cutoff?or I have use different cutoff?
For the annotation, you will have to just check the output of each command to ensure that it is working. I can see that some of your chromosome ('seqnames') in CN_GR are labelled as 0, when they should be 1-24 (chr1-22, chrX, chrY).
The cut-off of FDR Q < 0.15 is not really standard - there is no real standard. You can equally adjust it to Q < 0.05 or Q < 0.20. It relates to the False Discovery Rate.
Thank you Kevin, Can I use VEP(variant effect predictor) for annotating the regions? I tried the VEP and everything seems OK
Sure, VEP is also a good option outside of R.
Hi Kevin, If I understood correctly, we can exclude common variations from copy number alteration data using this punch of code:
If you remember I decided to annotate the resulted CNA using VEP. Do I have to filter my data using " exclude common variants" in VEP? or they have already been excluded by using above code?
In this code (above), the common variants are excluded.
Thank you. So I can consider the entire results as novel
Hi Kevin, Sorry me for my nonstop questions. I got the annotation results of my copy number alterations from VEP. Now I want to get the CNV plot of my results.
Can you advise me what is the best way to do this? Nazanin
Hey, at this point, you may want to open a new question and show examples of the data that you have, and an example of the result that you what. This thread is very long, at this stage.
Sure. I will open a new question and show some parts of my data
hI @nazaninhoseinkhan how you divided the original file into four stages plz share this information, i need it ???
Hi, I used the clinical data provided for my cancer of interest. I filter the clinical data for the specific stage and then use the TCGA codes for separating the original CNV files.
@nazaninhoseinkhan Hi, There are many samples in CNV data and every sample correspond to 23 chromosomes. which sample i have to take and compare it with the clinical data samples with their corresponding stages??? Plz help me in this issue and if you have any codes plz share it, friend,
If I remember, there were 4 different files for each sample.
normal hg19, tumor hg19, normal hg18 and tumor hg18.
Depends on what reference genome version (hg18 or hg19) you want to use you can select them.
from where can i get normal hg19, tumor hg19?
On Broad Firebrowse, there should be just a single file, which contains the somatic copy number alterations for all samples in a specific TCGA dataset. These relate to the copy number regions called in the sampels, minus the calls in the matched germline / normal samples.
Conversely, if you download the data directly from the GDC website, then you will have to apply a lot of extra processing on these sampels. The GDC contains a single file per sample.
Final: The data on the Broad Firebrowse has already processed the GDC data for you, and saves you a lot of time.
Hi @nazaninhoseinkhan, could you plz tell me how you solve the memory issue?? I have faced the same problem.
I re-run my analysis on a cluster (server) and the problem was solved.
is this server available in your LAB or what?
I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below: