Hi there,
I have RNA sequencing data from patient matched primary breast tumour that have metastasised to distant organs such as brain and bone.
I would like to use the new import-rna feature in CNVkit to calculate copy number variations in these samples.
I had ran import-rna using:
cnvkit.py import-rna ./salmon-counts/*-bone-*.txt \
--gene-resource cnvkit/data/ensembl-gene-info.hg38.tsv \
--correlations cnvkit/data/tcga-skcm.cnv-expr-corr.tsv \
--output bone-cnv-summary.tsv --output-dir out
In the documentation at https://cnvkit.readthedocs.io/en/stable/rna.html, it states that:
The --correlations
input is not required but is strongly recommended.
The TCGA melanoma cohort correlations can be used for analysis of any tissue type, not just neoplastic melanocytes.
However, best results will usually be achieved with a correlations table specific to the test cohort.
The script cnv_expression_correlate.py
generates this table from input tables of per-gene and per-sample
copy number and expression levels, typically retrieved from cBioPortal for TCGA cancer-specific cohorts.
Therefore, I would like to use the cnv_expression_correlate.py
script on TCGA BRCA data to pass as input for --correlations
to import-rna
.
Opening the python script it also states:
"""Get correlation coefficients for matched copy number and expression data.
cBioPortal offers a nice feature in which you can download a summary of many
large-scale sequencing studies. In this summary are two files that contain
the copy number and expression values of every gene in the study for every
sample. This summary is available for nearly every TCGA study, and the data
is intuitive to access, therefore I have designed this pre-processing script
to accept these as inputs. Of course, the user can calculate their own
Pearson values from other sources of data if they prefer -- in this case,
the user should formate their data to match the output of this prepocessing
script.
"""
However, on the cBioPortal website and with the cgdsr R package you cannot download all the expression and CNV data with EntrezID for all genes.
What would be the best way to approach this?
I was thinking of using the RTCGAToolbox to pull the
tcga.brca <- RTCGAToolbox::getFirehoseData(dataset = "BRCA",
RNASeqGene = TRUE,
RNASeq2GeneNorm = TRUE,
CNASeq = TRUE,
clinical = TRUE)
then use biomart to retrieve the entrez gene id for the HUGO gene symbols and use those files as input to the cnv_expression_correlate.py
script.
Would that be the correct way to approach it?
Thanks a million!
I got this to work.
It was not at first look very obvious where to get the summary data files from cbioportal but it is indeed there:
For TCGA BRCA (Breast Cancer)
http://www.cbioportal.org/study?id=brca_tcga#summary
to the right of the "Breast Invasive Carcinoma (TCGA, Provisional)" there is a download data button
this will download a zipped folder called brca_tcga.tar.gz
Unzip the folder
I then ran the following command:
python cnv_expression_correlate.py -o tcga-brca.cnv-expr-corr.tsv brca_tcga/data_CNA.txt brca_tcga/data_RNA_Seq_v2_expression_median.txt