Question

Identifying 5p and 3p in miRNA isoform expression data from TCGA for feature selection

5

Entering edit mode

6.1 years ago

jiaqiwu ▴ 50

I've downloaded some miRNA expression data from TCGA (for CHOL) and the isoform quantification files look like this:

miRNA_ID    isoform_coords  read_count  reads_per_million_miRNA_mapped  cross-mapped    miRNA_region 
hsa-let-7a-1    hg38:chr9:94175939-94175962:+   1   0.706072    N   precursor 
hsa-let-7a-1    hg38:chr9:94175942-94175962:+   1   0.706072    N   precursor 
hsa-let-7a-1    hg38:chr9:94175961-94175984:+   2   1.412144    N   mature,MIMAT0000062 
hsa-let-7a-1    hg38:chr9:94175962-94175981:+   45  31.773244   N   mature,MIMAT0000062

However, in other projects and papers, I always see selected features labeled as hsa-let-7a-1-3p or hsa-let-7a-5p, etc. Where is the 3p/5p coming from? Does it correspond with the +/- strand?

Additionally, how do I pool this data between different samples so I can run differential expression analysis between data from CHOL samples and other cancer types (i.e., BRCA). My end goal is to perform feature selection methods and then use the selected features to predict cancer types, but I am unsure how to process this data.

Thanks in advance.

tcga microrna mirna isoform data processing • 2.6k views

ADD COMMENT • link updated 3.9 years ago by andresllucena ▴ 40 • written 6.1 years ago by jiaqiwu ▴ 50

0

Entering edit mode

Did you find a way to do this? I want to figure out the 3p/5p forms from the isoform quantification files too, but don't know how or where to begin!

ADD REPLY • link 4.1 years ago by ginny • 0

score 2 · Answer 1 · 2021-01-19

There is how i solve that:

The 3p and 5p strands corresponds to these MIMAT IDs miRNA_region column, then you'll have to sum counts with this same ID for each sample and then you'll have the raw counts that you need for differential expression analysis and so on.

First, delete rows of ''precursor'' and ''stemloop'' rows because we only want mature strands counts:

mature = your_data[-grep("precursor|stemloop", your_data$miRNA_region),]

I also remove ''mature,'' before MIMAT IDs because after all that we'll have to convert MIMAT IDs to miRNAs mature strand names using miRBaseConverter:

mature$miRNA_region = gsub(pattern = c(as.character("mature,")), replacement = "", x = mature$miRNA_region)

Now you'll sum counts of same miRNA ID and patient Barcode.

x = aggregate(read_count ~ miRNA_region + barcode, data=mature, sum)

Then you get something like this:

 miRNA_region                      barcode    read_count
 MIMAT0000062 TCGA-05-4244-01A-01T-1108-13      14492
 MIMAT0000063 TCGA-05-4244-01A-01T-1108-13       8767
 MIMAT0000064 TCGA-05-4244-01A-01T-1108-13        610
 MIMAT0000065 TCGA-05-4244-01A-01T-1108-13        750
 MIMAT0000066 TCGA-05-4244-01A-01T-1108-13        804
 MIMAT0000067 TCGA-05-4244-01A-01T-1108-13       4748

Unfortunately, i don't know yet how to separete counts by its different samples barcodes

edit1: I asked for help on this issue here