Question

TCGA miRNA Seq data analysis

2

Entering edit mode

6.7 years ago

amandal ▴ 30

I have downloaded miRNA Seq data from TCGA GDC data portal. Type of each file is "13.mirna.quantification.txt" which has 4 columns, namely, "miRNA_ID", "read_count", "reads_per_million_miRNA_mapped" and "cross-mapped". And total number of miRNA ID is 1046 but among which around 800 miRNA ID have zero read_count value for all samples. I do not understand what should I do with these 800 miRNA ID which have zero read_count. Are they indicate "missing value"? Should I do missing value imputation method or should I just ignore these values from my experiments?

miRNA Seq TCGA Data zero read_count missing value • 3.8k views

ADD COMMENT • link 6.7 years ago by amandal ▴ 30

score 3 · Answer 1 · 2018-03-30

3

Entering edit mode

6.7 years ago

Kevin Blighe 88k

Hello amandal,

They are only missing in the sense that there was no read quantification over them. So, it's more a case of 'not transcribed', as opposed to 'missing'. I, therefore, do not think that imputation is necessary. Instead, you can safely remove them from the analysis prior to normalisation. You need to specify some cut-off, though, such as:

mean across all samples = 0
mean across all samples < 10

Either of those would be fine, with the second threshold being more stringent of course.

---------------------------

Out of curiosity, which cancer is this? I recently analysed UCEC (Uterine Corpus Endometrial Carcinoma) but only ~250 micro-RNAs had 0 across all samples. The files that I used were each called 'mirnas.quantification.txt'.

Another thing about which you need to be aware, if you have just downloaded the Level 3 (open access) data from the GDC, then it may contain a bunch of normal samples mixed with the tumours. In the UCEC dataset, for example, there are 22 normals mixed with 501 tumours.

Kevin

ADD COMMENT • link 6.7 years ago by Kevin Blighe 88k

0

Entering edit mode

Thank you sir for your reply, I have downloaded Lung cancer data set (LUAD and LUSC).

ADD REPLY • link 6.7 years ago by amandal ▴ 30

0

Entering edit mode

Great - I have also just processed LUAD and LUSC.

ADD REPLY • link 6.5 years ago by Kevin Blighe 88k

0

Entering edit mode

@kevin I been looking for normal Samples for AML with which is LAML its not there if i'm not wrong so how do I make a comparision of the disease , because I need normal ,and these are RSEM normalisation, i don;t think i can take the normal sample that are normalised in deseq2 for comparison, do you have any suggestion ?for TCGA LAML

ADD REPLY • link 6.7 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

You are just looking at bulk RNA-seq, right (krushnach80)? There should e some normal samples. I missed your post 7 weeks ago - maybe I was very busy.

ADD REPLY • link 6.5 years ago by Kevin Blighe 88k

1

Entering edit mode

Im always elated when you respond yes im looking for bulk rna seq ,I do have normal sample but what i found long back that all of them are rsem normalised so what Im thinking is I do have normal sample i would use star aligner and make them rsem normalised to make comparison hope this would work...

ADD REPLY • link 6.5 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

The RSEM files should include a 'raw_count' column, which you can regard as the raw counts prior to any normalisation.

ADD REPLY • link 6.5 years ago by Kevin Blighe 88k

0

Entering edit mode

okay i will see the star manual..and if any issue i will write you back..

ADD REPLY • link 6.5 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

No, well, the RSEM counts are available as open access data from the TCGA's GDC Legacy Archive. You select the samples that you want and then download a 'Manifest', which is then used with the GDC client (a rogram) for the purposes of downloading the selected files.

For example, here is a selection of all RNA-seq files for LAML: https://portal.gdc.cancer.gov/legacy-archive/search/f?filters=%7B%22op%2...

ADD REPLY • link 6.5 years ago by Kevin Blighe 88k

0

Entering edit mode

Hi! I'd be really grateful if you could help me out on this issue further because I'm so stuck here.. I have two conditions in my dataset, and I find that for one condition (the "control") ~560 miRNAs have 0 read counts in all the samples, whereas the treatment condition has over 1000s of counts for the same miRs. How do I analyze such data on DESeq2? These miRNAs seem to be biologically relevant (expressed only in one condition), but since one condition has no counts at all, should I skip using them in my DESeq analysis, in case they get filtered out? Or will DESeq2 carry out the analysis as i want and provide a reasonable fold change that would actually work? Thanks for your help!

ADD REPLY • link 4.1 years ago by ginny • 0