Question

which file to use for analysis

0

Entering edit mode

7.6 years ago

Learner ▴ 280

Hello,

I am trying to analysis the RNA seq . After downloading the data I have three types

htseq.counts

FPKM

FPKM-UQ

are these different? should I take only one types or I can use all three of them together when I do the analysis ?

for example, please have a look at this https://portal.gdc.cancer.gov/files/92e73892-811b-4edd-b3db-d452bc5d28e0

is there someone who can tell me which types of RNA seq this is? (I mean how it is acquired and how to understand it?)

Thanks

RNA-Seq • 1.7k views

ADD COMMENT • link updated 7.6 years ago by Kevin Blighe 89k • written 7.6 years ago by Learner ▴ 280

0

Entering edit mode

An update (6th October 2018):

You should abandon RPKM / FPKM. They are not ideal where cross-sample differential expression analysis is your aim; indeed, they render samples incomparable via differential expression analysis:

Please read this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis

The Total Count and RPKM [FPKM] normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis.

Also, by Harold Pimental: What the FPKM? A review of RNA-Seq expression units

The first thing one should remember is that without between sample normalization (a topic for a later post), NONE of these units are comparable across experiments. This is a result of RNA-Seq being a relative measurement, not an absolute one.

ADD REPLY • link 6.8 years ago by Kevin Blighe 89k

score 2 · Accepted Answer · 2017-12-28

2

Entering edit mode

7.6 years ago

Kevin Blighe 89k

The htseq.counts files contain raw counts and therefore provide you with maximum flexibility in terms of analysis.

FPKM and FPKM-UQ are both normalised counts, but the method of normalisation used in both has been slowly falling out of fashion. Most likely, both of these types of normalised counts would have been derived from the htseq.counts raw counts.

If you want me to simply give you advice on which to use, then my answer is htseq.counts. Read these counts into edgeR or DESeq2 and then Bob's your uncle.

Further information straight from TCGA's web domain:

Further information on processing htseq raw (and other) counts with DESeq2: Analyzing RNA-seq data with DESeq2

Kevin

PS - the exact file to which you've linked is the FPKM-UQ counts for a breast cancer primary tumour sample from the TCGA-BRCA study.

ADD COMMENT • link 7.6 years ago by Kevin Blighe 89k

0

Entering edit mode

@Kevin Blighe do you know how to annotate them too? is there any package in python, perl, R or other programing languages ? if you also have any paper, it would help a lot . thanks

ADD REPLY • link 7.6 years ago by Learner ▴ 280

0

Entering edit mode

You can do gene annotation conversions using the biomaRt package in R, but it's rarely straightforward due to some genes only being annotated in one database, or due to the existence of duplicate or redundant IDs, etc.

If you want to try this yourself, then do something like:

require(biomaRt)
mart <- useMart("ENSEMBL_MART_ENSEMBL")
mart <- useDataset("hsapiens_gene_ensembl", mart)

#Map the annotations
annots <- getBM(mart=mart,
  attributes=c("ensembl_gene_id", "hgnc_symbol", "gene_biotype", "external_gene_name", "refseq_mrna", "refseq_ncrna"),
  filter="ensembl_gene_id",
  values=ensembl.gene,
  uniqueRows=TRUE)

ensembl.gene contains your Ensembl Gene IDs to convert.

ADD REPLY • link 6.5 years ago by Kevin Blighe 89k

0

Entering edit mode

@Kevin Blighe I have few questions to ask. one is that can you give some definition about your code above? the first lines also I would like to know what you have done for your own newly analysis? did you also check the mutation ? if no, do you know how to find out the mutations across several samples ?

ADD REPLY • link 7.6 years ago by Learner ▴ 280