You should abandon RPKM / FPKM. They are not ideal where cross-sample differential expression analysis is your aim; indeed, they render samples incomparable via differential expression analysis:
The Total Count and RPKM [FPKM] normalization methods, both of which are
still widely in use, are ineffective and should be definitively
abandoned in the context of differential analysis.
The first thing one should remember is that without between sample
normalization (a topic for a later post), NONE of these units arecomparable across experiments. This is a result of RNA-Seq being arelative measurement, not an absolute one.
The htseq.counts files contain raw counts and therefore provide you with maximum flexibility in terms of analysis.
FPKM and FPKM-UQ are both normalised counts, but the method of normalisation used in both has been slowly falling out of fashion. Most likely, both of these types of normalised counts would have been derived from the htseq.counts raw counts.
If you want me to simply give you advice on which to use, then my answer is htseq.counts. Read these counts into edgeR or DESeq2 and then Bob's your uncle.
Further information straight from TCGA's web domain:
@Kevin Blighe do you know how to annotate them too? is there any package in python, perl, R or other programing languages ? if you also have any paper, it would help a lot . thanks
You can do gene annotation conversions using the biomaRt package in R, but it's rarely straightforward due to some genes only being annotated in one database, or due to the existence of duplicate or redundant IDs, etc.
If you want to try this yourself, then do something like:
require(biomaRt)
mart <- useMart("ENSEMBL_MART_ENSEMBL")
mart <- useDataset("hsapiens_gene_ensembl", mart)
#Map the annotations
annots <- getBM(mart=mart,
attributes=c("ensembl_gene_id", "hgnc_symbol", "gene_biotype", "external_gene_name", "refseq_mrna", "refseq_ncrna"),
filter="ensembl_gene_id",
values=ensembl.gene,
uniqueRows=TRUE)
ensembl.gene contains your Ensembl Gene IDs to convert.
@Kevin Blighe I have few questions to ask. one is that can you give some definition about your code above? the first lines
also I would like to know what you have done for your own newly analysis? did you also check the mutation ? if no, do you know how to find out the mutations across several samples ?
An update (6th October 2018):
You should abandon RPKM / FPKM. They are not ideal where cross-sample differential expression analysis is your aim; indeed, they render samples incomparable via differential expression analysis:
Please read this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis
Also, by Harold Pimental: What the FPKM? A review of RNA-Seq expression units