I work on RNA sequencing at the moment. I've used tophat and cufflinks before, so now I'd like to do it in different way. My new RNA-seq workflow is like below
BWA -> HTSeq-count -> TMM
However, I have no idea how to TMM after I finished BWA and HTSeq-count. I've found that I have to use edgeR package in R and the function for TMM is calcNormFactors()
. I've tried it already but there were some errors as I'm not much familiar with it.
Now I have about 10 bam files as a result of BWA, and 10 HTSeq-count results from that 10 bam files. Each HTSeq-count result has two columns, one for gene symbols and the other for the counts (and last 4 rows are values about no feature, ambiguous, too low aQual, not aligned, alignment not unique)
I wonder
- How to TMM with HTSeq-count results.
- What is the input format that I can make from those 10 HTSeq-count results for
calcNormFactors()
in R.
You should use a splice-aware aligner for RNA-Seq data (e.g. STAR, Tophat, HiSat,...)
Do you mean that I sould use Tophat or something else than BWA?
Yes because bwa does not support reads alignment over splice junction
Do you mean "TPM" or "CPM"? edgeR has a function for for calculating CPM.
As I know, I can put some options like
Is it TMM normalization?
Don't specify anything and you'll get TMM normalization.
If you want to perform TMM normalization then use edgeR, since that's where TMM comes from. If that gives you an error then post it.
calcNormFactors() takes a matrix of counts.
I tried to put 10 HTSeq-count results together, so I made one matrix with them.
It has 11 columns; 1st column for gene symbols and other columns are about count values of 11 HTSeq-count results.
First error I got before was about that column(x): 'x' must be numeric.
Try putting the gene names into rownames, e.g. if your data is in my.counts:
The count matrix must be numeric.
Thank you for your help, I tried it, and got 1 column with 10 rows as a result like below.
It's quite confusing because I expected that the result will be a matrix which has the same number of rows and columns with input data (about 30,000 rows and 11 rows, in this case) with normalized values... Is this result right? or am I mistaken?..
No, something is wrong here. What does dim(x) and head(x) give before these operations?
before these operations,
dim(x)
gives[1] 33126 10
andhead(x)
givesAnother little issue, and I am not sure how HTseq count deals with it, but did you check that the count files were all compatible, that means that the rownames are identical and in the same order in all files? You are probably fine, but better to check...
This is the default, isn't it? Btw, TMM (trimmed mean of M-values) normalization is
Robinson, M.D. and Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology 11, R25.
I guess I have never changed the defaults...
Did you follow through edgeR's manual? Did you create a DGEList?
What are the errors you are getting?
Finally, NicoBxl is right, use STAR (which will give you counts) or Subread+featureCounts or HISAT2 to do the mapping.