Entering edit mode
8.9 years ago
ChIP
▴
600
Hi,
I am curious to know, how can I normalise number of reads per gene/gene_id generated by the STAR aligner (ReadsPerGene file).
In case it is non stranded sequencing and my data looks like this (first few lines)
N_unmapped 663817 663817 663817
N_multimapping 1232697 1232697 1232697
N_noFeature 5408215 8176752 5438049
N_ambiguous 108068 315 106321
NM_214429 0 0 0
NM_214220 0 0 0
NM_001143697 51 0 51
NM_001164649 1 0 1
NM_001044603 2 0 2
NM_001244242 479 1 478
NM_001244241 286 1 285
NM_001177907 428 0 428
NM_001171752 79 0 79
NM_001123219 13 0 13
NM_001256148 198 2 196
NM_001001771 9068 17 9051
NM_001123181 132 0 132
NM_001258303 2678 7 2671
NM_001244322 282 0 282
NM_001317080 1445 3 1442
NM_001195372 0 0 0
NM_214313 3752 2 3750
NM_001253922 0 0 0
NM_001007195 717 2 715
NM_001244473 1930 1 1929
NM_001044590 0 0 0
NM_001001538 0 0 0
NM_001128489 0 0 0
NM_213885 125 0 125
NM_001114280 181 0 181
Thank you in advance
Hi,
Thank you. This is strand specific RNA-Seq. Where column 1 is gene name, column 2 total number of reads, column 3 reads on strand 1 and column 4 reads on 2 strand.
Will your R script still work?
I am sure that this will work. Because even if it strand specific the gene will normalize in strand specific manner. you will get normalization values for each strand based on the libraries of corresponding strands.
As stated by zhangchao3, use edgeR. It includes a function to normalize to RPKMs directly, rpkm(), so there's no need to reinvent the wheel. Having said that, it's much better to normalize to TPMs (transcripts per million); it's the only one that holds between samples (look for this in PubMed), as compared to RPKMs and RPKs.
I do not think there will be big difference in this case since all conditions will maintain the same gene length. Choosing the method is also depends on your goal. Whether you are looking for within sample variation (Ranking or finding enriched genes within samples) or you want to compare between different samples (differential expression between conditions). Here is the blog which clearly explains difference between RPKM, FPKM and TPM,
http://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/
You can modify the above code to calculate RPKM,FPKM or TPM. There will be just slight difference.
Example: To make the code work for TPM
First normalize the count for 'count/gene_length' and then by normalized lib_size
(sum(count/gene_length)/1,000,000)
(opposite to RPKM). Finally you divide it by 1,000,000 =[(sum(count/gene_length)/1,000,000) / 1,000,000]