Information on "sample_name.cnt" obtained by an RSEM analysis
2
0
Entering edit mode
4.2 years ago
umekage • 0

Hello,

I obtained a sample_name.cnt in a newly created sample_name.stat directory after an RSEM-1.3.3 analysis. Shown below is the content of the sample_name.cnt. What do these numbers mean?

Thank you in advance for your kindness.

0 2726098 0 2726098
1534055 1192043 1993977
9793897 1
0       0
1       732121
2       410181
3       513309
4       610475
5       90206
6       81551
7       63620
8       44947
9       33029
10      21745
11      22282
12      21545
13      13324
14      17247
..
..
..
RNA-Seq RSEM • 2.5k views
ADD COMMENT
1
Entering edit mode
4.2 years ago
umekage • 0

The format and meanings of each field are described in "cnt_file_description.txt" under RSEM directory.

http://deweylab.github.io/RSEM/rsem-calculate-expression.html#OUTPUT

https://github.com/bli25broad/RSEM_tutorial

Here is the transcript.

  # '#' marks the start of comments (till the end of the line) 
  # *.cnt file contains alignment statistics based purely on the alignment results 
  # obtained from aligners

  N0 N1 N2 N_tot   
  #  N0, number of unalignable reads; N1, number of alignable reads; N2, number 
  #of filtered reads due to too many alignments; N_tot = N0 + N1 + N2     

  nUnique nMulti nUncertain   
  # nUnique, number of reads aligned uniquely to a gene; nMulti, number of reads 
  #aligned to multiple genes; nUnique + nMulti = N1;                             
  # nUncertain, number of reads aligned to multiple locations in the given reference 
  #sequences, which include isoform-level multi-mapping reads

  nHits read_type             
  # nHits, number of total alignments.                         
  # read_type: 0, single-end read, no quality score; 1, single-end read, with quality 
  #score; 2, paired-end read, no quality score; 3, paired-end read, with quality score

  # The next section counts reads by the number of alignments they have. Each line 
  #contains two values separated by a TAB character. The first value is number of 
  # alignments. 
  # 'Inf' refers to reads filtered due to too many alignments. The second value is the 
  #number of reads that contain such many alignments

  0                             N0
   ...
  number_of_alignments          number_of_reads_with_that_many_alignments
  ...                           
  Inf                           N2
ADD COMMENT
1
Entering edit mode

Hi umekage,

What is the difference between the N_tot and the nHits? Why aren't they the same? Where are the 7 million reads difference? How is an aligned read not = to an alignment?

2,726,098 N_tot

1,534,055 Unique 1,192,043 Multi 1,993,977 Uncertain

9,793,897 nHits

Unique + Multi = N_tot so these are the aligned reads. But add in uncertain and that is still way less than the 9.7M. Also why is uncertain not in multi if its aligning to multiple places?

Thanks, Lindsay

ADD REPLY
0
Entering edit mode

Hi Lindsay,

Did you ever get this question answered? I have the same one lol

ADD REPLY
0
Entering edit mode
13 months ago
  • Sum of the number of alignments section is equal to N_tot
  • Sum of the number of alignments excluding 1 is nUncertain assuming this is multiple alignments so can't assign to one gene, why not nMulti ?
  • difference of N_tot and nUnique is nMulti assuming this may not be in a gene region so can't assign to a transcript/gene seems like this should nUncertain
  • number of alignments section for 1 is not = to nUnique assuming 1 = number of unique sites and nUnique number of alignments total among those number of alignments=1 sites
  • still don't know the different between alignable reads and nHits and why nHits is so much larger than alignable reads

Hope this helps clarify at least some things.

ADD COMMENT

Login before adding your answer.

Traffic: 2012 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6