Question

Incompatibility between the output of STAR and RSEM

0

Entering edit mode

7.4 years ago

wayj86 ▴ 40

Hi all,

I aligned my data with STAR, and then calculated the expression with RSEM:

STAR --runMode alignReads --runThreadN 2 --genomeDir ~/ref_ensemble_ERCC/star_ref/ --readFilesIn ~/WJC-RNAseq/WJC-3/WJC-3_clean_r1.fq  ~/WJC-RNAseq/WJC-3/WJC-3_clean_r2.fq --quantMode TranscriptomeSAM GeneCounts --outSAMtype BAM Unsorted --outSAMreadID Number --outSAMattrIHstart 0 --outSAMattributes All --outFilterMismatchNoverReadLmax 0.05 --outSAMstrandField intronMotif --outFilterIntronMotifs RemoveNoncanonical --outFilterMultimapNmax 1 --outReadsUnmapped Fastx --outFileNamePrefix ./WJC-3

rsem-calculate-expression --alignments --paired-end -p 2 --estimate-rspd ~/WJC-RNAseq/WJC-3/WJC-3Aligned.toTranscriptome.out.bam ~/ref_ensemble_ERCC/rsem_ref/ensemble_ERCC ./WJC-3

The "WJC-3ReadsPerGene.out.tab" generated by STAR contains the read counts mapped to genes, while the "WJC-3.genes.results" contains the FPKM values. However, I found Incompatibility between the output of STAR and RSEM:

gene_id read_counts FPKM
ENSG00000275708 44146   0
ENSG00000207442 1263    0
ENSG00000264462 738 0
ENSG00000207616 638 0
ENSG00000201512 425 0
ENSG00000264063 422 0
ENSG00000206688 354 0
ENSG00000269688 315 0
ENSG00000228437 311 0
ENSG00000248347 290 0
ENSG00000210196 279 0
ENSG00000207554 248 0
ENSG00000224574 248 0
ENSG00000258380 246 0
ENSG00000282732 235 0
ENSG00000266897 217 0
ENSG00000240970 215 0
ENSG00000225091 213 0
ENSG00000207549 212 0
ENSG00000269981 206 0
ENSG00000236871 181 0
ENSG00000226942 180 0
ENSG00000243433 180 0
...

As shown in the table above, there were 4776 genes with read_counts > 0 but FPKM = 0; meanwhile there were 800 genes with read_counts n = 0 but FPKM > 0 (data not shown). When using the different expression tools (e.g. DEseq, EdgeR), which require the read counts N, those genes whose read counts > 0 would be regarded as expressed; when using WGCNA, which requires FPKM or TPM values, those genes with read counts > 0 but FPKM = 0 would be regarded as no expression.

So comes the question: how should I get the read counts for DEseq, EdgeR? Extract it from the "ReadsPerGene.out.tab" file? Or recalculated it with featureCounts? Or extract it from the "genes.results" file, which also contains the "expected_count" values?

Thank you very much for your attention.

Stanley

RNA-Seq • 2.5k views

ADD COMMENT • link updated 7.4 years ago by WouterDeCoster 48k • written 7.4 years ago by wayj86 ▴ 40

0

Entering edit mode

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY • link 7.4 years ago by WouterDeCoster 48k

0

Entering edit mode

Thanks. I have marked my post up. This is the my first time to post a topic in Biostars: )

ADD REPLY • link 7.4 years ago by wayj86 ▴ 40