1, Quantify read count abundances directly from FASTQ

Question

How do I normalize for my RNA-seq data across different samples in different conditions

0

Entering edit mode

7.0 years ago

crzy_azn_sean • 0

Hi, so i am pretty new to this whole computational process. I have no experience with any "R" packeges. I have only used commercial softwares.(CLC genomics workbench). It would be awsome if you guys can give me detailed suggestions and links so i can practice.

Before I begin let me describe my samples.

nnnnnnnnnnnn </a">Mapping results to reference genome

method: cDNA library 100bp paired-end -> illumina hiseq -> Raw data processed(TPM) with CLC genomics workbench

*GOAL: * I am trying to compare expression levels between samples Treated VS Not Treated (with replicates)

====================!!!!!So *here are the questions i have!!!!!*=================================

1.My treated sample had a small population when i isolated the total RNA for RNA-seq. I observed that the mapping % between Treated and Non-treated show a big difference(non treated control mapped around 17 times more compared to treated sample). Is there any way to normalize for this difference. It's not like i can just multiply 17 to the treated samples to make up for the low mapping %, right?

When i normalize the raw data(Fastq), which normalization process must be considered? I have used TPM for transcript length and sequencing depth but I am new to this whole RNA-seq criteria and i need some serious help. any suggestions?

I dont have that much experience for computational processing. I would be greatful if you could give me detailed suggestions or methods.!!!!

rna-seq normalization mapping library size • 11k views

ADD COMMENT • link updated 7.0 years ago by Kevin Blighe 88k • written 7.0 years ago by crzy_azn_sean • 0

score 5 · Answer 1 · 2017-12-02

My advice would be to avoid FPKM and TPM. In addition, people should cease the use of Cufflinks (unless they are bound by some legacy data produced by Cufflinks) and move toward HISAT2 / StringTie, which are major upgrades of TopHat2 / Cufflinks.

One worry I have is that you allude to your lack of experience in R. However, the pipeline that I'm about to describe below is well documented and there is virtually an entire tutorial for you to follow (to which I link at the end).

Thus, a more simple workflow for you:

--------------------------

1, Quantify read count abundances directly from FASTQ

From your FASTQ files, quantify read count abundances per sample using Kallisto or Salmon. As your reference transcriptome (over which read counts will be counted), you can use the GENCODE reference FASTA files, either just the protein coding RNAs (~21,000 transcripts) or the 'comprehensive' reference FASTA (~200,000 transcripts and isoforms), which includes protein coding RNA, all known non-coding RNA, non-sense mediated decay transcripts, and both processed and unprocessed pseudogenes.

Protein coding RNA reference FASTA (direct link)
Comprehensive RNA reference FASTA (direct link)

These, and others including GTF files, are available here

2, import the counts into R using tximport

3, normalise and conduct differential epxression analysis in DESeq2

DESeq2's normalisation method, which is based on the determination of sizing factors per transcript across all samples in your dataset based on the geometric mean, will deal very well with differences in library size and also low count transcripts.

The best tutorial that you could follow is this one by the developers of DESeq2, recently updated in time for Christmas (a few days go): Analyzing RNA-seq data with DESeq2. In the tutorial, they allude to the use of Kallisto and tximport.

--------------------

I'm sure that you'll have further queries, in which case ask them here or open a new question.

For further reading on RNA-seq normalisation, read this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis.

Kevin

score 0 · Answer 2 · 2017-12-01

0

Entering edit mode

7.0 years ago

Hussain Ather ▴ 990

Reads can be normalized using fragments per kilobase of millions of reads (FPKM). You can use the Cufflinks to compare and find differentially expressed reads using FPKM.