Question

Best practices for differential expression analysis

0

Entering edit mode

5.4 years ago

flogin ▴ 280

I'm searching about best practices for differential expression analysis, and I found that paper https://www.ncbi.nlm.nih.gov/pubmed/24300110 (the most closely paper related to my questions)

But, that paper talk about the methods of doing the differential expression analysis starting with an input with expression data, like the input for DESeq package, right?

But, I'm thinking about the whole project, for example:

If I have only a few sequences (e.g. genes) and not the whole assembled genome, Can I make a differential expression analysis?
Which is the best tool to make it? (considering the situation above)
It's necessary a kind of normalization before the gene expression analysis?

With my knowledge, I designed an experiment like this:

Sequence reference: A fasta file with nucleotide information of 5 genes.
RNA-seq libraries: fastq files from RNA-seq experiments with the following conditions: Control, Treatment 1, Treatment 2, Treatment 3.
Mapping: Bowtie2
Output conversion using bam files information to make a table with a count of alignments of each mapping analysis.
DESeq analysis using as input the output created in the previous step.

It's is that? I have no idea if a simple mapping analysis with bowtie using just the sequence of genes can be used to infer gene expression difference.

Best,

transcriptome RNA-Seq mapping bowtie best • 3.4k views

ADD COMMENT • link updated 5.4 years ago by Bastien Hervé 5.9k • written 5.4 years ago by flogin ▴ 280

2

Entering edit mode

With only 5 genes of interest, why aren't you using qPCR?

ADD REPLY • link 5.4 years ago by Friederike 9.0k

0

Entering edit mode

because I'm working with public data with a lot of different species....

ADD REPLY • link 5.4 years ago by flogin ▴ 280

0

Entering edit mode

So does that mean you're not actually going to perform the sequencing yourself, but you're going to download data that other people have sequenced and deposited in a public repo?

ADD REPLY • link 5.4 years ago by Friederike 9.0k

0

Entering edit mode

Your working with lots of public NGS data with only 5 genes?

ADD REPLY • link 5.4 years ago by swbarnes2 14k

1

Entering edit mode

A couple of points; 3 replicates is the bare minimum. DESeq uses information from all the genes to estimate dispersion, that step might be a little strange with only a handful of genes being measured.

ADD REPLY • link 5.4 years ago by swbarnes2 14k

score 4 · Answer 1 · 2019-06-26

Some comments :

Sequence reference: A fasta file with nucleotide information of 5 genes. RNA-seq libraries: fastq files from RNA-seq experiments with the following conditions: Control, Treatment 1, Treatment 2, Treatment 3.

As you do RNA-Seq you should either : align (STAR, HiSat2) against the whole genome (e.g. hg38 if human) not your genes of interest, then count the number of reads per gene (featurecounts, htseq-count) or use pseudo-aligner directly on transciptome (kallisto, salmon).

Also you should have more then one control otherwise it will be impossible to infer any statistical significance with this.

Mapping: Bowtie2

You can use use bowtie2 but only if you align on transcriptome. For whole genome alignment use a splice-aware aligner such as STAR or Hisat2.

Output conversion using bam files information to make a table with a count of alignments of each mapping analysis. Use featurecounts or htseq-count with the correct annotation file (gtf). ENSEMBL ones are pretty good ( Check Gene sets column : http://www.ensembl.org/info/data/ftp/index.html )

DESeq analysis using as input the output created in the previous step.

DESeq2 to be precise ;)

score 2 · Answer 2 · 2019-06-26

AFAIK, for now, there is no "best practises". If I have time I try to do the same workflow with different softwares and compare results.

If I have only a few sequences (e.g. genes) and not the whole assembled genome, Can I make a differential expression analysis?

I assume you used cDNA capture or related capture technique to extract your RNA of interest. If so, it is totally fine to do gene expression analysis using your data.

RNA-seq libraries: fastq files from RNA-seq experiments with the following conditions: Control, Treatment 1, Treatment 2, Treatment 3.

You only have n=1 ? The statistical power of your experiment will be very low, becareful on results interpretation

First, take a look at your reads quality using fastQC or fastp to have an overall look at your sequencing

You can align your read in a reference genome (complete genome) to check if your reads are falling into your gene coordinates, which will be a good check validation of your capture.

If you are aligning to a genome, do not align your reads with a non slipce aware aligner as Bowtie2 without specific options. With default option Bowtie2 is not aware of splice events you will have in your genes, prefer HISAT2 or STAR. Also, you can take a look at pseudocount software like Kallisto or Salmon. If you are aligning on a transcriptome Bowtie2 will be ok

For the counting part you can use featureCounts or HTseq, or use pseudocount with Kallisto and Salmon.

If you want to look at expression variation between gene A and gene B in the same condition, TPM normalization will be enought.

If you are looking at variation of gene A across conditions, tools like edgeR, DESeq2 or Sleuth will help you.

See also for normalization : RNA-seq, why normalize for library size?