RNA-Seq analysis using STAR and Salmon
2
7
Entering edit mode
7.2 years ago

Hello! I am having some trouble figuring out how to use Salmon. I have around 30 different samples which I trimmed using bbmap then aligned them using STAR. I have all of the BAM files from this alignment. Should I merge them all before running Salmon or because they are all unique samples, do I run them separately? I also aligned them to the UMD3.1 cattle genome. Is this ok for Salmon or should I align it to a different reference? I am very new to all this and trying to teach myself as I go. So if anyone has any other sites that could help me out, that would be great!

Thanks!

RNA-Seq • 20k views
ADD COMMENT
1
Entering edit mode

There are a few conceptual issues that might help you in the analysis:

  1. You said you aligned the reads with STAR to the UMD3.1 cattle genome. With such alignments you cannot quantify using salmon. To use Salmon you'll need to work with a transcriptome, available from here ftp://ftp.ensembl.org/pub/release-90/fasta/bos_taurus/ (you'll want to download cDNA).

  2. If your interest is in finding unannotated splice sites or transcripts in the cow then you ought to be aligning to the genome as you did; you could then run a variety of tools to analyze the results; salmon doesn't do that.

  3. Salmon can be used to take STAR alignments to the transcriptome and quantify them. That is, you could feed it the STAR alignments to the ENSEMBL cDNA. It can also quantify directly from the reads by pseudoalignment (the distinction is explained here https://liorpachter.wordpress.com/2015/11/01/what-is-a-read-mapping/). I don't know of a benchmark that has tested whether alignment with STAR + Salmon is better or worse than Salmon with pseudoalignment.

  4. You asked about whether to quantify the samples jointly or not. That depends on the downstream analysis you will perform with them.

ADD REPLY
0
Entering edit mode

For Single end RNAseq reads since we don't have information about fragment length. Can we first map reads via STAR and give the transcriptomic BAM to Salmon. Can Salmon then infer fragment length by itself?

Here in Deeptools, it says The “Size” is the fragment (or read, for single-end datasets)

ADD REPLY
0
Entering edit mode

Salmon has a default for fragment size which I think is somewhatish 200bp and a given standard deviation. It does not infer, it simply uses a reasonable default or what the user provides via the flags (see documentation). If you want it accurate then you would ask the lab for the QC results. Usually a library is checked on a bioanalyzer/tapestation, and fragment length of the cDNA is simply the length it infers minus adapter content. That in most RNA-seq kits is somewhat around 150-200bp in my experience.

ADD REPLY
18
Entering edit mode
7.2 years ago

Salmon would typically be used instead of STAR, not in addition to.

The typical workflow is:

  1. raw read QC using FastQC
  2. trimming (if necessary)
  3. alignment, e.g. using STAR
  4. counting reads that overlap with genes, e.g. using featureCounts (alternatively, Salmon or Kallisto will omit step 3 and directly produce the read counts per transcript)
  5. differential gene expression analysis

Bioconductor has a couple of great and detailed workflows: https://bioconductor.org/help/workflows/

RNA-seqlopedia is a very comprehensive source of information specifically for RNA-seq.

The course notes here are similarly detailed, but a bit more focused just on plain differential gene expression analysis.

ADD COMMENT
0
Entering edit mode

which reference transcriptome to use for SALMONing with mus_musculus? cDNA or CDS?

ADD REPLY
1
Entering edit mode
7.2 years ago
aka001 ▴ 190

As you are mentioning Salmon, I would guess you want to count at the transcript level. You should run it for individual BAM files (without merging). As long as you have the reference (that is, annotation file of the transcriptome, not the genome file itself), Salmon should be usable.

If you want to do counting at the gene level, you would probably want to use featureCounts or --quantMode GeneCounts option with STAR.

ADD COMMENT
3
Entering edit mode

If you want gene abundances, you should consider using salmon and then aggregating to the gene level using tximport, this will generally be more accurate than a read counting pipeline.

ADD REPLY

Login before adding your answer.

Traffic: 2633 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6