Hi, I have got large RNASeq data sets from Illumina platform. I am looking to analyze these data in order to find differential gene expression, pathways analysis etc. Please suggest pipelines and workflow.
Hi, I have got large RNASeq data sets from Illumina platform. I am looking to analyze these data in order to find differential gene expression, pathways analysis etc. Please suggest pipelines and workflow.
IMO the edgeR package is amazing, has great documentation and is highly validated. A really good tutorial for differential expression is this one:
https://f1000research.com/articles/5-1438
For complete tutorial theres: https://github.com/griffithlab/rnaseq_tutorial
There's a massive amount of info online, google is your friend
1 - FastQC;
2 - STAR alignment;
3 - Get ReadCounts using Rsubreads;
4 - Differential expression using DEseq2 and EdgeR;
5 - Pathway analysis (Panther, GSEA, etc...);
I have used a couple of different genome aligners. Among all liked STAR the most and provide good and consistent results. Take a look at the following thread: HISAT2 V.S. STAR
STAR is better in general (but again depends on what you really want to do) for genome alignment and it is easy to use too! If your goal eventually is to do differential gene expression analysis and pathway analysis, I would probably use STAR for alignment.
If you are confirmable in using Hisat2, I don't see a problem using that either.
Simply, (1) Check the quality of raw sequences (FastQC), (2) Remove adapters and low quality reads (this is OPTIONAL and can use a tool like Cutadapt), (3) map to a reference genome of interest (STAR), (4) prepare count matrices (Subread featureCounts), (5) differential gene expression analysis using R package DESeq2 or edgeR and (6) pathway analysis using a program like clusterProfiler.
I have written a complete pipeline that I use for my RNA-seq data analyses: https://github.com/jkkbuddika/RNA-Seq-Data-Analyzer (This pipeline takes care of steps 1-4)
The repository also contain a detailed step-by-step user guide that would walk you through getting things started. The Python pipeline eventually generates count tables using featureCounts and additionally bedgraphs for IGV visualization using DeepTools. These final count tables can be used as an input for DESeq2 or edgeR to do the differential gene expression analysis. Let me know if you need any help.
Yes! Take a look at the User guide: https://github.com/jkkbuddika/RNA-Seq-Data-Analyzer/blob/master/USERGUIDE.md
User guide describes 4 simple steps to get you going. Yes! You have to have all the programs installed and have sufficient memory to run the pipeline. I usually install all programs using miniconda (see Step 1 of the user guide).
The pipeline analyze both single and paired end sequencing files. You just have to specify which mode you want to analyze. Again, let me know if you need any more assistance.
The pipeline itself will not perform the differential gene expression analysis. You have to use an R package such as DEseq2 or edgeR to do that. The pipeline generates count tables that are the input for these packages. Let me know if you need any help with this too.
Cheers!
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thank you all for your informative help.
Do I need to RUN any pipeline to remove adaptors before analyzing the data? I have a total 110 paired-end fastq files generated from Illumina.
Technically no. Most aligners will soft-clip adapters. One reason you may want to do it is to check if some samples are more affected than others. Taking a look at the FastQC reports (use
multiQC
to collate all FastQC reports) should give you a good idea.Hi, I have run STAR with following commands and got the output files. Please let me know which file should be used for featureCounts.
Use alignment BAM file.
Since, I have used gtf file in STAR index command. Should I use gtf file again in featureCount RUN command?
Here is the command for featureCount that I am using.
You post below that you have 180 fastq files. I honestly don't know why you're using STAR because that's going to take an incredibly long time (and lots of RAM) to process all of them. Use super-fast pseudoalignment methods (e.g. kallisto) instead -- benchmarks show that pseudoalignment methods (kallisto and salmon) outperform featureCounts in accuracy.
Previously, I used HISAT2 for less number of RNASeq data. However, I am not sure about STAR. If it is not usefull, please recommend any other to perform a large number of data. Yes! I have 180 very large files of RNASeq.
STAR and HISAT2 are good aligners but I recommend using pseudoalignment, especially when you have that much data.
Here's the manual for kallisto: https://pachterlab.github.io/kallisto/starting
Afterwards, for differential gene expression analysis, you can use DESeq2 or sleuth or limma etc. (there are many methods!).
Hi, I am not very familiar about kallisto. As I am taking a quick look at its manual, it requires transcriptome file. Therefore, where I can get the transcriptome file. I am also not sure about "transcripts.fasta.gz" in the following command. I have paired end reads instead.
Here is the command: kallisto index -i transcripts.idx transcripts.fasta.gz
Great question! This information is described in the manual. Briefly, you can download the ENSEMBL cDNA FASTA files (should look something like *.cdna.all.fa.gz) -- this will be your "transcripts.fasta.gz" file in the command you put in your comment. Or you can download a prebuilt index and skip the "kallisto index" step altogether.
Hi, I have RUN kallisto and got the output result for a sample. Finally, I have file called "abundance.tsv". Do I need to run featureCount. It seems .tsv file has already useful information, please let me know which column I need to consider for further use to identify differential gene expression analysis.
Yup! You're all good then -- you already have those counts (in est_counts). You're all good to go with differential gene expression analysis (see the differential gene expression analysis programs I linked to above).
Note: if you want to use DESeq2 or limma for your differential gene expression analysis, you should first import that file with tximport (see the kallisto section).
After running kallisto, I have got separate result files (abundance.tsv and abundance.h5) for each of my samples. I have just RUN for 6 samples so I have 6 .tsv and .h5 files.
However, I am not sure how to open all results files at tximport. Since I go through the kallisto section in tximport manual, in their command it illustrated one "abundance.h5" as input. How to import all files with tximport and marge into single file to use DESeq2.
Here is the R script that is in the manual: However, after running, it is showing default data. Could you please let me know, what is "samples.txt"?
The tximport manual is showing how to work with multiple abundance.tsv (or abundance.h5) files. The
files
variable is actually a list of multiple abundance.tsv files (each abundance.tsv file is in a separate directory).Just populate that
files
list in R with the paths to all your abundance.tsv files and after runningtximport
, you should get a genes x samples matrix.You don't need samples.txt -- just replace samples$run with a list of the names of your samples (the names of your samples are just the directory names where the abundance files are located).
Hi, I am trying following command, however it is showing error. Please correct me, I am pretty new on this platform.
ERROR:
No worries! OK, try the following:
If that doesn't work, double-check and make sure that those two files actually exist on your filesystem.
Now, it worked. I changed slightly in the coding.
However, I am not sure what should I do next. Do I need to make a metadata file for DESeq2? I am lost in DeSeq2 manual, could you please suggest some initial steps where should I start in the manual of DeSeq2.
Also, it shows gene ID in its column 1, how these ids will change genes name. It shows the following matrix after running the command (head(txi.kallisto$counts).
Yup! You're doing great; you've figured out how to create a genes x samples matrix.
From here, just follow the DESeq2 steps (create a sampleTable then use DESeqDataSetFromTximport) in the manual.
Also, by the way, you're using transcript IDs (ENST...) -- you need to change them into gene IDs (the tutorial explains how to make a
tx2gene
data frame to get you gene IDs from the transcript IDs). Once you have the gene IDs, you also have to convert them into gene names. It's a fairly involved process to go from transcript IDs -> gene IDs -> gene names, but many people have asked these questions in the past and you can find an answer by doing a Google search.You're almost there but you've got a few more steps to do before you can complete your analysis. This chain of comments has gotten pretty long so I recommend posting new questions on Biostars so that more of the community can see and respond. For example, if you can't figure out how to make a sampleTable for DESeqDataSetFromTximport, post that as a new question and mention where you're getting stuck. If you can't figure out how to get from transcript IDs to gene IDs or from gene IDs to gene names after Google searching, you can post that as another question on Biostars.
Sure! Thank you for your help.
Hi, please let me know how to RUN kallisto for multiple files. I am trying this loop, but I think it is not correct.
for file in *.fastq.gz; do kallisto quant -i transcripts.idx -b 100 -o "${file}-aligned" "${file}"; done
Your reads are paired-end (i.e. you have two fastq files per sample).
Let's say each sample has a _R1.fastq.gz and an _R2.fastq.gz file associated. You'd want to do this (basically, extracting all the sample names by removing the _R1.fastq.gz and _R2.fastq.gz via cut):
for file in $(ls *.fastq.gz|rev|cut -c 13-|uniq|rev); do kallisto quant -i transcripts.idx -b 100 -o "${file}-aligned" "${file}"_R1.fastq.gz "${file}"_R2.fastq.gz; done
FYI, if you want to speed things up, you can use more threads for kallisto via the -t flag. For example, if you want to use 8 parallel threads, you can do
kallisto quant -t 8
(followed by the rest of the kallisto command).Hi dsull, I have some more follow-up questions regarding RNA-Seq data analysis using Kallisto. Could you please answer my post here https://www.biostars.org/p/469769/?
Thanks, I'm following that thread -- it's a DESeq2-related question that ATPoint has just responded to. Let us know if any further issues arise.
Hi dull, I did not get my answer in the post. My initial question is how to import these two factors types of data with different numbers. I will use tx2gene after importing the data.
If you have 180 BAM files and you want to construct an alignment matrix from them you will need to use something similar to the command below:
This will create a matrix of counts with genes in rows and the sample in columns.
you can follow salmon-deseq2 and salmon-wasabi-sleuth pipeline for RNAseq here: https://digibio.blogspot.com/2018/05/rna-seq-data-analysis-using-salmon.html