Clarification of good pipeline for transcriptome assembly of RNA-seq data
1
1
Entering edit mode
8.1 years ago
nash.claire ▴ 510

Hi all,

I'm hoping to get some clarity on the best approach for analyzing my RNA-seq data. I've have read dozens of posts and articles on this subject but I feel it's made me more confused than when I started!

The background for my project is that I have RNA-seq data for the same rat tissue type in males and females. What I would like to do is look at the difference between what transcripts are present between males and females. Ie I'd like to see if known transcripts are present as well as any novel transcripts/splice variants in each sample and then compare them between my male and female tissues to see if I get different splice variants etc between the tissues.

I understand that there are loads of different algorithms out there each with their own advantages and disadvantages and each can be run in different modes. I have read that it's a good idea to run both de novo and genome-guided assemblies to get the most reliable results. (Combining de novo and genome guided transcriptome assembly for expression analysis?)

For genome-guided assembly, so far I have run Cufflinks (for this I supplied bam files previously aligned to the genome with subread as well as the known transcriptome gtf file for rat_rn6). This seems to give me what I'm looking for in that it gives me a gtf file that documents both known transcripts as well as seemingly novel transcripts/splice variants. However, I understand that Cufflinks is not the best algorithm. For that reason I have also run Stringtie again providing my aligned bam and the rat-rn6.gtf transcriptome file. This also gives me an output gtf file but this does not seem to output known transcripts as Cufflinks does, I seem to just get loads of small fragments per gene which is unlike what is shown in the stringtie paper. See image attached for comparison between Cufflinks and Stringtie results. I tried Scripture too but it seems to give me errors which I can't seem to fix when trying to run the whole genome through the latest version of the algorithm. enter image description here For de novo assembly, I am currently in the process of running Velvet/Oases as an example of de novo assembly (takes a long time) and was considering running IDBA-Tran and SOAPdenovo-Trans also. For these I am only supplying my raw fastq reads. I am unable to use Trinity due to lack of linux or a server and because Galaxy Trinity doesn't seem to work properly.

So my (somewhat general) questions are:

  • Am I applying the right algorithms to get my answer? Is running 2 x genome-guided (Cufflinks and Stringtie) and 3 x de novo (Velvet/Oases, IDBA, SOAP) enough or too much for my question?
  • Once I have the results from all these different algorithms, how can I interpret them to tell me whether I have known and novel common transcripts in my samples?
  • Is there an alternative to Stringtie that runs in a similar genome-guided fashion to Cufflinks or is there a way to get Stringtie to output what I want? This is so I have a couple of examples of transcripts obtained in a genome-guided fashion vs de novo.
  • Are there any examples of workflows or pipelines out there that address what I'm trying to address? I have a well annotated genome, I have an annotated transcriptome and I want to know if my samples contain these known transcripts as well as extras.

I'm a bit overwhelmed and confused. A lot of posts/answers are related to analyses with no reference genome. Not my case.

rna-seq Assembly sequencing • 5.0k views
ADD COMMENT
2
Entering edit mode

I recommend you this paper: Informatics for RNA Sequencing: A Web Resource for Analysis on the Cloud http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004393

In relation to your questions:

I work with RNA-seq Data and have never used these approaches. working with linux in a server really makes everything easier. I have aways aligned with STAR or Trinity (De Novo), quantified using Kallisto/RSEM and analysed using Bioconductor tools (R).

I have see people using velvet for genome but never saw for transcriptomics. And I am not a fan of Cufflinks.

ADD REPLY
1
Entering edit mode

Hi Tiago,

Thank you for the reply. I took a look at the paper you referenced but it seems to be a little vague/skim over the bit that I'm most interested in. There also seems to be a lot of emphasis on Cufflinks in there. I'm actually at this stage not too concerned about the quantification of the different transcripts, at this stage I'd just like to visualize the data in a genome browser (as i have just a couple of genes that we're really interested in to focus in on). Just to clarify, I assume you use STAR for genome-guided and Trinity for De novo assembly of transcripts? And then do you combine the results of the 2 somehow to get a final list of transcripts that you have confidence are real? With your approach, do you get out of the end a set of transcripts that are known and also some that are novel? Presumably you have to take your De novo transcripts and map them back on to known transcripts somehow?

ADD REPLY
0
Entering edit mode

I would recommend IGV or Golden Helix Genome Browse, you can load your BAM file in both, the problm with golden helix is that you cant add your own reference, so if you do not work with mouse or human is better use IGV. Using the UCSC genome browse, the best way is to producing bigWig files, put it in a http server and load the tracks (this can be little overwhelming). There is also a good R package for visualization named gviz.

I only used STAR for mouse since the genome is very well annotated. I use trinity for some neglected disease without good genome reference.

If you want to map the trinity output back into your reference you can use BLAT.

ADD REPLY
0
Entering edit mode

hi,

I use StringTie without the -e option to enable known as well as novel transcript assembly. And it works well in picking out both. At least the known ones I can confirm. And for a cell line we had knocked out a gene by CRISPR, StringTie could assemble the affected transcript isoform. The cmd and the assembled isoforms -

my/path/to/stringtie-1.2.2.Linux_x86_64/stringtie \
Aligned.sortedByCoord.out.bam \
-o StringTie_assembly_out/my_Sample_StringTie_w_novel.gtf \
-p 4 \
-G my/path/to/Homo_sapiens.GRCh37.73_Only-REF.gtf \
-B \
-A my_Sample_StringTie_gene_abundance.txt

(input is STAR aligned)

StringTie_assembly

Top blue is the canonical form, green is the control cell line and the next two are effect of knock-down

ADD REPLY
0
Entering edit mode

Hi Amitm,

Thank you for the reply. Do you have any experience using the newer version of Stringtie? It's just I seem to have used exactly the same parameters/commands as you and it gives me that output you see in the image above. I don't know why I'm not getting results like you have shown. I'm confused!

ADD REPLY
1
Entering edit mode

hi,

I have used the ver as noted in the code and older as well. All worked well. Haven't used newer ver. Maybe you could check the BAM file and see if you see spliced-alignment in IGV. A RNA-seq file aligned by a splice aware aligner, when loaded into IGV would show islands of reads linked by connecting bridges between the exons for any given gene. If you see this in your BAM on IGV then the transcript assembler should work. Also if you choose the Sashimi plot option in IGV you should see connecting loops (indicating split alignment). sashimi

ADD REPLY
1
Entering edit mode

Hi Amitm,

So you hit the nail on the head. There was a lack of communication in our lab. I thought that my reads had been aligned (by my colleague) with Subread using the "subjunc" function. Turns out it hadn't and therefore didn't have exon junction info. Once aligned properly and checked as you suggested via IGV, it ran perfectly with stringtie and I can see the results I was after. Feel like a bit of a dunce but thank you for helping me get to the bottom of my problem!

ADD REPLY
0
Entering edit mode

hi Amitm, kindly check this error.i did not get it and solve it .

./stringtie G1_sorted.bam -B -o G1.gtf -G Triticum_aestivum.IWGSC.42.gtf -p 4 -C G1.refs.gtf -A G1.abund.tab -WARNING: no reference transcripts were found for the genomic sequences where reads were mapped! Please make sure the -G annotation file uses the same naming convention for the genome sequences

Kindly set this command for my sample.

ADD REPLY
0
Entering edit mode

as it is written in the error message, check if -G annotation file uses the same naming convention for the genome sequences. If your bam files are aligned to a genome with chr prefix then your GTF should also have chr, it usually happens when the reads are aligned to a genome fasta from UCSC and the GTF is from ENSEMBL.

ADD REPLY
1
Entering edit mode
8.1 years ago

This answer may not reduce confusion: precise reconstruction genes is still an un-attained goal of genome informatics, with many approaches and differing interpretations of what are biologically accurate gene reconstructions. An expert curator can often look at evidence such as your gene map and decide, but none of the automated methods will return those expert decisions for all genes and alternates.

My approach in the EvidentialGene project is to use coding sequence measures for biological validation, and to include any and all the gene reconstruction methods you mention, then reduce that over-assembly of too many models by scoring their coding accuracy (homology to other species, plus other bits). If you accept that a consensus of genes across species is a good biological ruler, then this approach returns a more precise gene set than others I've tested, with or without reference chromosomes to guide gene reconstruction.

This is a recent summary of how well EvidentialGene works, and why, versus both chromosome-gene modelling, and de-novo gene assembly of animals and plants. This is a recent independent assessment of EvidentialGene with the spiny mouse.

With EvidentialGene, you should apply as many gene reconstruction/assembly methods as you can afford if they return some valid genes. I aim for around 100 models of each transcript, i.e. 10 million alternate transcript models, using different assemblers and different settings (many KMER steps), for complete and accurate gene set reconstructions. Your de-novo assembly methods are a good selection: Velvet/Oases is most accurate, when run properly; idba-tran second, and SOAP-trans thirdish, Trinity is fourth. The above evigene_bothgalmod1606iu documents this.

For the chromosome mapped methods, cufflinks makes more errors than de-novo assembly (ommission/commisson) last time I tested, but it does return some most accurate of the weakly expressed genes. I gave up on Scripture a while back as too error prone. I've not tested String-Tie.

EvidentialGene works with gene transcripts of any source (chromosome-based or not). It returns the subset of most accurate coding genes, classified into alternate transcripts/locus. The current version for transcripts doesn't use chromosome mapping to classify loci, but I'm working on that addition. The locus calls it makes by gene exon alignment are accurate, but for a small subset of high identity paralogs versus alternates. You can use one of various transcript x chromosome aligners to get back to your genome map views: GMAP, BLAT, even BLASTn.

Caveat: EvidentialGene documentation is still sparse, not in the easy to follow category. At the Galaxy meeting this summer I tried to interest those folks to help make this plug-in ready to use, so far no luck. But others with small informatics expertise do use this method with success (and griping at the docs :).

There is a trick to accurate de-novo assembly: use large KMER values (read shredding) up to size of your paired reads. Velvet/Oases uses this well: for complex genes, the most accurate subset are assembled with kmers in 60-90 bp range for 100 bp paired Illumina reads. And VelO runs much faster, with less memory, at these high KMER settings. You can use it for only a hi-K subset, other methods for low-K. Idba-tran also uses the kmer range well, and paper on it explains how/why this helps, esp. with alternate transcript reconstruction. This is a published trick, but one most have not noticed in this complex literature.

-- Don Gilbert

ADD COMMENT
0
Entering edit mode

Dear Gilbert I'm confused you talk about the absence of an easy mode. However, in the web site of EvidentialGene there is a script based approach which looks as easy as:

evigene/scripts/prot/tr2aacds.pl -mrnaseq Trinity.fasta

It works for me and I obtained a transcriptome reduction of a 90% (from 1.2M transcripts to more or less 120k). I have the folders with the okay (and okalt) and dropset datasets. All looks fine.

#t2ac: EvidentialGene tr2aacds.pl VERSION 2013.07.27
# Class Table for ../../Trinity.trclass
class           okay    drop    okay    drop
althi           1.8     9.1     18541   91851
althi1          0.5     3.7     5585    37501
althia2         0       1       0       10354
altmfrag        0       0.2     803     2626
altmfraga2      0       0       112     191
altmid          0.4     1.7     4930    17350
altmida2        0       0.1     498     1296
main            4.2     7.1     42273   72170
maina2          0       0.2     752     2577
noclass         8.1     55.7    81985   560451
noclassa2       0       0.1     443     1521
parthi          0       4.2     0       42710
parthi1         0       0.6     0       6316
parthia2        0       0.1     0       1915
---------------------------------------------
total           15.5    84.4    155922  848829
=============================================
# AA-quality for okay set of ../../Trinity.aa.qual (no okalt): all and longest 1000 summary
okay.top         n=1000; average=2037; median=1740; min,max=1363,9064; sum=2037028; gaps=0,0
okay.all         n=125453; average=188; median=112; min,max=41,9064; sum=23667671; gaps=0,0

I have missed something? I should tweak something? I think my only weakness is to use only Trinity and not add other assembler as you suggested.

I'm using this "curated" transcriptome to my downstream analysis. If it is a wrong procedure plz correct me I'm learning and I appreciate all feedback to improve my analysis.

-- Pablo

ADD REPLY

Login before adding your answer.

Traffic: 1805 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6