Question

How to calculate the different expression gene used PacBio full-length cDNA sequencing and Illumina sequencing?

0

Entering edit mode

7.4 years ago

xzpgocxx ▴ 20

Now, I have some data from PacBio full-length cDNA sequencing and Illumina sequencing. And I want to calculate the gene expression leave (FPKM). There is no available genome for my species. I can do it used Trinity if only have Illumina data, but now I don't know how to do it when adding PacBio full-length cDNA reads? Thanks.

RNA-Seq • 3.8k views

ADD COMMENT • link updated 7.1 years ago by tjduncan ▴ 280 • written 7.4 years ago by xzpgocxx ▴ 20

0

Entering edit mode

As far as I know, if you have the PacBio full-length cDNA , you already have the transcripts. Most of the PacBio analysis tools generates a GTF file at the end. I would generate the gene models from the transcript models and quantify the genes using illumina reads to get the expression levels of genes. Its bit tricky, I will leave it for you to think.

ADD REPLY • link 7.4 years ago by GouthamAtla 12k

score 5 · Accepted Answer · 2017-10-30

Since you don't have a reference genome/transcriptome you would need to generate your sample specific genemodels (ie: generate your own reference transcriptome from your pacbio data) then quantify the expression levels of your illumina data using your newly generated pacbio reference transcriptome with your preferred short-read gene expression pipeline.

To get started on making your own reference transcriptome from you pacbio data I would use their iso-seq pipeline. The link below that outlines the major steps of the pipeline, dependencies, and other tertiary pacbio iso-seq data analysis tools that may be useful.

-Iso-Seq Command Line Module from SMRT Link v4 https://github.com/PacificBiosciences/IsoSeq_SA3nUP/wiki

"It includes includes three major steps:

CCS: Getting CCS (circular consensus sequence) reads out of subreads BAM file.
Classify: Identifying full-length CCS reads based on cDNA primers and polyA tail signal.
Cluster: Isoform-level clustering and polishing to generate high-quality, full-length, transcript isoform sequences."

Once you have completed step 3 you should have an the output: hq_isoforms.fastq

This output includes the transcript sequences that are:

"Full-length (as indicated by presence of cDNA primers)
High-quality (predicted accuracy by default is >= 99%)
Supported by 2 or more FL reads (unless you changed the default or are using older versions)."

From there you have several options. An overview can be seen: https://github.com/PacificBiosciences/IsoSeq_SA3nUP/wiki/What-to-do-after-Iso-Seq-Cluster%3F

For your goal of adding pacbio to illumina data to calculate gene expression levels without a reference you would likely want to collapse the high quality isoforms into a single set of unique isoforms. To do that you could use Cogent or CD-Hit: https://github.com/Magdoll/cDNA_Cupcake/wiki/Tutorial:-Collapse-redundant-isoforms-without-genome

From there you should have a reference trascriptome that you could preform your preferred illumina data pipeline on.