transcripts sequence preparation for RNAseq DE analysis
2
0
Entering edit mode
7.3 years ago
yifangt86 ▴ 60

I am confused with the technical part to discriminate the transcript-level and gene-level in RNA-seq analysis. The biggest part is to get transcripts sequences as reference for indexing step (kallisto/salmon) from the gff3 file.

In following example partial gff3 file, the exon1 is the same as utr5p1, and exon2 is the same as utr5p2 by the coordinates.

chr1A   Acsv1.0_UTR gene    11085   42819   31  +   .   ID=Acs1A01G010LC;name=Acs1A01G010LC;primconf=LC
 chr1A  Acsv1.0_UTR mRNA    11085   42819   .   +   .   ID=Acs1A01G010LC.1;Parent=Acs1A01G010LC;Name=Acs1A01G010LC.1;Note=Acs1A01G010LC;primconf=LC;secconf=LC2
 chr1A  Acsv1.0_UTR exon    11085   11501   .   +   .   ID=Acs1A01G010LC.1.exon1;Parent=Acs1A01G010LC.1
chr1A   Acsv1.0_UTR 5'_UTR  11085   11501   .   +   .   ID=Acs1A01G010LC.1.utr5p1;Parent=Acs1A01G010LC.1
chr1A   Acsv1.0_UTR exon    23483   23914   .   +   .   ID=Acs1A01G010LC.1.exon2;Parent=Acs1A01G010LC.1
chr1A   Acsv1.0_UTR 5'_UTR  23483   23914   .   +   .   ID=Acs1A01G010LC.1.utr5p2;Parent=Acs1A01G010LC.1
chr1A   Acsv1.0_UTR 5'_UTR  40929   41201   .   +   .   ID=Acs1A01G010LC.1.utr5p3;Parent=Acs1A01G010LC.1
chr1A   Acsv1.0_UTR exon    40929   42819   .   +   .   ID=Acs1A01G010LC.1.exon3;Parent=Acs1A01G010LC.1
chr1A   Acsv1.0_UTR CDS     41202   41522   .   +   0   ID=Acs1A01G010LC.1.cds1;Parent=Acs1A01G010LC.1
chr1A   Acsv1.0_UTR 3'_UTR  41523   42819   .   +   .   ID=Acs1A01G010LC.1.utr3p1;Parent=Acs1A01G010LC.1

My questions are:

  1. Should the redundant exon/5'_UTR entries be excluded for the transcripts sequence? as I am not sure the redundant sequence will cause the reads counts doubled eventually for gene-level expression if I use kallisto/sleuth for DE analysis, for example.
  2. Similarly, should the CDS be excluded from this gff3 file to extract the transcripts for the reference sequence because I already have a separate fasta file for CDS sequences?

Thanks a lot!

rna-seq transcript-leve gene-level transcripts • 2.8k views
ADD COMMENT
0
Entering edit mode

It is generally better to align to the genome and then count as needed. You could also look at Salmon/Kallisto which do alignment free estimations of transcript abundance.

ADD REPLY
0
Entering edit mode

Thanks!

Kallisto manual indicates to use transcripts for index. Could you please share your idea to get transcript-level / gene-level from the indexed whole genome, if possible? Thanks again.

ADD REPLY
0
Entering edit mode

Once you do the alignments to genome you can use a program like featureCounts (or HTseq-count) to summarize your data at different levels. Take a look at the featureCounts manual to discover the options in chapter 6 (page 31).

ADD REPLY
1
Entering edit mode
7.3 years ago
h.mon 35k

If you intend to use kallisto or Salmon, there is no need to you should not remove the redundant regions shared between different transcripts of the same gene, as these programs use an expectation-maximization algorithm to best estimate how to apportion shared counts to transcripts. So you should have a fasta with one sequence per transcript, with its whole transcribed region.

ADD COMMENT
1
Entering edit mode
7.3 years ago
yifangt86 ▴ 60

Thanks genomax and h.mon!

I found out the solution for my problem with gffread, and I post it here for those who have the same problem.

   gffread $gff3_file -g $genome.fasta -w transcripts.fasta
    -w  write a fasta file with spliced exons for each GFF transcript   #So that the UTRs are included
    -x  write a fasta file with spliced CDS for each GFF transcript

The -w option resolved my issue, so that I can have one sequence including the UTR regions for each transcript for kallisto.

ADD COMMENT

Login before adding your answer.

Traffic: 2377 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6