I am confused with the technical part to discriminate the transcript-level and gene-level in RNA-seq analysis. The biggest part is to get transcripts sequences as reference for indexing step (kallisto/salmon) from the gff3 file.
In following example partial gff3 file, the exon1 is the same as utr5p1, and exon2 is the same as utr5p2 by the coordinates.
chr1A Acsv1.0_UTR gene 11085 42819 31 + . ID=Acs1A01G010LC;name=Acs1A01G010LC;primconf=LC
chr1A Acsv1.0_UTR mRNA 11085 42819 . + . ID=Acs1A01G010LC.1;Parent=Acs1A01G010LC;Name=Acs1A01G010LC.1;Note=Acs1A01G010LC;primconf=LC;secconf=LC2
chr1A Acsv1.0_UTR exon 11085 11501 . + . ID=Acs1A01G010LC.1.exon1;Parent=Acs1A01G010LC.1
chr1A Acsv1.0_UTR 5'_UTR 11085 11501 . + . ID=Acs1A01G010LC.1.utr5p1;Parent=Acs1A01G010LC.1
chr1A Acsv1.0_UTR exon 23483 23914 . + . ID=Acs1A01G010LC.1.exon2;Parent=Acs1A01G010LC.1
chr1A Acsv1.0_UTR 5'_UTR 23483 23914 . + . ID=Acs1A01G010LC.1.utr5p2;Parent=Acs1A01G010LC.1
chr1A Acsv1.0_UTR 5'_UTR 40929 41201 . + . ID=Acs1A01G010LC.1.utr5p3;Parent=Acs1A01G010LC.1
chr1A Acsv1.0_UTR exon 40929 42819 . + . ID=Acs1A01G010LC.1.exon3;Parent=Acs1A01G010LC.1
chr1A Acsv1.0_UTR CDS 41202 41522 . + 0 ID=Acs1A01G010LC.1.cds1;Parent=Acs1A01G010LC.1
chr1A Acsv1.0_UTR 3'_UTR 41523 42819 . + . ID=Acs1A01G010LC.1.utr3p1;Parent=Acs1A01G010LC.1
My questions are:
- Should the redundant exon/5'_UTR entries be excluded for the transcripts sequence? as I am not sure the redundant sequence will cause the reads counts doubled eventually for gene-level expression if I use kallisto/sleuth for DE analysis, for example.
- Similarly, should the CDS be excluded from this gff3 file to extract the transcripts for the reference sequence because I already have a separate fasta file for CDS sequences?
Thanks a lot!
It is generally better to align to the genome and then count as needed. You could also look at Salmon/Kallisto which do alignment free estimations of transcript abundance.
Thanks!
Kallisto manual indicates to use transcripts for index. Could you please share your idea to get transcript-level / gene-level from the indexed whole genome, if possible? Thanks again.
Once you do the alignments to genome you can use a program like featureCounts (or HTseq-count) to summarize your data at different levels. Take a look at the featureCounts manual to discover the options in chapter 6 (page 31).