Hi,
I'm working with RNAseq data from A. thaliana. Reads were mapped with STAR 2.6 using the TAIR10.1 genome from NCBI and its annotation in GFF3 format (link).
I have noticed some of the genes have not any 'exon' feature line and go directly from the 'gene' line to the 'CDS' line, and this is a problem since I would like to count the 'exon' features. I have always thought that a 'CDS' has to have a related 'exon'.
In lines below, gene38184 is a tRNA and has 'exon' features, however gene38185 and gene38186 (protein codificant) have not.
NC_000932.1 RefSeq gene 1717 4347 . - . ID=gene38184;Dbxref=GeneID:1466272;Name=trnK;gbkey=Gene;gene=trnK;gene_biotype=tRNA;locus_tag=ArthCt089
NC_000932.1 RefSeq tRNA 1717 4347 . - . ID=rna60172;Parent=gene38184;Dbxref=GeneID:1466272;gbkey=tRNA;gene=trnK;product=tRNA-Lys
NC_000932.1 RefSeq exon 4311 4347 . - . ID=id325622;Parent=rna60172;Dbxref=GeneID:1466272;gbkey=tRNA;gene=trnK;product=tRNA-Lys
NC_000932.1 RefSeq exon 1717 1751 . - . ID=id325623;Parent=rna60172;Dbxref=GeneID:1466272;gbkey=tRNA;gene=trnK;product=tRNA-Lys
NC_000932.1 RefSeq gene 2056 3570 . - . ID=gene38185;Dbxref=GeneID:844797;Name=matK;gbkey=Gene;gene=matK;gene_biotype=protein_coding;locus_tag=ArthCp003
NC_000932.1 RefSeq CDS 2056 3570 . - 0 ID=cds48182;Parent=gene38185;Dbxref=Genbank:NP_051040.2,GeneID:844797;Name=NP_051040.2;gbkey=CDS;gene=matK;product=maturase K;protein_id=NP_051040.2;transl_table=11
NC_000932.1 RefSeq gene 5084 6188 . - . ID=gene38186;Dbxref=GeneID:844798;Name=rps16;gbkey=Gene;gene=rps16;gene_biotype=protein_coding;locus_tag=ArthCp004
NC_000932.1 RefSeq CDS 6149 6188 . - 0 ID=cds48183;Parent=gene38186;Dbxref=Genbank:NP_051041.1,GeneID:844798;Name=NP_051041.1;gbkey=CDS;gene=rps16;product=ribosomal protein S16;protein_id=NP_051041.1;transl_table=11
NC_000932.1 RefSeq CDS 5084 5283 . - 2 ID=cds48183;Parent=gene38186;Dbxref=Genbank:NP_051041.1,GeneID:844798;Name=NP_051041.1;gbkey=CDS;gene=rps16;product=ribosomal protein S16;protein_id=NP_051041.1;transl_table=11
This is a problem since I would like to use 'exon' as a feature to count.
Is this normal? Should I use other feature to obtain the counts?
Thanks in advance.
I am not a plant guy, but checking these two examples, both are chloroplast genes, so not on the nuclear genome, e.g. gene38186.
You are right,
NC_000932.1
is the chloroplast chromosome. The case I describe also happens in the mitochondrial chromosome,NC_037304.1
. With the usual proceeding of counting 'exon' features these genes would be lost. In fact I'm not sure if people usually include these chromosomes in the analysis.chrM is typically not included in gene expression analysis (at least I do not do it for human and mouse). The reason that you do not have "exon" is probably the prokaryotic origin of these organelles which have operon rather than intron/exon structures.