Question

Number of discovered genes in genome does not correspond with published annotation

0

Entering edit mode

5.9 years ago

Jan Röslein ▴ 10

Greetings,

I have tried to run RNA-seq analysis for crucial carp samples on following genome: https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Carassius_auratus/100/

Author claims to discover 53,065 coding genes, but GFF file contain only fraction of presented number:

cat GCF_003368295.1_ASM336829v1_genomic.gff | grep -P "\tCDS\t" | awk -F'\t' '{print $1}' | sort | uniq | wc -l

3434

cat GCF_003368295.1_ASM336829v1_genomic.gff | grep -P "\texon\t" | awk -F'\t' '{print $1}' | sort | uniq | wc -l

4455

cat GCF_003368295.1_ASM336829v1_genomic.gff | grep -P "\tgene*\t" | awk -F'\t' '{print $1}' | sort | uniq | wc -l

4092

cat GCF_003368295.1_ASM336829v1_genomic.gff | grep -P "\tmRNA\t" | awk -F'\t' '{print $1}' | sort | uniq | wc -l

3419

Is there any secret? I apologize I am pretty new to any bioinformatics ...

Thank you for your time.

RNA-Seq NCBI genome annotation GFF • 1.8k views

ADD COMMENT • link updated 5.9 years ago by kashiff007 ★ 1.9k • written 5.9 years ago by Jan Röslein ▴ 10

1

Entering edit mode

Full information of your GFF file

awk '{print$3}' | sort | uniq -c your.gff

ADD REPLY • link 5.9 years ago by kashiff007 ★ 1.9k

0

Entering edit mode

What is this supposed to do? Please explain.

ADD REPLY • link 5.9 years ago by vkkodali_ncbi ★ 3.8k

1

Entering edit mode

It extract 3 column of gff file (which has information like gene, intron, exon, miRNA etc) and then count the number of occurrence in whole gff file. In other words, you will get the total number of genes in your gff file.

ADD REPLY • link 5.9 years ago by kashiff007 ★ 1.9k

1

Entering edit mode

It extract 3 column of gff file

No it doesn't. Please make sure your code works as intended before posting.

Even if it did, it will count the number of, for example, CDS lines in the GFF3 file which will be wildly off the actual number. That's because a given gene can have many alternately spliced transcripts with multiple coding exons, each of them represented in the GFF3 file as a single CDS row.

ADD REPLY • link 5.9 years ago by vkkodali_ncbi ★ 3.8k

1

Entering edit mode

Yes, you are right. Since your question mainly concerned about number of "discovered genes" in annotation file, it will for sure give the total count of genes (which may have more that one CDS for sure). Also, I have not mentioned in my previous comment that it will provide the total number of CDS for each genes, or count one CDS for one gene. Although, it will only provide the total number entity in whole gff file or in whole genome. I always double check my code before posting it. Thanks

ADD REPLY • link 5.9 years ago by kashiff007 ★ 1.9k

0

Entering edit mode

claims to discover 53,065 coding genes,

Are those total number of transcripts by any chance?

ADD REPLY • link 5.9 years ago by GenoMax 148k

score 2 · Answer 1 · 2019-02-16

You are counting the wrong column. Try this instead.

$ zgrep -v '^#' GCF_003368295.1_ASM336829v1_genomic.gff.gz \
    | awk 'BEGIN{FS="\t";OFS="\t"}($3=="CDS"){print $9}' \
    | grep -Po 'GeneID:\d+' \
    | sort -u \
    | wc -l
53614

Apparently, there are a bunch of CDS features without protein accessions. That is the reason for the discrepancy in the numbers (53065 vs 53614). You can get to that count by doing the following:

$ zgrep -v '^#' GCF_003368295.1_ASM336829v1_genomic.gff.gz \
    | awk 'BEGIN{FS="\t";OFS="\t"}($3=="CDS"){print $9}' \
    | grep 'protein_id=' \
    | grep -Po 'GeneID:\d+' \
    | sort -u \
    | wc -l
53065