Number of discovered genes in genome does not correspond with published annotation
1
0
Entering edit mode
5.8 years ago
Jan Röslein ▴ 10

Greetings,

I have tried to run RNA-seq analysis for crucial carp samples on following genome: https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Carassius_auratus/100/

Author claims to discover 53,065 coding genes, but GFF file contain only fraction of presented number:

cat GCF_003368295.1_ASM336829v1_genomic.gff | grep -P "\tCDS\t" | awk -F'\t' '{print $1}' | sort | uniq | wc -l

3434

cat GCF_003368295.1_ASM336829v1_genomic.gff | grep -P "\texon\t" | awk -F'\t' '{print $1}' | sort | uniq | wc -l

4455

cat GCF_003368295.1_ASM336829v1_genomic.gff | grep -P "\tgene*\t" | awk -F'\t' '{print $1}' | sort | uniq | wc -l

4092

cat GCF_003368295.1_ASM336829v1_genomic.gff | grep -P "\tmRNA\t" | awk -F'\t' '{print $1}' | sort | uniq | wc -l

3419

Is there any secret? I apologize I am pretty new to any bioinformatics ...

Thank you for your time.

RNA-Seq NCBI genome annotation GFF • 1.8k views
ADD COMMENT
1
Entering edit mode

Full information of your GFF file

awk '{print$3}' | sort | uniq -c your.gff
ADD REPLY
0
Entering edit mode

What is this supposed to do? Please explain.

ADD REPLY
1
Entering edit mode

It extract 3 column of gff file (which has information like gene, intron, exon, miRNA etc) and then count the number of occurrence in whole gff file. In other words, you will get the total number of genes in your gff file.

ADD REPLY
1
Entering edit mode

It extract 3 column of gff file

No it doesn't. Please make sure your code works as intended before posting.

Even if it did, it will count the number of, for example, CDS lines in the GFF3 file which will be wildly off the actual number. That's because a given gene can have many alternately spliced transcripts with multiple coding exons, each of them represented in the GFF3 file as a single CDS row.

ADD REPLY
1
Entering edit mode

Yes, you are right. Since your question mainly concerned about number of "discovered genes" in annotation file, it will for sure give the total count of genes (which may have more that one CDS for sure). Also, I have not mentioned in my previous comment that it will provide the total number of CDS for each genes, or count one CDS for one gene. Although, it will only provide the total number entity in whole gff file or in whole genome. I always double check my code before posting it. Thanks

ADD REPLY
0
Entering edit mode

claims to discover 53,065 coding genes,

Are those total number of transcripts by any chance?

ADD REPLY
2
Entering edit mode
5.8 years ago
vkkodali_ncbi ★ 3.8k

You are counting the wrong column. Try this instead.

$ zgrep -v '^#' GCF_003368295.1_ASM336829v1_genomic.gff.gz \
    | awk 'BEGIN{FS="\t";OFS="\t"}($3=="CDS"){print $9}' \
    | grep -Po 'GeneID:\d+' \
    | sort -u \
    | wc -l
53614

Apparently, there are a bunch of CDS features without protein accessions. That is the reason for the discrepancy in the numbers (53065 vs 53614). You can get to that count by doing the following:

$ zgrep -v '^#' GCF_003368295.1_ASM336829v1_genomic.gff.gz \
    | awk 'BEGIN{FS="\t";OFS="\t"}($3=="CDS"){print $9}' \
    | grep 'protein_id=' \
    | grep -Po 'GeneID:\d+' \
    | sort -u \
    | wc -l
53065
ADD COMMENT
0
Entering edit mode

Thank you very much! You are right!

ADD REPLY

Login before adding your answer.

Traffic: 1591 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6