Greetings,
I have tried to run RNA-seq analysis for crucial carp samples on following genome: https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Carassius_auratus/100/
Author claims to discover 53,065 coding genes, but GFF file contain only fraction of presented number:
cat GCF_003368295.1_ASM336829v1_genomic.gff | grep -P "\tCDS\t" | awk -F'\t' '{print $1}' | sort | uniq | wc -l
3434
cat GCF_003368295.1_ASM336829v1_genomic.gff | grep -P "\texon\t" | awk -F'\t' '{print $1}' | sort | uniq | wc -l
4455
cat GCF_003368295.1_ASM336829v1_genomic.gff | grep -P "\tgene*\t" | awk -F'\t' '{print $1}' | sort | uniq | wc -l
4092
cat GCF_003368295.1_ASM336829v1_genomic.gff | grep -P "\tmRNA\t" | awk -F'\t' '{print $1}' | sort | uniq | wc -l
3419
Is there any secret? I apologize I am pretty new to any bioinformatics ...
Thank you for your time.
Full information of your GFF file
What is this supposed to do? Please explain.
It extract 3 column of gff file (which has information like gene, intron, exon, miRNA etc) and then count the number of occurrence in whole gff file. In other words, you will get the total number of genes in your gff file.
No it doesn't. Please make sure your code works as intended before posting.
Even if it did, it will count the number of, for example, CDS lines in the GFF3 file which will be wildly off the actual number. That's because a given gene can have many alternately spliced transcripts with multiple coding exons, each of them represented in the GFF3 file as a single CDS row.
Yes, you are right. Since your question mainly concerned about number of "discovered genes" in annotation file, it will for sure give the total count of genes (which may have more that one CDS for sure). Also, I have not mentioned in my previous comment that it will provide the total number of CDS for each genes, or count one CDS for one gene. Although, it will only provide the total number entity in whole gff file or in whole genome. I always double check my code before posting it. Thanks
Are those total number of transcripts by any chance?