Drosophila melanogaster dm3 intron percentage too high
2
2
Entering edit mode
4 weeks ago
Christian ▴ 20

I am working from a Drosophila dm3 gtf file trying to infer different percentage compositions of genomic features of interest (UTRs, CDS, introns, etc.) Since there is no "intron" feature explicitly found in the file I decided to obtain them by:

  1. bedtools merge on file only containing "transcripts"
  2. bedtools merge on file containing the remaining features (CDS, exons, UTRs, start, and stop codons)
  3. bedtools subtract using - a "transcripts" file and -b "remaining_features" file
  4. Use awk '{total += $3 - $2} END {print total}' intron_file.txt to calculate total intron bp
  5. Total intron bp / Total Drosophila dm3 genome bp where total genome bp was obtained from (https://genome.ucsc.edu/cgi-bin/hgTracks?db=dm3&chromInfoPage=)

The value I get is usually >42% compared to the 30% mentioned in literature (Table 2 from Alexander, R. P., Fang, G., Rozowsky, J., Snyder, M., & Gerstein, M. B. (2010). Annotating non-coding regions of the genome. Nature Reviews Genetics, 11(8), 559-571. )

What could I be doing wrong? Things I should look out for? Thank you for the help!

drosophila sequence nucleotide intron genomics • 529 views
ADD COMMENT
1
Entering edit mode
4 weeks ago
GenoMax 149k

AGAT toolkit has a couple of stats generation programs which may be useful here.

Running one of then against the GTF file downloaded from NCBI and with a genome size of 180Mb you get the following. Output truncated to show only the total numbers. There are several other stats you can check by running this program yourself.

$ agat_sp_functional_statistics.pl --gff GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.gtf -g 180000000 


Total gene length (bp)                                      102519537
Total transcript length (bp)                                344759301
Total cds length (bp)                                       61229169
Total exon length (bp)                                      92515118
Total five_prime_utr length (bp)                            9559304
Total start_codon length (bp)                               92253
Total stop_codon length (bp)                                92391
Total three_prime_utr length (bp)                           18208644
Total intron length per cds (bp)                            171509556
Total intron length per exon (bp)                           252244183
Total intron length per five_prime_utr (bp)                 71801045
Total intron length per start_codon (bp)                    230194
Total intron length per stop_codon (bp)                     9155
Total intron length per three_prime_utr (bp)                2849033
ADD COMMENT
0
Entering edit mode
4 weeks ago

One problem with these files is that it is a bit hard to understand what is inside of them and how they all interact. For example when two transcripts overlap and there are no other annotations in between, how is that area counted?

I would instead first extract the features of type gene then those of type exon or utr in another file.

Then subtract the exons from the genes and that would give me introns and sum those up.

# genes only
cat droso.gff3 | awk ' $3=="gene" { print }'  > gene.gff

# exons and UTRs
cat droso.gff3 | awk ' $3=="exon" || $3~"UTR" { print }'  > exon.gff

# Here is hoping these are introns
bedtools subtract -a gene.gff -b exon.gff > intron.gff

# Sum them up 
bedtools sort  -i intron.gff  | bedtools merge -i - | awk '{ print($3-$2) }' | datamash sum 1
53133198

now here depends on what we consider it to be the genome size, the fasta file indicates a genome size of 143 million so the total ratio comes out to be around 37%

ADD COMMENT
0
Entering edit mode

I was able to replicate your results using the dm6 release of the genome! (https://ftp.ensemblgenomes.ebi.ac.uk/pub/metazoa/release-60/gff3/drosophila_melanogaster/) Unfortunately, I am unable to find a gff3 file for the dm3 release of the Drosophila melanogaster genome. I need to use this version since a lot of my other datasets are based on this older genome assembly. This is where I've been taking my genom files from: https://hgdownload.cse.ucsc.edu/goldenPath/dm3/bigZips/genes/

Sample gtf file content (no "gene" category, only "transcript" and other features such as UTRs, CDS, exons, start, and stop codons)

For refGene format

chr2L   7528    7679    CG11023 .   +   refGene 5UTR    .   gene_id "CG11023"; transcript_id "NM_001169365"; exon_number "1"; exon_id "NM_001169365.1"; gene_name "CG11023";
chr2L   7528    7679    CG11023 .   +   refGene 5UTR    .   gene_id "CG11023"; transcript_id "NM_001272857"; exon_number "1"; exon_id "NM_001272857.1"; gene_name "CG11023";
chr2L   7528    7679    CG11023 .   +   refGene 5UTR    .   gene_id "CG11023"; transcript_id "NM_175941"; exon_number "1"; exon_id "NM_175941.1"; gene_name "CG11023";
chr2L   7528    8116    CG11023 .   +   refGene exon    .   gene_id "CG11023"; transcript_id "NM_001169365"; exon_number "1"; exon_id "NM_001169365.1"; gene_name "CG11023";
chr2L   7528    8116    CG11023 .   +   refGene exon    .   gene_id "CG11023"; transcript_id "NM_001272857"; exon_number "1"; exon_id "NM_001272857.1"; gene_name "CG11023";
chr2L   7528    8116    CG11023 .   +   refGene exon    .   gene_id "CG11023"; transcript_id "NM_175941"; exon_number "1"; exon_id "NM_175941.1"; gene_name "CG11023";
chr2L   7528    9484    CG11023 .   +   refGene transcript  .   gene_id "CG11023"; transcript_id "NM_001169365";  gene_name "CG11023";
chr2L   7528    9484    CG11023 .   +   refGene transcript  .   gene_id "CG11023"; transcript_id "NM_001272857";  gene_name "CG11023";
chr2L   7528    9484    CG11023 .   +   refGene transcript  .   gene_id "CG11023"; transcript_id "NM_175941";  gene_name "CG11023";
chr2L   7679    7682    CG11023 .   +   refGene start_codon 0   gene_id "CG11023"; transcript_id "NM_001169365"; exon_number "1"; exon_id "NM_001169365.1"; gene_name "CG11023";
chr2L   7679    7682    CG11023 .   +   refGene start_codon 0   gene_id "CG11023"; transcript_id "NM_001272857"; exon_number "1"; exon_id "NM_001272857.1"; gene_name "CG11023";
chr2L   7679    7682    CG11023 .   +   refGene start_codon 0   gene_id "CG11023"; transcript_id "NM_175941"; exon_number "1"; exon_id "NM_175941.1"; gene_name "CG11023";
chr2L   7679    8116    CG11023 .   +   refGene CDS 0   gene_id "CG11023"; transcript_id "NM_001169365"; exon_number "1"; exon_id "NM_001169365.1"; gene_name "CG11023";
chr2L   7679    8116    CG11023 .   +   refGene CDS 0   gene_id "CG11023"; transcript_id "NM_001272857"; exon_number "1"; exon_id "NM_001272857.1"; gene_name "CG11023";

For ensGene format

chr2L   ensGene transcript  7529    9484    .   +   .   gene_id "FBgn0031208"; transcript_id "FBtr0300689";  gene_name "FBgn0031208";
chr2L   ensGene exon    7529    8116    .   +   .   gene_id "FBgn0031208"; transcript_id "FBtr0300689"; exon_number "1"; exon_id "FBtr0300689.1"; gene_name "FBgn0031208";
chr2L   ensGene 5UTR    7529    7679    .   +   .   gene_id "FBgn0031208"; transcript_id "FBtr0300689"; exon_number "1"; exon_id "FBtr0300689.1"; gene_name "FBgn0031208";
chr2L   ensGene CDS 7680    8116    .   +   0   gene_id "FBgn0031208"; transcript_id "FBtr0300689"; exon_number "1"; exon_id "FBtr0300689.1"; gene_name "FBgn0031208";
chr2L   ensGene exon    8193    9484    .   +   .   gene_id "FBgn0031208"; transcript_id "FBtr0300689"; exon_number "2"; exon_id "FBtr0300689.2"; gene_name "FBgn0031208";
chr2L   ensGene CDS 8193    8607    .   +   1   gene_id "FBgn0031208"; transcript_id "FBtr0300689"; exon_number "2"; exon_id "FBtr0300689.2"; gene_name "FBgn0031208";
chr2L   ensGene 3UTR    8611    9484    .   +   .   gene_id "FBgn0031208"; transcript_id "FBtr0300689"; exon_number "2"; exon_id "FBtr0300689.2"; gene_name "FBgn0031208";
chr2L   ensGene start_codon 7680    7682    .   +   0   gene_id "FBgn0031208"; transcript_id "FBtr0300689"; exon_number "1"; exon_id "FBtr0300689.1"; gene_name "FBgn0031208";
chr2L   ensGene stop_codon  8608    8610    .   +   0   gene_id "FBgn0031208"; transcript_id "FBtr0300689"; exon_number "2"; exon_id "FBtr0300689.2"; gene_name "FBgn0031208";
chr2L   ensGene transcript  7529    9484    .   +   .   gene_id "FBgn0031208"; transcript_id "FBtr0300690";  gene_name "FBgn0031208";
ADD REPLY
1
Entering edit mode

I am unable to find a gff3 file for the dm3 release of the Drosophila melanogaster genome.

From Flybase: https://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r3.0/gff/whole-genome_annotation-feature-region_dmel_RELEASE3.GFF.gz

ADD REPLY
0
Entering edit mode

The link you sent me belongs to the release 3 of the Drosophila genome. Name's a bit confusing but the dm3 genome is actually the release 5 of the genome. However, I was able to find a similar file for the release I was looking for using the file path you provided! https://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.57_FB2014_03/gff/

Luckily this file does have the "intron" feature included so I was able to do some calculations and the total number of base pairs was 54992051. This value is similar to the one calculated by the response above. I think the problem I'm having now is what I consider to be the whole genome size.

Since the ATAC-seq data I'm using does not have annotated peaks in chrU, chrUextra, and chrM I decided to exclude them from the analysis. This brings the total genome size to be:

# total genome bp - chrU - chrUextra - chrM = 
168736537 - 10049037 - 29004656 - 19517 = 129663327

# intron bp/ 129663327
54992051/129663327 * 100 = 42.4
ADD REPLY
0
Entering edit mode

For dmel6, the genome size based on sequences contained in the FASTA is 143 million bp.

The paper below reports the genome size to be 180 million bases, with a 120-megabase euchromatic portion

https://www.science.org/doi/10.1126/science.287.5461.2185

ADD REPLY
1
Entering edit mode

you could turn those files into BED/GFF then operate as before

ADD REPLY
0
Entering edit mode

I found a GFF file for the release I was interested in and it had an "intron" feature annotated. The details of the calc can be found here Drosophila melanogaster dm3 intron percentage too high. I think the main problem is what I consider to be the "whole genome size"

ADD REPLY
0
Entering edit mode

D. mel genome is supposed to be 180,000,000 bases.

ADD REPLY

Login before adding your answer.

Traffic: 1840 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6