I need to study the genomic distribution of certain transposon elements. So I first retrieve the information of the transposon element from repeatmasker in bed format (chr:start-end), then intersect with hg19 gene bed file. My purpose now is to figure out genes containing at least one such transposon would be enriched for certain categories or not, using GO term for example.
For instance:
GeneA: chr1:20000-50000
containing two transposonD:
chr1: 25000-26000
chr1: 31000-32000
GeneB: chr3: 40000-80000
containing one transposonD:
chr3: 60000-62000
My question is should gene length bias be taken into account? One huge gene is naturally more likely to contain more transposon elements. Or GO term has already taken account of this?
I searched literature and found discussion about length bias for RNA-seq data, but not for my problem here. Thanks