Hi. I got following error when I use stringtie. with repeatmasker annotation gff file and RNA-seq bam files which is already sorted with samtools.
GFF Error: overlapping duplicate dispersed_repeat feature (ID=461)
GFF Error: overlapping duplicate dispersed_repeat feature (ID=712)
GFF Error: overlapping duplicate dispersed_repeat feature (ID=1013)
...
GFF Error: overlapping duplicate dispersed_repeat feature (ID=128998)
I generated the repeatmasker annotation from following file in the link https://hgdownload.soe.ucsc.edu/goldenPath/hs1/bigZips/hs1.repeatMasker.out.gz and convert gff file with rmOutToGFF3.pl.
When I checked the duplicates in the original hs1.repeatMasker.out, there is many duplicates in the ID column top right like below (461).
SW perc perc perc query position in query matching repeat position in repeat
score div. del. ins. sequence begin end (left) repeat class/family begin end (left) ID
321 23.6 6.2 0.0 chrX 306885 306990 (153952576) C L1M4c LINE/L1 (4017) 2367 2255 459
713 10.8 0.0 0.0 chrX 307028 307129 (153952437) C AluJo SINE/Alu (91) 221 120 460
1486 18.5 5.6 4.3 chrX 307210 307577 (153951989) C MLT1C2 LTR/ERVL-MaLR (47) 414 42 461
1610 21.0 4.8 3.2 chrX 307562 307970 (153951596) C MLT1C2 LTR/ERVL-MaLR (40) 421 6 461
1171 22.5 5.0 3.0 chrX 307986 308315 (153951251) C MLT1C2 LTR/ERVL-MaLR (124) 337 1 462
I am learning the analysis for transposable element from this article (https://www.nature.com/articles/s41588-019-0373-3), What do you think how the author deal with this problem? Could you tell me how should I deal with this? Thanks in advance.
Well ... here is their code availability section:
https://www.nature.com/articles/s41588-019-0373-3#code-availability
I would ask the authors for those scripts. Most likely they applied some combination of interval filtering with existing tools.
I will say it is pretty darn ridiculous how in this day and age you have to request that the author gives you the scripts. I am going to test this out and ask the authors for the script.