Hi all,
I’m trying to identify transposable elements (TEs) in PacBio Iso-Seq data from a non-model plant species, for which we don’t have a reference genome. After researching possible tools, I decided to use RepeatMasker along with TE families from the Dfam database, but I’m uncertain if my approach is fully correct. I’d really appreciate it if someone could confirm whether my commands are correct or if I missed any steps!
Steps I’ve Taken:
1.Downloading the Dfam TE Families for Plants (Viridiplantae)
Since I’m working with non-model plants, I specified Viridiplantae to get the plant-specific TE families:
~/tools/miniconda3/share/RepeatMasker/famdb.py -i repeatMasker/share/RepeatMasker/Libraries/famdb/ info
The information regarding the TE database is showed below:
FamDB Directory : /gpfs1/SP11/iso-seq/TE/repeatMasker/share/RepeatMasker/Libraries/famdb
FamDB Generator : famdb.py v1.0
FamDB Format Version: 1.0
FamDB Creation Date : 2023-11-15 11:30:15.311827
Database: Dfam
Version : 3.8
Date : 2023-11-14
Dfam - A database of transposable element (TE) sequence alignments and HMMs.
2 Partitions Present
Total consensus sequences present: 472219
Total HMMs present : 472219
Partition Details
-----------------
Partition 0 [dfam38_full.0.h5]: root - Mammalia, Amoebozoa, Bacteria <bacteria>, Choanoflagellata, Rhodophyta, Haptista, Metamonada, Fungi, Sar, Placozoa, Ctenophora <comb jellies>, Filasterea, Spiralia, Discoba, Cnidaria, Porifera, Viruses
Consensi: 295552, HMMs: 295552
Partition 1 [ Absent ]: Obtectomera
Partition 2 [ Absent ]: Euteleosteomorpha
Partition 3 [ Absent ]: Sarcopterygii - Sauropsida, Coelacanthimorpha, Amphibia, Dipnomorpha
Partition 4 [ Absent ]: Diptera
Partition 5 [dfam38_full.5.h5]: Viridiplantae
Consensi: 176667, HMMs: 176667
Partition 6 [ Absent ]: Deuterostomia - Chondrichthyes, Hemichordata, Cladistia, Holostei, Tunicata, Cephalochordata, Cyclostomata <vertebrates>, Osteoglossocephala, Otomorpha, Elopocephalai, Echinodermata, Chondrostei
Partition 7 [ Absent ]: Hymenoptera
Partition 8 [ Absent ]: Ecdysozoa - Nematoda, Gelechioidea, Yponomeutoidea, Incurvarioidea, Chelicerata, Collembola, Polyneoptera, Tineoidea, Apoditrysia, Monocondylia, Strepsiptera, Palaeoptera, Neuropterida, Crustacea, Coleoptera, Siphonaptera, Trichoptera, Paraneoptera, Myriapoda, Scalidophora
2.Converting data to FASTA Format for RepeatMasker Compatibility
~/tools/miniconda3/share/RepeatMasker/famdb.py -i repeatMasker/share/RepeatMasker/Libraries/famdb/ families -f embl -a -d Viridiplantae >Viridiplantae.embl
~/tools/miniconda3/share/RepeatMasker/util/buildRMLibFromEMBL.pl Viridiplantae.embl >Viridiplantae2.fa
3.Running RepeatMasker
~/tools/miniconda3/bin/RepeatMasker -pa 16 -lib Viridiplantae2.fa -dir outputDir/ ./iso-seq/BJ202-01P0001.flnc.isoforms.fa
RepeatMasker seems to run without errors so far, but I’m not sure if these commands correct for my goals, or have I missed any important steps.
Thanks so much for any guidance or suggestions!
random aside but i wasn't aware that transposons were picked up from rna-seq-like experiments. looks like they are from cursory googling. one paper says "When not properly silenced, TEs can contribute a substantial portion to the cell's transcriptome, but are typically ignored in most RNA-seq data analyses." https://pubmed.ncbi.nlm.nih.gov/29508296/
Well, back in the days when polyA capturing (enrichment) was quite standard , those were usually depleted out of the transcript set, but nowadays very little enrichment is done and people just use total RNA (potentially depleted for rRNA) which can (and should) contain also TE derived transcripts.