Question

identify transposable elements in iso-seq data

0

Entering edit mode

17 days ago

triplee0305 ▴ 10

Hi all,

I’m trying to identify transposable elements (TEs) in PacBio Iso-Seq data from a non-model plant species, for which we don’t have a reference genome. After researching possible tools, I decided to use RepeatMasker along with TE families from the Dfam database, but I’m uncertain if my approach is fully correct. I’d really appreciate it if someone could confirm whether my commands are correct or if I missed any steps!

Steps I’ve Taken:

1.Downloading the Dfam TE Families for Plants (Viridiplantae)

Since I’m working with non-model plants, I specified Viridiplantae to get the plant-specific TE families:

~/tools/miniconda3/share/RepeatMasker/famdb.py -i repeatMasker/share/RepeatMasker/Libraries/famdb/ info

The information regarding the TE database is showed below:

FamDB Directory     : /gpfs1/SP11/iso-seq/TE/repeatMasker/share/RepeatMasker/Libraries/famdb
FamDB Generator     : famdb.py v1.0
FamDB Format Version: 1.0
FamDB Creation Date : 2023-11-15 11:30:15.311827

Database: Dfam
Version : 3.8
Date    : 2023-11-14

Dfam - A database of transposable element (TE) sequence alignments and HMMs.

2 Partitions Present
Total consensus sequences present: 472219
Total HMMs present               : 472219


Partition Details
-----------------
 Partition 0 [dfam38_full.0.h5]: root - Mammalia, Amoebozoa, Bacteria <bacteria>, Choanoflagellata, Rhodophyta, Haptista, Metamonada, Fungi, Sar, Placozoa, Ctenophora <comb jellies>, Filasterea, Spiralia, Discoba, Cnidaria, Porifera, Viruses
     Consensi: 295552, HMMs: 295552

 Partition 1 [ Absent ]: Obtectomera

 Partition 2 [ Absent ]: Euteleosteomorpha

 Partition 3 [ Absent ]: Sarcopterygii - Sauropsida, Coelacanthimorpha, Amphibia, Dipnomorpha

 Partition 4 [ Absent ]: Diptera

 Partition 5 [dfam38_full.5.h5]: Viridiplantae
     Consensi: 176667, HMMs: 176667

 Partition 6 [ Absent ]: Deuterostomia - Chondrichthyes, Hemichordata, Cladistia, Holostei, Tunicata, Cephalochordata, Cyclostomata <vertebrates>, Osteoglossocephala, Otomorpha, Elopocephalai, Echinodermata, Chondrostei

 Partition 7 [ Absent ]: Hymenoptera

 Partition 8 [ Absent ]: Ecdysozoa - Nematoda, Gelechioidea, Yponomeutoidea, Incurvarioidea, Chelicerata, Collembola, Polyneoptera, Tineoidea, Apoditrysia, Monocondylia, Strepsiptera, Palaeoptera, Neuropterida, Crustacea, Coleoptera, Siphonaptera, Trichoptera, Paraneoptera, Myriapoda, Scalidophora

2.Converting data to FASTA Format for RepeatMasker Compatibility

~/tools/miniconda3/share/RepeatMasker/famdb.py -i repeatMasker/share/RepeatMasker/Libraries/famdb/ families -f embl -a -d Viridiplantae >Viridiplantae.embl

~/tools/miniconda3/share/RepeatMasker/util/buildRMLibFromEMBL.pl Viridiplantae.embl >Viridiplantae2.fa

3.Running RepeatMasker

~/tools/miniconda3/bin/RepeatMasker -pa 16 -lib Viridiplantae2.fa -dir outputDir/ ./iso-seq/BJ202-01P0001.flnc.isoforms.fa

RepeatMasker seems to run without errors so far, but I’m not sure if these commands correct for my goals, or have I missed any important steps.

Thanks so much for any guidance or suggestions!

TE iso-seq RepeatMasker • 423 views

ADD COMMENT • link updated 5 days ago by lieven.sterck 15k • written 17 days ago by triplee0305 ▴ 10

0

Entering edit mode

random aside but i wasn't aware that transposons were picked up from rna-seq-like experiments. looks like they are from cursory googling. one paper says "When not properly silenced, TEs can contribute a substantial portion to the cell's transcriptome, but are typically ignored in most RNA-seq data analyses." https://pubmed.ncbi.nlm.nih.gov/29508296/

ADD REPLY • link 6 days ago by cmdcolin ★ 4.0k

1

Entering edit mode

Well, back in the days when polyA capturing (enrichment) was quite standard , those were usually depleted out of the transcript set, but nowadays very little enrichment is done and people just use total RNA (potentially depleted for rRNA) which can (and should) contain also TE derived transcripts.

ADD REPLY • link 5 days ago by lieven.sterck 15k

score 0 · Answer 1 · 2024-11-04

Hi,

yes, that seems about right what you're doing here.

The only thing I would add is to do a back screen to the non-redundant DB of ncbi to remove any potential false positives from your screen. Since you're using a dedicated TE db it's very well possible that some transcripts might be erroneously be assigned as a TE since the 'real' match is not available . Doing a blast analysis of the result of your approach against nr DB can remove the false positives (== any hit from your screen that has a better match against a non TE protein should be removed again, given that the description of the hit is NOT TE related)