Question

How to separate sub-families from transposons sequence based fasta files?

0

Entering edit mode

3.3 years ago

ANAM • 0

I'm working on the classification of transposable elements. I want to retrieve sequences of their sub-classes in separate files. Is there any code or tool present to separate their sub-families because dataset contains thousands of sequence entries for different species.

I really appreciate any help or suggestion!

DATASET SOURCE: https://pgsb.helmholtz-muenchen.de/plant/recat/index.jsp

For example:

I want to separate RLC Sequences in separate files and so far for other entries like for RLX & TXX

>RLC_163294|LTR_Gr_chr_04_982|LTR/Copia|02.01.01.05|29730|Gossypium
tgttagagtagttagtaaagttgttagtagttaaaactgttgtacgttcagttaacagttgagctgttaaatagttgacctgttagttatgcattcatttgagtataaaactatgagaagtctgtacttaaagatatgagttttataatgaagaaattctaagtctttgtttttaagctgcttgtttagcttaacatggtatcag
>RLX_163369|LTR_Gr_chr_10_2326|LTR|02.01.01|29730|Gossypium
tgtcacgggcaaaagtgcaaagcccgtgaccatggcataagatgtgccccatggaggtctatcgattagacaaggaacatttagcccacgagaacttgcccgattcaaaaaactgttggagaagcctgtcagattgaagcctggttggcccgataatgaagacgtggcaacttaggccaattttggt
>TXX_174935|TXX_Gr_DX404975.1_8351|MobileElement|02|29730|Gossypium
atccgtgcccatgccatgtcccagacatggtcttatgggggactctcatctcggtgccaacgccatatcccagacatggtcttacatgggacctctcataatctcaattatgccaatgccatgtcccagacatggtcttacatgggatctctttacccaaatatcatgacatttgtatccattacattcccaatgtttcaacggggcttttatcactgattctctgtcatctcatacttgagttaacattagatattttcatgaaataaatacataattgctggaaaatagcagcattaa

fasta transposons • 1.1k views

ADD COMMENT • link updated 20 months ago by Ram 44k • written 3.3 years ago by ANAM • 0

0

Entering edit mode

with awk (not tested on large dataset, please take a back up of your data and assumes that sequences are in single line):

$ awk -F '[>_]' '/>/{getline seq; print $0"\n"seq > $2".fa"}' test.fa

with seqkit (works for multi line sequences and outputs sequences in single line):

$ seqkit -w 0 split -i --id-regexp '(^[A-Z]{3})_*' test.fa -O new_files

Output files will be stored in a new directory "new_files" and files are named test.id_RLC.fa, test.id_RLX.fa, test.id_TXX.fa . Remove test_id from all the files running rename -n 's/test\.id_//' *.fa in new_files folder.

Please note that the regex is tightened, to be safe while running regex. If subfamily IDs star with more than 3 letters, at the start, please change accordingly.

ADD REPLY • link 3.3 years ago by cpad0112 21k

score 2 · Accepted Answer · 2021-08-07

2

Entering edit mode

3.3 years ago

GenoMax 147k

You can linearize the fasta file (code by @Pierre), search for the pattern you want and then reformat back to fasta.

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}'  your.fa | grep "^>RLC" | tr "\t" "\n" > RLC.fa