Entering edit mode
2.9 years ago
Mason
•
0
Hi,
I have a multifasta file with one sequence per line that is sorted by scaffold then coordinate within the scaffold. Example below:
>scaffold1:1851005..1851521_LTR#LTR/unknown
TGTGACAGCGCCATCGT
>scaffold1:2063846..2064928_LTR#LTR/Gypsy
TGATAAGACAAGCTCTACGCTTG
>scaffold1:2064929..2073244_INT#LTR/Gypsy
TAAACTTTGGAGTGCGTTCAGACA
>scaffold1:2088220..2090221_LTR#LTR/Gypsy
TGTCATGAAATATTCTAATCCACAT
>scaffold2:1003216..1004021_LTR#LTR/Gypsy
TGTTACGGTGTTTTTATTGAGG
>scaffold2:1022539..1026480_INT#LTR/unknown
GAGGTAACTCTTTTGAAAGAAAAGATTACTAAAC
I am looking to rename the sequences so they each have their own unique identifer, except for those that are flanking one another on the same scaffold. Flanking sequences will have the same identifier but will be distinguished by _LTR or _INT. Given the code above, it would look like this.
>ID1_LTR#LTR/unknown
TGTGACAGCGCCATCGT
>ID2_LTR#LTR/Gypsy
TGATAAGACAAGCTCTACGCTTG
>ID2_INT#LTR/Gypsy
TAAACTTTGGAGTGCGTTCAGACA
>ID3_LTR#LTR/Gypsy
TGTCATGAAATATTCTAATCCACAT
>ID4_LTR#LTR/Gypsy
TGTTACGGTGTTTTTATTGAGG
>ID5_INT#LTR/unknown
GAGGTAACTCTTTTGAAAGAAAAGATTACTAAAC
Does anyone have a solution to this problem? I have been trying to write my own shell and perl script to little success.
M.
Are they in order? Or does one have to do coordinate math to confirm/detect flanking? (It looks like the latter, which makes it a potentially much harder problem).
All of them are in order but not all are flanking. I have made some progress using multiple bash commands and intermediate files but not a full blown solution. Posting my jank code in case it helps another.
I first grabbed the sorted fasta headers with and filtered it into a tab delimited file.
Then I found all the flanking terms and their corresponding names in the original fasta using the following:
I have then replaced the names of the fasta file with the TE_flank_table using awk after splitting using cut. I still need to rename the rest of the elements but it is not an issue for my downstream process.
M.
Flanking here is defined as difference of 1 between previous entry end and current entry start. You can define this length.