Entering edit mode
9 months ago
sorrymouse
▴
120
I have hundreds of files that look like this:
>contig_204_1:1363108-1362734_r2d2_1
NIDMLKELRDFCVRRKMPLPVIEIVQQCGTPDAPEFVACCSVATIKRYGKSDKKKDARQRAAINVLNVISNDCDKADEEKKGLVGLNSLNLDDTLAELESQRSLIFTTYRELVKDPEVVKDYSVK
>contig_204_1:1363279-1362458_r2d2_1
IPPLYSGRSKRDSKHMAAIKLLKVLRTVPSFMDADKNSNSVKKHEEHHIIEDELSYSNIDMLKELRDFCVRRKMPLPVIEIVQQCGTPDAPEFVACCSVATIKRYGKSDKKKDARQRAAINVLNVISNDCDKADEEKKGLVGLNSL
>contig_204_1:1363479-1363288_r2d2_1
DKTAVSQLHELCSRAKQGTPEFRYKESPDGGFHCEASLLSYACYGTGKILKDYIYYLLLSDFLR
>contig_495_1:194415-194203_r2d2_1
KTPVSILQELLSRRGIT-PGYELVQIEGAIHEPTFRFRVSFKDKDLSFTAMGAGRSKKEAKHTAARALIDKL
What I need to do is separate out the sequences that are from the same region from those that aren't. So that for this file the output would be two files split like this: File 1
>contig_204_1:1363108-1362734_r2d2_1
NIDMLKELRDFCVRRKMPLPVIEIVQQCGTPDAPEFVACCSVATIKRYGKSDKKKDARQRAAINVLNVISNDCDKADEEKKGLVGLNSLNLDDTLAELESQRSLIFTTYRELVKDPEVVKDYSVK
>contig_204_1:1363279-1362458_r2d2_1
IPPLYSGRSKRDSKHMAAIKLLKVLRTVPSFMDADKNSNSVKKHEEHHIIEDELSYSNIDMLKELRDFCVRRKMPLPVIEIVQQCGTPDAPEFVACCSVATIKRYGKSDKKKDARQRAAINVLNVISNDCDKADEEKKGLVGLNSL
>contig_204_1:1363479-1363288_r2d2_1
DKTAVSQLHELCSRAKQGTPEFRYKESPDGGFHCEASLLSYACYGTGKILKDYIYYLLLSDFLR
File 2
>contig_495_1:194415-194203_r2d2_1
KTPVSILQELLSRRGIT-PGYELVQIEGAIHEPTFRFRVSFKDKDLSFTAMGAGRSKKEAKHTAARALIDKL
The sequences need to actually overlap in coordinates, the can't just match the contig name.
I can do all kinds of manipulations to make this easier, like removing the _r2d2_1 from the header or whatever, but I can't think of even the slightest direction to take in a way that wouldn't require any manual inspection.
Parse the names separately and create groups then use seqkit-like tools to extract sequences by name.
It looks more like a clustering problem.
Am I understanding this correctly, that what you are looking for is any sequence where any part of the coord range overlaps with any other part of another coord range? Even if that overlap were just one or two base positions?
Yes, thats correct. A lot of the responses are addressing problems other than the one I have - I don't want to cluster the sequences based on similarity, I just want to split the file based on coordinates.
could refer to sequence composition and that is the reason we were suggesting clustering options.
Adding using
coordinate information present in fasta headers
would have made things clear.You could convert the headers into BED like format and then use
intersectbed
to find overlaps. Once you identify the ID groups you can pull the sequences out of the file usingfilterbyname.sh
from BBMap suite to make multiple files.becomes