Question

Parsing fasta file by coordinates

0

Entering edit mode

8 months ago

sorrymouse ▴ 120

I have hundreds of files that look like this:

>contig_204_1:1363108-1362734_r2d2_1
NIDMLKELRDFCVRRKMPLPVIEIVQQCGTPDAPEFVACCSVATIKRYGKSDKKKDARQRAAINVLNVISNDCDKADEEKKGLVGLNSLNLDDTLAELESQRSLIFTTYRELVKDPEVVKDYSVK
>contig_204_1:1363279-1362458_r2d2_1
IPPLYSGRSKRDSKHMAAIKLLKVLRTVPSFMDADKNSNSVKKHEEHHIIEDELSYSNIDMLKELRDFCVRRKMPLPVIEIVQQCGTPDAPEFVACCSVATIKRYGKSDKKKDARQRAAINVLNVISNDCDKADEEKKGLVGLNSL
>contig_204_1:1363479-1363288_r2d2_1
DKTAVSQLHELCSRAKQGTPEFRYKESPDGGFHCEASLLSYACYGTGKILKDYIYYLLLSDFLR
>contig_495_1:194415-194203_r2d2_1
KTPVSILQELLSRRGIT-PGYELVQIEGAIHEPTFRFRVSFKDKDLSFTAMGAGRSKKEAKHTAARALIDKL

What I need to do is separate out the sequences that are from the same region from those that aren't. So that for this file the output would be two files split like this: File 1

>contig_204_1:1363108-1362734_r2d2_1
NIDMLKELRDFCVRRKMPLPVIEIVQQCGTPDAPEFVACCSVATIKRYGKSDKKKDARQRAAINVLNVISNDCDKADEEKKGLVGLNSLNLDDTLAELESQRSLIFTTYRELVKDPEVVKDYSVK
>contig_204_1:1363279-1362458_r2d2_1
IPPLYSGRSKRDSKHMAAIKLLKVLRTVPSFMDADKNSNSVKKHEEHHIIEDELSYSNIDMLKELRDFCVRRKMPLPVIEIVQQCGTPDAPEFVACCSVATIKRYGKSDKKKDARQRAAINVLNVISNDCDKADEEKKGLVGLNSL
>contig_204_1:1363479-1363288_r2d2_1
DKTAVSQLHELCSRAKQGTPEFRYKESPDGGFHCEASLLSYACYGTGKILKDYIYYLLLSDFLR

File 2

>contig_495_1:194415-194203_r2d2_1
KTPVSILQELLSRRGIT-PGYELVQIEGAIHEPTFRFRVSFKDKDLSFTAMGAGRSKKEAKHTAARALIDKL

The sequences need to actually overlap in coordinates, the can't just match the contig name.

I can do all kinds of manipulations to make this easier, like removing the _r2d2_1 from the header or whatever, but I can't think of even the slightest direction to take in a way that wouldn't require any manual inspection.

linux fasta • 725 views

ADD COMMENT • link updated 8 months ago by GenoMax 148k • written 8 months ago by sorrymouse ▴ 120

0

Entering edit mode

Parse the names separately and create groups then use seqkit-like tools to extract sequences by name.

ADD REPLY • link 8 months ago by Ram 44k

0

Entering edit mode

It looks more like a clustering problem.

ADD REPLY • link 8 months ago by shenwei356 8.7k

0

Entering edit mode

Am I understanding this correctly, that what you are looking for is any sequence where any part of the coord range overlaps with any other part of another coord range? Even if that overlap were just one or two base positions?

ADD REPLY • link 8 months ago by Joe 21k

0

Entering edit mode

Yes, thats correct. A lot of the responses are addressing problems other than the one I have - I don't want to cluster the sequences based on similarity, I just want to split the file based on coordinates.

ADD REPLY • link 8 months ago by sorrymouse ▴ 120

0

Entering edit mode

sequences that are from the same region from those that aren't.

could refer to sequence composition and that is the reason we were suggesting clustering options.

Adding using coordinate information present in fasta headers would have made things clear.

You could convert the headers into BED like format and then use intersectbed to find overlaps. Once you identify the ID groups you can pull the sequences out of the file using filterbyname.sh from BBMap suite to make multiple files.

>contig_204_1:1363108-1362734_r2d2_1

becomes

contig_204_1     1363108     1362734     r2d2_1

ADD REPLY • link 8 months ago by GenoMax 148k