Question

From the concatenated fasta file, how to find individual range of locations in each protein sequence

0

Entering edit mode

9.0 years ago

User 6777 ▴ 20

Hi all,

I have a protein fasta file (protein.txt) like:

>a
mnspq
>b
rstuvw
>c
mnqa

Note that the length of a, b and c proteins are 5,6 and 4 respectively (total length = 15)

now I have extracted some ranges (calculation is based on total length) and save it (file1.txt) as:

2-3
4-10
11-14

The length of each protein (within the total length) as seen in protein file is saved in another file (file2.txt) as:

a  1-5
b  6-11
c  12-15

Now from file1 values, I want to modify the file2 values and try to calculate individual range for each protein sequence, For the above input, the output will be:

a   2-3,4-5
b   1-5, 6
c   2-5

In other words, if I first concatenate my all sequences and derermine some ranges from the concatenated file, how can I find individual range of locations in each protein sequence

Thanks for your consideration.

fasta perl python • 2.1k views

ADD COMMENT • link 9.0 years ago by User 6777 ▴ 20

0

Entering edit mode

Well, just write a script.

ADD REPLY • link 9.0 years ago by shenwei356 8.7k

0

Entering edit mode

As long as you have unique headers in your multi-fasta file samtools faidx region should do the extraction part. See this: Extract User Defined Region From An Fasta File @Matt Shirley also has a python based pyfaidx solution.

I am not exactly certain what you are trying to do in the subsequent steps.

Edit: Re-reading your original post I am not sure this is what you need. But I will leave this here for now to see if it helps.

ADD REPLY • link 9.0 years ago by GenoMax 153k