Question

Extract Sequence From Fasta and remove some characters before printing

0

Entering edit mode

7.7 years ago

Light ▴ 20

Hello,

I have a list of UniProt Id's in the first column and signal peptide length(eg 25 below) in the second column in a file (say seqid_len.txt). eg of a single entry

tr_GHAT8X                25
tr_GHAMNO                26

I want to extract fasta seq for each UniProt id which is a subset of seqid in a large fasta file(say fasta_db.fasta), print those seq in a new file after removing the length(from 1 to 25, in eq) of the signal peptides. How could I do this?

>tr|GHAT8X|GHAT8X_9GHA Uncharacterized protein 
MJHJKAHKJHSKBABXNXNELRVYAISQLNELIADFGGNSARDYLESTISNEGAHPSIRNSAVFSY
GKTFYFSDRNHAENFLKRFSSJKDKAJHKDHKJAJGDAJSIRNP
>tr|GHAmno|GHAMNO_9GHA Uncharacterized protein 
MJHAHLFLAJFLJALFNNCLAN;CNALNCLANCLNALNCLANLKLJKHJHKBBHBHBHBHBHBHBHBH
CLNALCNLANCKLNALKNCKDHKJAJGDAJSIRNP

Output file sample(1to25aa removed)

>tr|GHAT8X|GHAT8X_9GHA Uncharacterized protein 
ISQLNELIADFGGNSARDYLESTISNEGAHPSIRNSAVFSY
GKTFYFSDRNHAENFLKRFSSJKDKAJHKDHKJAJGDAJSIRNP

Please suggest me if there is an alternate way for the solution in python/perl.

Thanks

fasta python biopython • 5.9k views

ADD COMMENT • link updated 7.7 years ago by st.ph.n ★ 2.7k • written 7.7 years ago by Light ▴ 20

1

Entering edit mode

using seqkit and bash: input fasta file:

$ cat test.fa

>tr|GHAT8X|GHAT8X_9GHA Uncharacterized protein 
MJHJKAHKJHSKBABXNXNELRVYAISQLNELIADFGGNSARDYLESTISNEGAHPSIRNSAVFSY GKTFYFSDRNHAENFLKRFSSJKDKAJHKDHKJAJGDAJSIRNP
>tr|GHAmno|GHAMNO_9GHA Uncharacterized protein 
MJHAHLFLAJFLJALFNNCLAN;CNALNCLANCLNALNCLANLKLJKHJHKBBHBHBHBHBHBHBHBH CLNALCNLANCKLNALKNCKDHKJAJGDAJSIRNP

input ids:

$ cat ids.txt 
tr_GHAT8X 25
tr_GHAMNO 26

code:

$ sed 's/_/|/g' ids.txt | while read line;do grep -i -A1 ${line%%[1-9]*} test.fa | seqkit subseq -r ${line##[a-z]* }:-1 ; done

output:

>tr|GHAT8X|GHAT8X_9GHA Uncharacterized protein 
AISQLNELIADFGGNSARDYLESTISNEGAHPSIRNSAVFSY GKTFYFSDRNHAENFLKRFSSJKDKAJHKDHKJAJGDAJSIRNP
>tr|GHAmno|GHAMNO_9GHA Uncharacterized protein 
ALNCLANCLNALNCLANLKLJKHJHKBBHBHBHBHBHBHBHBH CLNALCNLANCKLNALKNCKDHKJAJGDAJSIRNP

Note: Clean up your fasta. It has special characters (;) and spaces in between AA.

Assumptions:

Fasta sequences are linearized
IDs and signal peptide lenght in ids.txt are separated by space.

ADD REPLY • link 7.7 years ago by cpad0112 21k

0

Entering edit mode

Thank You @cpad0112 for ur help. I was hoping to use python for the purpose.

ADD REPLY • link 7.7 years ago by Light ▴ 20

0

Entering edit mode

oh okay. Moving the post as comment.

ADD REPLY • link 7.7 years ago by cpad0112 21k

0

Entering edit mode

This wouldn't be too hard in (bio)python, do you have any experience in programming (in python)?

ADD REPLY • link 7.7 years ago by WouterDeCoster 48k

0

Entering edit mode

I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

Please check if the displayed format is still accurate.

ADD REPLY • link 7.7 years ago by WouterDeCoster 48k

0

Entering edit mode

Thank You. Everything seems in the correct format. I recently started to learn python basics commands.

ADD REPLY • link 7.7 years ago by Light ▴ 20

score 2 · Answer 1 · 2017-11-07

LInearize fasta, and assumes tab-delimited seqid_len.txt.

#!/usr/bin/env python

# Empty dictionary; {key:value}; {seqid:position, seqid2:pos, ..} Calling lendict[seqid] give the pos in integer
lendict = {}

# open file with seid '\t' pos    
with open('seqid_len.txt', 'r') as lens:
    # for each line in the file
    for line in lens:
    # populate dictionary, with {seqid:pos}
        lendict[line.strip().split('\t')[0]] = int(line.strip().split('\t')[1]


# open intput fasta file for reading
with open('species.fasta', 'r') as f:
# open output file for writing
    with open('fasta_pruned.fasta', 'w') as out:
    # for line in input fasta
        for line in f:
        # if line starts with '>'; i.e is a header
            if line.startswith(">"):
            # print the header to the output file; and on the next line of the output file write the sequence, but use the position from the dictionary, based on key value; to print the sequence from that position to the end of the sequence
                out.write(line.strip, '\n', next(f)[lendict[('_'.join(line.strip().split('|')[:2])]:])

score 0 · Answer 2 · 2017-11-05

0

Entering edit mode

7.7 years ago

Asaf 10k

I think the best way would be to generate a bed file from the coordinates file you have, defining the region of the signal peptide, this could easily be done with awk for instance and then use bedtools maskfasta to remove these regions.

ADD COMMENT • link 7.7 years ago by Asaf 10k

0

Entering edit mode

Thanks @Asaf. But I wanted to perform using python.

ADD REPLY • link 7.7 years ago by Light ▴ 20

1

Entering edit mode

Then it will be much easier. Read the coordinates file to a dictionary and then read the sequences using SeqIO.parse and using the names of the sequences print the subsequences. Look at the biopython cookbook for more information

ADD REPLY • link 7.7 years ago by Asaf 10k

0

Entering edit mode

sounds easy but for new on python programming like me it looks a high wall :D

ADD REPLY • link 7.7 years ago by Light ▴ 20

1

Entering edit mode

I could just write the script but maybe if we do this step by step, then you'll learn more in the long run.

Step1:

We first need to read the file with the signal peptide lengths. For every line in the file we make an entry in a dictionary coupling the name to the length.

with open("seqid_len.txt") as seqlengths:
    # read the file line by line
    # construct the dictionary

We can go to step 2 if you have written the code for this - or let us know if you get stuck.

ADD REPLY • link 7.7 years ago by WouterDeCoster 48k

score 0 · Answer 3 · 2017-11-05

0

Entering edit mode

7.7 years ago

Light ▴ 20

for line in seqlengths:
     split_IDlength = line.split(' ')
     ID = split_IDlength[0]
     Length = split_IDlength[1]
dict = {"ID": , "Length":}

Thank You for the help. I am trying.

ADD COMMENT • link 7.7 years ago by Light ▴ 20

0

Entering edit mode

Print out the dictionary, it's not what you want to get. The start of your code was good, but you need to look at how to make a dictionary again.

I would suggest something like:

lengthdict = dict() 
for line in seqlengths:
     split_IDlength  = line.strip().split(' ')
     lengthdict[split_IDlength[0]] = split_IDlength[1]

Typing code on my phone is a bit hard, hope I didn't make ridiculous mistakes. Will follow up tomorrow morning (CET).

You should also have a look at the biopython tutorial and cookbook (which is great!) on how to parse and slice a fasta file.

ADD REPLY • link 7.7 years ago by WouterDeCoster 48k

0

Entering edit mode

Ok, will look at it.

ADD REPLY • link 7.7 years ago by Light ▴ 20

0

Entering edit mode

Light : Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

ADD REPLY • link 7.7 years ago by GenoMax 152k

0

Entering edit mode

Sorry about that. I will keep it in mind.

ADD REPLY • link 7.7 years ago by Light ▴ 20

0

Entering edit mode

Let us know when you need further help.

ADD REPLY • link 7.7 years ago by WouterDeCoster 48k

2

Entering edit mode

from Bio import SeqIO
from Bio import Seq

f1 = open('fasta_pruned.fasta','w')

lengthdict = dict() 
with open("seqid_len.txt") as seqlengths:
    for line in seqlengths:
        split_IDlength  = line.strip().split(' ')
        lengthdict[split_IDlength[0]] =int( split_IDlength[-1])


with open("species.fasta","rU") as spe:
    for record in SeqIO.parse(spe,"fasta"):
        if record[0] == '>' :
            split_header = line.split('|')
            accession_ID = split_header[1]
            if accession_ID in lengthdict:
                f1.writestrseq_record.id) + "\n")
                f1.write(str(seq_record.seq[split_IDlength[1]-1:]))



f1.close()

For 2nd step, i want to read the fasta file and split the seq id line extract the unique id for comparison. But stuck how to use it dictionary and remove chars. doubt my syntax also :(

ADD REPLY • link 7.7 years ago by Light ▴ 20

0

Entering edit mode

You are getting pretty close, but I would suggest the following:

for record in SeqIO.parse("species.fasta", "fasta"):
    accession_ID = record.id.split('|')[1]
    trim_length = int(lengthdict[accession_ID])
    record.seq = record.seq[trim_length:]
    print(record.format("fasta")

ADD REPLY • link 7.7 years ago by WouterDeCoster 48k

0

Entering edit mode

trim_length = int(lengthdict[accession_ID]

Trim_length will hold accession id?? not understanding it.

ADD REPLY • link 7.7 years ago by Light ▴ 20

0

Entering edit mode

You do not know what a dictionary is eh...

Why don't you make the dictionary and play with it a bit to see how it works?

ADD REPLY • link 7.7 years ago by WouterDeCoster 48k

0

Entering edit mode

working on it now..Thanks

ADD REPLY • link 7.7 years ago by Light ▴ 20

0

Entering edit mode

I added comments to my answer above. hope it helps.

ADD REPLY • link 7.7 years ago by st.ph.n ★ 2.7k