Question

Removing gaps from fasta sequence file

0

Entering edit mode

6.3 years ago

mdsiddra ▴ 30

I am using python (3.6)/biopython(1.72) to read sequence files. I have an aligned sequence file in fasta format.

>Human
----------------------------MRLRVRLLKRTWPLEVPETEPTL-RSHLRQSLLCT-IPSSTDSEHSSLQN-NEQPSL
>Chimpanzee
----------------------------MRLRVRLLKRTWPLEVPETEPTL-RSRLRQSLLCT-IPSSTDSEHSSLQN-NEQPSL
>Dog
----------------------------MKLRVRLQKRTWPLDLPDAEPTL-RAHLSQALLPS-LPSSTDSEHSSLQN-NDPPSL
>Mouse
----------------------------MKLRVRLQKRTQPLEVPESEPTL-RAHLSQVLLPT-LPSSTDTEHSSLQD-NDQPSL

I need to remove the gaps '-' from the file and have the result file like this:

>Human
MRLRVRLLKRTWPLEVPETEPTLRSHLRQSLLCTIPSSTDSEHSSLQNNEQPSL
>Chimpanzee
MRLRVRLLKRTWPLEVPETEPTLRSRLRQSLLCTIPSSTDSEHSSLQNNEQPSL
>Dog
MKLRVRLQKRTWPLDLPDAEPTLRAHLSQALLPSLPSSTDSEHSSLQNNDPPSL
>Mouse
MKLRVRLQKRTQPLEVPESEPTLRAHLSQVLLPTLPSSTDTEHSSLQDNDQPSL

I have been trying this using python:

file_var = input ("Enter your file name: ")
sequences = []
for seq_record in SeqIO.parse(file_var, "fasta"):
    sequences.append(seq_record.seq)
print (sequences)
list2 = [] # list for extracting "-"
list3 = [] # list for sequence without "-"

for seq_record in alignment:
    if "-" in alignment:
        list2.append(seq_record)
    else:
        list3.append(seq_record)

But this outputs me the error:

    raise NotImplementedError(_NO_SEQRECORD_COMPARISON)
NotImplementedError: SeqRecord comparison is deliberately not implemented. Explicitly compare the attributes of interest.

Can I have any suggestions?? (P.S: I have been working with sequence file using windows OS, not linux)

Python biopython • 7.1k views

ADD COMMENT • link updated 3.1 years ago by Joe 21k • written 6.3 years ago by mdsiddra ▴ 30

1

Entering edit mode

you should be using ungap function in biopython from here: https://biopython.org/DIST/docs/api/Bio.Seq.Seq-class.html. some thing like seq.ungap() assuming that seq object holds the sequence.

from Bio.Seq import Seq
from Bio import SeqIO
with open("out.fa", "w") as o:
    for record in SeqIO.parse("test.fa", "fasta"):
        record.seq = record.seq.ungap("-")
        SeqIO.write(record, o, "fasta")

output (input is from OP fasta):

$ cat out.fa 
>Human
MRLRVRLLKRTWPLEVPETEPTLRSHLRQSLLCTIPSSTDSEHSSLQNNEQPSL
>Chimpanzee
MRLRVRLLKRTWPLEVPETEPTLRSRLRQSLLCTIPSSTDSEHSSLQNNEQPSL
>Dog
MKLRVRLQKRTWPLDLPDAEPTLRAHLSQALLPSLPSSTDSEHSSLQNNDPPSL
>Mouse
MKLRVRLQKRTQPLEVPESEPTLRAHLSQVLLPTLPSSTDTEHSSLQDNDQPSL

ADD REPLY • link 6.3 years ago by cpad0112 21k

0

Entering edit mode

Hello mdsiddra!

It appears that your post has been cross-posted to another site: https://bioinformatics.stackexchange.com/questions/4846/removing-sequences-gaps-from-fasta-file

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY • link 6.3 years ago by Benn 8.3k

0

Entering edit mode

Thankyou for bringing it to my notice, I will better be careful next time.

ADD REPLY • link 6.3 years ago by mdsiddra ▴ 30

0

Entering edit mode

Hi, you can also use the "Undo alignment" function of the SEDA software (https://www.sing-group.org/seda/manual/operations.html#undo-alignment). Regards.

ADD REPLY • link 6.3 years ago by Hugo ▴ 380

0

Entering edit mode

hello I want to remove columns that contain gaps in multiple sequence alignment like human ATGC-GTGC--- campanz GT-A-TTGC---

output: human ATCGTGC campanz GTATTGC

ADD REPLY • link 3.1 years ago by ramsha • 0

0

Entering edit mode

OK - and what about the solutions provided here don't resolve that for you?

ADD REPLY • link 3.1 years ago by Joe 21k

score 2 · Accepted Answer · 2018-08-06

2

Entering edit mode

6.3 years ago

finswimmer 16k

Hello mdsiddra,

there is more than one thing weired in your code:

file_var = input ("Enter your file name: ")

Of course it depends on what your program should do. But in bioinformatics it is more common to give inputs/outputs as parameters to the program.

for seq_record in alignment:

You've never defined alignment before.

if "-" in alignment:
        list2.append(seq_record)
    else:
        list3.append(seq_record)

If I assume you are thinking that in alignment there is the sequence string itself, than here you are checking whether there is a - in. But you never remove - from your sequence.

This should work better:

import sys
from Bio import SeqIO
from Bio.Seq import Seq

input_file = sys.argv[1]
output_file = sys.argv[2]

with open(output_file, "w") as output:
    for record in SeqIO.parse(input_file, "fasta"):
        record.seq = Seq(str(record.seq).replace("-", ""))
        SeqIO.write(record, output, "fasta")

Run it like this:

$ remove.py input.fasta output.fasta

fin swimmer

ADD COMMENT • link 6.3 years ago by finswimmer 16k

0

Entering edit mode

@finswimmer #1: I will modify the input file to be as argument. But whenever I use this ;

input_file = sys.argv[1]
output_file = sys.argv[2]

I come across with error like this:

    input_file = sys.argv[1]
IndexError: list index out of range

#2: the "alignment" comes from;

alignment = AlignIO.read(open(file_var), "fasta")
for seq_record in alignment:
    sequences.append(seq_record.seq)
print (sequences)

#3: Would it be fine to read sequence file using AlignIO function of Biopython??

ADD REPLY • link 6.3 years ago by mdsiddra ▴ 30

2

Entering edit mode

#3: with alignIO.read:

with open("out.fa", "w") as o:
    for record in AlignIO.read("test.fa", "fasta"):
        record.seq = record.seq.ungap("-")
        SeqIO.write(record, o, "fasta")

ADD REPLY • link 6.3 years ago by cpad0112 21k

1

Entering edit mode

I guess first error is due to building code within IDE, without input. For eg. if you build the code in ST3, one would get the error. This would go away if you run the script with input and output from the shell. As fin or some other star here suggested, please do not use input in python code. @ mdsiddra

ADD REPLY • link 6.3 years ago by cpad0112 21k

1

Entering edit mode

mdsiddra you need to provide the arguments on the commandline, as finswimmer showed you. This was discussed in your last question.

If you want to provide the arguments 'interactively' you need to replace the lines:

input = sys.argv[1]
output = sys.argv[2]

with

input = str(input("Enter input file: "))
output = str(input("Enter output file: "))

This error:

input_file = sys.argv[1]
IndexError: list index out of range

Means that the command wasn't run correctly, as it didn't find the right number of commandline arguments.

ADD REPLY • link 6.3 years ago by Joe 21k

0

Entering edit mode

Yeah okay, I must follow the suggestions given. Thankyou for your kind response.

ADD REPLY • link 6.3 years ago by mdsiddra ▴ 30

1

Entering edit mode

To #1 and #3 other people already posted solution.

Concerning #2: You haven't posted this line in your initial post. Please always post your complete code. Or at least in example that we can run and see the same behaviour as you. Otherwise it's hard to help.

fin swimmer

ADD REPLY • link 6.3 years ago by finswimmer 16k

0

Entering edit mode

Alright, I will take care about these mistakes. Thankyou so much for giving me this much guidance.

ADD REPLY • link 6.3 years ago by mdsiddra ▴ 30

0

Entering edit mode

mdsiddra You have not been accepting answers to your questions. Please go back through your post history and accept any answers you like (you can accept more than one), by clicking the tick symbol on the left hand side - I have done some for you.

ADD REPLY • link 6.3 years ago by Joe 21k

0

Entering edit mode

I had been upvoting every answer to my question except one that I did'nt accept, for which you have made me attentive. Thankyou for this kind guidance.

ADD REPLY • link 6.3 years ago by mdsiddra ▴ 30