python script for deleting basepair in fasta format

0

Entering edit mode

10.7 years ago

bioinformaticssrm2011 ▴ 90

I have a multiple fasta format sequence in a notepad, I want to delete rest of the base above 500 bp, in each sequence. what script I can use.

thanks

sequence python • 4.3k views

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.7 years ago by bioinformaticssrm2011 ▴ 90

1

Entering edit mode

A simple sed command might do the trick if your fasta sequences are on one line (i.e. not wrapped):

sed --regexp-extended '/^>/! s/^(.{1,500}).*/\1/' input.fasta > output.fasta

ADD REPLY • link 10.7 years ago by Frédéric Mahé ★ 3.2k

2

Entering edit mode

Or using awk (I like awk!):

awk '{print /^>/ ? $0 : substr($1, 1, 500)}' input.fasta > output.fasta

ADD REPLY • link 10.7 years ago by Frédéric Mahé ★ 3.2k

0

Entering edit mode

Thanks for the quick reply.

Can I use this command on CentOS Linux terminal directly or do I need to download any other package?

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.7 years ago by bioinformaticssrm2011 ▴ 90

0

Entering edit mode

These commands should work on most Linux or Unix computers without installing any additional package.

ADD REPLY • link 10.7 years ago by Frédéric Mahé ★ 3.2k

0

Entering edit mode

Hi Frederic,

I have one more problem, Hope you can help me with that too

I have a single contig of around 5000 bp, which looks like

>contig 00001
ACTTTCAAAAAAATCAGCGTAAAAAACATACTAATTTGGGCAAATTCCCACCTGTTTTTAGGGACATTTT
TCTTTGAATTAGAGCCTCAGCAGCTCGTCATTGCTGAATTTTCTTGAAGTTGCCTCGTTTCTAGGCGCTT
ACGCTGGTCGTAGCCGGTTTCTGACTACGAATGATTTAAAAATTTAGTTTGTCATCTATGCTTCGTACAT
AGATAAAGATAATAATGCCTTTCCATCTATTTTAGATTGAGAGAGGCTTTTTCTGTCATGATTAAATTAG
GAAAGGAGTGAGCACTACTTTGGAAAATATTGAAGAATTATGGAGCGCTACATTAAAAAAAATTGAAGAA
AAATTGAGTAAACCTAGCTTTGATACGTGGCTTAAAAATACAAAAGCAGAAGCATTAGAAAAAGATACAC
TAATTATTTCTGCTCCAAATGAATTCGCAAGAGATTGGTTAGAAAATCAATACACGAACTTAATTTCTCA
GATGCTACTTGAAGTTACAGGATCTGAATTAAATACAAAATTTATCATTCCTGATTCATTAGAAGAAATA
GAAGAGCAAAAGCCTATGCCAAAACCAAAGCAGTCTACAGATACAGGTGACTCACCAAAATCGATGCTTA
ACTCAAAATATACTTTTGATACATTTGTCATTGGAGCTGGTAATCGTTTCGCTCATGCTGCTTCATTAGC

I want to have a multiple contigs of 100 bp, like

>contig 00001
ACTTTCAAAAAAATCAGCGTAAAAAACATACTAATTTGGGCAAATTCCCACCTGTTTTTAGGGACATTTTTCTTTGAATTAGAGCCTCAGCAGCTCGTCATTGCTGAATTTTCTTGAAGTTGCCTCGTTTCTAGGCGCTT
>contig 00002
ACGCTGGTCGTAGCCGGTTTCTGACTACGAATGATTTAAAAATTTAGTTTGTCATCTATGCTTCGTACATAGATAAAGATAATAATGCCTTTCCATCTATTTTAGATTGAGAGAGGCTTTTTCTGTCATGATTAAATTAG
>contig 00003
GAAAGGAGTGAGCACTACTTTGGAAAATATTGAAGAATTATGGAGCGCTACATTAAAAAAAATTGAAGAAAAATTGAGTAAACCTAGCTTTGATACGTGGCTTAAAAATACAAAAGCAGAAGCATTAGAAAAAGATACAC
>contig 00004
TAATTATTTCTGCTCCAAATGAATTCGCAAGAGATTGGTTAGAAAATCAATACACGAACTTAATTTCTCAGATGCTACTTGAAGTTACAGGATCTGAATTAAATACAAAATTTATCATTCCTGATTCATTAGAAGAAATA
>contig 00005
GAAGAGCAAAAGCCTATGCCAAAACCAAAGCAGTCTACAGATACAGGTGACTCACCAAAATCGATGCTTAACTCAAAATATACTTTTGATACATTTGTCATTGGAGCTGGTAATCGTTTCGCTCATGCTGCTTCATTAGC

and so on...

What command should I use?

Best!

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 9.7 years ago by bioinformaticssrm2011 ▴ 90

0

Entering edit mode

Hi, the thread A: How To Split One Big Sequence File Into Multiple Files With Less Than 1000 Seque might answer your question.

ADD REPLY • link 9.7 years ago by Frédéric Mahé ★ 3.2k

0

Entering edit mode

thanks Frederic, it works... thanks alot.

ADD REPLY • link 10.7 years ago by bioinformaticssrm2011 ▴ 90

1

Entering edit mode

Cross posted (to Linkedin???)

ADD REPLY • link 10.7 years ago by Matt Shirley 10k

1

Entering edit mode

10.7 years ago

Cytosine ▴ 460

Don't know of any scripts that do that, but it's really easy to write up one yourself. Try this one with using BioPython...

from Bio import SeqIO
import sys

fastafile, length = sys.argv[1], int(sys.argv[2])
entries = SeqIO.parse(fastafile, "fasta")

for entry in entries:
    print ">"+entry.name
    print entry.seq[:length]

Save it to a file (e.g. parse_fasta.py) and then run it in a terminal (or cmd if you're on windows) with:

python parse_fasta.py <input.fasta> <sequence_lengths> > <output.fasta>

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.7 years ago by Cytosine ▴ 460

0

Entering edit mode

Or slightly more elegantly using SeqIO.write and a generator expression,

from Bio import SeqIO
import sys
fastafile, length = sys.argv[1], int(sys.argv[2])
trimmed = (entry[:length] for entry in SeqIO.parse(fastafile, "fasta"))
SeqIO.write(trimmed, sys.stdout, "fasta")

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.7 years ago by Peter 6.0k

0

Entering edit mode

I am not a bioinformatician, so I don't know where to use this command, I am using CentOS Linux system, if I open terminal window, can I directly type the above script or I need to download any package before using the script.

Thanks.

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.7 years ago by bioinformaticssrm2011 ▴ 90

1

Entering edit mode

Running this is as simple as saving the script to a file (like parse_fasta.py from my post) and running the python interpreter over that script (with the commands I mentioned at the end of my post).

If you have a fresh installation, you'll probably be missing the Biopython library. In that case you can install it by following the steps mentioned here: http://centosn00b.blogspot.be/2012/05/biopython-tip.html

Remember to prepend "sudo" to the commands if you do not have administrator privileges.

ADD REPLY • link 10.7 years ago by Cytosine ▴ 460

1

Entering edit mode

Good information, but I would caution against advising beginning users to "make things work" but adding sudo.

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.7 years ago by Matt Shirley 10k

0

Entering edit mode

10.7 years ago

Matt Shirley 10k

	from pyfaidx import Fasta
	fa = Fasta('multi.fasta', strict_bounds=True)
	for seq in fa:
	with open('multi.500bp.fasta, 'w') as out:
	out.write(seq[:500])

view raw trim_fasta.py hosted with ❤ by GitHub

pip install pyfaidx

python

And then paste the script, changing your file name.

ADD COMMENT • link 10.7 years ago by Matt Shirley 10k

Login before adding your answer.