Question

parsing fasta file

1

Entering edit mode

8.3 years ago

a.rex ▴ 350

I have a fasta file that is formatted in the following way:

> gene1 
atctgtctgct 
atcgtc
at

and I want to put it in a format like the following:

> gene1            atctgtctgctatcgtcat

I am struggling to get rid of the whitespace between the lines under the header. Does anyone have any idea how I can do this in python?

fasta • 2.2k views

ADD COMMENT • link updated 8.3 years ago by rkostadi ▴ 60 • written 8.3 years ago by a.rex ▴ 350

3

Entering edit mode

Why? Why not leave it in fasta (a format most things already accept)?

ADD REPLY • link 8.3 years ago by Devon Ryan 105k

2

Entering edit mode

As Devon is getting at, what ever method you're looking to use to 'read' this FASTA file is probably not a good one. Is it, by any chance, awk/sed/td/grep?

If you tell us what your bigger problem is that you're looking to solve, we might be able to help you there. But creating a new format, unless for a really good reason/implementation, is generally a bad idea for everyone.

ADD REPLY • link 8.3 years ago by John 13k

1

Entering edit mode

You should look into using Biopython instead of trying to use text parsing. See these pages:
http://www.bioinformatics.org/bradstuff/bp/tut/Tutorial002.html
http://biopython.org/wiki/SeqIO
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc11
http://biopython.org/DIST/docs/api/Bio.SeqIO-module.html

ADD REPLY • link 8.3 years ago by steve ★ 3.5k

score 2 · Answer 1 · 2016-09-30

Do you want to convert FASTA format to tab-delimited format?

Writing Python script with Biopython or Perl with BioPerl is very convenient. It can also be achieved by one or more shell commands.

I'd like to introduce the FASTA/Q toolkit SeqKit, which can do this with one command:

seqkit fxtab seqs.fa > formated.txt

Since spaces exited in the sequence of your sample data, a cleaning step was used to remove the spaces. And seqkit fx2tab outputs 3 columns for compatibility of FASTA and FASTQ, cut was used to remove the empty third column:

$ seqkit seq --remove-gaps seqs.fa | seqkit fx2tab | cut -f 1,2
gene1   atctgtctgctatcgtcat

After manipulations of the tabular format, you can use seqkit tab2fx to convert it back to FASTA format.

score 1 · Answer 2 · 2016-09-30

1

Entering edit mode

8.3 years ago

Matt Shirley 10k

$ pip install pyfaidx 
$ faidx --transform transposed input.fa | cut -f1,4 > out.tab

If you really want to do this, but I agree it might be better to answer the real question you might have not yet asked...

ADD COMMENT • link 8.3 years ago by Matt Shirley 10k

score 1 · Answer 3 · 2016-10-01

1

Entering edit mode

8.3 years ago

rkostadi ▴ 60

fold -w 60 input.fa

ADD COMMENT • link 8.3 years ago by rkostadi ▴ 60