Help With Creating Fasta Files For Dnasp And Other Programs!
3
1
Entering edit mode
13.3 years ago
User 7433 ▴ 170

Hi there,

I have an excel file which contains DNA sequence information (over 7000 nucleotides) for 4000 chromosomes. I want to create a fasta file that gives the nucleotide sequence for each chromosome..like this

>chromosome1
AGTGATGCGCGCTCGCGCTCGCCTCCGGCGCGCGT...
>chromosome2
GCCGCTCGCTCGCTCGCCGCGCGCGCGCGCGGCTC...

And so on...

In my excel file I can concatonate columns 1 and 2 to give something which is close to what I want..eg >chromosome1GATAGATGAGAGAGAGGA...

But then I have to copy this into word and MANUALLY (it takes me AGES! lol) hit enter after chromosome1 to give me the correct arrangement for the input file DNAsp needs.

Does anyone know of a program that easily makes fasta files for software like DNAsp? Or any other way I can do this without having to sit and do it manually until my fingers fall of?

I am not exactly a computer whizz so I am sure someone can help me?

Hope so - many thanks in advance!!

xx

fasta nucleotide sequence format conversion • 9.5k views
ADD COMMENT
0
Entering edit mode

For future reference, if you want ">" characters to display properly in your questions, you need to indent each line with 4 spaces; otherwise the text appears as a blockquote.

ADD REPLY
2
Entering edit mode
13.3 years ago
Neilfws 49k

Here is how a bioinformatician would deal with this problem.

First, they would roll their eyes in despair at the inappropriate use of Excel and Word for storing/manipulating sequence data.

Next, they would export the data from Excel to a portable, plain text format which could be used by many programs on multiple platforms. They would probably export to CSV (comma separated values) format, so each column would look like this:

chromosome1,AGTGATGCGCGCTCGCGCTCGCCTCCGGCGCGCGT...
chromosome2,GCCGCTCGCTCGCTCGCCGCGCGCGCGCGCGGCTC...

Finally, they would write a short script in the language of their choice to reformat the CSV file as FASTA. They might also do a web search for the term "CSV to fasta", to see if anyone else has already solved the problem.

Their script might look something like this, which is written in Awk:

awk 'BEGIN {FS=","} {print ">"$1"\n"$2}' sequences.csv > sequences.fasta

Which would not be perfect, since the sequence parts of the file will all be on one line, not a maximum of 80 characters long as they should be. But most programs ignore that, so the bioinformatician might decide this "quick and dirty" solution is acceptable. Or, assuming that the header line is never more than 80 characters, they might use another small text-processing tool named fold, to wrap the lines containing sequence:

awk 'BEGIN {FS=","} {print ">"$1"\n"$2}' sequences.csv | fold -w 80 > sequences.fasta

They may use another language: Python, Perl, Ruby... They may also use existing libraries from that language (Bioperl, BioPython, BioRuby) to handle sequence reading and writing.

Not being a "computer whizz", it may be that none of this means much to you, so here's the take-home message. When you stop using tools inappropriately because those are the only tools that you know; and you stop storing data in proprietary, locked-up formats and start using more open, versatile formats (simple plain text), then a whole world of simple solutions to your problems opens up to you. A lot of bioinformatics involves processing text in one way or another and processing text is a breeze for people who know how to use a command line in Linux.

So - find such a person, or become one :)

ADD COMMENT
1
Entering edit mode
13.3 years ago
Swbarnes2 ★ 1.6k

You could probably do it with a perl one-liner, which I'm not very good at making. Do you have access to a computer with Perl, or Python, or something?

Did you ask the person who gave you the file to format it for you? They probably didn't type it by hand.

If you really have to do it in Excel, make your file look like this, with a line between every entry

>chr1   atctcta

>chr2   tatcgat

>chr3   atatatt

>chr4   tttagcta

You can make dummy entries that have names that will fit between your entries, like chr1a. Then a couple of columns over, The top cell equals A1, and the next cell down equals A2, and copy and paste that pattern down the column.

ADD COMMENT
0
Entering edit mode
9.8 years ago
mikaellfl • 0

This a very old post, but I actually had this problem, and solved it by using Excel so I will give my answer to anyone else with the same problem with name and sequence in 2 seperate columns. This is not difficult but requires a basic understanding of Excel, the data is not broken into 80character lines as FASTA should be, but I have had no problems with it.

Example of the setup needed to be converted.

Column1 Column2
name1   ATCGCGG
name2   CGTGA...
name3   CGTATG...
name4   CGTATGG

Solution

you already have column A and B.

C is the number of the rows starting with 1

D is "a" and "b" repeated loop, so all names get "a" and all sequnces get "b". Drag this column Down so it is minimum twice as long as your data columns.

E The 2 first rows (name and sequence) both gets the value "1" inputted manually. from the 3th cell and downwards use the formula =If(D3="a";E2+1;E2) this ensures that all names and sequnces gets a paired number.Drag this column Down so it is min. twice as long as your data columns.

C-D is guide columns and are only needed for the final formula to Work.

F this is the important column that puts your data in a fasta like format use the formula: =if(D1="a";">"&INDIRECT("A"&E1);INDIRECT("B"&E1)) in F1, Drag this column Down so it is Exactly twice as long as your data columns - you can adjust this when you reach the end.

The final Excel document looks like below. Finally you copy column F to column A in a new sheet and save the new sheet as Text MS-DOS, the final product is in fasta format and can be used as you please

  A         B        C   D    E      F
name1   sequence1    1   a    1    >name1
name2   sequence2    2   b    1    sequence1    
name3   sequence3    3   a    2    >name2
name4   sequence4    4   b    2    sequence2    
                     5   a    3    >name3
                     6   b    3    sequence3
                     7   a    4    >name4
                     8   b    4    sequence4
ADD COMMENT

Login before adding your answer.

Traffic: 2294 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6