Entering edit mode
10.5 years ago
esterbuiate
•
0
I've been trying to merge separate fasta files into a single file. I use cat *.fasta > outputname
but every time I do it I lose some of the headers, which is puzzling.
Example:
File1
>scaffold001
AGTCATGAT
File2
>scaffold004
AGTATAAAA
after using cat, output is:
New file
>scaffold001
AGTCATGAT
AGTATAAAA
There is no pattern, some random scaffolds headers appear, some don't. I have no duplicate scaffolds to merge, so that's not the case. I double checked and basically all names have the same format, with changes only in the numbers, there's no spaces or anything.
I have no idea what could be going on or what else I can use to concatenate the files.
Thanks!
The command should work. May be one of your file is lacking
\n
after the last sequence. Thus, when you concatenate this file with the other, the header of the other file gets attached to the last line of the previous file. Just guessing.can you please show us the ouput of:
This is the output of
file *.fasta
I'm curious about the 'with very long lines' output. Apparently, a line has to be > 300 characters before that output is generated: http://superuser.com/questions/91660/how-long-is-long-for-the-unix-file-command
Do you really have lines longer than 300 characters in that file? Also, if you try to open it in emacs do you get weird characters like '^@', or '^M'?
Yes, as I have DNA sequencing data, my lines are huge! The problem was in the fasta file headers though. I finally saw the pattern yesterday. After any header with a hyphen dash, the next header wouldn't be called. Simple but I just didn't catch it until I posted the question and started looking at it. Thanks for the input!
I found out the problem. Some of my fasta headers had a dash in the name ( - ) and that is what made cat behave weirdly. Thanks for the input though!
Probably the dash was surrounded by spaces. If you have filenames with spaces enclose the name with single or double quotes like
cat "Escherichia coli - genome.fa"
.