It's perfectly possible that the discussion in this thread didn't make you happy, but deleting the thread isn't the right way to cope with that. I've reversed that now. There are many interesting ideas and solutions, so there is no reason to wipe this all away "forever".
Yeah I alluded to this in a comment below. I used a cat pipe solution though in case the OP was to make a mistake in his command, editing in-place with sed is risky. At least this way you get to see what it will do to your file before you commit to the change.
Linearizing the input is not necessary. You can simply awk for lines that are sequence headers — lines which start with > — and those lines which are not headers (sequences). You can then modify the header lines by whatever logic you want and print out the modified header. In the case of non-header lines — sequences — you just print out the line as-is.
As @genomax2 said, "Fasta format is not a strictly defined standard".
So you need to check FASTA format very carefully.
Here's a fake but may existing FASTA file. Note that the last record gene_6 is appending to
sequence of gene_5, this often happens by cating too FASTA files where the
first one is not ending with new-line-character \n.
Similarly, concatenating files could result in the space in
front of >gene_5.
You can just use awk directly. No need to use cat:
$ awk '{ if ($0 ~ /^>/) { n = split($0, a, "|"); printf("%s\n", a[1]); } else { print $0; } }' in.fa > out.fa
Really no need to use slow parsers, either, until you do something more complex. This is a very basic text-editing exercise that all informaticists should be able to do.
You don't need cat with the sed solutions either, if you want to use inplace editing...my answer could be re-written sed -i 's/|.*//' filename but I figured simply printing the output avoids the OP making mistakes and irreversibly overwriting their files till they are sure what it'll do ;)
Using sed works too, definitely. I tend to shy away from sed because of differences between BSD and GNU versions (but that's solved on OS X with Homebrew, I guess).
I actually find a cat - sed - redirect, while not computationally efficient maybe because of an extra system call, to be a bit more elegant because you don't have to create extra messy intermediate files and you can test/build the command as you go.
True. I'm usually building pipelines as I go, and I start off with cat file. Then I hit up to go to the previous command, and I append a sed to it. It is easier building that way.
But once I build it the first time, I optimize it for later re-use. At this point, all the unnecessary cats are thrown out.
I just define a function in bashrc that strings head and tail together to inspect both ends of a file incase something horrible has happened somewhere in the middle :p
Even simpler, just 2 calls one after the other, but your way would totally work too - I wrapped it in a slightly more complicated if so that I could pass a number of lines to print if I wanted more than default 10
ends(){
if [ "$2" == "" ]; then
head $1
tail $1
elif [ "$2" != "" ] ; then
head -"${2}" $1
tail -"${2}" $1
else
echo "Parameter not recognised"
fi
}
** It's not written as a oneliner in bashrc, I've just copied it though, so there are some missing ";" as a oneliner FYI.
Please just use a proper FASTA parser. awk/grep/sed cannot parse FASTA reliably. Why do people even bother writing FASTA/FASTQ parsers if no one is going to use them? Why do people bother making custom bioinformatic file formats if people are going to treat them like a weird CSV? Convert all your data to CSV if you want CSV. But if you want to stick with FASTA - for whatever reason - use a FASTA parsing API like BioPython/Perl/Ruby/etc.
Everything else is loaded with assumptions about the rest of your data which we cannot possible know.
I cleaned up the unpleasant off-topic discussion which originated from this reply.
A summary of the useful bits of replies on this is below. I apologise for any context that has been lost by this operation.
jrj.healey
For complex input data, or situations where you don't know your own
files that well I very much agree. However, for working with files on
the commandline that don't require you to have installed Biopython
(for example, can be a right pain in the ass to install), I think
knowing the necessary oneliners with sed/cut/grep/awk and so on is an
absolute requirement...
the point he's getting at is that you shouldn't necessarily treat a
fasta as a file that can be coerced in to something which is simple a
delimited text file, when parsers already exist, which is what a
couple of the suggestions in this thread do.
genomax2
By using a standard file parser one is able to account for even edge
cases and get the desired output. You are not able to post your entire
file (due to space constraints) and the command line solutions people
suggest consider only those snippets. Fasta format is not a strictly
defined standard so if your file does not follow the same exact
structure throughout then you could end up with silent corruption of
data that you may not discover until some other downstream tool throws
an error.
Ram
I'd use cut or awk or sed only after ensuring all known edge cases,
and even then I'm risking unknown/unpredictable edge cases. AN
actively maintained tool would bypass that.
Why one would use cut or awk with CSV files without running a few
checks beforehand) when they are insensitive to quoted commas is
beyond me :) Also, encoding, quoted new lines, vertical tabs, etc.
Takeaway: Not even CSV file are as simple as they seem.
John
That's OK Ram - i'm not high value i'm just chatty
Thank you for keeping Biostars a nice place for all users!
Hi ####
It's perfectly possible that the discussion in this thread didn't make you happy, but deleting the thread isn't the right way to cope with that. I've reversed that now. There are many interesting ideas and solutions, so there is no reason to wipe this all away "forever".
Cheers, Wouter
Second ID is not gene_1. Do you want
abc
instead?No in this case I want first id ..i.e. gene_1
Why'd you say "second id" in the question then?