Entering edit mode
5.3 years ago
Bioinfonext
▴
470
Hi,
I need to generate a taxonomy txt file having semicolon between them instead of spaces, but it should have first spaces after gene ID.
AJQY01000137.1 Bacteria Firmicutes Bacilli Lactobacillales Streptococcaceae Streptococcus
AJRA01000005.1 Bacteria Firmicutes Bacilli Lactobacillales Streptococcaceae Streptococcus
AJRA01000158.1 Bacteria Firmicutes Bacilli Lactobacillales Streptococcaceae Streptococcus
I need to have output like this:
AJQY01000137.1 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus
AJRA01000005.1 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus
AJRA01000158.1 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus
This perl one liner could work
perl -lane '{printf ((shift @F)." ");print join(";",@F) }' your_input_file
or with sed skipping the first match
sed -s 's/\s\+/;/2g' your_input_file
thanks,
But this also removed spaces between lines as well.
I do not want to remove spaces between lines.
Use
awk
and based on the value ofNF
, treat$1
and the rest of the fields differently. This way, you can retain blank lines while custom formatting other lines. Figuring out the awk code for yourself will be a good learning exercise.Same question as in my earlier post: why is this question considered within a scope when it deals with simple manipulation of text columns? Is it that the biological content of text make it relevant to bioinformatics?
By the way, consider this command:
awk '{print $1, $2";"$3";"$4";"$5";"$6";"$7}' input_file > output_file
thanks, I was trying to type the same command after Ram suggestion but it only gives first line as output: Is there any issue with my input file or for running this command to all over the lines needs to modify;
Thanks Bioinfonext
When I save excel sheet to tab delimited format, it saved in a weird look and also inserted ^M character somehow: it do not save each line of excel in a separate line in tab delimited format.
excel sheet look like this:
hi @bioinfonext, as Mensur Dlakic suggested, this is not a problem specific to bioinformatics, it's a classic sorting problem and now in addition a classic windows/unix newline problem. Whatever editor you use to visualise the tab delimited file doesn't seem to handle the windows carriage return very well - see this post on stack overflow. It's also likely you stripped the newline character at some stage in the process, hence you see everything in one line.
These are classic beginners errors which all of us did. You can solve them easily with your own web search and I guarantee this will prepare you for the future.
thanks for your all help.
this command works for me:
Thanks again
Bioinfonext
We are not looking at your computer, @Bioinfonext. Please use a package to read this data into R if you're having difficulties working on the content - these comments are just asking us for a lot of handholding.
You'd be better off copy-pasting from Excel to a plain text application (TextWrangler/Sublime Text/Atom/Notepad++) than using Excel to save the document.
You can use one of the above tools to open the document and try and change line endings, invalid characters, etc.
We try to be as lenient as possible, but we are aware that drawing a well defined line is a problem. You are welcome to discuss and offer solutions on our slack channel.