Question

HOW TO remove all spaces except first space with semicolon

0

Entering edit mode

5.5 years ago

Bioinfonext ▴ 470

Hi,

I need to generate a taxonomy txt file having semicolon between them instead of spaces, but it should have first spaces after gene ID.

AJQY01000137.1  Bacteria     Firmicutes  Bacilli     Lactobacillales     Streptococcaceae    Streptococcus   
AJRA01000005.1  Bacteria     Firmicutes  Bacilli     Lactobacillales     Streptococcaceae    Streptococcus   
AJRA01000158.1  Bacteria     Firmicutes  Bacilli     Lactobacillales     Streptococcaceae    Streptococcus

I need to have output like this:

AJQY01000137.1  Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus   
AJRA01000005.1  Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus   
AJRA01000158.1  Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus

bash linux • 1.5k views

ADD COMMENT • link 5.5 years ago by Bioinfonext ▴ 470

1

Entering edit mode

This perl one liner could work

perl -lane '{printf ((shift @F)." ");print join(";",@F) }' your_input_file

or with sed skipping the first match

sed -s 's/\s\+/;/2g' your_input_file

ADD REPLY • link 5.5 years ago by microfuge ★ 2.0k

0

Entering edit mode

thanks,

But this also removed spaces between lines as well.

perl -lane '{printf ((shift @F)." ");print join(";",@F) }'

I do not want to remove spaces between lines.

ADD REPLY • link 5.5 years ago by Bioinfonext ▴ 470

0

Entering edit mode

Use awk and based on the value of NF, treat $1 and the rest of the fields differently. This way, you can retain blank lines while custom formatting other lines. Figuring out the awk code for yourself will be a good learning exercise.

ADD REPLY • link 5.5 years ago by Ram 44k

0

Entering edit mode

Same question as in my earlier post: why is this question considered within a scope when it deals with simple manipulation of text columns? Is it that the biological content of text make it relevant to bioinformatics?

By the way, consider this command:

awk '{print $1, $2";"$3";"$4";"$5";"$6";"$7}' input_file > output_file

ADD REPLY • link 5.5 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

thanks, I was trying to type the same command after Ram suggestion but it only gives first line as output: Is there any issue with my input file or for running this command to all over the lines needs to modify;

ABYV02000002.1 Archaea;Euryarchaeota;Methanobacteria;Methanobacteriales;Methanobacteriaceae;Methanobrevibacter

Thanks Bioinfonext

ADD REPLY • link 5.5 years ago by Bioinfonext ▴ 470

0

Entering edit mode

When I save excel sheet to tab delimited format, it saved in a weird look and also inserted ^M character somehow: it do not save each line of excel in a separate line in tab delimited format.

ABYV02000002.1  Archaea  Euryarchaeota   Methanobacteria         Methanobacteriales      Methanobacteriaceae     Methanobrevibacter      Methanobrevibacter smithii DSM 2374**^M**ABYV02000006.1    Archaea  Euryarchaeota   Methanobacteria         Methanobacteriales      Methanobacteriaceae     Methanobrevibacter      Methanobrevibacter smithii DSM 2374^MABYW01000005.1    Archaea  Euryarchaeota   Methanobacteria         Methanobacteriales      Methanobacteriaceae     Methanobrevibacter      Methanobrevibacter smithii DSM 2375**^M**ABYW01000007.1

ADD REPLY • link 5.5 years ago by Bioinfonext ▴ 470

0

Entering edit mode

excel sheet look like this:

ABYV02000002.1  Archaea  Euryarchaeota   Methanobacteria     Methanobacteriales  Methanobacteriaceae     Methanobrevibacter  Methanobrevibacter smithii DSM 2374
ABYV02000006.1  Archaea  Euryarchaeota   Methanobacteria     Methanobacteriales  Methanobacteriaceae     Methanobrevibacter  Methanobrevibacter smithii DSM 2374
ABYW01000005.1  Archaea  Euryarchaeota   Methanobacteria     Methanobacteriales  Methanobacteriaceae     Methanobrevibacter  Methanobrevibacter smithii DSM 2375
ABYW01000007.1  Archaea  Euryarchaeota   Methanobacteria     Methanobacteriales  Methanobacteriaceae     Methanobrevibacter  Methanobrevibacter smithii DSM 2375

ADD REPLY • link 5.5 years ago by Bioinfonext ▴ 470

1

Entering edit mode

hi @bioinfonext, as Mensur Dlakic suggested, this is not a problem specific to bioinformatics, it's a classic sorting problem and now in addition a classic windows/unix newline problem. Whatever editor you use to visualise the tab delimited file doesn't seem to handle the windows carriage return very well - see this post on stack overflow. It's also likely you stripped the newline character at some stage in the process, hence you see everything in one line.

These are classic beginners errors which all of us did. You can solve them easily with your own web search and I guarantee this will prepare you for the future.

ADD REPLY • link 5.5 years ago by Carambakaracho ★ 3.3k

1

Entering edit mode

thanks for your all help.

this command works for me:

cat final.taxonomy.txt | tr "\r" "\n" > final.taxonomy2.txt

Thanks again
Bioinfonext

ADD REPLY • link 5.5 years ago by Bioinfonext ▴ 470

0

Entering edit mode

We are not looking at your computer, @Bioinfonext. Please use a package to read this data into R if you're having difficulties working on the content - these comments are just asking us for a lot of handholding.

ADD REPLY • link 5.5 years ago by Ram 44k

0

Entering edit mode

You'd be better off copy-pasting from Excel to a plain text application (TextWrangler/Sublime Text/Atom/Notepad++) than using Excel to save the document.

You can use one of the above tools to open the document and try and change line endings, invalid characters, etc.

ADD REPLY • link 5.5 years ago by Ram 44k

0

Entering edit mode

We try to be as lenient as possible, but we are aware that drawing a well defined line is a problem. You are welcome to discuss and offer solutions on our slack channel.

ADD REPLY • link 5.5 years ago by Ram 44k