help in converting a table
5
0
Entering edit mode
6.6 years ago
Chris ▴ 30

Hi all,

I have a table as the example below which I need to convert to a table where the first column is the name of genes followed by the expression values for that particular gene.

hsa-mir-1229    1.197748
hsa-mir-3163    0.000000
hsa-mir-3175    0.000000
hsa-mir-3189    0.149719
hsa-mir-3619    3.144089
hsa-mir-3646    0.149719
hsa-mir-3658    0.000000
hsa-mir-3662    1.497185
hsa-mir-3928    1.347467
hsa-mir-4262    0.000000
hsa-mir-1229    1.000000
hsa-mir-3163    0.800000
hsa-mir-3175    0.030000
hsa-mir-3189    0.149719
hsa-mir-3619    3.144089
hsa-mir-3646    0.449719
hsa-mir-3658    0.490719
hsa-mir-3662    0.497185
hsa-mir-4262    0.700000
----------
hsa-mir-1229    1.197748   1.000000
hsa-mir-3163    0.000000   0.800000
hsa-mir-3175    0.000000   0.030000
hsa-mir-3189    0.149719   0.149719
hsa-mir-3619    3.144089   3.144089
hsa-mir-3646    0.149719   0.449719
hsa-mir-3658    0.000000   0.490719
hsa-mir-3662    1.497185   0.497185
hsa-mir-3928    1.347467     NA
hsa-mir-4262    0.000000   0.700000

Any help is highly appreciated.

Thank you

RNA-seq • 2.3k views
ADD COMMENT
0
Entering edit mode

Thank you for your answer, I think I did not explain myself properly.

This is what I get:

hsa-mir-1229 1.197748hsa-mir-3163 0.000000hsa-mir-3175 0.000000hsa-mir-3189 0.149719hsa-mir-3619 3.144089hsa-mir-3646 0.149719hsa-mir-3658 0.000000hsa-mir-3662 1.497185hsa-mir-3928 1.347467

as an output rather to what I describe earlier.

Again thank you so much for your help

ADD REPLY
0
Entering edit mode

I 've tried excel and still I am getting the same format. All genes and expressions are in one line.

ADD REPLY
0
Entering edit mode

Please do not add answers unless you're answering your own top level question. These are responses to other people's comments, which should be added as comment replies. If appropriate, your post should be edited. See: How to add a comment reply and How to edit your post

ADD REPLY
1
Entering edit mode
6.6 years ago
st.ph.n ★ 2.7k

Here's a quick python solution. I'm assuming your table is tab-delimited, and it will output tab-delimited.

#!/usr/bin/env python
import sys
from collections import defaultdict

with open(sys.argv[1], 'r') as f:
    genes = defaultdict(list)
    for line in f:
        genes[line.strip().split('\t')[0]].append(line.strip().split('\t')[1])

with open(sys.argv[2], 'w') as out:
    for i in genes:
        out.write(i + '\t' + '\t'.join(genes[i]))

Save code as format_table.py, run as python format_table.py input.txt output.txt

ADD COMMENT
2
Entering edit mode

Missing a line break in the out.write line. Change the last line to:

out.write(i + '\t' + '\t'.join(genes[i]) + '\n')
ADD REPLY
0
Entering edit mode

Yep, using write() in Python requires adding line terminators. Easy to miss.

ADD REPLY
0
Entering edit mode

I'm more use to 2.X syntax so I wasn't aware. Thanks.

ADD REPLY
0
Entering edit mode

thank you so much for your quick answer. I don't have any experience in python and I am willing to learn it. When I run your code somehow all the data in the output are place in one line like that:

hsa-mir-4649    0.449156    0.000000    0.808519    0.000000    0.543301    0.000000    0.000000    0.000000    0.000000    0.252087    0.000000    0.000000    0.000000    0.000000    0.245496    0.000000    0.000000    0.090737    0.000000    0.082297    0.000000    0.000000    0.000000    0.671861    0.207298    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.389510    0.428801    0.000000    0.000000    0.000000    0.000000    0.264927    0.000000    0.000000    0.000000    0.376946    0.316828    0.086942    0.396281    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.259527    0.107388    0.271481    0.000000    0.000000    0.295719    0.000000    0.177120    0.000000    0.175829    0.000000    0.477968    0.201986    0.482697    0.000000    0.810934    0.000000    0.000000    0.152771    0.349035    0.216139    0.000000    0.000000    0.310665    0.000000    0.000000    0.000000    0.997040    0.000000    0.000000    0.000000    0.000000    0.000000    0.140049    0.000000    0.000000    0.000000    0.000000    0.085340    0.000000    0.000000    0.000000    0.000000    0.202206    0.000000    0.000000    0.426247    0.111463    0.136478    0.103880    0.000000    0.518896    0.000000    0.000000    0.000000    0.219701    0.187424    0.000000    0.282888    0.190998    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.773226    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.392962    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.231889    0.000000    0.233245    0.000000    0.432839    0.363025    0.291168    0.000000    0.000000    0.000000    0.000000    0.155571    0.231324    0.000000    0.112256    0.212441    0.000000    0.000000    0.000000    0.000000    0.526724    0.000000    0.000000    0.494783    0.000000    0.335871    0.646818    0.000000    0.881312    0.000000    0.000000    0.157905    0.000000    0.186378    0.189747    0.513107    0.304384    0.450149    0.000000    0.000000    0.147183    0.000000    0.000000    0.402776    0.000000    0.123698    0.293994    0.000000    0.000000    0.000000    1.520795    0.000000    0.150698    0.349743    0.000000    0.000000    0.412048    0.000000    0.000000    0.000000    0.126304    0.000000    0.146505    0.000000    0.000000    0.000000    0.590510    0.000000    0.000000    0.692906    0.000000    0.000000    0.080122    0.000000    0.339625    0.135295    0.000000    0.000000    0.000000    0.000000    0.226271    0.238614    0.675249    0.000000    0.000000    0.109565    1.215094    0.122594    0.000000    0.754284    0.177904    0.000000    0.389059    0.170698    0.085960    0.000000    0.000000    0.000000    0.136473    0.299894    0.166363    0.135930    0.369595    0.617371    0.273669    0.000000    0.000000    0.000000    0.278574    0.000000    0.000000    0.000000    0.467559    0.052991    0.000000    0.073191    0.000000    0.000000    0.755511    0.000000    0.393010    0.000000    0.000000    0.123346    0.182877    0.305522    0.310795    1.063833    0.396389    0.447259    0.000000    0.114615    0.199650    0.000000    0.141586    0.362416    0.000000    0.000000    0.919735    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.360965    0.000000    0.165587    0.000000    0.360406    1.108463    0.175386    0.000000    0.565773    0.370455    0.000000    0.000000    0.173638    0.000000    0.000000    0.000000    0.244640    0.000000    0.000000    0.000000    0.178119    0.139975    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.161680    0.000000    0.000000    0.387075    0.367204    0.000000    0.528403    0.000000    0.000000    0.000000    0.231294    0.000000    0.112177    0.421011    0.000000    0.083285    0.522824    0.000000    0.000000    0.187261    0.581464    0.756509    0.000000    0.000000    1.517789    0.000000    1.024637    0.000000    0.201494    0.198940    0.000000    0.000000    0.061106    0.237552    0.233458    0.000000    0.000000    0.000000    0.000000    0.478040    0.000000    0.000000    0.166597    0.552751    0.400058    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.533696    0.000000    0.370002    0.000000    0.000000    0.000000    0.000000    0.255239    0.000000    0.000000    0.000000    0.275042    0.000000    0.000000    0.235675    0.000000    0.000000    0.578777    0.000000    0.429569    0.357804    0.414087    0.375918    0.000000    0.000000    0.170525    0.000000    0.000000    0.397967    0.000000    0.234764    0.199694    0.214717    0.000000    0.226096    0.000000    0.000000    0.000000    0.000000    0.000000    0.342193    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.136683    0.000000    0.000000    0.382529    0.000000    0.000000    0.159683    0.214213    0.000000    0.160232    0.000000hsa-mir-4500    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000
ADD REPLY
0
Entering edit mode

How many times does hsa-mir-4649 appear in your input file? This should equal (+1) the number of columns in the output for that gene id.

grep -e 'hsa-mir-4649' | wc -l

How many genes in your input file? How many lines in your output file? These should be identical.

cut -f 1 input.txt | sort | uniq | wc -l 

wc -l output.txt
ADD REPLY
0
Entering edit mode

Sometimes new line characters aren't interpreted properly, depending on how you're opening your file, e.g. notepad. Trying using the command-line, less output.txt, or open in Excel if it is not too large.

ADD REPLY
1
Entering edit mode
6.6 years ago
avek ▴ 10

try

#!/bin/bash

awk '{arr[$1]=arr[$1]"\t"$2} END {for(g in arr) printf "%s\t%s\n", g, arr[g]}' ~/Desktop/myfile.txt

where myfile.txt is your file actually. You will get something like:

hsa-mir-3646        0.149719    0.449719
hsa-mir-3619        3.144089    3.144089
hsa-mir-3163        0.000000    0.800000
hsa-mir-3928        1.347467
hsa-mir-3658        0.000000    0.490719
hsa-mir-3175        0.000000    0.030000
hsa-mir-4262        0.000000    0.700000
hsa-mir-3189        0.149719    0.149719
hsa-mir-3662        1.497185    0.497185
hsa-mir-1229        1.197748    1.000000
ADD COMMENT
2
Entering edit mode
6.6 years ago
$ datamash -s  -g 1 collapse 2 < test.txt | awk -F "," -v OFS="," '{ if(!$2) $2 = "NA" }; 1'| sed 's/,/\t/g'
hsa-mir-1229    1.197748    1.000000
hsa-mir-3163    0.000000    0.800000
hsa-mir-3175    0.000000    0.030000
hsa-mir-3189    0.149719    0.149719
hsa-mir-3619    3.144089    3.144089
hsa-mir-3646    0.149719    0.449719
hsa-mir-3658    0.000000    0.490719
hsa-mir-3662    1.497185    0.497185
hsa-mir-3928    1.347467    NA
hsa-mir-4262    0.000000    0.700000
ADD COMMENT
0
Entering edit mode
6.6 years ago

The other answers to this question do not add NA fields where there are missing values. If you want NA values in your table to "fill in the gaps", the script will have to do a little more work.

Here's an example of how that might be done. There are other ways to do this, of course. This assumes that your data are tab-delimited.

#!/usr/bin/env python

import sys

d = {}
m = 0

for line in sys.stdin:
    (k,v) = line.strip().split('\t')
    if k not in d:
        d[k] = []
    d[k].append(v)
    l = len(d[k])
    if l > m:
        m = l

for k in d.keys():
    for i in range(len(d[k]), m):
        d[k].append('NA')
    sys.stdout.write("%s\t%s\n" % (k, '\t'.join(d[k])))

Here's an example of how you could use this script. Let's say this script is called collapse.py:

$ ./collapse.py < expression.txt
hsa-mir-1229    1.197748        1.000000
hsa-mir-3163    0.000000        0.800000
hsa-mir-3175    0.000000        0.030000
hsa-mir-3189    0.149719        0.149719
hsa-mir-3619    3.144089        3.144089
hsa-mir-3646    0.149719        0.449719
hsa-mir-3658    0.000000        0.490719
hsa-mir-3662    1.497185        0.497185
hsa-mir-3928    1.347467        NA
hsa-mir-4262    0.000000        0.700000

This script assumes that you don't need the hsa-mir-* rows printed in any particular order. If you need ordering, you can either sort the output with Unix sort or use an ordered dictionary in Python. Feel free to clarify your question, if so.

ADD COMMENT
0
Entering edit mode
6.6 years ago
Chris ▴ 30

Hi all, thank you so much for your valuable answers.

ADD COMMENT
0
Entering edit mode

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.

Upvote|Bookmark|Accept

ADD REPLY
0
Entering edit mode

Also, this is not an answer. After you provide feedback on the other answers, please delete this post:

  • Click on moderate (link next to the Add Comment button)
  • Choose Delete Post
  • Click on the blue Submit button.
ADD REPLY

Login before adding your answer.

Traffic: 1837 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6