Converting 1D arrays to 3D arrays
2
1
Entering edit mode
10.4 years ago
ruchiksy ▴ 50

I have generated a BED like tab delimited file with intersections with the Gencode v.19 annotated gtf and splice junction files from the Human BodyMap 2.0. I am in the process of completing a script that will allow me to create weighted venn diagrams of this dataset. My problem is that I have 1D arrays and I need to convert them to 3D arrays. I am quite new to this array and plotting business. I have pasted my code below, please advice me on how to proceed. I have also included the traceback call.

import matplotlib.pyplot as plt
import numpy as np
from matplotlib_venn import venn3, venn3_circles
import os
import sys

input_file = "/home/ruchik/bodyMap_Data/bodyMap_Files/sample.txt"
print len(sys.argv)
if len(sys.argv) < 2:
        file_name = raw_input('Enter file path: ')
else:
        print sys.argv[1]
col1_idx = [7]
col2_idx = [8]

print >> sys.stderr,'File is {file} and used columns are {col1} and {col2}.'.format (file=input_file, col1=col1_idx, col2=col2_idx)

## Open the  file and read each line
## The 1st line: (col 7th to ith) is stored into a array or list named 'header_all'
## The other lines: 7th (aka col0_idx) to ith column into an array named 'data_all'

f = open("/home/ruchik/bodyMap_Data/bodyMap_Files/sample.txt")
fh = open(input_file, "r")
header = fh.readlines()
data = fh.readlines()

print "Header: "
#print header

print "Data: "
for j in range(len(data)):
        data[j] = data[j].split("\t")
        for i in range(len(data[j])):
                data[j][i] = int(data[j][i])
fh.close()

data = np.array(data, dtype=np.int32)

d = np.reshape([data[:, 0], data[:, 1], np.reshape(np.all(data[:, 2:5], 1))], dtype=np.int32).transpose()

op = [[1, 0, 0], [0, 1, 0], [1, 1, 0], [0, 0, 1], [1, 0, 1], [0, 1, 1], [1, 1, 1]]
count = [0, ] * len(op)
for i in range(len(op)):
        for j in range(len(d)):
                if list(d[j]) == op[i]:
                        count[i] += 1

for k in range(len(op)):
        print str(op[k]) + ': ' + str(count[k])

## Pipe the data and create the venn diagrams

plt.figure(figsize=(4,4))
v = venn3(subsets=count, set_labels = ('Introns', 'Heart', 'Kidney'))
v.get_patch_by_id('100').set_alpha(1.0)
v.get_patch_by_id('100').set_color('white')
v.get_label_by_id('100').set_text('Unknown')
v.get_label_by_id('A').set_text('Set "A"')
plt.show()

Traceback:

Traceback (most recent call last):
  File "make_venn.py", line 39, in <module>
    d = np.reshape([data[:, 0], data[:, 1], np.reshape(np.all(data[:, 2:5], 1))], dtype=np.int32).transpose()
IndexError: too many indices

Thank you for your time!

EDIT

Input File:

chrom    start    end    name    score    strand    intron    Heart    Kidney    Liver  Lung
chr10    100008748    100010821    .    .    -    1    1    1    1    1
chr10    100010933    100011322    .    .    -    1    0    1    0    1
chr10    100011459    100012109    .    .    -    1    1    1    1    1
chr10    100011959    100015344    .    .    +    1    0    0    0    0
chr10    100012225    100013309    .    .    -    1    0    1    0    1
chr10    100013553    100015333    .    .    -    1    0    1    1    1
chr10    100015496    100016536    .    .    -    1    1    1    1    1
chr10    100016704    100017406    .    .    -    1    0    1    0    1
chr10    100017561    100017737    .    .    -    1    0    1    1    1

This file was generated by intersecting the Gencode V.19 GTF file with the splice junctions of each tissue. The goal is to create a script which will take this file as input, keep Col. 7 constant take another tissue and the union of other three tissues to create a three way venn diagram since the python library can only take three files as input unlike R where multiple inputs can be used. This file has close to 420k records which is why I thought that arrays would come useful but now that I think about it, I can use dictionaries for this purpose.

The Venn library I am using is Matplotlib-venn.

matplotlib python arrays • 10k views
ADD COMMENT
2
Entering edit mode

I read as far as "file_name = raw_input('Enter file path: ')" and got bored.

Can't see what you mean by converting a 1-D array to a 3d array. A BED file is one-dimensional, you could make it 2d by flagging intersections (rows are BED, columns are GTF), but I don't understand where the 3d array is expected.

Please explain what you're trying to do, and we can suggest a solution.

ADD REPLY
0
Entering edit mode

I'm thinking OP means a 2D array with 3 elements each. Not sure if OP is savvy with array basics.

ADD REPLY
0
Entering edit mode

I added the example of my input file and a brief description of what I am trying to achieve.

ADD REPLY
0
Entering edit mode

Ruchik, this info is purely FYI.

1D array looks like this: [1,2,3,4,5,6]

2D array looks like a matrix:

 [ [1,2],
   [3,4],
   [5,6] ]

3D array is a list of matrices. Imagine 2 layers of the 2D matrix shown above. Another example here: http://www.mathworks.com/help/matlab/math/ch_data_struct4.gif

ADD REPLY
0
Entering edit mode

Hello Ram,

I am a wet lab person transitioning to Bioinformatics and so I don't have a full understanding of Arrays. I also thought about my problem a little bit and it looks like I don't need to have arrays to do my job. I will update on what I am trying to achieve when I get to the lab and so you can have a better understanding of what I am trying to do.

ADD REPLY
1
Entering edit mode

I understand, Ruchik. That's why I gave you some basic info on arrays :-) Welcome to the wonderful world of Bioinformatics!

Also, just noticed a missing ] here:

d = np.reshape([data[:, 0], data[:, 1], ...

It should be so for proper syntax:

d = np.reshape([data[:, 0], data[:, 1]], ...

I'm assuming you're trying to slice the first two columns out and reshape it. Also, shouldn't you be working with the data_all array in the code above?

ADD REPLY
0
Entering edit mode

Hello Ram,

I am a wet lab person transitioning to Bioinformatics and so I don't have a full understanding of Arrays. I also thought about my problem a little bit and it looks like I don't need to have arrays to do my job. I will update on what I am trying to achieve when I get to the lab and so you can have a better understanding of what I am trying to do.

ADD REPLY
0
Entering edit mode

In the future, please make an effort to format the text in your question. This makes it easier to distinguish question text from source code or input. This makes it easier for people to help you.

ADD REPLY
1
Entering edit mode
10.4 years ago

Don't overthink it. Here's a fully worked solution in R. Assuming the data file is tab-delimited.

library(limma)
D<-read.table("data.txt", header=TRUE)
vennDiagram(vennCounts(D[,8:10]))

Three lines of code, after you get the limma package installed.

ADD COMMENT
1
Entering edit mode

Cool solution, but IMO OP is looking to learn Python. OP being a beginner, while I support you that R is THE way to go about the task, I would not discourage Python usage.

ADD REPLY
0
Entering edit mode

Thanks for your Answer Karl,

However, I will also be doing some differential gene expression analysis and that will add more data to my input and I would like to have a script where I can go in and change things to get desired Venn's. Also this would help me in understanding how the code is going to be structured and will allow me to be able to become an independent coder going further. Thanks for your time!

ADD REPLY
0
Entering edit mode
10.4 years ago
João Rodrigues ★ 2.5k

Hi,

You have some issues with your code that might be causing the problem. First, your header and data variables hold the same information; you have a typo in the class method: readline() vs readlines(). Second, the for loop to convert the values to integers will fail with the example input you provided (dots, first column, etc, cannot be converted). There are also issues with op (list of lists?), etc...

Here's my suggestion:

1. Load the data to a numpy array directly using the loadtxt function!! Read more about it here.

 data = np.loadtxt('data.txt', dtype=np.int, usecols=(6,7,8), skiprows=1)

You need the header for the labels of the diagram so just parse it once like you did:

 fh = open('data.txt')
 header = fh.readline().strip().split() # This way you already have a list
 fh.close()

​2. Then just do your counting. You can simplify it by using list comprehensions and the count method of lists. This does imply however converting the numpy array to a list, so it might be useless just to have that first step anyway.

 counts = [data.tolist().count(o) for o in op]

3. Plot

  v = venn3(subsets=counts, set_labels=header[6:9])
  plt.show()

Also, have a look at this article to learn some tricks of Python (i.e. for loops without range!)

ADD COMMENT

Login before adding your answer.

Traffic: 2768 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6