Convert hg38 to a nested array for binary search
1
0
Entering edit mode
4.0 years ago

Hi all, I am attempting to convert hg38 gtf file into a nested array so I can do binary search with it. I am trying to make the nested array based on position in which the first array is each chromosome sorted:

chromosomes = []
for i in range(1, 23):
    chromosomes.append(i)

the second array would be strand (+, -)

strands = ['+', '-']

for i in range(0,len(chromosomes)):
    chromosomes[i] = strands

the third array would be start, end positions

and the final array would be a list of attributes such as transcript_id and gene_id.

I am not sure the best way to iterate through the gtf file that I loaded to append my current array of arrays. I have this so far, but I am not sure if it is working or just taking a long time for the size:

for i in range(0, len(chromosomes)):
    for index, row in sorted_df.iterrows():
        if (str(i) == row['chr']) & (chromosomes[i][0] == row['strand']):
            positions = []
            positions.append(row['start'], row['end'])
            chromosomes[i][0] = positions
            positions.clear()

Is this the right way of thinking about this problem or is there a better way to approach it? Any help would be appreciated.

python hg38 genome binary search • 680 views
ADD COMMENT
1
Entering edit mode
4.0 years ago

I think you're re-inventing the wheel. You should have a look at htslib/tabix.

see also : http://genomewiki.ucsc.edu/index.php/Bin_indexing_system

ADD COMMENT

Login before adding your answer.

Traffic: 2100 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6