Question

Issues with reading .bed files and compressing output in a specific format

0

Entering edit mode

4.2 years ago

dk0319 ▴ 70

import gzip
input_file = open("example.bed","rb")#compress existing file
data = input_file.read()
with gzip.open("example.bed.gz", "wb") as filez:
    filez.write(data)
    filez.close()

import pandas as pd#converts gz file to .txt 
df=pd.read_csv("example.bed.gz", delimiter='\t',header=1 )
df.to_csv('exampleziptotxt.bed', index=False) 



import gzip
import os
file_name = "exampleziptotxt.bed"
out_file_root = "example_by_chrom"
file_handle_dict = {}
with open(file_name, "rb") as file_reader:

   for line in file_reader:
      ff = line.split()
      chrom_name = ff[0].decode("utf-8")

      if not (chrom_name in file_handle_dict):
          out_file_chrom_name = out_file_root + "." + chrom_name + ".bed.gz"

          with gzip.open(out_file_chrom_name, "wb") as out_file_chrom_name_handle:
                file_handle_dict[chrom_name] = out_file_chrom_name_handle
               file_handle_dict[chrom_name].write(line)

          file_handle_dict[chrom_name].write(gzip.compress(line))
file_reader.close()

(Desired) program takes a .bed file compresses it, reconverts the gzipped file to a .txt file , and then reads the contents and produces individual gzipped .bed files for each chromosome containing each gene belonging to that chromosome. vs. (Reality) the current script produces gzipped files for every gene for each chromosome and then eventually throws an error

FileNotFoundError: [Errno 2] No such file or directory: example_by_chrom.chr12,11733136,11733137,Cyp3a23/3a1,1,-.bed.gz'

Any help with solving this problem will be greatly appreciated. I have been stuck for days with this issue.

bed python gzip • 2.9k views

ADD COMMENT • link updated 4.2 years ago by Jorge Amigo 14k • written 4.2 years ago by dk0319 ▴ 70

0

Entering edit mode

4.2 years ago

Alex Reynolds 36k

You could do this task much more easily and more quickly using the shell, instead of Python.

First, sort the starting file with sort-bed:

$ sort-bed in.unsorted.bed > in.bed

Then split the sorted file by chromosome with bedextract, writing each per-chromosome dataset to separate compressed files:

$ for chrom in `bedextract --list-chr in.bed`; do echo ${chrom}; bedextract ${chrom} in.bed | gzip -c > in.${chrom}.bed.gz; done

You could also write to Starch format, if you need to save more disk space:

$ for chrom in `bedextract --list-chr in.bed`; do echo ${chrom}; bedextract ${chrom} in.bed | starch --omit-signature - > in.${chrom}.bed.starch; done

References:

ADD COMMENT • link 4.2 years ago by Alex Reynolds 36k

0

Entering edit mode

Unfortunately, I am limited to jupyter-lab at the moment, though I see your point.

ADD REPLY • link 4.2 years ago by dk0319 ▴ 70

0

Entering edit mode

Maybe use subprocess if you absolutely have to use Python. It's just not the right tool for this particular job, though.

ADD REPLY • link 4.2 years ago by Alex Reynolds 36k

score 1 · Accepted Answer · 2020-09-24

Despite the fact I don't understand why anyone would want to compress a file and uncompress it right afterwards, if you have a bed file and you want to split it by chromosome there's a fairly simple awk way to do it:

awk '{print > FILENAME".split."$1}' input.bed

You can compress them all at the end if you want to have them gzipped:

gzip -f input.bed.split.*

Or you can do both splitting and compressing in a single step:

awk '{print | "gzip > input."$1".bed.gz"}' input.bed