Issues with reading .bed files and compressing output in a specific format
2
0
Entering edit mode
4.2 years ago
dk0319 ▴ 70
import gzip
input_file = open("example.bed","rb")#compress existing file
data = input_file.read()
with gzip.open("example.bed.gz", "wb") as filez:
    filez.write(data)
    filez.close()

import pandas as pd#converts gz file to .txt 
df=pd.read_csv("example.bed.gz", delimiter='\t',header=1 )
df.to_csv('exampleziptotxt.bed', index=False) 



import gzip
import os
file_name = "exampleziptotxt.bed"
out_file_root = "example_by_chrom"
file_handle_dict = {}
with open(file_name, "rb") as file_reader:

   for line in file_reader:
      ff = line.split()
      chrom_name = ff[0].decode("utf-8")

      if not (chrom_name in file_handle_dict):
          out_file_chrom_name = out_file_root + "." + chrom_name + ".bed.gz"

          with gzip.open(out_file_chrom_name, "wb") as out_file_chrom_name_handle:
                file_handle_dict[chrom_name] = out_file_chrom_name_handle
               file_handle_dict[chrom_name].write(line)

          file_handle_dict[chrom_name].write(gzip.compress(line))
file_reader.close()

(Desired) program takes a .bed file compresses it, reconverts the gzipped file to a .txt file , and then reads the contents and produces individual gzipped .bed files for each chromosome containing each gene belonging to that chromosome. vs. (Reality) the current script produces gzipped files for every gene for each chromosome and then eventually throws an error

FileNotFoundError: [Errno 2] No such file or directory: example_by_chrom.chr12,11733136,11733137,Cyp3a23/3a1,1,-.bed.gz'

Any help with solving this problem will be greatly appreciated. I have been stuck for days with this issue.

bed python gzip • 2.9k views
ADD COMMENT
1
Entering edit mode
4.2 years ago

Despite the fact I don't understand why anyone would want to compress a file and uncompress it right afterwards, if you have a bed file and you want to split it by chromosome there's a fairly simple awk way to do it:

awk '{print > FILENAME".split."$1}' input.bed

You can compress them all at the end if you want to have them gzipped:

gzip -f input.bed.split.*

Or you can do both splitting and compressing in a single step:

awk '{print | "gzip > input."$1".bed.gz"}' input.bed
ADD COMMENT
0
Entering edit mode
4.2 years ago

You could do this task much more easily and more quickly using the shell, instead of Python.

First, sort the starting file with sort-bed:

$ sort-bed in.unsorted.bed > in.bed

Then split the sorted file by chromosome with bedextract, writing each per-chromosome dataset to separate compressed files:

$ for chrom in `bedextract --list-chr in.bed`; do echo ${chrom}; bedextract ${chrom} in.bed | gzip -c > in.${chrom}.bed.gz; done

You could also write to Starch format, if you need to save more disk space:

$ for chrom in `bedextract --list-chr in.bed`; do echo ${chrom}; bedextract ${chrom} in.bed | starch --omit-signature - > in.${chrom}.bed.starch; done

References:

  1. https://bedops.readthedocs.io/en/latest/content/reference/file-management/sorting/sort-bed.html
  2. https://bedops.readthedocs.io/en/latest/content/reference/set-operations/bedextract.html
  3. https://bedops.readthedocs.io/en/latest/content/reference/file-management/compression/starch.html
ADD COMMENT
0
Entering edit mode

Unfortunately, I am limited to jupyter-lab at the moment, though I see your point.

ADD REPLY
0
Entering edit mode

Maybe use subprocess if you absolutely have to use Python. It's just not the right tool for this particular job, though.

ADD REPLY

Login before adding your answer.

Traffic: 2670 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6