Python script random act
1
0
Entering edit mode
6.2 years ago

In this script I use genode GTF for mouse (last realase vM18)

At the beginning, I wanted to modify the mouse GTF annotation using a python script (modify some positions in chr12 annotation)

Of course I simplified my script to focus on my issue. I'll try to make it clear.

This is my script :

import os
import sys
import getopt
import os.path
import numpy as np
import pandas as pd

def usage():
    print('Usage:\n')
    print('\tpython ' +
          sys.argv[0] + ' -g <genome type> -y <gtf file>')
    print('\t\t-h or --help : display this help')
    print('\t\t-g or --genome : will add the genome to the cytoband file name')
    print('\t\t-y or --file_gtf : gtf file')


def main(argv):

    genome = ""
    file_gtf = ""

    try:
        opts, args = getopt.getopt(sys.argv[1:], 'g:y:', ['genome=', 'file_gtf=', 'help'])
    except getopt.GetoptError:
        usage()
        sys.exit(2)

##############################OPTIONS##############################

    for opt, arg in opts:
        if opt in ('-h', '--help'):
            usage()
            sys.exit(2)
        elif opt in ('-g', '--genome'):
            genome = arg
        elif opt in ('-y', '--file_gtf'):
            file_gtf = arg
        else:
            print("Error : Bad option -> " + opt)
            usage()
            sys.exit(2)

##############################PROGRAM##############################

    new_vcf_file = open(genome + ".gtf", "a")
    with open(file_gtf, 'r') as vcf_f:
        for line in vcf_f:
            if line[0] != '#':
                array_line = line.split("\t")
                if array_line[0] == "chr12":
                    new_vcf_file.write(line)
                else:
                    new_vcf_file.write(line)
            else:
                new_vcf_file.write(line)
    new_vcf_file.close()

if __name__ == '__main__':
    main(sys.argv[1:])

I know it does nothing because I simplified it. Just read and write lines.

I run my script with this command line :

python test_biostar.py -g testBiostar -y gencode.vM18.chr_patch_hapl_scaff.annotation.gtf && wc -l testBiostar.gtf && rm testBiostar.gtf

At the end of this command a new GTF file is generated (testBiostar.gtf). To show my issue I count the number of line of this file and I delete it to re-run easily.

In this script I do not delete lines from the gencode GTF, so I expect to have the same number of lines

17:32:15 [ubuntu16@ubuntu16-VirtualBox Data]$ wc -l gencode.vM18.chr_patch_hapl_scaff.annotation.gtf 
1818071 gencode.vM18.chr_patch_hapl_scaff.annotation.gtf

Then I run my script multiple times :

17:31:21 [ubuntu16@ubuntu16-VirtualBox Data]$ python /media/sf_raid/Projects/PTCB/test_biostar.py -g testBiostar -y gencode.vM18.chr_patch_hapl_scaff.annotation.gtf && wc -l testBiostar.gtf && rm testBiostar.gtf
1754744 testBiostar.gtf
17:31:47 [ubuntu16@ubuntu16-VirtualBox Data]$ python /media/sf_raid/Projects/PTCB/test_biostar.py -g testBiostar -y gencode.vM18.chr_patch_hapl_scaff.annotation.gtf && wc -l testBiostar.gtf && rm testBiostar.gtf
1754735 testBiostar.gtf

Everytime I launch the script I got different results (sometimes I got a result I already got...)


EDIT : While typing I found the answer but I don't know why (computer science as its finest)

I opened the testBiostar.gtf result file and I found that some lines were cut and glue to others. My hints were on the read/write process... I do not know where exactly, but I replaced :

new_vcf_file = open(genome + ".gtf", "a")

To :

new_vcf_file = open(genome + ".gtf", "w")

17:54:10 [ubuntu16@ubuntu16-VirtualBox Data]$ python /media/sf_raid/Projects/PTCB/test_biostar.py -g testBiostar -y gencode.vM18.chr_patch_hapl_scaff.annotation.gtf && wc -l testBiostar.gtf && rm testBiostar.gtf
1818071 testBiostar.gtf
17:55:10 [ubuntu16@ubuntu16-VirtualBox Data]$ python /media/sf_raid/Projects/PTCB/test_biostar.py -g testBiostar -y gencode.vM18.chr_patch_hapl_scaff.annotation.gtf && wc -l testBiostar.gtf && rm testBiostar.gtf
1818071 testBiostar.gtf

My question is why it is working now ? I know the 'a' is for 'append to the end of the file' but after each run I delete this file

I bet it is a stupid answer... but I can not find it

Thanks

python • 2.1k views
ADD COMMENT
2
Entering edit mode

This have no relation to the answer, why you do not use python argparse? this will save you the usage() function and help more in processing input args.

ADD REPLY
0
Entering edit mode

Habits mostly, I learned python with getopt so I keep using it. I will give a try to argparse thanks

ADD REPLY
0
Entering edit mode

You will not regret :D

ADD REPLY
1
Entering edit mode

EDIT : While typing I found the answer [...]

I stopped counting how many time this happens to me ;)

ADD REPLY
2
Entering edit mode
6.2 years ago

What happens if you use open(genome + ".gtf", "w"), rather than open(genome + ".gtf", "a")? I wonder if there's some buffering going on.

ADD COMMENT
0
Entering edit mode

I forgot to put the result of "a" to "w", I edited my post. It works, I got all the results I was expected. I agree for the buffering in the script with the "a" but the script end successfully, I delete the file and I build a new one using another command line... How can it still buffer something if the script ends ?

If this could have a relation, I am in a virtual environment

ADD REPLY
1
Entering edit mode

My guess is that the file system inside the virtualbox is odd, in that it doesn't just delete the inode, but instead starts deleting the actual contents first. Do an immediate ls -l after the rm and see if the file still shows up.

ADD REPLY
0
Entering edit mode

After the rm command the file is gone. But the first run works well I got all the results, the other runs failed. I'll try this on my host machine

ADD REPLY
0
Entering edit mode

Works perfectly well on the host machine !

This is a dangerous behaviour from my virtual machine, took me 3 hours to find out the bug and I doubt about my results now... Did you see something similar on specialize forum ?

ADD REPLY
0
Entering edit mode

I'm not sure what sort of "specialize forum" it is, but no. This was a pure guess on my part aided by knowing what the "a" and "w" bits are really doing under the hood.

ADD REPLY
0
Entering edit mode

Mean StackOverflow discussion or python forum... How does a system manage the "a" option ?

ADD REPLY
0
Entering edit mode

a opens the file and, if it already exists. w opens a file and will truncate it if it already exists. As a general strategy, it's best to avoid using a unless you have a good reason to use it. My guess is that the system call used by ls and that used by the open() system call are seeing somewhat different things due to a shoddy file system implementation in virtual box.

ADD REPLY
0
Entering edit mode

I think you missed some words at the beginning. w truncate ? w overwrite !

ADD REPLY
0
Entering edit mode

It overwrites by truncating in my understanding.

ADD REPLY

Login before adding your answer.

Traffic: 1681 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6