trying to run python script to generate synthetic reads

1

Entering edit mode

8.8 years ago

lien ▴ 90

Hi all,

I've downloaded a Python script from the Supplementary data from a paper "Genome-wide copy number analysis of single cells', from Nature protocols 2012, Baslan et al. They provide several scripts to use to make your own pipeline to analyse copy numbers from sequencing data. I've ran some scripts from their supplementary files already, and these don't give any problems. However, I have some problems when I try to generate synthetic 50bp reads for each position in the hg19 genome. When I try to run this script 'hg19.generate.reads.k50.py' I don't get an output. Actually, I do get an output: one empty file, instead of several files. I have adapted the filepaths to correctly represent the paths for my case (as I did in the other scripts I ran). I don't know if I need to change some other parameters as well? (I'm not a pro with Python). I've copied the original script below, any help on how to run this would be very much appreciated.

#!/usr/bin/env python

import re
import sys
import random
import string

def main():

    CHROMLIST = open("/filepath/chromlist.txt", "r")

    chromlist = []
    for x in CHROMLIST:
        if x.rstrip().find("_") == -1:
            chromlist.append(x.rstrip())

    INFILE = False
    OUTFILE = open("/filepath/sequence.part.0.k50.txt", "w")
    outfilecount = 0
    counter = 0

    for chrFile in chromlist:
        a = chrFile.split("/")
        b = a[len(a) - 1]
        thisChrom = b.split(".")[0]

        if INFILE:
            INFILE.close()
        INFILE = open(chrFile, "r")
        chr = []
        x = ""
        y = ""
        INFILE.readline()
        for x in INFILE:
            chr.append(x.rstrip())
        x = "".join(chr)
        y = x.upper()
        print "after read " + thisChrom
        sys.stdout.flush()

        for i in range(len(y) - 50):
            counter += 1
            thisRead = y[i:(i+50)]

            if counter % 150000000 == 0:
                OUTFILE.close() 
                outfilecount += 1
                OUTFILE = open("/filepath/sequence.part." + str(outfilecount) + ".k50.txt", "w")

            OUTFILE.write("@" + thisChrom + "." + str(i))
            OUTFILE.write("\n")
            OUTFILE.write(thisRead)
            OUTFILE.write("\n")
            OUTFILE.write("+" + thisChrom + "." + str(i))
            OUTFILE.write("\n")
            OUTFILE.write("b" * len(thisRead))      
            OUTFILE.write("\n")

    INFILE.close()

    OUTFILE.close()

if __name__ == "__main__":
    main()

python synthetic reads • 2.6k views

ADD COMMENT • link updated 8.8 years ago by John 13k • written 8.8 years ago by lien ▴ 90

3

Entering edit mode

I've corrected the formatting as much as I could without actually going through the code. In general, it's often simpler to just link to a gist.

ADD REPLY • link 8.8 years ago by Devon Ryan 105k

0

Entering edit mode

I agree to an extent. Defeats our markdown move though :(

ADD REPLY • link 8.8 years ago by Ram 44k

3

Entering edit mode

8.8 years ago

Devon Ryan 105k

What are you putting in for chromlist.txt? Looking at the code, it's expecting a file giving the path to single chromosome fasta files. No where in the path can a _ occur, which is presumably meant to get rid of *_random.fa contigs. Of course, if you happen to have a directory with a _ in the name then this will cause problems...

As an aside, the coding is terrible and rather inefficient, but that's a different post...

ADD COMMENT • link 8.8 years ago by Devon Ryan 105k

2

Entering edit mode

The chromlist.txt file indeed contains a list of paths to single chromosome fasta files.

I was working on a cluster where directories have fixed names, indeed containing a _. When I tried running it locally, after changing all names of directories so that they don't contain an _, everything works. Many thanks!

ADD REPLY • link 8.8 years ago by lien ▴ 90

1

Entering edit mode

Man, that was some seriously impressive detective work. How did you guys figure that out without even seeing the file-structure!? This only further illustrates how Devon is actually a Hidden Markov Learning Machine, and where we all see questions with insufficient information to answer, he sees emissions. Feedback only makes him stronger. When he releases Methyl-Skynet or TermSNPator-1000, we'll know its too late :P

ADD REPLY • link 8.8 years ago by John 13k

2

Entering edit mode

"Experience" means having already made all of the mistakes at least once :P

ADD REPLY • link 8.8 years ago by Devon Ryan 105k

1

Entering edit mode

b = a[len(a) - 1] :P hehe. Makes me feel better about my awful code ;)

ADD REPLY • link 8.8 years ago by John 13k

1

Entering edit mode

I know, right? My favorite is the fact that y is essentially a duplicate of x and likely a couple hundred megs in memory :(

ADD REPLY • link 8.8 years ago by Devon Ryan 105k

2

Entering edit mode

8.8 years ago

John 13k

This code is really oddly written...

I've edited it a bit to be simpler but still do the same thing. I've also added some print lines (which end in #############...) which you can delete once your issue is solved - but if run they should give us some more information about what's going wrong.

Gist link :

	#!/usr/bin/env python

	def main():
	counter = 0
	chromlist = []
	outfilecount = 0
	OUTFILE = open("/filepath/sequence.part.%s.k50.txt" % outfilecount, "w")

	with open("/filepath/chromlist.txt", "r") as CHROMLIST:
	for x in CHROMLIST:
	if '_' not in x: chromlist.append(x.rstrip())

	for chrFile in chromlist:
	chromosome = []
	thisChrom = chrFile.split("/")[-1].split(".")[0]
	print 'thisChrom:',thisChrom ##########################################
	with open(chrFile, "r") as INFILE:
	print 'skipping header:',INFILE.readline() ##########################################
	for x in INFILE: chromosome.append(x.rstrip())

	y = ''.join(chromosome).upper()
	print 'y:',y,len(y) ##########################################

	for i in range(len(y) - 50):
	counter += 1
	thisRead = y[i:(i+50)]

	if counter % 150000000 == 0:
	print 'yolo' ##########################################
	OUTFILE.close()
	outfilecount += 1
	OUTFILE = open("/filepath/sequence.part.%s.k50.txt" % outfilecount, "w")

	read = '@' + thisChrom + '.' + str(i) + '\n'
	read += thisRead + '\n'
	read += '+' + thisChrom + '.' + str(i) + '\n'
	read += str("b" * len(thisRead)) + '\n'
	OUTFILE.write(read)
	OUTFILE.close()

	if __name__ == "__main__": main()

view raw derp.py hosted with ❤ by GitHub

ADD COMMENT • link 8.8 years ago by John 13k

2

Entering edit mode

OK mister "I have lab meeting in a couple days so I don't have time to write the README.md file for pybam"...