Sorting sequences according header

Entering edit mode

8.9 years ago

Gian77 ▴ 70

Hello everyone,

I have an Illumina library back form the sequencing facility of around 18M PE reads. Sequences are in two huge files (R1.fastq, and R2.fastq), the index (barcodes) are in another file (I1.fastq). I have made some filtering on my reads before demultiplexing, so before doing that I need to sort my Index to be sure only the one correspond to the reads are present. I have a python code that aims to output that exactly the same Index that are contained in the reads .fastq file according the read headers. The code below is working but is incredibly slow (the files I am working are very big, like 18M for the Index file and 15M for the reads file) and I don't really know how to speed it up. Any suggestions? Thank you very much in advance, G.

#usage $: python match_index_to_reads_fastq.py reads.fastq index.fastq 

import sys
reads = sys.argv[1]
index = sys.argv[2]

input = open(reads, "r")
all_lines = input.readlines()
input.close()

all_readIDs = []
for i, line in enumerate(all_lines):
    if line.startswith("@HWI"):
        all_readIDs.append(line)

input2 = open(index, "r")
all_lines2 = input2.readlines()
input2.close()

output = open("filtered_index.fastq", "w")
for i, line in enumerate(all_lines2):
    if line.startswith("@HWI"):
        line in all_readIDs:
            output.write(line + "".join(all_lines2[i+1:i+4])

output.close()

software error sequence Assembly next-gen • 3.9k views

ADD COMMENT • link 8.9 years ago by Gian77 ▴ 70

Entering edit mode

What do you think line in all_readIDs: does?

This code is about a week of learning Python fundamentals away from working. While i'm sure people on Biostars could write the correct code for you, I suspect this wouldn't really be the sort of help you actually need. Most importantly, whomever gave you this task would think you understood concepts that you currently do not, and that could be dangerous to you and whomever asked you to do this.

I would suggest starting with learning about the 'with' operator, and how python stores stuff in memory (and why you should never be using readlines() on a big file!)

+1 for asking your question well though. Always good to see well thought-out questions :)

ADD REPLY • link 8.9 years ago by John 13k

Entering edit mode

after what @John said take a look it this;

http://stackoverflow.com/questions/8009882/how-to-read-large-file-line-by-line-in-python
and
http://stackoverflow.com/questions/6335839/python-how-to-read-n-number-of-lines-at-a-time

ADD REPLY • link 8.9 years ago by Medhat 9.8k

Entering edit mode

As far as I can tell, you just want to make I1.fastq only contain index reads also present in your R1.fastq/R2.fastq files. Is that correct? And are you certain that your filtering did not reorder the reads or sometimes discard read 2 but not read 1?

If so, you can do this quickly with the BBMap package like this:

filterbyname.sh in=I1.fastq names=R1.fastq out=filtered_I1.fastq include=t

ADD REPLY • link 8.9 years ago by Brian Bushnell 20k

Entering edit mode

Hello again, I am facing with Python since not a long time... I am indeed currently learning. Thank you for the tips. By the way @John, none gave me this tasks, I had this problem with my own data and I needed to solve it so I tried to use my "weak" Python skills, that sometimes are helpful and some other times not :) G.

ADD REPLY • link 8.9 years ago by Gian77 ▴ 70

Entering edit mode

That's absolutely the best way to learn! :D If you're still having troubles by the end of tomorrow i'll post the code since this is a real problem that needs solving, but it's best to learn on your own to remember things well, I think. Good luck!!

ADD REPLY • link 8.9 years ago by John 13k

Entering edit mode

Thank you @john, I think I will work on the code early tomorrow and if I will find a decent solution I will post it. G.

ADD REPLY • link 8.9 years ago by Gian77 ▴ 70

Entering edit mode

Dear all and dear @john, ...my coding skills sucks! (sorry for the bad word). I've tried to develop a new code based on your advices (attached), that works until the very end, but I am not able to store the selected data into a file, I tried to create and index [i=f3.index(line) ] for the lines but didn't work either. Help is greatly appreciated (as well as links, manuals, everything helpful to learn pyhton for bioinformatics), I really want to learn :) G.

import sys

reads = sys.argv[1]
index = sys.argv[2]

all_readIDs = []
with open(reads, "r") as f:
        for line in f:
                if line.startswith("@HWI"):
                        all_readIDs.append(line)
                        if not all_readIDs:
                                print "List is Empty"

with open("filtered_index.fastq", "w") as f2:
    with open(index, "r") as f3:
        for line in f3:
                        if line.startswith("@HWI") and line in all_readIDs:
                                f2.write(line + "".join(f3[????????

ADD REPLY • link 8.9 years ago by Gian77 ▴ 70

Entering edit mode

Please do not add answers unless you're answering your own top-level question. Instead, use "Add Comment" or "Add Reply" to add a comment or reply to someone's comment as appropriate.

ADD REPLY • link 8.9 years ago by Ram 45k

Entering edit mode

Check this:


	import sys
	reads = sys.argv[1]
	index = sys.argv[2]

	# Get all read ids:
	all_readIDs = set() # Using a set because it will be quicker to find things in later.
	# A list would require python to check every item in the list.
	# A set can be thought of as an always-sorted list (elements are added in their sorted order) with no duplication of elements.
	# The sorted order is based on the hash() value of the items however, which is essentially a random number.
	# A further optimization of this would to not even bother storing the values in our hash-sorted list of values, but just
	# the hashes themselves, which a friend just told me is called a "Bloom filter" and I don't get a prize for thinking it up :P

	with open(reads, "r") as f:
	# The following is the simplest code to do what you want to do:
	for line in f:
	if line.startswith("@HWI"): all_readIDs.add(line)

	# However a quicker method would to just be to add every 4th line to our set once we've found the first line that starts with @HWI.
	# This is because python wouldn't have to check for @HWI on every line. Furthermore, for some FASTQ files the third row, usually "+",
	# is actually the same as the first row, the id. Trying to add the same value to a set twice is not a problem, but it takes time to check that it's
	# already in there, and nothing will change, so it's better to just read every 4th line.

	if not all_readIDs: print "List is Empty" # Nice check. Like it.
	else:
	# but its always a good idea to look at a few items in the set to, just to be sure:
	for idx,item in enumerate(all_readIDs):
	print item
	if idx == 10: break

	with open(index, "r") as f2, open("filtered_index.fastq", "w") as f3: # using with in this way saves on some indentation. I also swapped f2 and f3 around.
	# Here's one way to do the reading every 4th line, as mentioned above:
	skip = 0
	for line in f2:
	if skip == 0:
	if line in all_readIDs: f3.write(line)
	skip = 3
	else:
	skip = skip - 1 # (can also be written as "skip -= 1")

	print "All done!"

view raw gistfile1.py hosted with ❤ by GitHub

I'd also throw in some checks for that last for loop to see that the 'every 4th line' code is working as expected, as header lines might throw it off.

ADD REPLY • link 8.9 years ago by John 13k

Entering edit mode

Hello @John, sorry for the late reply, I am actually running your script. I will let you know how it goes ( the cluster is slow today). I would thank you for the comments on your code, very helpful. Have a great night, G.

ADD REPLY • link 8.9 years ago by Gian77 ▴ 70

Entering edit mode

That's OK, you're very welcome. But just so you know, i wrote that code mainly to help you with your Python not to solve your research questions. I'd use Brian's tried and tested BBMap package over my scribbles any day!

ADD REPLY • link 8.9 years ago by John 13k

Entering edit mode

I agree with you @John, I would love to be autonomous at least with these bioinfo kind of problems and be able to use Python to solve them... Your commented code helped a lot. I know this is not the place to ask but if you some useful links to good tutorials I would greatly appreciate :)

ADD REPLY • link 8.9 years ago by Gian77 ▴ 70

Entering edit mode

It runs very well, its fast :D ...and if I wish to store in f3 also the other lines, not only the header but all the 4 lines of the fastq file containing the index?

ADD REPLY • link 8.9 years ago by Gian77 ▴ 70

Entering edit mode

I'm happy you're looking to learn more Python for all the right reasons :) I think the code below does what you're looking for, but I haven't tested it on any data! Think about building in some tests to make sure it's doing what we think it is :)

	import sys
	reads = sys.argv[1]
	index = sys.argv[2]
	all_readIDs = set()

	# Re-written the code below to only ever read the first row:
	with open(reads, "r") as read_file:
	while True:
	try:
	line1 = next(read_file)
	if line1.startswith("@HWI"): all_readIDs.add(line1)
	next(read_file) # We walk through the
	next(read_file) # next 3 lines but don't
	next(read_file) # save the data anywhere.
	except StopIteration:
	break
	# This hasn't changed:
	if not all_readIDs: print "List is Empty"
	else:
	for idx,item in enumerate(all_readIDs):
	print item
	if idx == 10: break

	# Rewritten to not skip any rows and write all 4 rows to a file if the ID line in the ID set:
	with open(index, "r") as index_file, open("filtered_index.fastq", "w") as filtered_file:
	while True:
	try:
	line1 = next(index_file)
	line2 = next(index_file)
	line3 = next(index_file)
	line4 = next(index_file)
	except StopIteration:
	break
	if line1 in all_readIDs:
	filtered_file.write(line1)
	filtered_file.write(line2)
	filtered_file.write(line3)
	filtered_file.write(line4)
	print "All done!"

view raw 209184.py hosted with ❤ by GitHub

ADD REPLY • link 8.9 years ago by John 13k

Entering edit mode

Hi John, I am really not very familiar with while loops, I am getting the following error after running the code:

File "extract_Index_byR1R2combined_Biostar.py", line 17, in <module>
    line1 = next(f) # When we get to the end of the file, line1 == '', which is False, which will stop the while above, otherwise it will have some value, which is True, and the while keeps running.
StopIteration

ADD REPLY • link 8.9 years ago by Gian77 ▴ 70

Entering edit mode

Ah, no, that's actually my fault. Perhaps this is a good time to explain the importance of variable names. I used next(f) because I often use 'f' as the name for my open file handle. However 'f' is a pretty bad/meaningless name for something. f1, f2 and f3 are really bad names for something. I should have used a name that better reflects the thing that it is, in this case the index_file.

So I wrote next(f) when modifying the old code, but I should have named it next(f2). But that isn't the real problem. The real problem is that i should never have named if f-anything to begin with. If it had a name like index_file, I wouldn't have missed the mistake :) This is something all programmers have to get into the habit of doing. I know what f is now -- but will I remember in a day's time. Or a months time? Or even a year?

I have updated the gist to reflect the proper names. I'm still not saying it works, but it has 1 less dumb error :)

The trick with whiles is that they are basically:

if bool(x) == True: [DO ANOTHER LOOP]
else:               [CONTINUE ON]

bool() is a fundamental python function that always returns either True or False. You can play around with bool() in the python terminal if you like. bool(True) returns True. bool(False) returns False. bool(1), bool(2), bool(3)... etc, all return True, while bool(0) returns False. bool(-1), etc, returns True too, as do all the negative numbers. bool('hi') returns True, and so does bool('False') because that's just some string and it doesn't matter what the string is. bool('0') is also True. But bool(''), however, is False, because it is an empty string. bool(None) returns False. All these things are just python conventions though. In other languages, '' might be True. In other languages, all numbers less than 0 might be False. The above is just how Python's bool() works :) As a result, while True: will cause an infinite loop, which i've done really just to show you how we get out of it later.

I've also added a little try/except to make things a little clearer too. Basically, in the try: part nothing special happens, but if there is a StopIteration problem in that try block, then we go into the except block, which says break. If i had just "except:", then any error would put us in the except block (which is generally considered lazy programming). break is another special python command that lets us 'break out' of a while or for loop. Because i've changed the code to while True, the break is how we get out of it. Essentially, we keep grabbing 4 lines, and doing the comparison, until we hit the end of the file and a StopIteration is raised on line1 = next(....). The causes us to move to the except block, which triggers the break, and we move out of the while loop.

Exactly how errors are raised and caught is a lesson for another day. There's really no need to learn everything so quick :)

ADD REPLY • link 8.9 years ago by John 13k

Entering edit mode

That's impressive John, thanks for your explanations and really useful lesson. I have run the new code and unfortunately I have a similar error, please see below:

  File "extract_Index_byR1R2combined_Biostar_new.py", line 17, in <module>
    line1 = next(read_file) # When we get to the end of the file, line1 == '', which is False, which will stop the while above, otherwise it will have some value, which is True, and the while keeps running.
StopIteration

ADD REPLY • link 8.9 years ago by Gian77 ▴ 70

Entering edit mode

Ah, my bad - fixed now :)

ADD REPLY • link 8.9 years ago by John 13k

Entering edit mode

Oh yeah John, now it's perfectly working. Thank you so much for the help, I will study the code and try to make it flow inside my own veins :P

ADD REPLY • link 8.9 years ago by Gian77 ▴ 70