Hello everyone,
I have an Illumina library back form the sequencing facility of around 18M PE reads. Sequences are in two huge files (R1.fastq, and R2.fastq), the index (barcodes) are in another file (I1.fastq). I have made some filtering on my reads before demultiplexing, so before doing that I need to sort my Index to be sure only the one correspond to the reads are present. I have a python code that aims to output that exactly the same Index that are contained in the reads .fastq file according the read headers. The code below is working but is incredibly slow (the files I am working are very big, like 18M for the Index file and 15M for the reads file) and I don't really know how to speed it up. Any suggestions? Thank you very much in advance, G.
#usage $: python match_index_to_reads_fastq.py reads.fastq index.fastq
import sys
reads = sys.argv[1]
index = sys.argv[2]
input = open(reads, "r")
all_lines = input.readlines()
input.close()
all_readIDs = []
for i, line in enumerate(all_lines):
if line.startswith("@HWI"):
all_readIDs.append(line)
input2 = open(index, "r")
all_lines2 = input2.readlines()
input2.close()
output = open("filtered_index.fastq", "w")
for i, line in enumerate(all_lines2):
if line.startswith("@HWI"):
line in all_readIDs:
output.write(line + "".join(all_lines2[i+1:i+4])
output.close()
What do you think
line in all_readIDs:
does?This code is about a week of learning Python fundamentals away from working. While i'm sure people on Biostars could write the correct code for you, I suspect this wouldn't really be the sort of help you actually need. Most importantly, whomever gave you this task would think you understood concepts that you currently do not, and that could be dangerous to you and whomever asked you to do this.
I would suggest starting with learning about the 'with' operator, and how python stores stuff in memory (and why you should never be using readlines() on a big file!)
+1 for asking your question well though. Always good to see well thought-out questions :)
after what @John said take a look it this;
http://stackoverflow.com/questions/8009882/how-to-read-large-file-line-by-line-in-python
and
http://stackoverflow.com/questions/6335839/python-how-to-read-n-number-of-lines-at-a-time
As far as I can tell, you just want to make I1.fastq only contain index reads also present in your R1.fastq/R2.fastq files. Is that correct? And are you certain that your filtering did not reorder the reads or sometimes discard read 2 but not read 1?
If so, you can do this quickly with the BBMap package like this:
Hello again, I am facing with Python since not a long time... I am indeed currently learning. Thank you for the tips. By the way @John, none gave me this tasks, I had this problem with my own data and I needed to solve it so I tried to use my "weak" Python skills, that sometimes are helpful and some other times not :) G.
That's absolutely the best way to learn! :D If you're still having troubles by the end of tomorrow i'll post the code since this is a real problem that needs solving, but it's best to learn on your own to remember things well, I think. Good luck!!
Thank you @john, I think I will work on the code early tomorrow and if I will find a decent solution I will post it. G.
Dear all and dear @john, ...my coding skills sucks! (sorry for the bad word). I've tried to develop a new code based on your advices (attached), that works until the very end, but I am not able to store the selected data into a file, I tried to create and index [i=f3.index(line) ] for the lines but didn't work either. Help is greatly appreciated (as well as links, manuals, everything helpful to learn pyhton for bioinformatics), I really want to learn :) G.
Please do not add answers unless you're answering your own top-level question. Instead, use "Add Comment" or "Add Reply" to add a comment or reply to someone's comment as appropriate.
Check this:
I'd also throw in some checks for that last for loop to see that the 'every 4th line' code is working as expected, as header lines might throw it off.
Hello @John, sorry for the late reply, I am actually running your script. I will let you know how it goes ( the cluster is slow today). I would thank you for the comments on your code, very helpful. Have a great night, G.
That's OK, you're very welcome. But just so you know, i wrote that code mainly to help you with your Python not to solve your research questions. I'd use Brian's tried and tested BBMap package over my scribbles any day!
I agree with you @John, I would love to be autonomous at least with these bioinfo kind of problems and be able to use Python to solve them... Your commented code helped a lot. I know this is not the place to ask but if you some useful links to good tutorials I would greatly appreciate :)
It runs very well, its fast :D ...and if I wish to store in f3 also the other lines, not only the header but all the 4 lines of the fastq file containing the index?
I'm happy you're looking to learn more Python for all the right reasons :) I think the code below does what you're looking for, but I haven't tested it on any data! Think about building in some tests to make sure it's doing what we think it is :)
Hi John, I am really not very familiar with while loops, I am getting the following error after running the code:
Ah, no, that's actually my fault. Perhaps this is a good time to explain the importance of variable names. I used next(f) because I often use 'f' as the name for my open file handle. However 'f' is a pretty bad/meaningless name for something. f1, f2 and f3 are really bad names for something. I should have used a name that better reflects the thing that it is, in this case the index_file.
So I wrote next(f) when modifying the old code, but I should have named it next(f2). But that isn't the real problem. The real problem is that i should never have named if f-anything to begin with. If it had a name like index_file, I wouldn't have missed the mistake :) This is something all programmers have to get into the habit of doing. I know what f is now -- but will I remember in a day's time. Or a months time? Or even a year?
I have updated the gist to reflect the proper names. I'm still not saying it works, but it has 1 less dumb error :)
The trick with whiles is that they are basically:
bool() is a fundamental python function that always returns either True or False. You can play around with bool() in the python terminal if you like. bool(True) returns True. bool(False) returns False. bool(1), bool(2), bool(3)... etc, all return True, while bool(0) returns False. bool(-1), etc, returns True too, as do all the negative numbers. bool('hi') returns True, and so does bool('False') because that's just some string and it doesn't matter what the string is. bool('0') is also True. But bool(''), however, is False, because it is an empty string. bool(None) returns False. All these things are just python conventions though. In other languages, '' might be True. In other languages, all numbers less than 0 might be False. The above is just how Python's bool() works :) As a result,
while True:
will cause an infinite loop, which i've done really just to show you how we get out of it later.I've also added a little try/except to make things a little clearer too. Basically, in the try: part nothing special happens, but if there is a StopIteration problem in that try block, then we go into the except block, which says break. If i had just "except:", then any error would put us in the except block (which is generally considered lazy programming).
break
is another special python command that lets us 'break out' of a while or for loop. Because i've changed the code to while True, the break is how we get out of it. Essentially, we keep grabbing 4 lines, and doing the comparison, until we hit the end of the file and a StopIteration is raised on line1 = next(....). The causes us to move to the except block, which triggers the break, and we move out of the while loop.Exactly how errors are raised and caught is a lesson for another day. There's really no need to learn everything so quick :)
That's impressive John, thanks for your explanations and really useful lesson. I have run the new code and unfortunately I have a similar error, please see below:
Ah, my bad - fixed now :)
Oh yeah John, now it's perfectly working. Thank you so much for the help, I will study the code and try to make it flow inside my own veins :P