Sorting sequences according header
0
2
Entering edit mode
8.4 years ago
Gian77 ▴ 70

Hello everyone,

I have an Illumina library back form the sequencing facility of around 18M PE reads. Sequences are in two huge files (R1.fastq, and R2.fastq), the index (barcodes) are in another file (I1.fastq). I have made some filtering on my reads before demultiplexing, so before doing that I need to sort my Index to be sure only the one correspond to the reads are present. I have a python code that aims to output that exactly the same Index that are contained in the reads .fastq file according the read headers. The code below is working but is incredibly slow (the files I am working are very big, like 18M for the Index file and 15M for the reads file) and I don't really know how to speed it up. Any suggestions? Thank you very much in advance, G.

#usage $: python match_index_to_reads_fastq.py reads.fastq index.fastq 

import sys
reads = sys.argv[1]
index = sys.argv[2]

input = open(reads, "r")
all_lines = input.readlines()
input.close()

all_readIDs = []
for i, line in enumerate(all_lines):
    if line.startswith("@HWI"):
        all_readIDs.append(line)

input2 = open(index, "r")
all_lines2 = input2.readlines()
input2.close()

output = open("filtered_index.fastq", "w")
for i, line in enumerate(all_lines2):
    if line.startswith("@HWI"):
        line in all_readIDs:
            output.write(line + "".join(all_lines2[i+1:i+4])

output.close()
software error sequence Assembly next-gen • 3.2k views
ADD COMMENT
2
Entering edit mode

What do you think line in all_readIDs: does?

This code is about a week of learning Python fundamentals away from working. While i'm sure people on Biostars could write the correct code for you, I suspect this wouldn't really be the sort of help you actually need. Most importantly, whomever gave you this task would think you understood concepts that you currently do not, and that could be dangerous to you and whomever asked you to do this.

I would suggest starting with learning about the 'with' operator, and how python stores stuff in memory (and why you should never be using readlines() on a big file!)

+1 for asking your question well though. Always good to see well thought-out questions :)

ADD REPLY
2
Entering edit mode

As far as I can tell, you just want to make I1.fastq only contain index reads also present in your R1.fastq/R2.fastq files. Is that correct? And are you certain that your filtering did not reorder the reads or sometimes discard read 2 but not read 1?

If so, you can do this quickly with the BBMap package like this:

filterbyname.sh in=I1.fastq names=R1.fastq out=filtered_I1.fastq include=t
ADD REPLY
1
Entering edit mode

Hello again, I am facing with Python since not a long time... I am indeed currently learning. Thank you for the tips. By the way @John, none gave me this tasks, I had this problem with my own data and I needed to solve it so I tried to use my "weak" Python skills, that sometimes are helpful and some other times not :) G.

ADD REPLY
0
Entering edit mode

That's absolutely the best way to learn! :D If you're still having troubles by the end of tomorrow i'll post the code since this is a real problem that needs solving, but it's best to learn on your own to remember things well, I think. Good luck!!

ADD REPLY
1
Entering edit mode

Thank you @john, I think I will work on the code early tomorrow and if I will find a decent solution I will post it. G.

ADD REPLY
1
Entering edit mode

Dear all and dear @john, ...my coding skills sucks! (sorry for the bad word). I've tried to develop a new code based on your advices (attached), that works until the very end, but I am not able to store the selected data into a file, I tried to create and index [i=f3.index(line) ] for the lines but didn't work either. Help is greatly appreciated (as well as links, manuals, everything helpful to learn pyhton for bioinformatics), I really want to learn :) G.

import sys

reads = sys.argv[1] index = sys.argv[2]

all_readIDs = [] with open(reads, "r") as f: for line in f: if line.startswith("@HWI"): all_readIDs.append(line) if not all_readIDs: print "List is Empty"

with open("filtered_index.fastq", "w") as f2: with open(index, "r") as f3: for line in f3: if line.startswith("@HWI") and line in all_readIDs: f2.write(line + "".join(f3[????????

ADD REPLY
1
Entering edit mode

Please do not add answers unless you're answering your own top-level question. Instead, use "Add Comment" or "Add Reply" to add a comment or reply to someone's comment as appropriate.

ADD REPLY
1
Entering edit mode

Check this:

I'd also throw in some checks for that last for loop to see that the 'every 4th line' code is working as expected, as header lines might throw it off.

ADD REPLY
0
Entering edit mode

Hello @John, sorry for the late reply, I am actually running your script. I will let you know how it goes ( the cluster is slow today). I would thank you for the comments on your code, very helpful. Have a great night, G.

ADD REPLY
0
Entering edit mode

That's OK, you're very welcome. But just so you know, i wrote that code mainly to help you with your Python not to solve your research questions. I'd use Brian's tried and tested BBMap package over my scribbles any day!

ADD REPLY
1
Entering edit mode

I agree with you @John, I would love to be autonomous at least with these bioinfo kind of problems and be able to use Python to solve them... Your commented code helped a lot. I know this is not the place to ask but if you some useful links to good tutorials I would greatly appreciate :)

ADD REPLY
0
Entering edit mode

It runs very well, its fast :D ...and if I wish to store in f3 also the other lines, not only the header but all the 4 lines of the fastq file containing the index?

ADD REPLY
1
Entering edit mode

I'm happy you're looking to learn more Python for all the right reasons :) I think the code below does what you're looking for, but I haven't tested it on any data! Think about building in some tests to make sure it's doing what we think it is :)

ADD REPLY
0
Entering edit mode

Hi John, I am really not very familiar with while loops, I am getting the following error after running the code:

File "extract_Index_byR1R2combined_Biostar.py", line 17, in <module>
    line1 = next(f) # When we get to the end of the file, line1 == '', which is False, which will stop the while above, otherwise it will have some value, which is True, and the while keeps running.
StopIteration

ADD REPLY
1
Entering edit mode

Ah, no, that's actually my fault. Perhaps this is a good time to explain the importance of variable names. I used next(f) because I often use 'f' as the name for my open file handle. However 'f' is a pretty bad/meaningless name for something. f1, f2 and f3 are really bad names for something. I should have used a name that better reflects the thing that it is, in this case the index_file.

So I wrote next(f) when modifying the old code, but I should have named it next(f2). But that isn't the real problem. The real problem is that i should never have named if f-anything to begin with. If it had a name like index_file, I wouldn't have missed the mistake :) This is something all programmers have to get into the habit of doing. I know what f is now -- but will I remember in a day's time. Or a months time? Or even a year?

I have updated the gist to reflect the proper names. I'm still not saying it works, but it has 1 less dumb error :)

The trick with whiles is that they are basically:

if bool(x) == True: [DO ANOTHER LOOP]
else:               [CONTINUE ON]

bool() is a fundamental python function that always returns either True or False. You can play around with bool() in the python terminal if you like. bool(True) returns True. bool(False) returns False. bool(1), bool(2), bool(3)... etc, all return True, while bool(0) returns False. bool(-1), etc, returns True too, as do all the negative numbers. bool('hi') returns True, and so does bool('False') because that's just some string and it doesn't matter what the string is. bool('0') is also True. But bool(''), however, is False, because it is an empty string. bool(None) returns False. All these things are just python conventions though. In other languages, '' might be True. In other languages, all numbers less than 0 might be False. The above is just how Python's bool() works :) As a result, while True: will cause an infinite loop, which i've done really just to show you how we get out of it later.

I've also added a little try/except to make things a little clearer too. Basically, in the try: part nothing special happens, but if there is a StopIteration problem in that try block, then we go into the except block, which says break. If i had just "except:", then any error would put us in the except block (which is generally considered lazy programming). break is another special python command that lets us 'break out' of a while or for loop. Because i've changed the code to while True, the break is how we get out of it. Essentially, we keep grabbing 4 lines, and doing the comparison, until we hit the end of the file and a StopIteration is raised on line1 = next(....). The causes us to move to the except block, which triggers the break, and we move out of the while loop.

Exactly how errors are raised and caught is a lesson for another day. There's really no need to learn everything so quick :)

ADD REPLY
0
Entering edit mode

That's impressive John, thanks for your explanations and really useful lesson. I have run the new code and unfortunately I have a similar error, please see below:

  File "extract_Index_byR1R2combined_Biostar_new.py", line 17, in <module>
    line1 = next(read_file) # When we get to the end of the file, line1 == '', which is False, which will stop the while above, otherwise it will have some value, which is True, and the while keeps running.
StopIteration

ADD REPLY
1
Entering edit mode

Ah, my bad - fixed now :)

ADD REPLY
1
Entering edit mode

Oh yeah John, now it's perfectly working. Thank you so much for the help, I will study the code and try to make it flow inside my own veins :P

ADD REPLY

Login before adding your answer.

Traffic: 1065 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6