Hi,
I want to parse a SAM file based on QNAME, however the python script I designed to do so takes too long.
To visualize my issue, here are the first 5 lines of my 4.25 GB SAM file:
HWI-D00372:292:C6U8JANXX:2:1310:9053:40752 99 chr1_gl000191_random 623 255 51M = 737 165 CTTTGGGAGGCTGAGGCAGGTGGATCACCTGAGGTCAGGAGTTCGAGACCA /<BBBFFFFFFFFFFFFFFFFFFBFFFFFFFBFFFFFFFFFFFFFFBFFFF XA:i:0 MD:Z:51 PG:Z:MarkDuplicates NM:i:0
HWI-D00372:292:C6U8JANXX:2:2108:14511:86999 99 chr1_gl000191_random 721 255 51M = 769 99 TTAGCTGGGGCTGGTGGCGGATGCCTGTAATCCCAGCTACTCAGGAGGCTG BBBBBFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFB<BFFFFBFFF XA:i:0 MD:Z:51 PG:Z:MarkDuplicates NM:i:0
HWI-D00372:292:C6U8JANXX:2:1310:9053:40752 147 chr1_gl000191_random 737 255 51M = 623 -165 GCGGATGCCTGTAATCCCAGCTACTCAGGAGGCTGAGGCTGGGGAATCACT FFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFF/<FFFFFFFFB/FFBBBBB XA:i:0 MD:Z:51 PG:Z:MarkDuplicates NM:i:0
HWI-D00372:292:C6U8JANXX:2:2108:14511:86999 147 chr1_gl000191_random 769 255 51M = 721 -99 CTGAGGCTGGGGAATCACTTGAACCTGGAAGGCGGAGGTTGCAGTGAGCTG F<FFFFBFFFFFBB<F/FFFFF<FFBFFFFFFFFFFFFFFFBFF<FBBBBB XA:i:0 MD:Z:51 PG:Z:MarkDuplicates NM:i:0
HWI-D00372:292:C6U8JANXX:2:1313:1300:84021 163 chr1_gl000191_random 833 255 51M = 877 95 GCACTCCAGCCTACGCAACAAGAGCGAAACTCTGTCTCAAAAAATAAAAAA BBBBBFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:51 PG:Z:MarkDuplicates NM:i:0
The QNAME is the HWI-D00372:292:C6U8JANXX:2:1310:9053:40752
part of the SAM file line.
I have a list of about 1 million QNAMEs that I want to iterate through, pick out each SAM file line that corresponds to each QNAME (each QNAME corresponds with 2 separate SAM file lines because this is paired-end sequencing alignment data), then write these lines out into a new SAM file.
The best python script I could design to do this task is the following:
parsedSamData = "" # initialize parsedSamData variable
# read SAM file in as a single string
file_handle = open(samFile, 'r')
sam_file_contents_string = file_handle.read()
file_handle.close()
for QNAME in QNAME_list: # iterate through each QNAME
regex = re.escape(QNAME) + r".*\n" # design a regular expression to match the SAM file lines with the QNAME
mate_pairs = re.finditer(regex, sam_file_contents_string) # match QNAME with the SAM file lines
for entry in mate_pairs: # access the match object
parsedSamData += entry.group() # append the 2 lines to the parsedSamData variable
# save the parsed SAM file data
fh = open(parsed_SAM_file_path, "w")
fh.write(parsedSamData)
fh.close()
This definitely does what I want, however the time it takes to go through each QNAME is about 5 seconds. This translates to my script running for about 58 days for this particular SAM file. I would like to perform this task in a few days or less, as I have multiple SAM files to perform this on that range in size from 4-12 GB.
I appreciate and thank you for any help!
I rarely use grep for anything, but this really does seem like a good use :)
Yeah well that's the thing about shell commands piped together - they seem like they work fine and they're a good use, but unless we walk through the logic ourselves in at least pseudocode, one tends to miss things. The problem is, they're impossible to unit test and write exceptions for.
This is a pretty good example actually. If you have a pre-filtered SAM file and only 1 of your pairs is in the file (or worse, more than 2!) a programming language would raise an exception while the grep would pass whatever it found through without incident.
I used the grep method based on another post and, surprisingly, it was taking as long if not longer than my original python script. I then tried your python script and it took ~30 seconds! Also, it worked perfectly. Thank you very much!