Hi,
I have a file containing information about SNPs (contig number, position, variant counts per individual) and would like to parse that file using Python. Here is an example of the file format (tab separated text file):
Contig_nb Pos tag_name A C G T N * -
Contig_3 365 LD_Tag04 1 3 0 0 0 0 8
Contig_3 365 LD_Tag05 0 0 0 5 0 0 8
Contig_3 365 LD_Tag07 0 0 0 1 0 0 9
Contig_3 365 LD_Tag08 1 0 0 4 0 0 10
Contig_128 1432 DR_Tag09 0 2 1 0 0 0 11
Contig_128 1432 DR_Tag11 0 4 1 0 0 0 16
Contig_128 1432 DR_Tag15 0 0 3 0 0 0 9
Contig_128 1432 DR_Tag16 0 2 0 0 0 0 10
Contig_128 1432 LD_Tag01 0 4 8 0 0 0 18
The crux of the problem is to regroup lines that share a similar contig_nb AND position and then treat them as a block. I thus want a generator that permits iteration over all the groups (identical contig_np
AND pos
) present in the file and return them as a list of lists (each line in a group being a list with in 'group' list).
I am trying to avoid doing a custom while loop where I include the lines as long as they share the same contig_np and pos and then starting a new group when they don't
How would you do it? Could 'groupby' be used in such an instance?
Many thanks!
Looks good. you're just missing the first argument to
groupby
(should be yourline_gen
) and then your key function can bethat's what you get for not using Unit-Tests ;) ... I actually prefer
lambda
expressions over mostoperator
expressions. To me they're just a little easier to read .. and the speed hit isn't that bad unless your doing terabytes worth of info.@Will, I would be glad to give you the answer if you add the missing parts so that it can be readily used. Thanks!
@Eric ... Edited the code to make more sense :)
@Will, Many thanks :)