I want to compare the 2nd and 4th columns of lines in a file. In detail, line 1 with 2,3,4...N, then line 2 with 3,4,5...N, and so on.
I have written a bash script, it worked but running so long, over 30 minutes.
Let the number of lines is 1733 with header, my code is:
for line1 in {2..1733}; do \
for line2 in {$((line1+1))..1733}; do \
i7_diff=$(cmp -bl \
<(sed -n "${line1}p" Lab_primers.tsv | cut -f 2) \
<(sed -n "${line2}p" Lab_primers.tsv | cut -f 2) | wc -l);
i5_diff=$(cmp -bl \
<(sed -n "${line1}p" Lab_primers.tsv | cut -f 4) \
<(sed -n "${line2}p" Lab_primers.tsv | cut -f 4) | wc -l);
if [ $i7_diff -lt 3 ]; then
if [ $i5_diff -lt 3 ]; then
sed -n "${line1}p; ${line2}p" Lab_primers.tsv) >> primer_collision.txt
fi;
fi;
done
done
I used nested for loops then using sed to print exactly the $line, next using cut to extract the desired column. Finally, the cmp and wc command to count the number of differences of two columns of a pair lines.
If meeting the condition of barcode collision (the number of differences less than 3), the code will print a pair lines to output file.
Here is the excerpt of the tab-separated input (it has 1733 lines):
I7_Index_ID index I5_Index_ID index2 primer
D703 CGCTCATT D507 ACGTCCTG 27
D704 GAGATTCC D507 ACGTCCTG 28
D701 ATTACTCG D508 GTCAGTAC 29
S6779 CGCTAATC S6579 ACGTCATA 559
D708 TAATGCGC D503 AGGATAGG 44
D705 ATTCAGAA D504 TCAGAGCC 45
D706 GAATTCGT D504 TCAGAGCC 46
i796 ATATGCGC i585 AGGATAGC R100
D714 TGCTTGCT D510 AACCTCTC 102
D715 GGTGATGA D510 AACCTCTC 103
D716 AACCTACG D510 AACCTCTC 104
i787 TGCTTCCA i593 ATCGTCTC R35
Then the expected output is:
D703 CGCTCATT D507 ACGTCCTG 27
S6779 CGCTAATC S6579 ACGTCATA 559
D708 TAATGCGC D503 AGGATAGG 44
i796 ATATGCGC i585 AGGATAGC R100
D714 TGCTTGCT D510 AACCTCTC 102
i787 TGCTTCCA i593 ATCGTCTC R35
My question is what the better solution to deal with it, how to reduce the running time?
Thank you for your help!
nice 1 jesse. another alternative could be functionality built into seurat4.
vl
Good to know (I'm really overdue to have a look at seurat). bobia9193, seurat might be handy if you're working with single-cell data, like from 10x. Does it have utilities for this sort of thing?
https://satijalab.org/seurat/reference/read10x
yes there is a specific function called read10x aimed at this sort of thing :-D
Thank you for your solution which worked perfectly. With my actual dataset, it took about 3s to finish.
I'm newbie in Python, thus could you please explain your code?
Thank you!
Sure, here goes:
I used Python's built-in csv module to read and write tab-separated files (giving
delimiter="\t"
to use tab instead of comma), so that's what the DictReader and DictWriter are doing (those classes use dictionaries (key=value pairs) for each row for input and output based on column names). Often it might be a good idea to go row-by-row without storing everything at once, but in this case you do need everything at once to compare rows, so I uselist(reader)
to get a list of all rows from the input in one go.compare
is a little function (very compact thanks to lambda) that pairs up characters of text from two input text strings and sums up a total of how many character pairs were not the same. That relies on a couple of things:for x in "abc"
) gives you individual characters of the stringFalse
converts to 0 andTrue
to 1 (sosum(a != b ...
is how many characters didn't match)(... for ... in ...)
is a generator expression that gets passed to thesum
function, which in turn can gather the values it needs to sum up without needing to see them all at once. (That sort of thing would matter if you wanted to sum, say, millions of numbers you were calculating on the fly, but not so much when it's just 8 numbers.)The rest is very similar to what you had in bash, except that it uses the already-parsed lists of dictionaries for working with rows, the compare function above, and then the csv DictWriter to write the rows back out based on the same column names seen in the input.
rows
is a Python list, so it can be indexed by integers starting at zero, while eachrows[someindex]
is a dictionary, so you can use the named keys that were taken from the header row of the input CSV. I based things off the number of rows (vialen(rows)
) so it would still work with different files.range(len(rows))
kind of smells, and you could probably be more clever with enumerate and maybe other things, but going with pairs of line numbers seems straightforward enough.Thank you for your help! Sincerely