Question

Check barcode collision

0

Entering edit mode

2.2 years ago

bobia9193 ▴ 20

I want to compare the 2nd and 4th columns of lines in a file. In detail, line 1 with 2,3,4...N, then line 2 with 3,4,5...N, and so on.

I have written a bash script, it worked but running so long, over 30 minutes.

Let the number of lines is 1733 with header, my code is:

for line1 in {2..1733}; do \
    for line2 in {$((line1+1))..1733}; do \
        i7_diff=$(cmp -bl \
            <(sed -n "${line1}p" Lab_primers.tsv | cut -f 2) \
            <(sed -n "${line2}p" Lab_primers.tsv | cut -f 2) | wc -l);
        i5_diff=$(cmp -bl \
            <(sed -n "${line1}p" Lab_primers.tsv | cut -f 4) \
            <(sed -n "${line2}p" Lab_primers.tsv | cut -f 4) | wc -l);
        if [ $i7_diff -lt 3 ]; then
            if [ $i5_diff -lt 3 ]; then
                sed -n "${line1}p; ${line2}p" Lab_primers.tsv) >> primer_collision.txt         
            fi;
        fi;
    done
done

I used nested for loops then using sed to print exactly the $line, next using cut to extract the desired column. Finally, the cmp and wc command to count the number of differences of two columns of a pair lines.

If meeting the condition of barcode collision (the number of differences less than 3), the code will print a pair lines to output file.

Here is the excerpt of the tab-separated input (it has 1733 lines):

I7_Index_ID  index     I5_Index_ID  index2    primer
D703         CGCTCATT  D507         ACGTCCTG  27
D704         GAGATTCC  D507         ACGTCCTG  28
D701         ATTACTCG  D508         GTCAGTAC  29
S6779        CGCTAATC  S6579        ACGTCATA  559
D708         TAATGCGC  D503         AGGATAGG  44
D705         ATTCAGAA  D504         TCAGAGCC  45
D706         GAATTCGT  D504         TCAGAGCC  46
i796         ATATGCGC  i585         AGGATAGC  R100
D714         TGCTTGCT  D510         AACCTCTC  102
D715         GGTGATGA  D510         AACCTCTC  103
D716         AACCTACG  D510         AACCTCTC  104
i787         TGCTTCCA  i593         ATCGTCTC  R35

Then the expected output is:

D703 CGCTCATT D507 ACGTCCTG 27
S6779 CGCTAATC S6579 ACGTCATA 559


D708 TAATGCGC D503 AGGATAGG 44
i796 ATATGCGC i585 AGGATAGC R100


D714 TGCTTGCT D510 AACCTCTC 102
i787 TGCTTCCA i593 ATCGTCTC R35

My question is what the better solution to deal with it, how to reduce the running time?

Thank you for your help!

awk script bash • 1.4k views

ADD COMMENT • link 2.2 years ago by bobia9193 ▴ 20

score 3 · Accepted Answer · 2022-09-24

3

Entering edit mode

2.2 years ago

Jesse ▴ 850

1733 lines isn't very much, but it's doing a lot of comparisons between those with the same steps over and over, so I'd say load the whole thing into memory and go from there. Something like R or Python would be my go-to here. (Interesting use of cmp, though!)

This Python snippet runs in about 4 seconds for me with 1733 lines with randomized sequences, so that should do better for you than 30+ minutes. (This is very brittle-- for example compare would just truncate inputs of different lengths-- but it looks like it would work for your example.)

#!/usr/bin/env python
from csv import DictReader, DictWriter
with open("Lab_primers.tsv") as f_in:
    reader = DictReader(f_in, delimiter="\t")
    rows = list(reader)
compare = lambda txt1, txt2: sum(a != b for a, b in zip(txt1, txt2))
with open("primer_collision.txt", "wt") as f_out:
    writer = DictWriter(f_out, fieldnames=reader.fieldnames, delimiter="\t", lineterminator="\n")
    for line1 in range(len(rows)):
        for line2 in range(line1+1, len(rows)):
            i7_diff = compare(rows[line1]["index"], rows[line2]["index"])
            i5_diff = compare(rows[line1]["index2"], rows[line2]["index2"])
            if i7_diff < 3 and i5_diff < 3:
                writer.writerow(rows[line1])
                writer.writerow(rows[line2])

ADD COMMENT • link 2.2 years ago by Jesse ▴ 850

1

Entering edit mode

nice 1 jesse. another alternative could be functionality built into seurat4.

vl

ADD REPLY • link 2.2 years ago by LauferVA 4.5k

1

Entering edit mode

Good to know (I'm really overdue to have a look at seurat). bobia9193, seurat might be handy if you're working with single-cell data, like from 10x. Does it have utilities for this sort of thing?

ADD REPLY • link 2.2 years ago by Jesse ▴ 850

1

Entering edit mode

https://satijalab.org/seurat/reference/read10x

yes there is a specific function called read10x aimed at this sort of thing :-D

ADD REPLY • link 2.2 years ago by LauferVA 4.5k

1

Entering edit mode

Thank you for your solution which worked perfectly. With my actual dataset, it took about 3s to finish.

I'm newbie in Python, thus could you please explain your code?

Thank you!

ADD REPLY • link 2.2 years ago by bobia9193 ▴ 20

1

Entering edit mode

Sure, here goes:

I used Python's built-in csv module to read and write tab-separated files (giving delimiter="\t" to use tab instead of comma), so that's what the DictReader and DictWriter are doing (those classes use dictionaries (key=value pairs) for each row for input and output based on column names). Often it might be a good idea to go row-by-row without storing everything at once, but in this case you do need everything at once to compare rows, so I use list(reader) to get a list of all rows from the input in one go.

compare is a little function (very compact thanks to lambda) that pairs up characters of text from two input text strings and sums up a total of how many character pairs were not the same. That relies on a couple of things:

iterating strings (like for x in "abc") gives you individual characters of the string
so when the built-in zip function is used with a set of strings it will iterate them grouped character-by-character (but, it will truncate if lengths don't match)
when cast as integers, False converts to 0 and True to 1 (so sum(a != b ... is how many characters didn't match)
It's a really minor point here, but, the (... for ... in ...) is a generator expression that gets passed to the sum function, which in turn can gather the values it needs to sum up without needing to see them all at once. (That sort of thing would matter if you wanted to sum, say, millions of numbers you were calculating on the fly, but not so much when it's just 8 numbers.)

The rest is very similar to what you had in bash, except that it uses the already-parsed lists of dictionaries for working with rows, the compare function above, and then the csv DictWriter to write the rows back out based on the same column names seen in the input. rows is a Python list, so it can be indexed by integers starting at zero, while each rows[someindex] is a dictionary, so you can use the named keys that were taken from the header row of the input CSV. I based things off the number of rows (via len(rows)) so it would still work with different files.

range(len(rows)) kind of smells, and you could probably be more clever with enumerate and maybe other things, but going with pairs of line numbers seems straightforward enough.