We used a new single-cell sequencing method for sequencing, and now we have encountered the following problems when analyzing the data. We designed three sequencing barcodes to be linked step by step. Specifically, we first linked the first barcode (containing 30 different base sequences), then linked the second barcode (containing 185 different base sequences) on this basis, and then linked the third barcode (containing 56 base sequences) on this basis.
The data obtained in an ideal state should be:
barcode1 sequence + barcode2 sequence + barcode3 sequence + DNA fragment + barcode3 reverse complementary sequence + barcode2 reverse complementary sequence + barcode1 reverse complementary sequence
There will be 3018556 Barcodes combinations.
But the actual sequencing data we analyzed is:
- barcode1 sequence + barcode2 sequence + barcode3 sequence + DNA fragment + barcode3 reverse complementary sequence
- barcode1 sequence + barcode2 sequence + barcode3 sequence + DNA fragment + barcode3 reverse complementary sequence + barcode2 reverse complementary sequence
- barcode1 sequence + DNA fragment + barcode3 reverse complementary sequence + barcode2 reverse complementary sequence + barcode1 reverse complementary sequence
- Sequence from unknown source + barcode1 sequence + barcode2 sequence + barcode3 sequence + DNA fragment + barcode3 reverse complementary sequence
- barcode1 sequence + DNA fragment + sequence from unknown source
- Sequence with one base difference from barcode1 sequence (maybe caused by mutation?) + DNA fragment + barcode3 reverse complementary sequence + barcode2 reverse complementary sequence + barcode1 reverse complementary sequence
The above are just some examples, there are actually many combinations. In addition, all sequences must contain barcode1 sequence or the reverse complementary sequence of barcode1 sequence.
We want to know the number and proportion of different barcode combinations. This problem has troubled me for a long time. I hope you can give me some code ideas to solve this problem? Thank you very much!
Show us a few examples of reads. Since you said
barcode
I assume these arein-line
in the sequencing read. What is the length of these barcodes and the length of total read?This may not be answerable via a forum like
biostars
since this seems fairly complex and access to actual data may be needed.I have no idea what you’re asking. “185 base sequences”? — I have no idea what that means; do you mean there are 185 possible barcodes? When you use phrases like “on this basis”, I have no clue what that is supposed to mean. What is “DNA fragment” supposed to mean?
Honestly, this just looks to me like a form of a SPLiT-seq assay. You extract three sequences from fixed positions (i.e. the positions where the barcodes should be) in your reads and then error-correct each of them to “whitelists”.
Cross-posted on bioinfo SE: https://bioinformatics.stackexchange.com/questions/22944/how-to-count-the-number-and-proportion-of-different-barcode-combinations-in-dna
tulip Please keep in mind that posting the same question to multiple sites can be perceived as bad etiquette, because efforts may be made to address a problem that has already been solved elsewhere in the meantime.
The helpful thing to do if you do decide to post on multiple forums is to add a link to the other forum posts on each post so people will look at the other posts before investing their effort.