Question

How to align library of highly similar sequences

0

Entering edit mode

6 weeks ago

rustyshackleford • 0

I have an MPRA-like experiment in which I designed a library of approximately 2000 different promoters expressing a barcoded transcript. The way this library was constructed, the barcodes were introduced randomly such that I don't know a priori which barcodes correspond to which promoter and there should be 100+ different barcodes per promoter. I have two sets of data. The first is of the library with reads spanning the entire promoter and barcoded transcript (paired end with some overlap in the middle). With this, I need to identify which barcodes go with which promoter and get some QC metrics. The second is reads of only the transcript from RNA and DNA after the library was put into cells. With it, I need to extract just the barcode and match it to its previously identified promoter so I can compare counts.

I thought that I had the pipeline worked out for both steps but I realized a big problem with the first part where I associate barcodes and promoters. I created a reference fasta with a list of all my expected promoters and aligned the reads to it with BWA. The vast majority of the promoters are very similar though, mostly a saturated mutagenesis of the same promoter with just SNVs or 5 to 25 bp deletions at each position plus a handful of entirely different promoters. So when I aligned to this list, very few reads are uniquely mapped and I can't actually tell which barcode is paired with which promoter.

My question is, how can I solve this mapping problem where most of my library members have high sequence similarity? Should I instead align to a single reference sequence (or much smaller set) and use a variant caller like GATK afterwards? And if so can I input a list of expected variants, or can I match the vcf up with my reference after variant calling?

I'm pretty novice when it comes to working with sequence data, so this may be an obvious thing I'm misunderstand. I'm comfortable with command line using existing tools and good to go once I get things in a format that I can make plots and such in R, but working through this part of bioinformatics is still rusty.

MPRA variant alignment BWA SNV • 589 views

ADD COMMENT • link 6 weeks ago by rustyshackleford • 0

score 2 · Answer 1 · 2025-03-27

Hey Rusty,

Ah, the old MPRA barcode to promoter mapping problem, eh? We got you.

First let's frame the problem for educational purposes:

MPRA stands for massively parallel reporter assay. The promoters in such a "MPRA-like" library tend to differ only by small number of SNVs or short deletions. The problem is that the promoter sequences are so similar that many reads will match equally well to multiple references ...

When an aligner sees this, it may pick several options depending on its defaults, the settings used, etc. For instance, it could flag them as multimapped, or it could pick one arbitrarily, and so on ... In short, standard aligners (e.g., BWA/Bowtie) can’t always pick a single best alignment when so many near-identical references exist. As a result, it’s difficult to determine which read—and hence which barcode—corresponds to which specific promoter design ...

Outline of the Fix

Labs deal with this in a couple different ways. I'm going to outline just one in the interest of time. If you need more explanation let me know.

Treat promoters as variants of one (or a few) reference sequences (rather than mapping to thousands of highly similar references)
OK, now align reads to these "references" --> avoids multi-mapping issues.
Identify each read’s variants (SNVs, small deletions) relative to the reference
Next, match those variants back to your known library of promoter designs (pinpoint which construct the read came from)
Finally, associate the barcodes from those reads with their definitive promoter assignments ..

TL; DR: consolidate all the constructs into a small number of references, then call variants, then link each barcode to the promoter --> disambiguation of highly similar info w/o loss of accuracy.

score 1 · Answer 2 · 2025-03-28

You need to deviate from the defaults of the aligner to enforce that only perfect matches are counted. Basically, and I use bowtie2 for purposes like this, you enforce ed-to-end mapping (not local!) and give high penalties to the alignment score if there is any mismatch or gap. Then you select only those with highest mapping quality, which effectively means only perfect matches.

Since you do not know beforehand which promoter associate with which barcode I would make a greedy reference. Meaning, you pair all promoters with all possible barcodes and then simply let the aligner (with its perfect matching) decide. Inspired by single-cell you could also make some sort of a knee plot, to later visually check that you get a good separation between what looks like real existing promoter-barcode associations and background. Since you do the greedy reference I would intuituvely expect that many combinations don't even exist, so a proper knee should be possible I guess.

Hope that makes sense.