I have an MPRA-like experiment in which I designed a library of approximately 2000 different promoters expressing a barcoded transcript. The way this library was constructed, the barcodes were introduced randomly such that I don't know a priori which barcodes correspond to which promoter and there should be 100+ different barcodes per promoter. I have two sets of data. The first is of the library with reads spanning the entire promoter and barcoded transcript (paired end with some overlap in the middle). With this, I need to identify which barcodes go with which promoter and get some QC metrics. The second is reads of only the transcript from RNA and DNA after the library was put into cells. With it, I need to extract just the barcode and match it to its previously identified promoter so I can compare counts.
I thought that I had the pipeline worked out for both steps but I realized a big problem with the first part where I associate barcodes and promoters. I created a reference fasta with a list of all my expected promoters and aligned the reads to it with BWA. The vast majority of the promoters are very similar though, mostly a saturated mutagenesis of the same promoter with just SNVs or 5 to 25 bp deletions at each position plus a handful of entirely different promoters. So when I aligned to this list, very few reads are uniquely mapped and I can't actually tell which barcode is paired with which promoter.
My question is, how can I solve this mapping problem where most of my library members have high sequence similarity? Should I instead align to a single reference sequence (or much smaller set) and use a variant caller like GATK afterwards? And if so can I input a list of expected variants, or can I match the vcf up with my reference after variant calling?
I'm pretty novice when it comes to working with sequence data, so this may be an obvious thing I'm misunderstand. I'm comfortable with command line using existing tools and good to go once I get things in a format that I can make plots and such in R, but working through this part of bioinformatics is still rusty.
Thanks so much!