My question in short is, is it possible to count my reads summarizing them based on a combination of cellular barcodes and molecular barcodes (UMI) from one read and a specific ID tag from the second pair.
This is also related to my question before.
As stated there, I have a data set of paired-end fastq files. In read2 of this set I have my barcodes and UMIs I can find using a regular expression. So I can use the umi_tools extract command
to attach the barcodes_UMI combination to the header of my fastq files
==> 3 barcode (30nt) put together _ UMI(8nt)
@A01878:119:H53Y3DRX5:2:2101:11469:1344_CTCTCCTGAACTAACACGCCCATTCACTCT_GCGTTGAT 2:N:0:CGTCTCAT+AGCTACTA
TCCAGCTACTGCACCACTGCTTAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGCTACT
+
FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,
The three cellular barcode of 10nt each are concatenated together first and following the underscore I have the UMI of 8nt long. (for the structure see the previous post).
My problem is that I can't really map these files to a genome. On Read 1, where the genomic sequence would have been, we have only a tag ID for a specific Antibody protein, so it cant be mapped to a genome. (See image below). The ID part is also only 8nt long and is always at the beginnning of Read1.
Is there a way to still count such combination of IDs and barcodes?
You can extract the UMI by using.
Then use a combo of