Question

Can I count UMI-barcode combination without mapping?

1

Entering edit mode

10 months ago

Assa Yeroslaviz ★ 1.9k

My question in short is, is it possible to count my reads summarizing them based on a combination of cellular barcodes and molecular barcodes (UMI) from one read and a specific ID tag from the second pair.

This is also related to my question before.

As stated there, I have a data set of paired-end fastq files. In read2 of this set I have my barcodes and UMIs I can find using a regular expression. So I can use the umi_tools extract command to attach the barcodes_UMI combination to the header of my fastq files

 ==>                                    3 barcode (30nt) put together _ UMI(8nt)
@A01878:119:H53Y3DRX5:2:2101:11469:1344_CTCTCCTGAACTAACACGCCCATTCACTCT_GCGTTGAT 2:N:0:CGTCTCAT+AGCTACTA
TCCAGCTACTGCACCACTGCTTAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGCTACT
+
FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,

The three cellular barcode of 10nt each are concatenated together first and following the underscore I have the UMI of 8nt long. (for the structure see the previous post).

My problem is that I can't really map these files to a genome. On Read 1, where the genomic sequence would have been, we have only a tag ID for a specific Antibody protein, so it cant be mapped to a genome. (See image below). The ID part is also only 8nt long and is always at the beginnning of Read1.

read pair structure

Is there a way to still count such combination of IDs and barcodes?

HyDrop protein UMI UMI-Tools • 509 views

ADD COMMENT • link updated 10 months ago by GenoMax 150k • written 10 months ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

You can extract the UMI by using.

$ awk -F "_| " '$0 ~ /^@/ {print $2"_"$3}' test.fq
CTCTCCTGAACTAACACGCCCATTCACTCT_GCGTTGAT
CTCTCCTGAACTAACACGCCCATTCACTCT_GCGTTGAT
CTCTCCTGAACTAACACGCCTATTCACTCT_GCGTTGAT

Then use a combo of

 $ awk -F "_| " '$0 ~ /^@/ {print $2"_"$3}' test.fq | sort | uniq -c

ADD REPLY • link 10 months ago by GenoMax 150k