Can I count UMI-barcode combination without mapping?
0
0
Entering edit mode
5 months ago
Assa Yeroslaviz ★ 1.9k

My question in short is, is it possible to count my reads summarizing them based on a combination of cellular barcodes and molecular barcodes (UMI) from one read and a specific ID tag from the second pair.

This is also related to my question before.

As stated there, I have a data set of paired-end fastq files. In read2 of this set I have my barcodes and UMIs I can find using a regular expression. So I can use the umi_tools extract command to attach the barcodes_UMI combination to the header of my fastq files

 ==>                                    3 barcode (30nt) put together _ UMI(8nt)
@A01878:119:H53Y3DRX5:2:2101:11469:1344_CTCTCCTGAACTAACACGCCCATTCACTCT_GCGTTGAT 2:N:0:CGTCTCAT+AGCTACTA
TCCAGCTACTGCACCACTGCTTAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGCTACT
+
FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,

The three cellular barcode of 10nt each are concatenated together first and following the underscore I have the UMI of 8nt long. (for the structure see the previous post).

My problem is that I can't really map these files to a genome. On Read 1, where the genomic sequence would have been, we have only a tag ID for a specific Antibody protein, so it cant be mapped to a genome. (See image below). The ID part is also only 8nt long and is always at the beginnning of Read1.

read pair structure

Is there a way to still count such combination of IDs and barcodes?

HyDrop protein UMI UMI-Tools • 333 views
ADD COMMENT
0
Entering edit mode

You can extract the UMI by using.

$ awk -F "_| " '$0 ~ /^@/ {print $2"_"$3}' test.fq
CTCTCCTGAACTAACACGCCCATTCACTCT_GCGTTGAT
CTCTCCTGAACTAACACGCCCATTCACTCT_GCGTTGAT
CTCTCCTGAACTAACACGCCTATTCACTCT_GCGTTGAT

Then use a combo of

 $ awk -F "_| " '$0 ~ /^@/ {print $2"_"$3}' test.fq | sort | uniq -c
ADD REPLY

Login before adding your answer.

Traffic: 1826 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6