Question

How can I count the number of UMIs and reads at each position after mapping with a reference sequence?

0

Entering edit mode

4.5 years ago

naeem40thju ▴ 10

Hi, I have collected my HTS data (single-end) of E.coli ribosome (full) using the Illumina platform. I found UMI-tools is very interesting and useful. I have used 18nt random barcode at 5'-end for avoiding the read duplication. I want to count the number of UMIs and reads at each position after mapping with a reference sequence. I have read the manual of UMI-tools, but couldn't figure out the solution: can you please suggest me how can I proceed. I'm providing an example showing what is my aim and how much I have understood:

Say, I have extracted the random barcode (18nt) from the 5'- end of each reads at the head ('_' seperated) like below using UMI-tools. Then I'll do mapping with the reference sequence using bowtie -2 . Now, I want to count the number of reads at each position of the reference and the barcodes which were unique to those reads from the SAM/BAM file. That means, I want to get the number of molecules at each position and their UMIs. For example, if I get 100 reads at 15th position and those 100 reads contained 75 types of unique barcodes, e.g., I want to get the number of reads (100) and unique barcodes (75) at each position (here 15th).

@ST-E00205:943:HCF3YCCX2:4:1101:11495:1678_CCAGCCCAAAGCCACCCG 1:N:0:NCCACGCG+NGATCTCG ACCGGATGGTAGACCTGGAGGAGGGGAAAGCCGAGGTGGTGACGGGAGCGGCTGGGGGGGGAGTCCGGGATGGTAGGCGGAGCGGGCAGAGCACAGCAGCTCGTGTAGAAATGG
+ 
7-<--7--7-7F-----77----7---7-------------------7----77-7-----7------7---------7-7------7--7----77----------77-7---

next-gen sequencing • 2.3k views

ADD COMMENT • link 4.5 years ago by naeem40thju ▴ 10

GenoMax · Answer 1 · 2020-06-18

1

Entering edit mode

4.5 years ago

i.sudbery 20k

After alignment, you should be able to deduplicate your BAM file using umi_tools dedup.

Can should then be able to use something like bedtools genomeCoverage -5 to calculate the mapping depth (and therefore the number of unique UMIs) at each position.

ADD COMMENT • link 4.5 years ago by i.sudbery 20k

0

Entering edit mode

Sorry, you need a -bga on that command as well.

(so bedtools genomecov -5 -bga) The output will then be:

Chromsome (as in E. coli I'm guess the chromosome labeled "genome" is the main E. coli circular genome"
Start Coordinate
End Coordinate (always start coordinate + 1)
Number of reads that start at this position.

because we have deduplicated the bam, so that only one copy of each read starting at a given location and with a particular UMI is kept, the column 4 - the number of reads that start at that position IS the number of UMIs at that position.

ADD REPLY • link 4.5 years ago by i.sudbery 20k

0

Entering edit mode

Thank you very much. I have understood. [Sorry, I have lost my password of naeem40thju and even couldn't retrieve it].

ADD REPLY • link 4.5 years ago by md.solayman • 0

0

Entering edit mode

Hi, after deduplicating I used

bedtools genomecov -5 -bga -ibam xxxx_dedupe.bam

command and I found the output as follows: 1. Chromosome; 2. Start coordinate 3. End coordinate 4. Number of reads/UMIs.

That means at 39th position (BB) there is no UMIs and at 40th position, there are 6 UMIs and so on, isn't it? I am a little confused: In case of all chr (BB, AA, DA), in the beginning, up to 39/38th position there is no aligned reads/UMIs, how is it possible? I have checked the dedupe.bam file, there are aligned reads in 1-37/38th positions. What's wrong here actually? Thanks.

BB  0   39  0
BB  39  40  6
BB  40  41  3
BB  41  42  11
...........................
...........................
AA  0   38  0
AA  38  39  16
AA  39  40  10
AA  40  41  44
............................
...........................
DA  0   38  0
DA  38  39  29
DA  39  40  3
DA  40  41  8
............................
............................

ADD REPLY • link updated 4.5 years ago by GenoMax 148k • written 4.5 years ago by md.solayman • 0

0

Entering edit mode

Thank you very much for your reply. I had taken 25K reads as a sample run. The bedtools genomecov -5 command gives the output as follows: According to bedtools manual:

1st column is chromosome (in case of me, BB: 5S, AA: 16S, DA: 23 ribosomal subunits). I am not sure what does genome means at the bottom. 2n column: depth of coverage (why 0?) 3rd column: number of bases on chromosome

Actually, I wanted to get the number of total UMIs and aligned reads at each position. I apologies, if my enquiry is very ordinary- I'm totally new in analysing high-throughput sequencing data.

BB      0       117     120     0.975
BB      1       2       120     0.0166667
BB      3       1       120     0.00833333
AA      0       1356    1538    0.881665
AA      1       138     1538    0.0897269
AA      2       30      1538    0.0195059
AA      3       7       1538    0.00455137
AA      4       1       1538    0.000650195
AA      5       2       1538    0.00130039
AA      6       1       1538    0.000650195
AA      12      1       1538    0.000650195
AA      15      1       1538    0.000650195
AA      234     1       1538    0.000650195
DA      0       2605    2914    0.89396
DA      1       226     2914    0.0775566
DA      2       61      2914    0.0209334
DA      3       12      2914    0.00411805
DA      4       3       2914    0.00102951
DA      5       1       2914    0.000343171
DA      6       1       2914    0.000343171
DA      7       1       2914    0.000343171
DA      29      1       2914    0.000343171
DA      50      1       2914    0.000343171
DA      56      1       2914    0.000343171
DA      63      1       2914    0.000343171
genome  0       4078    4572    0.891951
genome  1       366     4572    0.0800525
genome  2       91      4572    0.0199038
genome  3       20      4572    0.00437445
genome  4       4       4572    0.000874891
genome  5       3       4572    0.000656168
genome  6       2       4572    0.000437445
genome  7       1       4572    0.000218723
genome  12      1       4572    0.000218723

ADD REPLY • link 4.5 years ago by naeem40thju ▴ 10