Hi, I have collected my HTS data (single-end) of E.coli ribosome (full) using the Illumina platform. I found UMI-tools is very interesting and useful. I have used 18nt random barcode at 5'-end for avoiding the read duplication. I want to count the number of UMIs and reads at each position after mapping with a reference sequence. I have read the manual of UMI-tools, but couldn't figure out the solution: can you please suggest me how can I proceed. I'm providing an example showing what is my aim and how much I have understood:
Say, I have extracted the random barcode (18nt) from the 5'- end of each reads at the head ('_' seperated) like below using UMI-tools. Then I'll do mapping with the reference sequence using bowtie -2 . Now, I want to count the number of reads at each position of the reference and the barcodes which were unique to those reads from the SAM/BAM file. That means, I want to get the number of molecules at each position and their UMIs. For example, if I get 100 reads at 15th position and those 100 reads contained 75 types of unique barcodes, e.g., I want to get the number of reads (100) and unique barcodes (75) at each position (here 15th).
@ST-E00205:943:HCF3YCCX2:4:1101:11495:1678_CCAGCCCAAAGCCACCCG 1:N:0:NCCACGCG+NGATCTCG ACCGGATGGTAGACCTGGAGGAGGGGAAAGCCGAGGTGGTGACGGGAGCGGCTGGGGGGGGAGTCCGGGATGGTAGGCGGAGCGGGCAGAGCACAGCAGCTCGTGTAGAAATGG
+
7-<--7--7-7F-----77----7---7-------------------7----77-7-----7------7---------7-7------7--7----77----------77-7---
Sorry, you need a
-bga
on that command as well.(so
bedtools genomecov -5 -bga
) The output will then be:because we have deduplicated the bam, so that only one copy of each read starting at a given location and with a particular UMI is kept, the column 4 - the number of reads that start at that position IS the number of UMIs at that position.
Thank you very much. I have understood. [Sorry, I have lost my password of naeem40thju and even couldn't retrieve it].
Hi, after deduplicating I used
command and I found the output as follows: 1. Chromosome; 2. Start coordinate 3. End coordinate 4. Number of reads/UMIs.
That means at 39th position (BB) there is no UMIs and at 40th position, there are 6 UMIs and so on, isn't it? I am a little confused: In case of all chr (BB, AA, DA), in the beginning, up to 39/38th position there is no aligned reads/UMIs, how is it possible? I have checked the dedupe.bam file, there are aligned reads in 1-37/38th positions. What's wrong here actually? Thanks.
Thank you very much for your reply. I had taken 25K reads as a sample run. The
bedtools genomecov -5
command gives the output as follows: According to bedtools manual:1st column is chromosome (in case of me, BB: 5S, AA: 16S, DA: 23 ribosomal subunits). I am not sure what does genome means at the bottom. 2n column: depth of coverage (why 0?) 3rd column: number of bases on chromosome
Actually, I wanted to get the number of total UMIs and aligned reads at each position. I apologies, if my enquiry is very ordinary- I'm totally new in analysing high-throughput sequencing data.