I am trying to count the number of UMIs associated with reads that contain a particular SNP. So, count the number of different UMIs associated with reads that do and do not contain the SNP
I was thinking of first parsing through the processed bam files to extract all reads at a specific site
samtools view -b -o output.bam input.bam "1:1000-1000"
Then convert output.bam --> output.fastq to easily parse reads
And I know there are packages like UMI tools that can append cell and molecular barcodes to each read line in a fastq file. Counting the number of unique UMIs under a specific cell barcode could give me what I want. However, I feel like this is too convoluted. Any recommendations of how to more easily count the number of UMIs which contain a specific SNP? Pysam doesn't seem like it has the functionality I'm looking for.
Where are your UMI's located? In a separate file?
UMIs are currently denoted by the XM tag in my bam file
Hi there. Its not quite clear what you want to do:
You talk about which "UMI"s contain the mutation. Is the mutation you are looking for in the UMI sequence? Or do you mean count the number of UMIs associated with reads that contain a particular SNP?
Assuming the above, do you want to count the number different UMIs associated with reads that do and do not contain the SNP, or, do you want to count the number of reads with and without the SNP for each UMI?
Sorry for being unclear - I am trying to count the number of UMIs associated with reads that contain a particular SNP. So, count the number of different UMIs associated with reads that do and do not contain the SNP