Finding # of UMIs with and without specific SNP
1
0
Entering edit mode
5.4 years ago
aalith ▴ 20

I am trying to count the number of UMIs associated with reads that contain a particular SNP. So, count the number of different UMIs associated with reads that do and do not contain the SNP

I was thinking of first parsing through the processed bam files to extract all reads at a specific site

samtools view -b -o output.bam input.bam "1:1000-1000"

Then convert output.bam --> output.fastq to easily parse reads

And I know there are packages like UMI tools that can append cell and molecular barcodes to each read line in a fastq file. Counting the number of unique UMIs under a specific cell barcode could give me what I want. However, I feel like this is too convoluted. Any recommendations of how to more easily count the number of UMIs which contain a specific SNP? Pysam doesn't seem like it has the functionality I'm looking for.

RNA-Seq SNP sequencing • 1.9k views
ADD COMMENT
1
Entering edit mode

Where are your UMI's located? In a separate file?

ADD REPLY
0
Entering edit mode

UMIs are currently denoted by the XM tag in my bam file

ADD REPLY
1
Entering edit mode

Hi there. Its not quite clear what you want to do:

  • You talk about which "UMI"s contain the mutation. Is the mutation you are looking for in the UMI sequence? Or do you mean count the number of UMIs associated with reads that contain a particular SNP?

  • Assuming the above, do you want to count the number different UMIs associated with reads that do and do not contain the SNP, or, do you want to count the number of reads with and without the SNP for each UMI?

ADD REPLY
0
Entering edit mode

Sorry for being unclear - I am trying to count the number of UMIs associated with reads that contain a particular SNP. So, count the number of different UMIs associated with reads that do and do not contain the SNP

ADD REPLY
1
Entering edit mode
5.4 years ago

There are several ways you might proceed here.

The most obvious way to proceed would be to extract the reads in a region around your position (i might be tempted to extract the whole chr, as you want both the read1s and the read2 for each pair even if only one overlaps the position in question).

You could then apply umi_tools dedup to the resulting BAM file, specificying the UMI location as being in the XM tag.

Then simply use samtools mpileup to get the read-depth of each allele at the location.

The problem with this approach is that it assumes that the sequence of all reads with the same UMI is the same: When you do dedup one read to represent each UMI is selected (the one with the highest MAPQ, so in this case it would probably be the one with the reference sequence).

The way around this would be to use group instead of dedup. Instead of selecting one read per UMI family, group adds a tag to the BAM containing the UMI family the read belongs to (affectively an error corrected UMI).

In pysam you could then create a set for each genotype, iterate over reads that overlap your site of interest, retrieve the genotype from the read sequence (probably with get_aligned_pairs(with_seq=True)), get the UMI group with get_tag and add the UMI to the appropriate set based on the genotype of the read. At the end you could count the length of each set to retrieve the UMI counts.

ADD COMMENT
0
Entering edit mode

this is unbelievably helpful, thank you!

ADD REPLY
0
Entering edit mode

So I'll have to use fetch instead of pileup so i can hold on to the tags?

ADD REPLY
0
Entering edit mode

Sorry for editing this so much, but having another bit of trouble..

To see the nucleotide at the specific point of interest, I did the following - subset to only chromosome 3 at position 50093871 (for reads in bamfile.fetch('3', 50093871, 50093872)) - Find the specific nucleotide by reads.seq[50093871 - reads.reference_start]

But a few reads, when subtracting their start site from my site of interest, are >70,000 positions away. So reads.seq runs into trouble when indexing. I've checked if they're unmapped, paired, unpaired, etc, but I can't find why they're different than the reads with a more reasonable start position

ADD REPLY

Login before adding your answer.

Traffic: 1987 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6