Question

Quick questions about Alevin and AlevinQC

0

Entering edit mode

3.3 years ago

rohitsatyam102 ▴ 920

I have just started using Alevin and I have few basic queries. And I am asking this because I couldn't find it in the documentation

How to know how many reads are uniquely mapping and how many are multi mapping when doing quantification with Alevin.
The alevin_meta_info.json says Number of mapped reads. Are these Uniquely+multimappig reads or just Uniquely mapping reads.
For all of my samples the library type couldn't be determined. I got the following log for example. I used ISR since the data was generated from Chromium 3' V3 chemistry.

"expected_format": "ISR",
    "compatible_fragment_ratio": 1.0,
    "num_compatible_fragments": 11147644,
    "num_assigned_fragments": 11147644,
    "num_frags_with_concordant_consistent_mappings": 0,
    "num_frags_with_inconsistent_or_orphan_mappings": 15681780,
    "strand_mapping_bias": 0.0,
    "MSF": 0,
    "OSF": 0,
    "ISF": 0,
    "MSR": 0,
    "OSR": 0,
    "ISR": 0,
    "SF": 0,
    "SR": 0,
    "MU": 0,
    "OU": 0,
    "IU": 0,
    "U": 0

When I run AlevinQC why the Number of used reads (which Alevin documentation says are reads that are used for quantification) greater than Number of mapped reads.

I am using the following command

~/salmon-1.5.2_linux_x86_64/bin/salmon alevin -l ISR -1 R1_001.fastq.gz -2 R2_001.fastq.gz --chromiumV3 --dumpFeatures --dumpMtx --expectCells 200 -i salmon_index/ -p 20 -o sample1 --keepCBFraction 1 --numCellBootstraps 100 --tgMap tx2gene.txt &

When I set --keepCBFraction 1, does setting --expectCells 200 make sense. I am asking this because I didn't see any change in mapping percentages of reads and cell barcodes being thrown out when I use both.

alevinqc alevin • 1.5k views

ADD COMMENT • link updated 3.3 years ago by ATpoint 86k • written 3.3 years ago by rohitsatyam102 ▴ 920

1

Entering edit mode

When I run AlevinQC why the Number of used reads (which Alevin documentation says are reads that are used for quantification) greater than Number of mapped reads.

Is this surprising? The "used reads" are the total reads minus those where the CB/UMIs are noise and therefore thrown away. And from these only a fraction can be mapped (19.62%). Are you sure you want to use an explicit estimate of the expected cells? 200 is very low for a 10X experiment. Alevin will try to estimate the number of cells in a data-driven fashion.

ADD REPLY • link 3.3 years ago by ATpoint 86k

0

Entering edit mode

Hi ATpoint, Thanks for taking out time to answer this query. My major concern is if only 11,147,644 reads mapped successfully, how come Alevin is using 56,817,566 reads (this is more than mapped reads, is this inclusive of multi-mapping reads)? If that's so 56,817,566 - 11,147,644 = 45,669,922 are multi-mapping ? This means Number of mapped reads statistics is uniquely mapping reads right??.

The reason why I am using an explicit cutoff of cells (200) is that for this dataset that I am working on, it's hard to find the Knee point in Waterfall Plot. When I run the barcodeRanks function I get an error which is discussed here. Do you have any recommendations to deal with such sort of data?

ADD REPLY • link 3.3 years ago by rohitsatyam102 ▴ 920

1

Entering edit mode

From my understanding Alevin first throws away noisy barcodes but then uses all remaining reads for the CB detection. That is to identify cells. And these reads (the R2 actually which is the cDNA; R1 was CB/UMI) is then used to align/quantify the reads against the reference). So it uses R1 (all of it minus the noisy ones) for cell assignment, and then the mates in R2 for alignment. I do not think that Alevin/Salmon has this concept of "unique" mapping as in traditional alignment because it uses a method to assign reads that map similiarily to multiple transcripts to all potential origins rather than just one (you will have to check the salmon paper, I do not recall the exact procedure so I will not guess here). The number that is reported are (to my understanding) all reads that contributed to the quantification results, so all minus the strictly unmapped ones.

No, sorry I did not have datasets with trouble finding the knee. Yes, maybe when you know how many cells were used for the library then using this parameter might make sense, but this is just a guess.

ADD REPLY • link 3.3 years ago by ATpoint 86k