What's the best method to normalize TCR repertoire data (for comparison between samples) which already has raw counts and count frequency. Already tried Counts per Million (CPM) but there's a possibility that it might exaggerate the count number of a clone so the real picture won't be evident. Count Proportion (in %) is essentially the same as CPM (per 100 instead of per million). Was wondering something similar to if not same as TPM (Transcripts per Million) but with T cell clonotypes. Any help appreciated.
We have Bulk Sequencing data using SMARTer a/b tcr kit tool for library prep. Sequencing was done on MiSEQ. Counts represent reads of particular clonotype in CDR3 region.
Yes we want to know the relative abundance between samples.
CPM rescales the count linearly. If a clone has a certain saturation after certain reads it won't be evident from CPM. So there's a possibility that CPM normalization exaggerates the count number. Basically we don't know how individual clonotypes expand with the read count. If it's highly non-linear then CPM won't work.
What do you mean with "saturation after certain reads"? Do you mean that one drastically expanding clonotype will scavenge reads away from the other, rarer clonotypes? That's definitely a possibility. You could add the number of clonotypes per samples as a denominator
Hi, I have a question regarding this too. I have samples of animals of different ages, so some have small organs and some have big organs. I find it difficult to decide for a normalization method that takes this into account in a fair manner.
Small/young animals might show 5 clonotypes while big animals can show 500 clonotypes in the same ammount of RNA used (1ug). I believe it is okay to use the observed diversity to compare between samples, since the starting material is the same, but then, 1ug of the small tissue might correspond to 100% of the tissue, while 1 ug of the big tissue might be 10% of it.
How would you compare diversity, in an accurate way, when having such differences in total tissue size vs the portion we use for sequencing?
Also how to best compare between repertoires in this case, overlaping samples, etc?
One of the most commonly accepted methods is to normalize the data using UMIs. If you use Takara SMARTer a/b tcr kit for human data with UMI, you can do that. Otherwise you can downsample to the same number of randomly selected reads, or to the top abundant clonotypes by weight (number of reads).
Also, its pretty easy to use MiXCR for takara kits, there are a specific commands available for every Takara kit, e.g.:
What type of data did you obtain? Is this from single cells? Bulk? What do your counts represent - reads? Cells?
We have Bulk Sequencing data using SMARTer a/b tcr kit tool for library prep. Sequencing was done on MiSEQ. Counts represent reads of particular clonotype in CDR3 region.
great! and the goal of your analysis is to see whether a certain clonotype is more abundant in one sample than the other?
Can you also elaborate on why you think CPMs won't do the job? It may help me understand the issue a bit better. :)
Yes we want to know the relative abundance between samples.
CPM rescales the count linearly. If a clone has a certain saturation after certain reads it won't be evident from CPM. So there's a possibility that CPM normalization exaggerates the count number. Basically we don't know how individual clonotypes expand with the read count. If it's highly non-linear then CPM won't work.
What do you mean with "saturation after certain reads"? Do you mean that one drastically expanding clonotype will scavenge reads away from the other, rarer clonotypes? That's definitely a possibility. You could add the number of clonotypes per samples as a denominator
you can do downsampling and then calculate entropy or Gini-index diversity metrics.
Hi, I have a question regarding this too. I have samples of animals of different ages, so some have small organs and some have big organs. I find it difficult to decide for a normalization method that takes this into account in a fair manner. Small/young animals might show 5 clonotypes while big animals can show 500 clonotypes in the same ammount of RNA used (1ug). I believe it is okay to use the observed diversity to compare between samples, since the starting material is the same, but then, 1ug of the small tissue might correspond to 100% of the tissue, while 1 ug of the big tissue might be 10% of it. How would you compare diversity, in an accurate way, when having such differences in total tissue size vs the portion we use for sequencing? Also how to best compare between repertoires in this case, overlaping samples, etc?