What's the best method to normalize TCR repertoire data (for comparison between samples) which already has raw counts and count frequency. Already tried Counts per Million (CPM) but there's a possibility that it might exaggerate the count number of a clone so the real picture won't be evident. Count Proportion (in %) is essentially the same as CPM (per 100 instead of per million). Was wondering something similar to if not same as TPM (Transcripts per Million) but with T cell clonotypes. Any help appreciated.
We have Bulk Sequencing data using SMARTer a/b tcr kit tool for library prep. Sequencing was done on MiSEQ. Counts represent reads of particular clonotype in CDR3 region.
Yes we want to know the relative abundance between samples.
CPM rescales the count linearly. If a clone has a certain saturation after certain reads it won't be evident from CPM. So there's a possibility that CPM normalization exaggerates the count number. Basically we don't know how individual clonotypes expand with the read count. If it's highly non-linear then CPM won't work.
What do you mean with "saturation after certain reads"? Do you mean that one drastically expanding clonotype will scavenge reads away from the other, rarer clonotypes? That's definitely a possibility. You could add the number of clonotypes per samples as a denominator
One of the most commonly accepted methods is to normalize the data using UMIs. If you use Takara SMARTer a/b tcr kit for human data with UMI, you can do that. Otherwise you can downsample to the same number of randomly selected reads, or to the top abundant clonotypes by weight (number of reads).
Also, its pretty easy to use MiXCR for takara kits, there are a specific commands available for every Takara kit, e.g.:
What type of data did you obtain? Is this from single cells? Bulk? What do your counts represent - reads? Cells?
We have Bulk Sequencing data using SMARTer a/b tcr kit tool for library prep. Sequencing was done on MiSEQ. Counts represent reads of particular clonotype in CDR3 region.
great! and the goal of your analysis is to see whether a certain clonotype is more abundant in one sample than the other?
Can you also elaborate on why you think CPMs won't do the job? It may help me understand the issue a bit better. :)
Yes we want to know the relative abundance between samples.
CPM rescales the count linearly. If a clone has a certain saturation after certain reads it won't be evident from CPM. So there's a possibility that CPM normalization exaggerates the count number. Basically we don't know how individual clonotypes expand with the read count. If it's highly non-linear then CPM won't work.
What do you mean with "saturation after certain reads"? Do you mean that one drastically expanding clonotype will scavenge reads away from the other, rarer clonotypes? That's definitely a possibility. You could add the number of clonotypes per samples as a denominator
you can do downsampling and then calculate entropy or Gini-index diversity metrics.