how to identify CDR region in antibody sequence
3
3
Entering edit mode
2.4 years ago
reany ▴ 50

I want to extract CDR region form an antibody sequence or numbered antibody sequence.

Because SCALOP will miss H-CDR3, is there annother tool could identify CDR region? After numbering the antibody sequence by ANARCI, whether it's the right way to extract CDR region according to the position of CDR definitions?

I am looking for some stand-alone tools or correct rules to extract CDR after numbering antibody sequence. I am not sure the definition is enough to do this.

Any other suggestions would be appreciated.

CDR antibody • 8.6k views
ADD COMMENT
0
Entering edit mode

What species are you analyzing? IMGT is a great resource.

ADD REPLY
1
Entering edit mode

Sorry about that the problem was not clearly described. Yes IMGT is useful, but i am looking for some stand-alone tools or correct rules to extract CDR after numbering antibody sequence. I am not sure the definition is enough to do this.

ADD REPLY
4
Entering edit mode
15 months ago
jlparkinson1 ▴ 40

I'm a little late to this party, so this is probably something you already figured out a long time ago, but for anyone else who's trying to do this, hopefully this helps.

First of all, it's important to realize that what region of the sequence is considered CDR-H3 can vary slightly depending on what antibody numbering system you use, since each has a slightly different definition of the CDRs. There are (remarkably) at least six different numbering systems in the literature, although IMGT, Chothia and Kabat are probably the most common. This can be a little confusing -- this paper provides a really nice overview of the different systems, the differences between each and how to convert from one to the other. If in doubt, I would just use IMGT. Again, the differences will be slight. But it's still good to be aware that how you define CDR3 will be slightly different if you use say Chothia rather than IMGT.

With that said...MIXCR is a nice program, but it's designed for analysis of sequencing data and is overkill for a situation where you have what are already protein sequences and don't need to do all the preprocessing that raw sequencing data will require. You could use the ANARCI, AbNum or AbRSA online tools, but for a single sequence I would just use the tool on the IMGT website, which will give you both the sequence numbering and some other useful information.

If you have say thousands of protein sequences, you can use the ANARCI tool, but I've found it to be quite slow for any large number of sequences; it does some rather inefficient things like loading all of the sequences from a fasta file into memory all at once, then writing them back to disk to query against profiles for all species and chains in HMMer (even if the species & chain type are already known), then if they are single domain sequences writing them back to disk AGAIN to query HMMer again, then doing lots of corrections to the HMMer alignment, etc. (Shameless self-promotion here -- I've written my own tool which is 50 - 100x faster than ANARCI depending on settings. The strategy I've used is pretty similar to the one in the AbRSA paper, but with some tweaks. The AbRSA tool is also available and is another you could use if you have lots of sequences.)

Once you've numbered your sequence, e.g. using the IMGT website, to identify CDR H3, determine which numbers correspond to CDR H3 and then whichever positions in your numbered sequence correspond to that are the CDR. If you use IMGT, this page will give you the IMGT position numbers corresponding to CDR H3. If by contrast you use Chothia or Kabat, this page has a nice overview of which position numbers correspond to CDR H3 in those schemes.

Long story short: 1) choose a numbering scheme (I would suggest defaulting to IMGT if you're not sure); 2) number your sequence(s) (if you only have one as it sounds like, perhaps just use the IMGT tool); 3) look up the position numbers in the numbering scheme that you're using that correspond to CDR H3 and there you go.

ADD COMMENT
1
Entering edit mode
2.4 years ago
Jeremy ▴ 930

I would try MIXCR. You can look for other tools on b-t.cr under the Software category. If your main focus is antibodies, you could also consider joining the AIRR Slack channel and posting a question there.

CDR H3 starts one amino acid after the conserved cysteine at the end of the VH region and ends 1 amino acid before the conserved tryptophan at the beginning of the JH region. (See Figure 5 in the paper below.)

Antibody Diversity Paper

ADD COMMENT
1
Entering edit mode

MIXCR seems like situed for massively parallel sequencing data while my input is relatively simple, for example, just a protein sequence of light chain. I'm still going to look into MIXCR and hopefully learn something from it. Other suggestions are also ueful and thanks very much.

ADD REPLY
0
Entering edit mode

If you can provide an example sequence, I can probably write some R code to detect CDR H3.

ADD REPLY
0
Entering edit mode

Thx so much. The method to identify CDR is my focus. Maybe i can get CDR regions by some simple code like [residue for residue, index in numbered_residues if CDR_start_index <= index <= CDR_end_index]

ADD REPLY
0
Entering edit mode

Right. For the start, you'll want to count from the beginning, and for the end, you'll want to count backwards from the end.

ADD REPLY
0
Entering edit mode

I've made an app for extracting CDR H3 from an amino acid sequence:

CDRH3finder

ADD REPLY
1
Entering edit mode

Neat! What's your method under the hood? It looks like it starts a few AA too far in for a couple human sequences I tried, though it got the end position correct.

ADD REPLY
0
Entering edit mode

For human, I'm going from position 98 to -11 of the input sequence. It worked for some human sequences I found on NCBI, but my expertise is really cow antibodies. Do you know what germline VH gene your sequences were or if there were any insertions in CDR1 or CDR2? I might need to re-think the human option. Thanks for your feedback!

ADD REPLY
0
Entering edit mode

IMGT's IGHV4-34*02, and no insertions, but thanks to the randomness of VDJ recombination and nontemplated nucleotides you can end up with CDR3 positions all over the place even without later insertions/deletions, so you can't rely on position in the sequence alone.

It's a tricky problem, especially with just AA versus NT. igblastn requires a table of positions in each J gene to figure out the CDR3 end position (and from what reany said in another comment that's apparently not totally reliable in igblastp). I'm not sure what it's doing for the start position, but I'd bet it's looking for the conserved C late in the V gene that should come just before CDRH3. All my experience is with human and rhesus macaque so I can't say how cow might differ though.

ADD REPLY
0
Entering edit mode

Cows are a little simpler because in the germline sequence, the conserved cysteine is always in the same position with respect to the beginning of the VH gene and the conserved tryptophan is always in the same position with respect to the end of the JH gene, but indels could throw that off.

ADD REPLY
1
Entering edit mode

The AIRR folks are on slack? I had no idea. Thanks for mentioning that.

ADD REPLY
0
Entering edit mode
2.4 years ago
Jesse ▴ 850

I generally just use IgBLAST when I need antibody sequence annotations like CDR3. There's a command-line version and you can have arbitrary species and gene references (though in that case you need to jump through some hoops and create an "auxiliary data file" to get it to report the CDR3 info). It can also give AIRR-compatible TSV output so you can extract sequences directly from the specific columns you want, like cdr3, without worrying about a particular numbering scheme (e.g. kabat) and extracting subsequences yourself, but there are a ton of columns with position info as well. The web interface uses IMGT references by default, too. IMGT's own V-QUEST tool can give similar info, but is web-only and I don't believe supports custom references. There are a bunch of tools out there for bulk data but IgBLAST scales up and down nicely to even just one or a few sequences.

Also I'd reiterate what Jeremy said about the junction (where the sequence for the conserved amino acids at each end are included) versus the CDR3 (which leaves out those bits). Some texts jumble those definitions and that's tripped me up before.

ADD COMMENT
0
Entering edit mode

For igblastp, auxiliary_data is not support to get accurate CDR3 aligment although it will report H-CDR3 sometimes. Is it a challenge for existing tools to delimit H-CDR3 or have other reasons?

ADD REPLY
0
Entering edit mode

Oh, sorry, I haven't tried it for amino acid sequences so I don't know about igblastp's behavior. I'm a bit surprised any of these tools would have much trouble labeling CDR3 though I suppose there's less to go by with just the amino acid sequence. The IgBLAST docs say igblastp doesn't search D and J which matches what you're saying. Do you have nucleotide sequence you could use too, or just amino acid?

ADD REPLY
0
Entering edit mode

Only protein sequence aviliable. By changing some settings, H-CDR3 can be report with SCALOP finally although that may be inaccurate. Thanks anyway.

ADD REPLY
0
Entering edit mode

Could you share how you adjusted SCALOP settings to report H-CDR3 with only protein sequences available? I'm struggling to get beyond H1 and H2.

ADD REPLY

Login before adding your answer.

Traffic: 1787 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6