Question

Tool:MIGEC: towards error-free profiling of immune repertoires

6

Entering edit mode

10.9 years ago

mikhail.shugay 3.5k

Dear fellow bioinformaticians,

I would like to announce our recently published protocol and a software pipeline that allows to greatly reduce NGS and PCR error rates in immune repertoire sequencing data, without loss of real repertoire diversity [ http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.2960.html ].

One of the main applications of immune repertoire sequencing is to identify the sequences of CDR3 regions of immune genes (T-cell receptors and immunoglobulins) which are the key determinants of antigen specificity. Those regions are highly variable and are only partially encoded in germline sequence, making it extremely hard to distinguish an error from a de-novo variant in them. Moreover, CDR regions of immunoglobulin genes are the hotspots of somatic hypermutations (SHM), with the patterns of SHM and PCR errors being very similar. Therefore the issue of PCR/NGS error filtration is of a great importance for immune repertoire sequencing data analysis.

Briefly, our library preparation method and analysis relies on unique molecular identifier tags (UMIs, random 12-mers), introduced on the stage of cDNA synthesis that allows to trace individual molecules through all stages of library preparation. The analysis pipeline, written in Groovy (Java scripting language) and distributed as a jar executable allows for

De-multiplexing of sequencing data, read overlapping and adapter trimming
Assembly of consensuses for reads grouped by their molecular identifier (aka MIGs)
Extraction of CDR3 regions for immune receptor genes and annotation of Variable (V), Diversity (D) and Joining (J) segments
Advanced filtering of hot-spot errors that couldn't be eliminated by consensus assembly

While the pipeline was optimized for immune repertoire libraries with unique molecular identifiers, several tools embedded in it could be useful for a wider audience, e.g.

Fast de-multiplexing tool is suitable for any type of NGS data
Consensus assembly could be used for any library prepared using unique molecular identifiers, given sufficient over-sequencing (number of reads in MIG) is reached
Fast and accurate BLAST-based CDR3 extraction and clonotype assembly for TCR Gamma/Delta chains and Immunoglobulin Heavy/Light/Kappa chains for all kinds of immune repertoire libraries. To the best of our knowledge there is currently no software tool that is able to accurately process with large amounts of NGS data for those genes

More information and download links to the software: http://milaboratory.com/projects/migec-project/

GitHub repository: https://github.com/mikessh/migec

Update 18-02-15 (v1.2.0, latest);Major update: implemented the annotation of Diversity (D) segment of immune receptors, added support for minor alleles of V/D/J segments, significant speedup for CDR3 extraction algorithm, minor bug fixes. See https://github.com/mikessh/migec/releases/tag/v1.2.0

Update 02-09-14 Significant pipeline improvements based on the experience from processing a dozen of ours and collaborators' repertoire sequencing datasets: improved read overlapping in Checkout, more flexible options for MIG consensus assembly, correct handling of multiple fastq files by CdrBlast, more extensive reporting in log files. Added duck and chicken TRB references kindly provided by ZeZhong Li (those are completely missing in IMGT).

Update 07-06-14 Added species support for Macaca mulatta, Oryctolagus cuniculus, Rattus norvegicus, Canis lupus familiaris, Sus scrofa, Bos taurus, Mus spretus based on IMGT. Corresponding reference files are here: https://github.com/mikessh/migec/tree/master/src/main/resources.

sequencing-error tcr antibody molecular-barcode • 12k views

ADD COMMENT • link updated 21 months ago by Ram 45k • written 10.9 years ago by mikhail.shugay 3.5k

0

Entering edit mode

Update (from v1.0.2 to v1.0.6)

added some useful features, such as filtering erroneous UMIs tags (could be very useful for data with low over-sequencing), adapter trimming. V segments library was extended

ADD REPLY • link 10.8 years ago by mikhail.shugay 3.5k

0

Entering edit mode

Major update (v1.1.0):

Batch operations, that allow executing the whole pipeline in several lines by utilizing provided sample metadata and the metadata generated on various stages of pipeline execution. Batch operations also provide flexible parameter setting, either through options or through metadata tables.

Several stages of pipeline, such as consensus assembly and hot-spot error filtering were further optimized by including heuristics.

Several minor bug fixes and performance improvements

ADD REPLY • link updated 5.2 years ago by Ram 45k • written 10.8 years ago by mikhail.shugay 3.5k

1

Entering edit mode

Is this software limited to human and mouse like MiTCR? Can the source of sequencing material be something other than blood, say a spleen?

ADD REPLY • link 10.8 years ago by pld 5.1k

0

Entering edit mode

Right now yes, but here the segments library could be modified far easier than in MiTCR. Please tell me which species you're interested in. And of course I'll not be able to help if those species are not present in IMGT, unless you'll provide me V and J segment sequences with marked Cys and Try/Phe reference points.

I don't see a problem for using any cell source given you'll preparing your library by amplifying TCR/IG gene and your library is going to contain CDR3 regions. You can always collect cells, sort them and multiplex them using barcodes..

ADD REPLY • link 10.8 years ago by mikhail.shugay 3.5k

0

Entering edit mode

http://www.imgt.org/IMGTrepertoire/Taxonomy/Taxons.php

Assuming this is the correct place to look: Grivet, Macaca mulatta, Macaca fascicularis, any of the ruminants (alpaca, goat, camel, etc).

ADD REPLY • link 10.8 years ago by pld 5.1k

0

Entering edit mode

Actually no, the site is quite confusing, and the correct place is Table 7 here. For your gene of interest, say TRA, you need that both V and J segments (TRAV and TRAJ) be present for your species of interest. So for example I could add IGH for Llama and TRA+TRB for nonhuman primates.

I would appreciate it if you'll provide a little bit more concrete gene+species list based on this table. And also try going down by the links and check if there are actually any number of segments for your concrete species of interest.

ADD REPLY • link updated 5.2 years ago by Ram 45k • written 10.8 years ago by mikhail.shugay 3.5k

0

Entering edit mode

I agree, their website is a mess. If that is the case then just the non-human primates.

ADD REPLY • link 10.8 years ago by pld 5.1k

0

Entering edit mode

I've parsed all segments from IMGT. So far only Macaca mulatta seems feasible for V-(D)-J analysis. You can see the results here. Feasible species have "1" in the last column.

It will take me a couple of days to implement it to Migec.

Just of note, I haven't found any software that provides V-(D)-J analysis for species other than mouse, human, rabbit and rat.

Please note that Migec is focused on CDR3 extraction and determination of V/J segments. It doesn't provide CDR1,2 extraction and allele/hypermutation mapping. I'm now customizing IgBlast for this. However, species appear to be hard-coded there, so it seems I'll have to contact IgBlast developers. Also IgBlast is quite slow.

Please also note, that if you're using not targeted sequencing, but smth like RNA-Seq, you should first consider pre-filtering data by mapping them with e.g. bwa-mem to corresponding genomic loci. This will save you lots of time.

ADD REPLY • link updated 5.2 years ago by Ram 45k • written 10.8 years ago by mikhail.shugay 3.5k

0

Entering edit mode

Added support for all species-gene pairs where it is possible based on IMGT data. Binaries are available here.

ADD REPLY • link updated 5.2 years ago by Ram 45k • written 10.8 years ago by mikhail.shugay 3.5k

0

Entering edit mode

Hi, mikhail. I have built a library of monkey TCR and want to use MIGEC to analyse my fasq file. I have few questions.

I built the library by normal ligate P5, P7 adapter with paired barcode up to 96 samples, so no MIGs in my library. Can I use MIGEC analyse my data produced by MiSeq?
My data have been de-multiplexed by MiSeq, can I assemply the demultiplexed file directly? In the Consensus assembly section of the manual, it needs checkout_output_folder/ histogram_output_folder/ output_folder/. If I use demultplexed data, I have NO these folders.
If I want use MIGEC to demultiplex the fasq file of MiSeq (with 12 of i7 and 8 of i5 indices), how to set a barcade file? I don't understand the barcode file in the mauual. There is only one of i7 or i5 index either in read 1 or read 2 in my case. Could you please set an exsample of barcode file for me using i7 and i5 indices? And Could you explain the master and slave barcode sequence in detail? is that mean there are two barcode sequences in a read sequence?

I am a very new to bioinformatics. Many easy things for you experts are difficult for me. I am grateful for your help.

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.7 years ago by mcchance ▴ 30

0

Entering edit mode

Hello!

If you don't have any unique molecular identifiers (UMIs), the only routine you need is CdrBlast which performs V-(D)-J mapping and assemble clonotypes.

You should not demultiplex Illumina indices (the instrument should do it for you), this option is for the case when you have a custom adapter sequence (in our protocol it is M1s) which carry sample barcodes. In that case, you just put a set of sample ids, followed by their adapter sequence with barcodes marked in upper-case letters into barcodes.txt. Slave barcode is used for case when barcodes are attached from both sides. It can contain additional sample barcode in order to filter cross-sample contamination and additional set of UMI (N) nucleotides.

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.6 years ago by mikhail.shugay 3.5k

0

Entering edit mode

Thank you for your quick response. Do I need overlap two read files before I run CdrBlast? Or CdrBlast can overlap two files into one, then map and assemble?

I ran the script

java -jar migec.jar CdrBlast -S MacacaMulatta -R TRB -N 1000 \
    --all-alleles --no-sort checkout/2008-12-02-D7_S93_L001_R1_001.fastq.gz \
    checkout/2008-12-02-D7_S93_L001_R2_001.fastq.gz cdrblast/2008-12-02-D7_raw.cdrblast.txt

It ran and completed quickly, and showed result

EVENTS (good mapped total): 0 ..., READS (good mapped total): 0...

Is my scripts right?

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.6 years ago by mcchance ▴ 30

0

Entering edit mode

It appears that none of 1000 input sequences were mapped. Are you able to map any of those reads manually (http://www.ncbi.nlm.nih.gov/igblast/igblast.cgi?CMD=Web&SEARCH_TYPE=TCR&LINK_LOC=igtab)? Note that reads should contain both V and J segments in the same read.

Overlap can be performed, yet it is meaningful when the expected overlap is high. It can boost the quality of TCR sequences when both reads cover CDR3 region.

ADD REPLY • link 9.6 years ago by mikhail.shugay 3.5k

0

Entering edit mode

You said "overlap can be performed". Do you mean CdrBlast can do it?

Do I need other parameters to tell CdrBlast to performed the overlap?

When overlapping, how the software decide which one of two mismatched nt to be remained? The one with higher quality score?

I have mapped the 1000 input sequences on ncbi link you gave to me. There are at least 328 sequences mapped to the TCR of human (my sequences are monkey's). So I list the complete result below after I ran the CdrBlast for your reference. Could you analyse them for me and find what is wrong with it? Thank you.

E:\2015-08 TCR analysis>java -jar migec.jar CdrBlast -S MacacaMulatta -R TRB out.extendedFrags_1000.fastq cdrblast/out.extendedFrags_1000.fastq.txt
Executing com.milaboratory.migec.CdrBlast -S MacacaMulatta -R TRB out.extendedFrags_1000.fastq cdrblast/out.extendedFrags_1000.fastq.txt
[Wed Aug 19 11:36:44 CST 2015 com.milaboratory.migec.CdrBlast] Loading TRB Variable and Joining segment data
[Wed Aug 19 11:36:44 CST 2015 com.milaboratory.migec.CdrBlast] Generating BLAST database for TRB V
USAGE
  convert2blastmask.exe [-h] [-help] [-in input_file_name]
    [-out output_file_name] [-outfmt output_format] [-parse_seqids]
    -masking_algorithm mask_program_name -masking_options mask_program_options
    [-version]

DESCRIPTION
   Convert masking information in lower-case masked FASTA input to file
   formats suitable for makeblastdb

Use '-help' to print detailed descriptions of command line arguments
========================================================================

Error: Too many positional arguments (1), the offending value: TCR
USAGE
  makeblastdb.exe [-h] [-help] [-in input_file] [-input_type type]
    -dbtype molecule_type [-title database_title] [-parse_seqids]
    [-hash_index] [-mask_data mask_data_files] [-mask_id mask_algo_ids]
    [-mask_desc mask_algo_descriptions] [-gi_mask]
    [-gi_mask_name gi_based_mask_names] [-out database_name]
    [-max_file_sz number_of_bytes] [-logfile File_Name] [-taxid TaxID]
    [-taxid_map TaxIDMapFile] [-version]

DESCRIPTION
   Application to create BLAST databases, version 2.2.31+

Use '-help' to print detailed descriptions of command line arguments
========================================================================

Error: Too many positional arguments (1), the offending value: TCR
[Wed Aug 19 11:36:45 CST 2015 com.milaboratory.migec.CdrBlast] Generating BLAST database for TRB J
USAGE
  convert2blastmask.exe [-h] [-help] [-in input_file_name]
    [-out output_file_name] [-outfmt output_format] [-parse_seqids]
    -masking_algorithm mask_program_name -masking_options mask_program_options
    [-version]

DESCRIPTION
   Convert masking information in lower-case masked FASTA input to file
   formats suitable for makeblastdb

Use '-help' to print detailed descriptions of command line arguments
========================================================================

Error: Too many positional arguments (1), the offending value: TCR
USAGE
  makeblastdb.exe [-h] [-help] [-in input_file] [-input_type type]
    -dbtype molecule_type [-title database_title] [-parse_seqids]
    [-hash_index] [-mask_data mask_data_files] [-mask_id mask_algo_ids]
    [-mask_desc mask_algo_descriptions] [-gi_mask]
    [-gi_mask_name gi_based_mask_names] [-out database_name]
    [-max_file_sz number_of_bytes] [-logfile File_Name] [-taxid TaxID]
    [-taxid_map TaxIDMapFile] [-version]

DESCRIPTION
   Application to create BLAST databases, version 2.2.31+

Use '-help' to print detailed descriptions of command line arguments
========================================================================

Error: Too many positional arguments (1), the offending value: TCR
[Wed Aug 19 11:36:45 CST 2015 com.milaboratory.migec.CdrBlast] Reading out.extendedFrags_1000.fastq and generating list of unique sequences to map V and J genes
[Wed Aug 19 11:36:46 CST 2015 com.milaboratory.migec.CdrBlast] Making a temporary FASTA file for BLAST queries
[Wed Aug 19 11:36:46 CST 2015 com.milaboratory.migec.CdrBlast] Pre-aligning TRB V segment with BLAST <4 threads>
BLAST Database error: No alias or index file found for nucleotide database [E:\2015-08] in search path [E:\2015-08 TCR analysis;;]
BLAST Database error: No alias or index file found for nucleotide database [E:\2015-08] in search path [E:\2015-08 TCR analysis;;]
BLAST Database error: No alias or index file found for nucleotide database [E:\2015-08] in search path [E:\2015-08 TCR analysis;;]
BLAST Database error: No alias or index file found for nucleotide database [E:\2015-08] in search path [E:\2015-08 TCR analysis;;]
[Wed Aug 19 11:36:48 CST 2015 com.milaboratory.migec.CdrBlast] Done
[Wed Aug 19 11:36:48 CST 2015 com.milaboratory.migec.CdrBlast] Pre-aligning TRB J segment with BLAST <4 threads>
BLAST Database error: No alias or index file found for nucleotide database [E:\2015-08] in search path [E:\2015-08 TCR analysis;;]
BLAST Database error: No alias or index file found for nucleotide database [E:\2015-08] in search path [E:\2015-08 TCR analysis;;]
BLAST Database error: No alias or index file found for nucleotide database [E:\2015-08] in search path [E:\2015-08 TCR analysis;;]
BLAST Database error: No alias or index file found for nucleotide database [E:\2015-08] in search path [E:\2015-08 TCR analysis;;]
[Wed Aug 19 11:36:49 CST 2015 com.milaboratory.migec.CdrBlast] Done
[Wed Aug 19 11:36:49 CST 2015 com.milaboratory.migec.CdrBlast] Reading in BLAST results
[Wed Aug 19 11:36:49 CST 2015 com.milaboratory.migec.CdrBlast] Done 0 V and 0 J mapped, out of total 1000 sequence variants
[Wed Aug 19 11:36:49 CST 2015 com.milaboratory.migec.CdrBlast] Extracting CDRs
[Wed Aug 19 11:36:50 CST 2015 com.milaboratory.migec.CdrBlast] Done. 0 CDRs mapped, out of total 1000 sequence variants
[Wed Aug 19 11:36:50 CST 2015 com.milaboratory.migec.CdrBlast] Done. Mapping Diversity segments
[Wed Aug 19 11:36:50 CST 2015 com.milaboratory.migec.CdrBlast] Appending read data from out.extendedFrags_1000.fastq
[Wed Aug 19 11:36:50 CST 2015 com.milaboratory.migec.CdrBlast] Done
EVENTS (good mapped total):    0    0    1000    |    0%    0%    100%
READS (good mapped total):    0    0    1000    |    0%    0%    100%
[Wed Aug 19 11:36:50 CST 2015 com.milaboratory.migec.CdrBlast] Writing output

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 9.6 years ago by mcchance ▴ 30

0

Entering edit mode

Hello,

It appears to be an issue with makeblastdb running under windows. From the

Too many positional arguments (1), the offending value: TCR

and

E:\2015-08 TCR analysis

I can assume that the problem is caused by spaces in folder name. A general suggestion will be of course to try running the pipeline in UNIX environment :)

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 9.6 years ago by mikhail.shugay 3.5k

0

Entering edit mode

Hi Mikhail,

Very nice tool. If we don't have MIGs identifiers in data, could we still use the CdrBlast and the sequencing/PCR error correction algorithm just like miTCR?

I tried to run CdrBlast and it is successfully done. Now to run FilterCdrBlastResults, I need clustered and raw CDR3. In my case I just have normal raw CDR3. Can you suggest a right step? Or the filtering step only works for the data from MIGs data?

Thank you :-)

ADD REPLY • link 9.5 years ago by Deepali Vasoya ▴ 10

0

Entering edit mode

Hi, sorry for late reply (it seems that I'm not getting any notifications from biostars), the raw CDR3 data should be generated with MIGEC (as it uses a bit different aligner than MITCR) without any frequency-based error correction, moreover the output format should be the same as for MIGEC, so the answer is no, it should be performed with CdrBlast.

ADD REPLY • link 9.3 years ago by mikhail.shugay 3.5k

0

Entering edit mode

Hi Mikhail,

Can I use MIGEC to analyse TCR sequencing data from single cells?

Many thanks

ADD REPLY • link 7.3 years ago by margaridarei • 0

0

Entering edit mode

Hello,

should I use tool like Trimmomatic for adapter and quality trimming before using migec?

Thank you.

ADD REPLY • link 6.9 years ago by jirkanov ▴ 10

score 0 · Answer 1 · 2015-10-22

0

Entering edit mode

9.5 years ago

Deepali Vasoya ▴ 10

Hi Mikhail,

Very nice tool. If we don't have MIGs identifiers in data, could we still use the CdrBlast and the sequencing/PCR error correction algorithm just like miTCR?

I tried to run CdrBlast and it is successfully done. Now to run FilterCdrBlastResults, I need clustered and raw CDR3. In my case I just have normal raw CDR3. Can you suggest a right step? Or the filtering step only works for the data from MIGs data?

Thank you :-)

ADD COMMENT • link 9.5 years ago by Deepali Vasoya ▴ 10

0

Entering edit mode

Hi! Indeed this software relies solely on unique molecular identifiers (UMIs) for error correction. In case of a library prep that doesn't utilize UMIs, only Checkout and CdrBlast routines are of use, for de-multiplexing and V-D-J mapping respectively.

ADD REPLY • link 9.5 years ago by mikhail.shugay 3.5k

score 0 · Answer 2 · 2015-12-23

0

Entering edit mode

9.3 years ago

m338102001 ▴ 10

Dear Shugay,

I'm trying to use the "checkout" function to perform demultiplexing of Miseq 300+300 paired-end (all.R1.fastq and all.R2.fastq) read files [without -overlap].

However, i found that the reads in demultiplexed R2 file is in the reverse direction, but i can't find relevant information from the user guide. So i want to ask whether the reads in demultiplexed R2 are reversed or reverse-complemented?

In addition, in the downstream analysis, can i directly use MIXCR the demultiplexed R1 and R2 (reverse/reverse-complement) to perform analysis, or i need to convert the R2 back to the original direction as in all.R2.fastq?

Thank you~

ADD COMMENT • link 9.3 years ago by m338102001 ▴ 10

0

Entering edit mode

Hi,

Those are reverse-complemented, so that they are in the same strand as read#1. In case you have cDNA data and barcode#1 is the 5' one this gives you the correct strand of transcript. For downstream analysis you can use MITCR and MIGEC itself for CDR3 extraction, MIXCR and MIGMAP for full-length analysis (in this case you should overlap reads with any suitable software after you've assembled consensuses) or any other tool from omictools/repseq that you like. If you don't have UMIs I would rather trim low-quality bases. As far as I remember the R2 should be reverse-complemented back to the original positioning for MIXCR.

ADD REPLY • link 9.3 years ago by mikhail.shugay 3.5k

0

Entering edit mode

Thank you!

Merry christmas :)

ADD REPLY • link 9.3 years ago by m338102001 ▴ 10

0

Entering edit mode

Glad to help, merry christmas to you too :)

ADD REPLY • link 9.3 years ago by mikhail.shugay 3.5k