Dear fellow bioinformaticians,
I would like to announce our recently published protocol and a software pipeline that allows to greatly reduce NGS and PCR error rates in immune repertoire sequencing data, without loss of real repertoire diversity [ http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.2960.html ].
One of the main applications of immune repertoire sequencing is to identify the sequences of CDR3 regions of immune genes (T-cell receptors and immunoglobulins) which are the key determinants of antigen specificity. Those regions are highly variable and are only partially encoded in germline sequence, making it extremely hard to distinguish an error from a de-novo variant in them. Moreover, CDR regions of immunoglobulin genes are the hotspots of somatic hypermutations (SHM), with the patterns of SHM and PCR errors being very similar. Therefore the issue of PCR/NGS error filtration is of a great importance for immune repertoire sequencing data analysis.
Briefly, our library preparation method and analysis relies on unique molecular identifier tags (UMIs, random 12-mers), introduced on the stage of cDNA synthesis that allows to trace individual molecules through all stages of library preparation. The analysis pipeline, written in Groovy (Java scripting language) and distributed as a jar executable allows for
- De-multiplexing of sequencing data, read overlapping and adapter trimming
- Assembly of consensuses for reads grouped by their molecular identifier (aka MIGs)
- Extraction of CDR3 regions for immune receptor genes and annotation of Variable (V), Diversity (D) and Joining (J) segments
- Advanced filtering of hot-spot errors that couldn't be eliminated by consensus assembly
While the pipeline was optimized for immune repertoire libraries with unique molecular identifiers, several tools embedded in it could be useful for a wider audience, e.g.
- Fast de-multiplexing tool is suitable for any type of NGS data
- Consensus assembly could be used for any library prepared using unique molecular identifiers, given sufficient over-sequencing (number of reads in MIG) is reached
- Fast and accurate BLAST-based CDR3 extraction and clonotype assembly for TCR Gamma/Delta chains and Immunoglobulin Heavy/Light/Kappa chains for all kinds of immune repertoire libraries. To the best of our knowledge there is currently no software tool that is able to accurately process with large amounts of NGS data for those genes
More information and download links to the software: http://milaboratory.com/projects/migec-project/
GitHub repository: https://github.com/mikessh/migec
Update 18-02-15 (v1.2.0, latest);Major update: implemented the annotation of Diversity (D) segment of immune receptors, added support for minor alleles of V/D/J segments, significant speedup for CDR3 extraction algorithm, minor bug fixes. See https://github.com/mikessh/migec/releases/tag/v1.2.0
Update 02-09-14 Significant pipeline improvements based on the experience from processing a dozen of ours and collaborators' repertoire sequencing datasets: improved read overlapping in Checkout, more flexible options for MIG consensus assembly, correct handling of multiple fastq files by CdrBlast, more extensive reporting in log files. Added duck and chicken TRB references kindly provided by ZeZhong Li (those are completely missing in IMGT).
Update 07-06-14 Added species support for Macaca mulatta, Oryctolagus cuniculus, Rattus norvegicus, Canis lupus familiaris, Sus scrofa, Bos taurus, Mus spretus based on IMGT. Corresponding reference files are here: https://github.com/mikessh/migec/tree/master/src/main/resources.
Update (from v1.0.2 to v1.0.6)
added some useful features, such as filtering erroneous UMIs tags (could be very useful for data with low over-sequencing), adapter trimming. V segments library was extended
Major update (v1.1.0):
Batch operations, that allow executing the whole pipeline in several lines by utilizing provided sample metadata and the metadata generated on various stages of pipeline execution. Batch operations also provide flexible parameter setting, either through options or through metadata tables.
Several stages of pipeline, such as consensus assembly and hot-spot error filtering were further optimized by including heuristics.
Several minor bug fixes and performance improvements
Is this software limited to human and mouse like MiTCR? Can the source of sequencing material be something other than blood, say a spleen?
Right now yes, but here the segments library could be modified far easier than in MiTCR. Please tell me which species you're interested in. And of course I'll not be able to help if those species are not present in IMGT, unless you'll provide me V and J segment sequences with marked Cys and Try/Phe reference points.
I don't see a problem for using any cell source given you'll preparing your library by amplifying TCR/IG gene and your library is going to contain CDR3 regions. You can always collect cells, sort them and multiplex them using barcodes..
http://www.imgt.org/IMGTrepertoire/Taxonomy/Taxons.php
Assuming this is the correct place to look: Grivet, Macaca mulatta, Macaca fascicularis, any of the ruminants (alpaca, goat, camel, etc).
Actually no, the site is quite confusing, and the correct place is Table 7 here. For your gene of interest, say TRA, you need that both V and J segments (TRAV and TRAJ) be present for your species of interest. So for example I could add IGH for Llama and TRA+TRB for nonhuman primates.
I would appreciate it if you'll provide a little bit more concrete gene+species list based on this table. And also try going down by the links and check if there are actually any number of segments for your concrete species of interest.
I agree, their website is a mess. If that is the case then just the non-human primates.
I've parsed all segments from IMGT. So far only Macaca mulatta seems feasible for V-(D)-J analysis. You can see the results here. Feasible species have "1" in the last column.
It will take me a couple of days to implement it to Migec.
Just of note, I haven't found any software that provides V-(D)-J analysis for species other than mouse, human, rabbit and rat.
Please note that Migec is focused on CDR3 extraction and determination of V/J segments. It doesn't provide CDR1,2 extraction and allele/hypermutation mapping. I'm now customizing IgBlast for this. However, species appear to be hard-coded there, so it seems I'll have to contact IgBlast developers. Also IgBlast is quite slow.
Please also note, that if you're using not targeted sequencing, but smth like RNA-Seq, you should first consider pre-filtering data by mapping them with e.g. bwa-mem to corresponding genomic loci. This will save you lots of time.
Added support for all species-gene pairs where it is possible based on IMGT data. Binaries are available here.
Hi, mikhail. I have built a library of monkey TCR and want to use MIGEC to analyse my fasq file. I have few questions.
I built the library by normal ligate P5, P7 adapter with paired barcode up to 96 samples, so no MIGs in my library. Can I use MIGEC analyse my data produced by MiSeq?
My data have been de-multiplexed by MiSeq, can I assemply the demultiplexed file directly? In the Consensus assembly section of the manual, it needs
checkout_output_folder/ histogram_output_folder/ output_folder/
. If I use demultplexed data, I have NO these folders.If I want use MIGEC to demultiplex the fasq file of MiSeq (with 12 of i7 and 8 of i5 indices), how to set a barcade file? I don't understand the barcode file in the mauual. There is only one of i7 or i5 index either in read 1 or read 2 in my case. Could you please set an exsample of barcode file for me using i7 and i5 indices? And Could you explain the master and slave barcode sequence in detail? is that mean there are two barcode sequences in a read sequence?
I am a very new to bioinformatics. Many easy things for you experts are difficult for me. I am grateful for your help.
Hello!
If you don't have any unique molecular identifiers (UMIs), the only routine you need is CdrBlast which performs V-(D)-J mapping and assemble clonotypes.
You should not demultiplex Illumina indices (the instrument should do it for you), this option is for the case when you have a custom adapter sequence (in our protocol it is M1s) which carry sample barcodes. In that case, you just put a set of sample ids, followed by their adapter sequence with barcodes marked in upper-case letters into barcodes.txt. Slave barcode is used for case when barcodes are attached from both sides. It can contain additional sample barcode in order to filter cross-sample contamination and additional set of UMI (N) nucleotides.
Thank you for your quick response. Do I need overlap two read files before I run CdrBlast? Or CdrBlast can overlap two files into one, then map and assemble?
I ran the script
It ran and completed quickly, and showed result
Is my scripts right?
It appears that none of 1000 input sequences were mapped. Are you able to map any of those reads manually (http://www.ncbi.nlm.nih.gov/igblast/igblast.cgi?CMD=Web&SEARCH_TYPE=TCR&LINK_LOC=igtab)? Note that reads should contain both V and J segments in the same read.
Overlap can be performed, yet it is meaningful when the expected overlap is high. It can boost the quality of TCR sequences when both reads cover CDR3 region.
You said "overlap can be performed". Do you mean CdrBlast can do it?
Do I need other parameters to tell CdrBlast to performed the overlap?
When overlapping, how the software decide which one of two mismatched nt to be remained? The one with higher quality score?
I have mapped the 1000 input sequences on ncbi link you gave to me. There are at least 328 sequences mapped to the TCR of human (my sequences are monkey's). So I list the complete result below after I ran the CdrBlast for your reference. Could you analyse them for me and find what is wrong with it? Thank you.
Hello,
It appears to be an issue with makeblastdb running under windows. From the
and
I can assume that the problem is caused by spaces in folder name. A general suggestion will be of course to try running the pipeline in UNIX environment :)
Hi Mikhail,
Very nice tool. If we don't have MIGs identifiers in data, could we still use the CdrBlast and the sequencing/PCR error correction algorithm just like miTCR?
I tried to run CdrBlast and it is successfully done. Now to run FilterCdrBlastResults, I need clustered and raw CDR3. In my case I just have normal raw CDR3. Can you suggest a right step? Or the filtering step only works for the data from MIGs data?
Thank you :-)
Hi, sorry for late reply (it seems that I'm not getting any notifications from biostars), the raw CDR3 data should be generated with MIGEC (as it uses a bit different aligner than MITCR) without any frequency-based error correction, moreover the output format should be the same as for MIGEC, so the answer is no, it should be performed with CdrBlast.
Hi Mikhail,
Can I use MIGEC to analyse TCR sequencing data from single cells?
Many thanks
Hello,
should I use tool like Trimmomatic for adapter and quality trimming before using migec?
Thank you.