Entering edit mode
4.8 years ago
prasundutta87
▴
670
Hi,
In Illumina sequencing, where dual indexes are used (i5 and i7), the i7 indexing read contains a 9bp molecular tag (UMI) in the form of 'N' in addition to the unique 8bp sample index.N can be any base. I am aware that when bcl2fastq is used for demultiplexing, both i7 and i5 index sequences are given as a parameter and it is advised not to add the Ns after the i7 sequences. I know that N can be any sequence, so when the index/barcode is chopped off the read and put at the end of the read name, what happens to the Ns?
AFAIK UMI are trimmed and transferred to read header.
Not really, I just have 'i7 index'+'i5 index' at the end of the read header. The i7 index is just 8 bp long. I am not understanding why do sequencing protocols have UMI as N's? Isn't it the whole point that when sequencing is done, that UMI should be known? Is there any document/website where I can understand if there is any historical reason for this?
Please provide some reproducible examples, be it screenshots or a read example.
So, basically, the i7 index looks like this- TACTAGTANNNNNNNNN and the i5 index looks like this-GATCGACA
My read header looks like this-
@______:__:_________-_____:_:____:_____:____ _:_:_:TACTAGTA+GATCGACA
Please let me know if this information is enough.
According to me, N can be any base. Why does any sequencing protocol have such Ns? What's the purpose of being UMI if the base is not known?
Secondly, bcl2fastq does not allow Ns to be added along with the i7 index. Where does the NNNNNNNNN go? Is it something basic I should know about?
These are not standard illumina i7 indexes, correct? Are these from a different provider?
bcl2fastq
can only deal with UMI's that are part of read 1 and 2 (not index reads).I just made them up for understanding the concept. But, to answer your question, they are from a different provider, but the sequencing has been done on an Illumina machine. But still, shouldn't the UMIs be known? Should they be asked from the protocol providers directly?
Take a look at these adapters from IDT that do have UMI's. These would not be processed by
bcl2fastq
since it can only deal with UMI's that are in-line. See the answer in this thread on how the data using IDT adapters would be processed: bcl2fastq with xGen Dual Index UMI Adapters to produce 3 read and 2 index fastqsIf you need them added to the fastq header then you will need to do some additional work: How to append the cell barcode and UMI information to the fastq header in paired-end single-cell RNA-seq data? (and a couple others)
Thanks a lot, let me go through these documents and pages. Will come back here in case of any doubts.