Is someone familiar with demultiplexing (i.e. whitelisting and extracting UMI and cell barcodes) single cell RNA seq data generated with the QIAGEN QIAseq UPX 3' Transcriptome kit?
The only information I have regarding the format of the fastq files generated with this kit can be found in Figure 2 of the kit handbook here.
I think it could be summarise as follows:
read_1: transcript sequence
read_2: cell_index | UMI | ACG | poly-T
I tried to use salmon alevin
with the chromiumV3
flag, but it discards more than 97% of the reads dues to "noisy cellular barcodes".
UMI-tools whitelist
and extract
seem to handle only droplet-based single cell RNA-Seq.
The QIAGEN GeneGlobe Data Analysis Center pipeline does not explain in details how the demultiplexing is done neither.
Does someone would know any other tools able to deal with this kind of fastq file format?
EDIT
Looking at the kit protocol here, it is said that "The UMI is a 12-base fully random sequence". But they do not mention the length of the cell barcode.
However, as genomax mentioned, my R2 reads are indeed 27 bp long (without poly-T nor ACG triplet though):
$ zcat my_R2.fastq.gz | head -16
@NB551406:25:HKGLJBGX7:1:11101:14400:1057 2:N:0:ATCACG
TATGGAGAACATGGCGCGTTACAAGCN
+
AAAAAEEEEEEEAAEAEEEEEEEE//#
@NB551406:25:HKGLJBGX7:1:11101:17302:1058 2:N:0:ATCACG
TATGGAGAACTGACTTGAGTGCAACAN
+
AAA<AEAEEEEEEEEE6EAEEEEA/E#
@NB551406:25:HKGLJBGX7:1:11101:14122:1059 2:N:0:ATCACG
GCTCGACACATGCGAAGGCTGGAAGAN
+
AAAAAAEEE<EEEAEEEAEEAAAA/A#
@NB551406:25:HKGLJBGX7:1:11101:5220:1059 2:N:0:ATCACG
CTATCCGCTGGCTGTGCTTCGCAAGTT
+
AAAAAEEAE/EEEA/EEEAEEAEA/A/
After filtering out reads for which at least one base have a quality score < 30, I checked the number of unique k-mers starting from the beginning of the read (the problem is that I don't know how many cell IDs have been used).
k=1, 4 unique bases
k=2, 16 unique sequences
k=3, 57 unique sequences
k=4, 136 unique sequences
k=5, 197 unique sequences
k=6, 257 unique sequences
k=7, 321 unique sequences
k=8, 372 unique sequences
k=9, 426 unique sequences
k=10, 649 unique sequences
EDIT 2
Qiagen sent me the cell_ID sequences (length=10 bases) and confirmed that UMI = 12 bases long.
So read 2 is 27 bp? Do you know what is the length of actual
cell_index
,UMI
? I am not sure where you are gettingACG
from but I suppose that is present in your reads? Can you post output ofzcat read2.fq.gz | head -16
so we can see what your read 2 looks like?UMI-Tools should be able to handle these reads. I am going to let @Ian Sudbury (author of UMI-tools) know. He is active here.
UMIs are supposed to be 12 base long. I don't know neither where the 'ACG' comes from (source). How did you know reads 2 were 27 bp? Is it kind of standard?
Can you try this using
reformat.sh
from BBMap suite:When I checked with a sample of 10x data I am able to see (a bit of cheating since we know that 10x has 16-bp cell barcodes) the common pattern of cell barcodes. Since you have that
ACG
on other end that may help anchor things.I guess, if your UMI is 12, and your read is 27, then that leaves 15 for the cell-barcode. I don't think looking for unique kmers is much help because of sequencing errors - You might have 1000 reads with kmer 1 and one read with kmer 2, which arose as a sequencing error from kmer 1, but would still be counted as an extra kmer.
You SHOULD be able to find out how many cells, because in this protocol, you put one cell in each well, so the wet lab people should know how many wells they used. I'd try it with a 15-mer cell barcode and
--expect-cells=96
and see ifumi_tools whitelist
finds a reasonable number of obvious sequences.user31888 : Just make sure it is ok for you to post the info about cell barcodes publicly. Companies can be sensitive about this sort of thing and you don't want to get in trouble.
But now you have enough info to get your task accomplished.
Ok, I did not know. I removed the sequences from my original post (although it could have been useful for other people).
If the information was not publicly available in first place then you would be right to not post in a public forum unless you received permission from the company to do so.
UMI-tools needed
cell_ID sequences (length=10 bases) and confirmed that UMI = 12 bases long
which is now known.