Question

Demultiplexing single cell QIAseq UPX 3 Transcriptome Kits fastq files

0

Entering edit mode

5.7 years ago

user31888 ▴ 150

Is someone familiar with demultiplexing (i.e. whitelisting and extracting UMI and cell barcodes) single cell RNA seq data generated with the QIAGEN QIAseq UPX 3' Transcriptome kit?

The only information I have regarding the format of the fastq files generated with this kit can be found in Figure 2 of the kit handbook here.

I think it could be summarise as follows:

read_1: transcript sequence

read_2: cell_index | UMI | ACG | poly-T

I tried to use salmon alevin with the chromiumV3 flag, but it discards more than 97% of the reads dues to "noisy cellular barcodes".

UMI-tools whitelist and extract seem to handle only droplet-based single cell RNA-Seq.

The QIAGEN GeneGlobe Data Analysis Center pipeline does not explain in details how the demultiplexing is done neither.

Does someone would know any other tools able to deal with this kind of fastq file format?

EDIT

Looking at the kit protocol here, it is said that "The UMI is a 12-base fully random sequence". But they do not mention the length of the cell barcode.

However, as genomax mentioned, my R2 reads are indeed 27 bp long (without poly-T nor ACG triplet though):

 $ zcat my_R2.fastq.gz | head -16
    @NB551406:25:HKGLJBGX7:1:11101:14400:1057 2:N:0:ATCACG
    TATGGAGAACATGGCGCGTTACAAGCN
    +
    AAAAAEEEEEEEAAEAEEEEEEEE//#
    @NB551406:25:HKGLJBGX7:1:11101:17302:1058 2:N:0:ATCACG
    TATGGAGAACTGACTTGAGTGCAACAN
    +
    AAA<AEAEEEEEEEEE6EAEEEEA/E#
    @NB551406:25:HKGLJBGX7:1:11101:14122:1059 2:N:0:ATCACG
    GCTCGACACATGCGAAGGCTGGAAGAN
    +
    AAAAAAEEE<EEEAEEEAEEAAAA/A#
    @NB551406:25:HKGLJBGX7:1:11101:5220:1059 2:N:0:ATCACG
    CTATCCGCTGGCTGTGCTTCGCAAGTT
    +
    AAAAAEEAE/EEEA/EEEAEEAEA/A/

After filtering out reads for which at least one base have a quality score < 30, I checked the number of unique k-mers starting from the beginning of the read (the problem is that I don't know how many cell IDs have been used).

k=1, 4 unique bases

k=2, 16 unique sequences

k=3, 57 unique sequences

k=4, 136 unique sequences

k=5, 197 unique sequences

k=6, 257 unique sequences

k=7, 321 unique sequences

k=8, 372 unique sequences

k=9, 426 unique sequences

k=10, 649 unique sequences

EDIT 2

Qiagen sent me the cell_ID sequences (length=10 bases) and confirmed that UMI = 12 bases long.

single cell demultiplexing • 3.6k views

ADD COMMENT • link 5.7 years ago by user31888 ▴ 150

1

Entering edit mode

So read 2 is 27 bp? Do you know what is the length of actual cell_index, UMI? I am not sure where you are getting ACG from but I suppose that is present in your reads? Can you post output of zcat read2.fq.gz | head -16 so we can see what your read 2 looks like?

UMI-Tools should be able to handle these reads. I am going to let @Ian Sudbury (author of UMI-tools) know. He is active here.

ADD REPLY • link 5.7 years ago by GenoMax 147k

0

Entering edit mode

UMIs are supposed to be 12 base long. I don't know neither where the 'ACG' comes from (source). How did you know reads 2 were 27 bp? Is it kind of standard?

ADD REPLY • link 5.7 years ago by user31888 ▴ 150

0

Entering edit mode

Can you try this using reformat.sh from BBMap suite:

 reformat.sh in=my_fastq.R2.gz out=stdout.fa | grep -v "^>" | sort | uniq -c > file_to_save_indexes

When I checked with a sample of 10x data I am able to see (a bit of cheating since we know that 10x has 16-bp cell barcodes) the common pattern of cell barcodes. Since you have that ACG on other end that may help anchor things.

   1 AAACGGGGTAGGCTGATGTGAAACTG
      1 AAACGGGGTCAACTGTCCAGTATCTA
      1 AAACGGGGTCAACTGTCCGTGCCCTG
      1 AAACGGGGTCAACTGTCTCACGCTTC
      1 AAACGGGGTCAACTGTGCGTCTCCGC
      1 AAACGGGGTCAACTGTTCTATTGCCT
      1 AAACGGGGTTCTGAACTTTCTTTGTC
      1 AAACGGGTCAGGCCCATATATCTCAT
      1 AAACGGGTCCCAACGGCAACGGGTTC
      1 AAACGTGACGATGAGGGACATAAAAA
      1 AAACTGGTTCTGAGGGAATGTTAGTA
      1 AAAGAGATTGCTGGCATTCAGTCGGC
      1 AAAGATATCCGTACAATCCGAATGTG
      1 AAAGATGAGAAGCCCAAAAACTTAAA
      1 AAAGATGAGAAGCCCAAATCCTTTTA
      1 AAAGATGAGAAGCCCAATCCCAGAGG
      1 AAAGATGAGAAGCCCACAGCAATTGC
      1 AAAGATGAGAAGCCCACCACAACAAA
      1 AAAGATGAGAAGCCCACCCCCCGGTG
      1 AAAGATGAGAAGCCCACCCGAATATA
      1 AAAGATGAGAAGCCCACCTCGACGAG
      1 AAAGATGAGAAGCCCACTCGTCCTAA
      1 AAAGATGAGAAGCCCATATCACTGGC
      1 AAAGATGAGAAGCCCATCCTACTGCC
      1 AAAGATGAGAAGCCCATTAACACTAT
      1 AAAGATGAGATGTGGCAAGTTGTCCT
      1 AAAGATGAGATGTGGCAGGGATGGAA
      1 AAAGATGAGATGTGGCAGTGTCTGCT
      1 AAAGATGAGATGTGGCCATCATAGTG
      1 AAAGATGAGATGTGGCCCCGAGTGAA
      1 AAAGATGAGATGTGGCGACAACTTGC
      1 AAAGATGAGATGTGGCGCTTTAACCG
      1 AAAGATGAGATGTGGCTCTACTAGCT
      1 AAAGATGAGATGTGGCTGTGGCACCG
      1 AAAGATGAGATGTGGCTTACTTCCCA
      1 AAAGATGAGATGTGGCTTGCTGTGTA
      1 AAAGATGAGCAATATGACTGATAAGG
      1 AAAGATGAGCAATATGGTAAGGTCTC
      1 AAAGATGAGCAATATGGTTTCGATCT
      1 AAAGATGAGCAATATGTAGGAGCCTA
      1 AAAGATGAGCAATATGTCCCGTCACG
      1 AAAGATGAGCAATATGTCGCTTCCTA

ADD REPLY • link 5.7 years ago by GenoMax 147k

0

Entering edit mode

I guess, if your UMI is 12, and your read is 27, then that leaves 15 for the cell-barcode. I don't think looking for unique kmers is much help because of sequencing errors - You might have 1000 reads with kmer 1 and one read with kmer 2, which arose as a sequencing error from kmer 1, but would still be counted as an extra kmer.

You SHOULD be able to find out how many cells, because in this protocol, you put one cell in each well, so the wet lab people should know how many wells they used. I'd try it with a 15-mer cell barcode and --expect-cells=96 and see if umi_tools whitelist finds a reasonable number of obvious sequences.

ADD REPLY • link 5.7 years ago by i.sudbery 20k

0

Entering edit mode

user31888 : Just make sure it is ok for you to post the info about cell barcodes publicly. Companies can be sensitive about this sort of thing and you don't want to get in trouble.

But now you have enough info to get your task accomplished.

ADD REPLY • link 5.7 years ago by GenoMax 147k

0

Entering edit mode

Ok, I did not know. I removed the sequences from my original post (although it could have been useful for other people).

ADD REPLY • link 5.7 years ago by user31888 ▴ 150

0

Entering edit mode

If the information was not publicly available in first place then you would be right to not post in a public forum unless you received permission from the company to do so.

UMI-tools needed cell_ID sequences (length=10 bases) and confirmed that UMI = 12 bases long which is now known.

ADD REPLY • link 5.7 years ago by GenoMax 147k

score 1 · Accepted Answer · 2019-04-03

UMI-tools is not limited to droplet-based single cell RNA seq (it was originally designed for analysing iCLIP data), however alevin is, which is why you'll be having trouble with alevin.

UMI-tools should be able to analyse this data just fine as long as you work out how to provide your read layout correctly.

In most scRNAseq applications, the barcodes (Cell ID/UMI) are on read1 and the transcript on read2, and the opposite is the case here. But thats fine, just provide your read2 file as the primary input to umi_tools and your read1 to the --read2-in option. The process to follow will depend to a certain extent on whether you already know the sequences of the cell ids or not. Qiagen clear you 96 or 384 pre-designed cell-id sequences, but I don't know if they provide them to you. Its also not clear how long the cell-id barcodes are, which we need to process the data. I'm going to take a guess that its 12nt. If they do provide you with a list of cell-id sequences and they are 12nt then you can run extract like so:

  umi_tools extract --stdin=read2.fastq.gz \
                    --read2-in=read1.fastq.gz \
                    --bc-pattern=CCCCCCCCCCCCNNNNNNNNNNNN \
                    --read2-stdout
                    --whitelist=whitelist.txt \
                    --filter-cell-barcode \
                    --logfile=extract.log \
| gzip > extracted_reads.fastq.gz

Assuming that whitelist.txt contains the cell-id sequences, one per line. If you havn't been provided with them, then you'll need to generate a list using whitelist. You might run it something like (i'm assuming you know how many cells went into the analysis:

umi_tools whitelist --stdin read2.fastq.gz \
                    --bc-pattern=CCCCCCCCCCCCNNNNNNNNNNNN \
                    --set-cell-number=96 \
                    --log2stderr > whitelist.txt;

From there you just need to follow the rest of the single-cell RNA-seq tutorial on the documentation site. There you'll also find explainations of the options I've used above.