Hello,
I got 2x101bp fastq data obtained using BD Rhapsody single cell kit. Not sure if I am missing something, but with labeling in both r1 and r2 reads I am struggling to find usable human mRNA sequence.
In read_1 I got +70bp labels (polyT track vary in length apparently):
GCCTTACAAactggcctgcgaAGATAGTTCggtagcggtgacaACACTCCGCCTCCCGCCGGCACTGTTTGTTTTTTTTTTTTTCCTGGACAACCCATCTG
Linker1 and Linger2 small letters.
In read_2 the labels can get up to 95bp:
label with sequence tag 10
GTTGTCAAGATGCTACCGTTCAGAGGCAGCCGGCGTCGTACGAGGCACAGCGGAGACTAGATGAGGCCCCAAAAAAAAAAAAAAAAAAAAAAAAA
example sequence
GTTGTCAAGATGCTACCGTTCAGAGGCAGCCGGCGTCGTACGAGGCACAGCGGAGACTAGATGCGGCCCCAAAAAAAAAAAAAAAAAAAAAAGTTCCCGGT
Frankly I doubt there is much to map to human transcriptome, but I would be happy proven wrong.
My questions:
- Is this a normal outcome of BDs Rhapsody single cell RNA protocol or there was some "one to many" labeling? I am thinking here about 95bp long labels in read_2 just to distinguish between samples.
- Assuming such IMHO over the top labeling is the standard way of preparing the sample, would switching to say 2x150bp make the data way better?
My idea would be to skip the labels in read_2 and use a simple 8bp say Illumina index2, but I suspect that decoupling cell & molecule labeling in read_1 performed downstream from sequence tag labeling ending up in read_2 may not be easy.
Many thanks for your help
DK
edit
Looks like it was extra labeling:
The picture is from page 11 of https://www.bdbiosciences.com/content/dam/bdb/marketing-documents/BD_Single_Cell_Genomics_Bioinformatics_Handbook.pdf
You could check the mapping rate of the R2 reads only by alignment using STAR etc. These should be the mRNA reads. AFAIK they should not have any technical labels etc as per your diagram.
There is a S3 bucket with stand alone BD Rhapsody scripts: http://bd-rhapsody-public.s3-website-us-east-1.amazonaws.com/Rhapsody-Install-Bundle/