How do I use the STARSolo aligner with MGI DNBelab C series HT scRNAseq libraries?
0
0
Entering edit mode
7 months ago
atowns21 ▴ 10

I am using bat (rhinolophus sinicus) snRNA-seq brain samples located here. The associated paper is located here. The samples were prepped with the MGI DNBelab C4 scRNA Preparation Kit and were sequenced on the BGI DNBSEQTM technology platform. I downloaded all the bat brain tissue samples with parallel-fastq-dump. Below is an example of downloading a single sample.

parallel-fastq-dump --tmpdir . --threads 8 --gzip --readids --split-files --sra-id SRR13528085

There are a total of 12 samples (6 biological samples and 2 technical replicates per biological sample). After running parallel-fastq-dump for each SRR ID, I have a total of 24 fastq files with _1 and _2 for read 1 and read 2, respectively. The barcode sequences are in read 1 and the cDNA sequences are in read 2. I have a function below which I usually use for alignment of scRNA-seq data.

def sc_star_align(fastq1, fastq2, prefix):
    out_prefix = out_dir + prefix + '_'
    subprocess.run([star,
        '--runThreadN','16',
        '--genomeDir',genome,
        '--soloType CB_UMI_Simple',
        '--soloCBwhitelist',bc_whitelist,
        '--soloFeatures','GeneFull',
        '--soloCBlen','16',
        '--soloUMIstart','17',
        '--soloUMIlen','12',
        '--soloCBmatchWLtype','1MM_multi_Nbase_pseudocounts',
        '--soloBarcodeReadLength','0',
        '--clipAdapterType','CellRanger4',
        '--outFilterScoreMin','30',
        '--soloUMIdedup','1MM_CR',
        '--soloCellFilter','EmptyDrops_CR',
        '--soloUMIfiltering','MultiGeneUMI_CR',
        '--outFileNamePrefix',out_prefix,
        '--readFilesCommand','zcat',
        '--readFilesIn',fastq2,fastq1,
        '--outSAMtype', 'BAM', 'SortedByCoordinate'])

I have a few questions on how to adjust this function for these samples:

  1. How do I find the barcode whitelist for these samples?
  2. How do I determine the cell barcode and UMI lengths?
  3. How do I determine the UMI start site?
  4. How would I create the STAR index for bat (rhinolophus sinicus)?

The only information listed in the paper is the read lengths which are 30bp for read 1 and 100bp for read2. I even looked at the code for the paper but they did not include their alignment commands. Any help is appreciated very much.

STARSolo scRNA-seq STAR snRNA-seq MGI • 1.1k views
ADD COMMENT
0
Entering edit mode

You simply need to make a list of barcodes one on each line: https://kb.10xgenomics.com/hc/en-us/articles/115004506263-What-is-a-barcode-whitelist

So you will need to reformat the list of barcodes you found in the other thread about DNBSeq.

ADD REPLY
0
Entering edit mode

So I'd take the json file located here and essentially create the whitelist to be all possible combinations of the position 1-10 barcodes and position 11-20 barcodes. The UMI length is 10bp and the cell barcode length after creating the combinations is 20bp. The UMI start site is at 21bp and the index I can create with STAR. Is this correct?

ADD REPLY
0
Entering edit mode

The barcodes are in location 1-10 and you have the actual list there. I don't know for certain what is in position 11-20 (perhaps spacer, check in data you have). 21-30 bp are UMI. Looks like RNA read is read 2.

ADD REPLY
0
Entering edit mode

read 1 is 30bp... which I think you meant to type. If you scroll approximately halfway down the json file I linked to, it has the following information.

enter image description here

I think this means the sequence from 11-20 is also cell barcodes. Which is why I was asking if I should just paste all possible combinations of position 1-10 barcodes with position 11-20 barcodes. Hope that makes more sense now. I'm still not sure if I am supposed to paste a list of all combinations.

ADD REPLY
0
Entering edit mode

Ah I see that in now. I am not sure why MGI has split the whitelist into two sections. You may need to see if you can dig up some info about that. All possible combinations sounds excessive. I looked around in the GItHub but did not see an immediate explanation for the split list.

ADD REPLY
0
Entering edit mode

Yeah, I've been digging around the web for a while now. Can't find any information about the barcodes. The total number of combinations is around 2.3 million, which doesn't seem to excessive in comparison to the 10x barcode whitelist files.

ADD REPLY
0
Entering edit mode

Only thing would be to try them out. See if you can detect them in the data you have.

You could also simply look for unique representatives (convert the sequences to plain text, use sort and uniq etc) and see what is present in a real dataset like the one you have.

ADD REPLY
0
Entering edit mode

Yes, that is a very good idea. Thanks for the help!

ADD REPLY
0
Entering edit mode

Let us know when you find out. Would be a useful thing to know what the data looks like.

ADD REPLY
0
Entering edit mode

Hmm, okay I'm not sure what is going on, but here is what I did:

  1. Downloaded a single fastq: parallel-fastq-dump --tmpdir . --threads 8 --gzip --readids --split-files --sra-id SRR13528082
    1. Took the sequence lines out of read 1 file with: zcat SRR13528082_1.fastq.gz | sed -n '2~4p'| uniq -u | cut -c-20 | uniq -u > Brain2BC.txt
    2. Make all possible combinations for the barcode whitelist from here within R by running expand.grid on the first list of barcodes
    3. Compare the barcodes generated in step 3. to the first twenty characters of Brain2BC.txt

There are 15,530,488 unique barcodes for the single SRR ID... which seems like a lot. There are 2,359,296 unique barcodes in the Brain2BC.txt file. Only roughly 4% of the 15,530,488 unique barcodes for the SRR ID overlap with the barcode whitelist that I generated in step 3. So I'm not sure the whitelist that I am generating is correct, even though they explicitly state the positions of the barcode sequences in the config file on GitHub.

ADD REPLY
0
Entering edit mode

So I used the barcodes that I created (combos of positions 1-10 and 11-20) and I obtained similar alignment stats as the paper I pulled the samples from. I'm thinking a lot of the barcodes in the reads were only off for a single base pair. STARSolo allows a single base pair in the read barcodes to not match the pre-defined barcode whitelist.

ADD REPLY

Login before adding your answer.

Traffic: 2288 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6