How to split a scRNA reads BAM or FASTQ file to a separate file for each cell by cell barcode?
1
0
Entering edit mode
2.4 years ago
MYousry ▴ 20

Hello everyone,

I have a sample of scRNA seq data (A.Thaliana) generated by 10X Genomics. The data is composed of R1 (cell barcodes and UMIs) and R2 (actual reads) in FASTQ format. The sample has around ~7000 cells

I ran the data through STAR solo to map them to the genome. The results are a BAM file (mapped reads) and count matrix files.

I need to run the mapped reads through some sort of an algorithm. However, if I run the produced BAM file, it will be dealt with as bulk rna-seq. So, ideally, I want to split the BAM file into a separate file for each cell. Splitting the FASTQ file would be ok too then I run them separately through STAR (not sure about the efficiency of this though).

I am working on Bash terminal environment

Is there a method to do so? Any suggestions?

I am an undergraduate student and completely new to this. Any help would be appreciated. Thank you!!

10xgenomics RNAseq scRNA CB • 4.8k views
ADD COMMENT
1
Entering edit mode

This is one case where you may need to use cellranger to create the BAM unles STARsolo does this too. The cell barcode will be encoded in the alignments with following tag

CB  Z   Chromium cellular barcode sequence that is error-corrected and confirmed against a list of known-good barcode sequences. For multiplex Fixed RNA Profiling, the cellular barcode is a combination of the 10x GEM Barcode and Probe Barcode sequences.

You will need to use pysam or similar program to parse the BAM tags to create files for individual cells. It is unclear what the utility of this application is though. Otherwise people would have written tools to do this already.

ADD REPLY
0
Entering edit mode

Thank you for your reply. Yes, STAR solo does that too. My struggle is with splitting the BAM file to individual cells. There are some tools but I am not sure which one to use and how and if they even do the job I need. Any clarification on how to use pysam or another tool would be appreciated.

ADD REPLY
0
Entering edit mode

10x Genomics provides Cell Ranger to easily process the data. Why are you not using it?

ADD REPLY
0
Entering edit mode

Does cell ranger have a tool to split the BAM file into individual cells? That's the particular step that I'm struggling with.

STARsolo does the same job as cell ranger for mapping the reads (the BAM file which I need) and making the count matrix (I don't need for now) I think. The reason why I used STAR is that I'm working on the bash terminal for other processing and would like to keep all in the same place.

The later analysis that I would do is not present on cell ranger. It's goal is to identify and classify RNA modification in each cell.

ADD REPLY
1
Entering edit mode

The reason why I used STAR is that I'm working on the bash terminal for other processing and would like to keep all in the same place.

You can run CellRanger from the command line so this shouldn't be a reason not to use it.

ADD REPLY
0
Entering edit mode

ATpoint Thank you for your reply! So what I am trying to do is to run scRNA data through an algorithm that uses usual bulk rna seq data that is mapped to a reference genome to detect rna modifications and classify them. The goal is to get scRNA data to run successfully through this algorithm while keeping the single cell quality. I could successfully run the produced BAM file for the mapped scRNA reads, however, it is useless since the results represents rna modifications location in the data as if it is bulk. So, I am trying to find a way to separate data from individual cells. The idea I have in mind is to split the bam file and run the algorithm in a loop over the produced files. I hope that makes sense and any help or guidance would be greatly appreciated. Thank you so much once more!

ADD REPLY
1
Entering edit mode

There will be very few reads mapped per individual cell. So while you may be able to run the tool your are referring to the results may not be valid/accurate. Tools make certain assumptions and if the data does not meet those then you will need to consider that scenario..

ADD REPLY
0
Entering edit mode

Did you have any luck in splitting the BAM file based on the 10x cell barcode? I would like to split a BAM file based to only include 5 specific cell barcodes and not sure how to do it. Thanks

ADD REPLY
0
Entering edit mode

Hey, if you arelady have the barcodes, you could use samtools

samtools view -h -b -f CB:Z:TAAGAGATCCTATGTT > TAAGAGATCCTATGTT.bam

Hopefully it is useful, this works well with STARsolo bam files, don't know how CellRanger handles its barcodes

ADD REPLY
0
Entering edit mode
18 months ago
biofalconch ★ 1.3k

Here is a code that should work, but just like everyone else in the comments I'm a little confused why would you need to separate them:

bamtools split -in <Output.bam> -tag CB

WARNING: This might yield a LARGE number of files, so use at your own risk :)

You can download bamtools here: https://github.com/pezmaster31/bamtools

Maybe a combination of samtools, a workflow manager (e.g. Nextflow) and the file barcode.tsv might yield less and more meaningful files.

ADD COMMENT
0
Entering edit mode

I want to split the bamfiles based on the CB tag as well. I want to use samtools to identify each cellular barcode. Currently, I am using this code

#Extract unique cell-specific barcodes
samtools view possorted_genome_bam.bam | awk '{for(i=12;i<=NF;i++) if($i ~ /^CB:Z:/) print $i}' | sort | uniq > unique_barcodes.txt

Then, using these barcodes, I am trying to extract the associated bam file in a loop in linux

Do you have any suggestions?

I want to identify reads associated with each barcode, then I will align these fingerprinted reads to the spots that could have variants (based on a previous analysis). Not sure if I could explain very well. But I would appreciate any help.

ADD REPLY

Login before adding your answer.

Traffic: 2420 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6