Entering edit mode
2 days ago
connorjfausto
▴
30
Hello,
I have a list of barcodes generated from a snRNA-seq Seurat
object, and I'd like to go back to my original .fast.gz
R1 and R2 files to get the reads of my barcodes.
Right now, I am using bbmap
based off these two posts:
How To Extract A Subset Of Reads In Fastq Using An Id List?
Bbmap filterbyname.sh to extract sequences out of a fasta file - wrong output
Here below is head barcodes.csv
AAACAGCCATGCTATG-1 wk17M
AAACATGCAGGCTTCG-1 wk17M
AAACGTACACTTACAG-1 wk17M
AAAGCAAGTGAGGTGA-1 wk17M
AAAGCACCAATTAACC-1 wk17M
AAAGCACCACATAACT-1 wk17M
AAAGCACCAGAATGAC-1 wk17M
AAAGCCCGTGATCAGC-1 wk17M
AAAGCCGCAGCCTGCA-1 wk17M
AAAGCTTGTGTGTCCC-1 wk17M
and my simplified code below for running bbmap
filterbyname.sh in=fq_r1.fastq.gz in2=fq_r2.fastq.gz \
out=fq_out_r1.fastq.gz out2=fq_out_r2.fastq.gz \
names=barcodes.csv include=t substring=t
But I am not receiving any matching reads. Is it due to my barcodes.csv
being incorrectly formatted? Thank you!
Can you post an exact example of what your fastq headers look like?
names
file needs to have only one entry per line without columns.If you have the BAM files available it would be easier to do with a tool 10x makes available --> Separate single cell BAM file by the cell barcode
Thanks for the response! Here's an example of an out when i input
zcat fq_r1.fastq.gz | head -n 10
I did previously use the
sinto
to sort bam files, but when I tried converting thosebam
back tofastq
usingbamtofastq
from 10X and re-aligning withcellranger count
the alignment was very different (and Seurat object)I'm essentially trying to go from Subset Seurat Object to subset Fastq R1 and R2, and then back to subset Seurat Object.
filterbyname.sh
only works based on information that is found in the fastq header. As you can see the barcodes you have are not in those headers. They are still in the beginning 26 bp of R1. So the method you tried to use will not work.If you try to filter using
GTAACATGCG+NGGTAACACT
(which is in the fastq header) then it will work.I assume this comes from CellRanger or simlar? First, you need to remove the
-*
so the dash with the trailing number. This is not present in the fastq files but added by other software. Then also, note that barcodes get error-corrected during the processing, e.g. by CellRanger so the raw barcodes sequenced in the fastq files might not be perfect matches. What I would probably do is to scan the BAM file for the error-corrected barcodes which should be perfect matches, then extract the read name from the bam and filter the fastq based on that. Or convert the filtered BAM back to fastq. Something like this.Yes these barcodes came from a Seurat Object aligned from
cellranger count
, I'll change the.csv
file with your corrections and try. I actually did usesinto
as recommended by timmoast, then usedbamtofastq
from 10X to try and convert thebam
tofastq
. However, the resulting alignment with these filteredfastq
files usingcellranger count
gave very different results than the original.How would I use the filtered
bam
files to get the fastq? Convert tosam
to get the reads pairs? Thanks!If you have the BAM files you can simply use
samtools fastq your.bam
with appropriate options to write out paired fastq files. Check in-line help.Your original question asked about getting fastq per barcode. Is that actually your intention since there will be thousands of files if you do that.
What is the intent behind doing this?
From our original snRNA-seq Seurat object, we have a small subset (~5% of nuclei) that collaborators were given which is now in an accepted manuscript. However, the full analysis of the original snRNA-seq is still ongoing so we are trying to release only the relevant subset data. Therefore, we are trying to now subset the original
fq.gz
file to create asub.fq.gz
for GEO submission.This has been done before with Split-seq (Parse Biosciences)
fq
files before, since they have their own pipeline written in python to split by barcodes that label specific well numbers. The dataset I'm troubleshooting right now was made with a 10X kit, so I can't use the same methods and I don't know python well enough to edit a source script for my purposes.Use the 10x's BAM splitting tool (https://github.com/10XGenomics/subset-bam ). You should share those BAM's (instead of fastq) since they have all the relevant metadata that would be needed by others if they were to reproduce your analysis.