Filter snRNA-seq .fastq files based on barcodes
0
0
Entering edit mode
2 days ago

Hello,

I have a list of barcodes generated from a snRNA-seq Seurat object, and I'd like to go back to my original .fast.gz R1 and R2 files to get the reads of my barcodes.

Right now, I am using bbmap based off these two posts: How To Extract A Subset Of Reads In Fastq Using An Id List? Bbmap filterbyname.sh to extract sequences out of a fasta file - wrong output

Here below is head barcodes.csv

AAACAGCCATGCTATG-1  wk17M
AAACATGCAGGCTTCG-1  wk17M
AAACGTACACTTACAG-1  wk17M
AAAGCAAGTGAGGTGA-1  wk17M
AAAGCACCAATTAACC-1  wk17M
AAAGCACCACATAACT-1  wk17M
AAAGCACCAGAATGAC-1  wk17M
AAAGCCCGTGATCAGC-1  wk17M
AAAGCCGCAGCCTGCA-1  wk17M
AAAGCTTGTGTGTCCC-1  wk17M

and my simplified code below for running bbmap

filterbyname.sh in=fq_r1.fastq.gz in2=fq_r2.fastq.gz \
out=fq_out_r1.fastq.gz out2=fq_out_r2.fastq.gz \
names=barcodes.csv include=t substring=t

But I am not receiving any matching reads. Is it due to my barcodes.csv being incorrectly formatted? Thank you!

snRNA-seq scRNA-seq FASTQ bbmap • 290 views
ADD COMMENT
1
Entering edit mode

Can you post an exact example of what your fastq headers look like?

names file needs to have only one entry per line without columns.

If you have the BAM files available it would be easier to do with a tool 10x makes available --> Separate single cell BAM file by the cell barcode

ADD REPLY
0
Entering edit mode

Thanks for the response! Here's an example of an out when i input zcat fq_r1.fastq.gz | head -n 10

@A00738:690:HCNMKDSX7:3:1101:21025:1016 1:N:0:GTAACATGCG+NGGTAACACT
CNCTATGAGTTTGCGGTTCAATACCAAGTAATGTAATTTCCAGGGTGGGGGGGTTTGGTAATTTTTGGAAGAACATCAGCCTTATTCTCACTAGTGGCCCTGACAGCACTCAGCCACAAACCATAAGGACACCCTACATCGTCTCACCCC
+
F#FFF:FFFFFFFFFFFFFFFFFFFFFFF,:,,F,,,,,,,,,,,,:,,:,:,,,FF,F,FF:FFF:F,,F,FF,,FF:F,:F,FFFFF,F,,F:FF,,F:FF:F,F,F,F,,F,FF:F,,F,,FF:FFF,FF,,,F,,FF,::FFFF:F
@A00738:690:HCNMKDSX7:3:1101:21043:1016 1:N:0:GTAACATGCG+NGGTAACACT
TNCATCAAGAGGAGTCCCAGTAGATAGCTAAATAATTTGGTTTTTTTTTTTTTTTTTACTCTGGACACTTTTTTAATAATGATTATCTTAAAACAAAAACACAATTTATAGCTGTACCAAAAAATTTTTTTTATATACTTATTCAATAAT
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFF,:,,,F,,,,,,,F,,::,::,FFFFFFFFF,FFFFFF,FFFFF,FF:,:FF:,F:F:FFFF,FF,F,F,::FFF,F,,:,FF:F:FFF::FFF,FFFFF,F:F,:FFFF:F::,F,,FF:,

I did previously use the sinto to sort bam files, but when I tried converting those bam back to fastq using bamtofastq from 10X and re-aligning with cellranger count the alignment was very different (and Seurat object)

I'm essentially trying to go from Subset Seurat Object to subset Fastq R1 and R2, and then back to subset Seurat Object.

ADD REPLY
0
Entering edit mode

filterbyname.sh only works based on information that is found in the fastq header. As you can see the barcodes you have are not in those headers. They are still in the beginning 26 bp of R1. So the method you tried to use will not work.

If you try to filter usingGTAACATGCG+NGGTAACACT (which is in the fastq header) then it will work.

ADD REPLY
0
Entering edit mode

I assume this comes from CellRanger or simlar? First, you need to remove the -* so the dash with the trailing number. This is not present in the fastq files but added by other software. Then also, note that barcodes get error-corrected during the processing, e.g. by CellRanger so the raw barcodes sequenced in the fastq files might not be perfect matches. What I would probably do is to scan the BAM file for the error-corrected barcodes which should be perfect matches, then extract the read name from the bam and filter the fastq based on that. Or convert the filtered BAM back to fastq. Something like this.

ADD REPLY
0
Entering edit mode

Yes these barcodes came from a Seurat Object aligned from cellranger count, I'll change the .csv file with your corrections and try. I actually did use sinto as recommended by timmoast, then used bamtofastq from 10X to try and convert the bam to fastq. However, the resulting alignment with these filtered fastq files using cellranger count gave very different results than the original.

How would I use the filtered bam files to get the fastq? Convert to sam to get the reads pairs? Thanks!

ADD REPLY
1
Entering edit mode

How would I use the filtered bam files to get the fastq?

If you have the BAM files you can simply use samtools fastq your.bam with appropriate options to write out paired fastq files. Check in-line help.

Your original question asked about getting fastq per barcode. Is that actually your intention since there will be thousands of files if you do that.

However, the resulting alignment with these filtered fastq files using cellranger count gave very different results than the original.

What is the intent behind doing this?

ADD REPLY
0
Entering edit mode

What is the intent behind doing this?

From our original snRNA-seq Seurat object, we have a small subset (~5% of nuclei) that collaborators were given which is now in an accepted manuscript. However, the full analysis of the original snRNA-seq is still ongoing so we are trying to release only the relevant subset data. Therefore, we are trying to now subset the original fq.gz file to create a sub.fq.gz for GEO submission.

This has been done before with Split-seq (Parse Biosciences) fq files before, since they have their own pipeline written in python to split by barcodes that label specific well numbers. The dataset I'm troubleshooting right now was made with a 10X kit, so I can't use the same methods and I don't know python well enough to edit a source script for my purposes.

ADD REPLY
1
Entering edit mode

Use the 10x's BAM splitting tool (https://github.com/10XGenomics/subset-bam ). You should share those BAM's (instead of fastq) since they have all the relevant metadata that would be needed by others if they were to reproduce your analysis.

ADD REPLY

Login before adding your answer.

Traffic: 2899 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6