Hello, everyone:
I'm recently analyze my scRNA-seq data, the first step is to splitting fastq files according to my barcode file which looks like this:
sc1 AACGTGAT
sc2 AAACATCG
sc3 ATGCCTAA
sc4 AGTGGTCA
sc5 ACCACTGT
sc6 ACATTGGC
sc7 CAGATCTG
sc8 CATCAAGT
sc9 CGCTGATC
sc10 ACAAGCTA
sc11 CTGTAGCC
sc12 AACGCTTA
My data is pair end sequenced and the R1, R2 are like these (I trimmed some):
R2:
@ST-E00493:75:H33JKALXX:1:1101:10987:2206 2:N:0:ATACACAT
AACGCTTAAGGGTAATTTTTTGTGTTATGTATTTTTTTTTTAGGGGAAAAGGCATTTTTGGT...
+
AAFFFFJJ<A7JF<JF----AA--A--7----AAFJ-F<-FF-<<F-<-AFFA-7A7A-A-<...
R1:
@ST-E00493:75:H33JKALXX:1:1101:10987:2206 1:N:0:ATACACAT
GTTGTGAAGGGGAGGCTGGAGAGGCTTCGTCTGCTAAGAGCATTGGCCGTTCTTCCACTGTT...
+
AAAFFFJ-<JJJJJJJJFJJJF7JFFJJJJJJJJJJJJJJFFJJJJJJJJJJJJJJJJFFJJJ...
The barcode information is in the first 8bp of R2 (Here is AACGCTTA), so, I want to split the fastq file according to the barcode informations and pair the read_1 to read_2 by header info. But after I searched many programes or scripts I can't find a suitable solution:
fastq-multx
fastq-multx -B barcode_sequence -b -m 0 R2.fastq.gz R1.fastq.gz -o %_R1.fq -o %_R2.fq
The result is absolutely not what I want which only 7 lines head with its barcode.
fastx_barcode_splitter.pl It seems don't spport PE reads.
BBmap
I also wrote a python script, but it runs so slow.... , So, I wonder if somebody have good suggestions. Thanks in advance!
Isn't the barcode the last part of the header, in your example
ATACACAC
andATACACAT
?If so demuxbyname.sh should help.
fin swimmer
No, it's in the head 8bp of read_2
Index sequences are in both R1/R2 headers (see example you posted above). As suggested by @finswimmer
demuxbyname.sh
should indeed work in this case.Thanks for your reply, genomax , but this protocol is modified, so the barcode is in the head 8bp of Read_2. At the very beginning when I recieved the data, I also thought the barcode is in the header of both reads, so weird a protocol :(
use fastx tool kit: http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_barcode_splitter_usage
Thanks cpad0112 for your reply, I have tried this method before but it can only handle SE data, and I have to get a script for pairing read1 and read2.
you may want to look at this python library ( I haven't used it): https://pypi.org/project/barcode-splitter/
show us the code
The first version I wrote is too slow, and I found a python script paired-end_barcode_splitter from github. It uses a perl script named "fastx_barcode_splitter.pl" from FASTX-toolkit to split my read_2 data according to barcode information and after that, fastq header was used to match the paired reads. But It's only litter faster and it takes about 10h to run my 40GB PE gzipped data. so, I wonder if anyone has a better solution, thanks a lot.
Thanks for the clarification, @RamRS
Hi, it looks like this post is the right place where to find a solution to my problem. I am trying to do the same Houyu needed, with the difference that in my FASTQ files the barcodes are in the first 16nt of the second line in the R1 files. Thus, I have inverted the order of the R1/R2 files in fin swimmer's script and modified 8>16. However, only the last cell R1/R2 files are saved (i.e.there are 1503 cells and only SC1503_1/2.fastq files are saved), any help please?
Thanks a lot in advance!
SV
Hi, did you solve the problem? I have the same problem, the output only contains the last barcode.
Have you solved your problem? I met the same problem as you. I would like to ask you how to solve it
Please see the solution by
finswimmer
below.