Let me show you an example: https://trace.ncbi.nlm.nih.gov/Traces/index.html?view=run_browser&acc=SRR16093385&display=metadata
This data contains two reads, R1 and R2. The read length of R1 and R2 are the same 150bp.
However, this experiment is performed following 10x 3'library protocol. In the method section, it described as below:
The scRNA-seq libraries were generated using the 10x Genomics Chromium Controller Instrument and Chromium Single Cell 30 V3 Reagent Kits (10x Genomics). Briefly, cells were concentrated to 1,000 cells/mL and approximately 8,000–10,000 cells were loaded into each channel to generate single-cell gel bead-in-emulsions (GEM), which resulted in the expected mRNA barcoding of 3,000–8,000 single cells for each sample. After the reverse transcription step, GEMs were broken and barcoded cDNA was purified and amplified. The amplified barcoded cDNA was fragmented, A-tailed, ligated with adaptors and index PCR amplified. The final libraries were quantified using a Qubit High Sensitivity DNA assay (Thermo Fisher Scientific) and the size distribution of these libraries was determined by a High Sensitivity DNA chip on a Bioanalyzer 2200 (Agilent). All libraries were then sequenced by an Illumina sequencer (Illumina) on a 150 bp paired-end run.
Generally, fastq files from 10x 3' library should be I1, R1 and R2. The R1 only contains UMI and barcode, hence the length of R1 is far less than R2. According to this paper, they generated the double strand cDNA, in which both strands have UMI and barcode (I think? ). It seems to be reasonable to generate two fastq files that have equal read length like a pair-end sequencing data.
When downloading such file either from SRA or ENA, I always get these two fastq. I think the index, UMI and barcode should be in the reads. But I don't know how to extract them and split the SRA or fastq file to the default format of 10x scRNA-seq fastq.
When looking up original data stored in AWS, the filename is not a normal format for 10x 3' library fastq. s3://sra-pub-src-10/SRR16093385/OC17-1_BKDL192531646-1a-AK1647_1.fq.gz.1
and
s3://sra-pub-src-9/SRR16093385/OC17-1_BKDL192531646-1a-AK1647_2.fq.gz.1
BTW, the example I provided here is not the only case. I have found this issue in another dataset. It's so strange and confused.
Very clear. Thank you very much. May I ask one more question? If I want to extract UMI and barcode from R1, and keep R2 insert reads in a separate fastq, is there any tool can do this? Is it possible to like "re-create" the R1 and R2 files? I know
UMI_tools
has such function to extract UMI and barcode sequence. But it seems to use the standard 10x R1.fastq as input.Links you provided above are for the original fastq data submitted to NCBI (which is normally under
Data Access
tab inOriginal format
section). In some instances people also submitcellranger
BAM files. You can then usebamtofastq
utility provided by 10x to recreate the original fastq files. It is now included incellranger
package.You're correct. In some instances the bam file is an option. Data stored at AWS cloud sometimes can be public accessible. But in many other cases, only _R1.fq and _R2.fq were provided. That was why I was confused. ATpoint has explained why I always get two 150bp reads. But I still don't know if there is any method to recreate those original fastq files. I don't use CellRanger because I'm trying to use these raw data for other special purpose instead of quantifying gene expression. I've read CellRanger manual. It seems that CellRanger does not mention any function to split these reads into I1 R1 and R2 from such data format archived by SRA.
Would you mind enlighten me if there is any tool can achieve this task? Thank you.
Files you linked to are the original reads for one sample. I1 files, if they exist, are simply illumina index sequences. Sequence in that file is identical for every read in one sample.
I1 fastq actually is out of my concern. Many datasets in SRA do not provide I1 fastq either. What I need are the
R1
(which only have UMI and barcode, length should be 26/28bp) andR2
(which only have insert sequence, varied from 98~150bp) fastq files. The sequence of these two fastq files should have been included in the files downloaded from SRA (in which the read length of these two files are 150bp).That is what ATPoint explained above. Tools like cellranger will automatically use parts of read that they need e.g. 26-28 bp from Read1 to get UMI/Cell barcodes.
These submitters sequenced the samples much longer than recommended/necessary both for Read 1 and Read 2 and submitted the sequences as is. You can manually trim the reads down if you want to make them recommended length.
hello , I have some question if you can help me plz
thank you,
No you do not need trimmomatic since you will be following
cellranger
,alevin
scRNA seq pipeline etc.It's fine to be new to a field, but what is not fine is to be resistant against advise. Two experienced users told you already trimmomatic on scRNA seq data that you don't need trimming for your data so what is the point insisting on it? Use CellRanger with your 10x data as they advise in the CellRanger manual and be done with it.
Sorry, I got it NO trimmomatic thank you
Dear GenoMax,
Hello. I have carefully read your and ATpoint's answers. I am working on a similar project as Tomas4482, where I use SAHMI to annotate microbial information from single cell sequencing. However, SAHMI requires kraken2 to calculate k-mer values from sequencing sequences. As you suggested, I can use bamtofastq to obtain the official 10X Fastq files, where R1 contains only barcode and umi, and R2 contains only sequencing data. For the R1 and R2 fastq files that I downloaded from the internet with a sequencing length of 150, I want to extract the relevant information from them. As you mentioned, I need to do it manually. For barcode and umi, they are the first 26 or 28 bases of the Fastq1 file; For the Fastq2 file, how can I locate the 91 or 98bp in the Fastq2 file? I can only extract them if I know their positions. I would be very grateful if you could help me with this. Thank you!
This is the structure of the 10x libraries: https://kb.10xgenomics.com/hc/en-us/articles/360035999892-What-is-the-structure-of-the-final-Visium-for-fresh-frozen-library-
If your software expects 91 or 98 bp then you can take the first 91/98 bp from fastq2 file. You can trim the data using
bbduk.sh
or any other trimming program to keep that many bases.Thank you! I have written a script to filter the former 91/98 bp. I then ran cellranger both before and after trimming to examine the results, and the results were almost the same but with some minor differences. I followed ATpoint’s suggestion and posted a new issue here:Extract the true single-cell RNA sequencing reads for running SAHMI. Really thank you for your reply!
Please open a new question for this one.
HI! I'ave posted a new issue here: Extract the true single-cell RNA sequencing reads for running SAHMI
Hi! Jusrt so that I understand this correctly, for the 10x v3 libraries sequenced on Novaseq, does it mean that if R1 is 150bp and looks like:
GNAACATGTTATAGCCTGGAATATCAGATTATTGTATATCATAAGTAGTCTCTATTTTTTTTTTTTTAAATATTTATGCTGTGTTTTCCCCGGGTGTAGTACAAATGTGCGAGATCGTCGAACCACCACCACCCCCACCTCGCGAGACTC
Then, we can just effectively ignore everything from bp 29 onwards? Am asking this in relation to this biostars post on STARsolo. Thank you!
Yes.