Question

only one biological read present in fastq NCBI database for paired-end sequencing

0

Entering edit mode

3.9 years ago

Matt ▴ 20

Hi, I would like to extract the 2 biological reads of a RNAseq single cell of a paired-end sequencing. With this run SRR11772847 I tried the command line of the sra-toolkit ./fastq-dump --skip-technical --split-3 SRR11772847 I should have 2 .fastq but I only get one with reads of size 98 bp (there is an extract below), there are 496352056 lines

@SRR11772847.1.3 NB502129:188:HY73HBGX9:1:11101:13593:1050 length=98
NCGATCCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR11772847.1.3 NB502129:188:HY73HBGX9:1:11101:13593:1050 length=98
#AAAAEA###########################################################################################
@SRR11772847.2.3 NB502129:188:HY73HBGX9:1:11101:9270:1050 length=98
NCCATGTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR11772847.2.3 NB502129:188:HY73HBGX9:1:11101:9270:1050 length=98
#A/AA//###########################################################################################
@SRR11772847.3.3 NB502129:188:HY73HBGX9:1:11101:16784:1053 length=98
NAAAAGAATATCTGTCCTANNGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR11772847.3.3 NB502129:188:HY73HBGX9:1:11101:16784:1053 length=98
#A/AAEEEEEEAEEEEEAA##E############################################################################
@SRR11772847.4.3 NB502129:188:HY73HBGX9:1:11101:20118:1053 length=98
NAGGAGGATGAAGGCTTACNNGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR11772847.4.3 NB502129:188:HY73HBGX9:1:11101:20118:1053 length=98
#A6AAE/AE/E/E/EEA//##<############################################################################
@SRR11772847.5.3 NB502129:188:HY73HBGX9:1:11101:13559:1054 length=98
NTTTTAGTTGGTCTTCATCTNTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR11772847.5.3 NB502129:188:HY73HBGX9:1:11101:13559:1054 length=98
#AAAA/<E//EEEE6A/AE/#<############################################################################

I'm quite a beginner, do I miss something or the data for this run is incomplete ? I have an analogical problem with SRR7049900 run

i also tried ./fastq-dump -I --split-files SRR11772847 but get 3 fastq with reads of size 8bp, 26bp and 98bp. I should get an other fastq of size 98bp (read2 of the paired-end sequencing), i don't understand.

Thank you by advance for your help,

Matt

RNA-Seq • 2.7k views

ADD COMMENT • link updated 2.8 years ago by ATpoint 86k • written 3.9 years ago by Matt ▴ 20

0

Entering edit mode

Thank you very much for this answer, very clear very complete, i successfully used Cellranger with that.

ADD REPLY • link 3.9 years ago by Matt ▴ 20

score 4 · Accepted Answer · 2021-01-19

4

Entering edit mode

3.9 years ago

ATpoint 86k

The output of --split-files is correct, see https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR11772847

This is 10X scRNA-seq. The 8bp file is the index read, the 26bp is the cellular barcode + UMI read and the 98bp one is the cDNA. The authors did not upload 98bp for R1 since everything beyond the first 26bp is meaningless in this kind of assay. Technically speaking this kind of assay is basically single-end as you by design only get one "biological" read for the cDNA which is R2. The three files you obtain are the required input for CellRanger which is the standard processing tool. Alternatives are STARsolo and lightweight quantifiers such as the alevin module from the salmon software.

See here this scheme for a 10X v2 library.

enter image description here

If you need further clarification please feel free to comment.

ADD COMMENT • link 3.9 years ago by ATpoint 86k

1

Entering edit mode

Thank you very much for this answer, very clear very complete, i successfully used Cellranger with that.

ADD REPLY • link 3.9 years ago by Matt ▴ 20

0

Entering edit mode

I don't want to abuse of your time but when I see this run https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR7049900 created with 10x v2 library. in the NCBI database we only have one read of 110bp however in the related paper we can see "Read 1 -26 cycles, i7 index-8cycles, i5 index : 0 cycles, Read 2 : 110 cycles"

According to your previous explanation and the paper do you think there are missing data in the NCBI data ? Without barcodes/UMI we can't analyze that with CellRanger

ADD REPLY • link 3.9 years ago by Matt ▴ 20

1

Entering edit mode

10x data in SRA is hit and miss. There is no standard protocol that submitters and/or SRA seem to follow. Best bet is to look under the Data Access tab in the link you posted above and see if the section on Original format has BAM files available. In this case it looks like there is one. You can use the bamtofastq utility (LINK) provided by 10x to recreated the reads from this BAM.

ADD REPLY • link 3.9 years ago by GenoMax 148k

0

Entering edit mode

Thank you for the trick that seems to work ! Quite strange that the data provided is not verified ^^ during the publication process.

ADD REPLY • link 3.9 years ago by Matt ▴ 20

0

Entering edit mode

Thank you for the explanation! I'm also still a beginner, and I would like to know if I can follow this procedure also with Galaxy. So far, I have accessed the data like Matt did (with Faster Download and Extract Reads in FASTQ), but I only get one file. How can I make sure I get all the files? In addition, I would like to use STARsolo for mapping. Should I treat the input as single end?

Thank you in advance for helping out :)

ADD REPLY • link 2.8 years ago by balou • 0

1

Entering edit mode

Without the UMI/Barcode file the data are basically useless. You may want to open a new question with some details.

ADD REPLY • link 2.8 years ago by ATpoint 86k