Dropseq, GSE124872 and structure of the reads in the FastQ files
1
0
Entering edit mode
3.0 years ago
tmms ▴ 10

Hello

I am confused by the structure of the reads in a FastQ File I downloaded via GEO. The data is stored in GEO GSE124872, GSM3557675. According the SRA Run Selector, there is one FastQ file per sample, but the paper* and SRA mention that the data is paired end. So I was expecting two FastQ files (r1 and r2).

When I downloaded and opened that FastQ file, the headers of the reads caught my attention. They headers contain the length of sequences and that length is almost approximately 200 bp ("@SRR8426358.1 1 length=202"). This seems to me that r1 and r2 are concatenated. However, I couldn't find any source to confirm this. Additionally, I don't see a specific stretch of nucleotides between the reads. I would expect a fixed sequenced between r1 en r2. For clarity, I added one read to this post:

@SRR8426358.1 1 length=202
ATCAATGATCGGTCGTGACTTTTTTTTTTTTTTTTTTTTTTTTTAGTGAAATAAATTCTTTNTTTTTGTTAGAAGACTGATTTTTAAATGTCTTTATCATTGCAAGAAAGTGATAACTGCCTTTAACGATGGACTGAATCACTTGGNAAGCNTCAAGGGCACCTTTGCCAGCCTCAGTGAGCTCCACTGTGACAAGCTGCAT
+SRR8426358.1 1 length=202
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJF--A<F--<----7F--A<J#F-AJA-7-7-<-A-----7<JF---<<-<7FFF-<-7-F<-<A-FJJF<-7<FJ7--77-<-AJJ-A77--7FAJJ7-A7-FJ-#-7--#7-7F7JFFFA-<JFJ-F7-AJ---AAAJ<FF7-7-7FAFJF7--7FJF-A

Do you think it is safe to just split every read into two so r1 contains the first 100 bases and r2 the remaining bases? I tried this for a subsample of the FastQ file, I quantified r1 and r2 with salmon. r1 had a very low mapping rate (1%), while r2 had a normal mapping rate (68%) when mapping the mouse genome. This seems to support the idea that r1 contains the barcode and UMI, while r2 contains the actual sequences.

* "Single-cell libraries were sequenced in a 100 bp paired-end run on the Illumina HiSeq4000 using 0.2 nM denatured sample and 5% PhiX spike-in." ("An atlas of the aging lung mapped by single cell transcriptomics and deep tissue proteomics", Angelidis et.al., 2019)

Thanks in advance and hoping someone can offer some insight into this data.

scRNA-seq Dropseq • 1.6k views
ADD COMMENT
1
Entering edit mode

Based on that single read alone, I'd be inclined to think that the first 20 bases are the cell barcode and UMI, and the stuff after the T's is the real RNA sequence.

ADD REPLY
3
Entering edit mode
3.0 years ago
ATpoint 85k

I think the 202bp read is due to not properly downloading the data. As you can see here https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR8426358 the run is indeed paired-end with R1 and R2 separately. My assumption is that you ran fastq-dump without the --split-files flag? Adding this flag will correctly output two files, the R1 and R2 separately. Yes, UMI/CB is probably R1 and cDNA is R2, at least this is how it goes with 10X Chromium data. I recommend Alevin for the quantification of such as https://salmon.readthedocs.io/en/latest/alevin.html as this has a dedicated --dropseq flag which will parse all relevant CB/UMI and cDNA from the reads automatically. It is basically the single-cell module that builds on the Salmon selective alignment procedure, see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5600148/

For downloading data, you can enter the accessions at sra-explorer.info to get direct fastq download links such as:

curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR842/008/SRR8426358/SRR8426358_1.fastq.gz -o SRR8426358_GSM3557675_old_Dropseq_1_Mus_musculus_RNA-Seq_1.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR842/008/SRR8426358/SRR8426358_2.fastq.gz -o SRR8426358_GSM3557675_old_Dropseq_1_Mus_musculus_RNA-Seq_2.fastq.gz
ADD COMMENT
0
Entering edit mode

Thank you very much for this answer.

Your assumption is correct, I indeed didn't add the --split-files flag to fastq-dump. I wasn't aware of this behaviour of fastq-dump until now.

Also thanks for the additional hints for the downstream analysis.

ADD REPLY
0
Entering edit mode

I wasn't aware of this behaviour of fastq-dump until now.

Nobody is until the fastq-dump madness hits you for the first time, and it is just a terrible design flaw of this tool to even output such a nonsense read rather than raining a warning or being smart by splitting PE by default.

ADD REPLY

Login before adding your answer.

Traffic: 1089 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6