Hello
I am confused by the structure of the reads in a FastQ File I downloaded via GEO. The data is stored in GEO GSE124872, GSM3557675. According the SRA Run Selector, there is one FastQ file per sample, but the paper* and SRA mention that the data is paired end. So I was expecting two FastQ files (r1 and r2).
When I downloaded and opened that FastQ file, the headers of the reads caught my attention. They headers contain the length of sequences and that length is almost approximately 200 bp ("@SRR8426358.1 1 length=202"). This seems to me that r1 and r2 are concatenated. However, I couldn't find any source to confirm this. Additionally, I don't see a specific stretch of nucleotides between the reads. I would expect a fixed sequenced between r1 en r2. For clarity, I added one read to this post:
@SRR8426358.1 1 length=202
ATCAATGATCGGTCGTGACTTTTTTTTTTTTTTTTTTTTTTTTTAGTGAAATAAATTCTTTNTTTTTGTTAGAAGACTGATTTTTAAATGTCTTTATCATTGCAAGAAAGTGATAACTGCCTTTAACGATGGACTGAATCACTTGGNAAGCNTCAAGGGCACCTTTGCCAGCCTCAGTGAGCTCCACTGTGACAAGCTGCAT
+SRR8426358.1 1 length=202
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJF--A<F--<----7F--A<J#F-AJA-7-7-<-A-----7<JF---<<-<7FFF-<-7-F<-<A-FJJF<-7<FJ7--77-<-AJJ-A77--7FAJJ7-A7-FJ-#-7--#7-7F7JFFFA-<JFJ-F7-AJ---AAAJ<FF7-7-7FAFJF7--7FJF-A
Do you think it is safe to just split every read into two so r1 contains the first 100 bases and r2 the remaining bases? I tried this for a subsample of the FastQ file, I quantified r1 and r2 with salmon. r1 had a very low mapping rate (1%), while r2 had a normal mapping rate (68%) when mapping the mouse genome. This seems to support the idea that r1 contains the barcode and UMI, while r2 contains the actual sequences.
* "Single-cell libraries were sequenced in a 100 bp paired-end run on the Illumina HiSeq4000 using 0.2 nM denatured sample and 5% PhiX spike-in." ("An atlas of the aging lung mapped by single cell transcriptomics and deep tissue proteomics", Angelidis et.al., 2019)
Thanks in advance and hoping someone can offer some insight into this data.
Based on that single read alone, I'd be inclined to think that the first 20 bases are the cell barcode and UMI, and the stuff after the T's is the real RNA sequence.