Dear community,
Something weird happened to me, my public dataset is obviously paired-end data (stated in 'metadata' part of ENA database, and there are two seperate fastq files (R1 & R2) and index file (I1) per sequencing run). After mapping them to reference genome by cellranger count
, I performed typical scRNA-seq downstream analysis and applied stringtie
and featureCounts
to compare miRNA expression of each cell types. But the problem is, while trying to identify the strandedness of my data I ran infer_experiment.py
which resulted in
infer_experiment.py -r hg38_GENCODE_V42_Basic.bed -i my_bam_file.bam
This is SingleEnd Data
Fraction of reads failed to determine: 0.0670
Fraction of reads explained by "++,--": 0.8406
Fraction of reads explained by "+-,-+": 0.0924
so I double-checked whether it's real by
samtools view -c -f 1 my_bam_file.bam
which yielded 0 while
samtools view -c -f 1 my_bam_file.bam
yielded 97581274, made me to think that aligned bam files (all of the generated bam files through downstream analysis) are actually single-end data. The problem might have arised from cellranger count
, but there were no errors with mapping and no warnings at the summary.html output file (and also I made sure to include all the R1 R2 fastq files as an input). I totally can't understand why is this happening... any help will be appreciated.
Best,
So you mean data generated by 10x scRNA-seq is basically paired-end data but technically single-end data... Makes me confused but makes sense, thank you
Yes. Why is it confusing? If you look at 10x libraries (below) you see that the left-hand side of each fragments contains CB and UMI and right-hand side contains cDNA. ence, the R1 that "comes from the left" picks up CB/UMI and R2 "from the right" picks up cDNA. So technically it's paired-end because you use two reads on the same fragment, but there is only one read (R2) for the gene expression so it's single-end in that regard, and the aligner in the end only uses R2.