Entering edit mode
3.9 years ago
berry
▴
40
Hi,
I have 3 single-cell RNA-seq datasets from the same platform (10X), same type of sample, same condition, but from different labs to integrate. When I check the genes.tsv or features.tsv files, even though the high majority of the IDs match, I see some differences. For example here "ENSG00000243485" corresponds to a different gene symbol in each dataset:
data1[data1$ENSEMBL == "ENSG00000243485", ]
>ENSG00000243485 MIR1302-2HG
data2[data2$ENSEMBL == "ENSG00000243485", ]
>ENSG00000243485 RP11-34P13.3
data3[data3$ENSEMBL == "ENSG00000243485", ]
>ENSG00000243485 MIR1302-10
Or here "AL627309.1" gene corresponds to a different ENSEMBL id:
data1[data1$GeneName == "AL627309.1", ]
>ENSG00000238009 AL627309.1
data2[data2$GeneName == "AL627309.1", ]
>0 rows
data3[data3$GeneName == "AL627309.1", ]
>ENSG00000237683 AL627309.1
How would you process these matrices?
Many thanks!
Can you find out which GTF file versions were used for the different samples? Presumably, they differ, and ideally, you should reprocess all samples with the same annotation file.
Hi Friederike, thank you for your reply. I only have access to CellRanger outputs unfortunately.
Is this from a paper or a collaborator?
From different papers. They all used GRCh38 but I don't know about the GTF files.
Their fastq files are likely uploaded to SRA or ENA. If they are, I would recommend rerunning them through cell ranger with the same annotation.