Hello,
Please bear with me as I am relatively inexperienced to single-cell RNAseq data and downloading data from the SRA run selector.
I am interested in downloading mutant EGFR and mutant KRAS patient data from this article under this BioProject accession number PRJNA591860. Ideally, I hope to download these patients, convert them into Seurat objects, merge them based on condition (in this case mEGFR vs mKRAS), integrate the samples, and visualize DE between the two groups.
Now, from my understanding, because I'm not interested in the entire dataset (only interested in mutant EGFR/KRAS), I should individually download the read data and run it through the workflow to generate individual count matrices and so on... I found the raw reads on SRA under accession SRP238929. And based on the patient demographics the author uploaded, each sample name on the spread sheet should correspond to isolate number on the SRA download page.
Each isolate has around 50 files associated with its patient. So my question is what does this necessarily mean? Does this mean 50 reads associated with that patient? Do I need to download all of these files? If I'm demultiplexing with bclfastq2 and aligning with STAR2 how would that work?
Is there also an easier way of getting this data into the workflow (ie downloading count matrices themselves as opposed to remaking them, couldn't find it anywhere)?
Thanks.
Dear crx6xw,
I have the same problem with this dataset. Additionally, when trying to use cellranger to process the samples it presents a series of errors.
Have you tried to contact the authors to obtain the data already processed?
Nope, I simply just didn't use this dataset. I think it's probably better to just download the count matrices directly instead of processing the raw reads first.
Hi, I hope that it can help you.
In the article (https://www.sciencedirect.com/science/article/pii/S0092867420308825?via%3Dihub) the authors told us that we could access all code used to generate the results of this study on GitHub at czbiohub/scell_lung_adenocarcinoma .
In this link, there is information: Clone the repo Download the Data_input folder from the link below into the repo: https://drive.google.com/drive/folders/1sDzO0WOD4rnGC7QfTKwdcQTx3L36PFwX?usp=sharing
In this drive, we can find a folder with the counts generated by a singlecell and metadata. The file is named S01_datafinal.csv (counts) and S01_metadata.csv (metadata) (link to access: https://drive.google.com/drive/folders/1VmPan5V19Hq--fMnFHOta67TIpO60srE).
Then you can filter only the sample of interesting using information available in the NCBI BioProject #PRJNA591860. For example, I need to use all the samples of patient TH226, then, I looked in SRA which the samples are for this patient. Later I filtered all the samples for this patient in the S01_datafinal.csv file using a script in R.
Let me know if you don't understand.