Trouble finding datasets
2
0
Entering edit mode
19 months ago
SHXVRR ▴ 20

Hello,

I am trying to find datasets for a project on HNSCC. I have been using GEO as my main website to find datasets but have not found anything. I am trying to find a dataset about HNSCC, tumor and control, RNA, for the tonsil body part, and then FASTA files. I find it hard to find GEO datasets that are also SRA, which contains fasta files, unlike normal geo datasets with only txt files most of the time. Adding on to the previous sentence, I found numerous geo datasets that fit my bill, but contain no fasta files. I am wondering if you know how I can find SRA datasets better or any other website that has datsets(with Fasta files)?

Thanks

GEO SRA • 1.3k views
ADD COMMENT
1
Entering edit mode
19 months ago
GenoMax 151k

First of all there are no fasta files with next generation sequencing datasets. You will have fastq files. Secondly you will likely not be able to access original fastq files unless you apply for access via dbGaP because of participant privacy reasons.

You can find these datasets here: https://portal.gdc.cancer.gov/projects/TCGA-HNSC

If you are able to use gene counts etc then some of the files may be available via open access: https://portal.gdc.cancer.gov/repository?facetTab=files&filters=%7B%22content%22%3A%5B%7B%22content%22%3A%7B%22field%22%3A%22cases.project.project_id%22%2C%22value%22%3A%5B%22TCGA-HNSC%22%5D%7D%2C%22op%22%3A%22in%22%7D%5D%2C%22op%22%3A%22and%22%7D&searchTableTab=files

ADD COMMENT
0
Entering edit mode

If I’m trying to do rna-seq, how do you think the pipeline would look if I start with bam files from tcga. Normally I would something like fastqc, STAR, trimmomatic, and then feature counts. With the bam, would I just go straight to featurecounts?

ADD REPLY
0
Entering edit mode

You will still need to apply for access to BAM's. They are not publicly available. If you can use counts then those are publicly available. There are portals like cBioPortal and Xenabrowser that give you access to analyzed TCGA data.

ADD REPLY
0
Entering edit mode

oooh makes sense. My end goal is to find a number of genes associated to that dataset. Would I need to compare it to a control dataset or could I apply it to deseq2 with the counts file and a sample data file? If so, How could I compare it to a control dataset using R?

ADD REPLY
0
Entering edit mode
15 days ago
ehaag ▴ 80

If you do need to find other datasets, you can check the NIAID Data Ecosystem: https://data.niaid.nih.gov/ They integrate GEO as well as about 40 other repositories so you can search simultaneously across of them.

ADD COMMENT
0
Entering edit mode

Are you associated with this website/portal? I had previously asked in one of your prior posts that this portal seems to focus on

NIAID Data Ecosystem Discovery Portal you can find tips for searching infectious and immune disease datasets

Unless you know that it does more than these topics, please do not post about it in unrelated threads. If you wish to publicize this resource then you could create a separate tools category post.

ADD REPLY
0
Entering edit mode

It has more than strictly infectious disease datasets. They integrate general biomedical sources like Zenodo and bioinformatics tool repositories like BioTools. It's relatively new and not a lot of people seem to know about it so I thought I'd get the word out.

ADD REPLY

Login before adding your answer.

Traffic: 2694 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6