Entering edit mode
7.9 years ago
oriolebaltimore
▴
190
Dear group,
I am looking for raw FASTQ files for RNA-Seq TCGA data. The BAM files were made using reads that map only to known genes. I am looking to get FASTQ files that were not filtered in anyway to retain reads mapping to known genes only.
I have access to Level 1 data through an approved protocol.
Thanks Adrian.
Sorry - I forgot to add that - is it possible to get raw FASTQ files from TCGA. Thanks
Are you sure about that? You can see the command line used to map the reads in the BAM header. Nowhere do I see anything that suggests that only reads mapping to known genes were kept? Only known genes were used when quantifying, but that's different.
Fastq's only exist in the TCGA legacy archive, whic hI don'tthink contains everything.
The legacy archive does contain all fastq files for RNA-Seq data. They are the TARGZ format.
Link to GDC Legacy Archive
that's correct. you just need to convert the supplied bam files to raw reads available through GDC.
here's an pipeline example: https://github.com/mforde84/TCGA-BRCA-RNAseq-realignment-pipeline
also from experience, converting bam to fastq is a bottleneck. picard has an option but it's really slow. the scripting provided above has a custom solution called fasty to do this. however i couldn't locate my source code. instead you could use something like the following which should be as fast: https://github.com/arq5x/bedtools2/blob/master/src/bamToFastq/bamToFastq.cpp