I've looked really hard, and I can't find any documentation for their RNA-seq analysis pipeline other than saying that they followed the ICGC STAR 2-pass RNA-seq SOP, which is not documented on the ICGC site at the moment as far as I can find. I've even gone as far a looking at the ICGC and GDC github repositories to see if I can find the commands they used, but thus far, no luck.
I am not very clear what you want exactly. But what I interpret is that you also want normal samples from fpkm files. Technically in the fpqm you have both normal and tumor data. While downloading the samples from GDC portal there will also be a meta data file under download option. Download it and it will link each sample to filename. There will be a TCGA barcode also given in a meta data file. That barcode will help you to characterise the samples. If you split that individual barcode by '-' the fourth element would be of form 01A,01B,11A,07A etc.
Note that this symbol enables you to identify whether it is normal or tumor.01-09 stands for tumor and 11-20 for normal.
Eg - barcode is if TCGA-P4-A5E8-01A-11R-A28H-07 then 4th element is 01A and is tumor whereas TCGA-P4-A5E8-11A-11R-A28H-07 4th element is 11A. Now if you look at the first 3 elements they are same meaning tumor and normal are from same patient.
maybe I haven't show my meanning well, actually I want to translate SRA type data got from other database(SRAdb) into gene expression fpkm by many steps as GDC did. And I'd like to compare this data with that I download from GDC.
for now, I have got the way to complete the program, but for some reason, I can't performance the concrete steps on here. Thank you for your advice, but i think you may misunderstand my purpose. Thank you all the same.@noorpratap.singh
As far as I can tell, GDC does yet contain the matched normals for all cancer samples as the old portal did. If you do the barcode translation step you recommend, you'll find that the matched normal is barcode is often unrecognized.
The ICGC pipeline is explained in OICR wiki, if you have access to it.
It's the STAR 2-pass alignment, followed by HT-Seq count assuming all library are unstranded.
GDC is working on get all pipeline public (not in weeks, likely months), if you can wait.
STAR 2-pass could cover a multitude of sins. Fortunately, I found that the exact command IS contained in header to the BAM file, at least for the second-pass (but not the first unfortunately). Buts its a start. Key points to note, they allow 10 mismatches or upto a third of the aligned read. Upto 20 multi-maps. A minimum overhang for a known splice junction of 1. And they assign strand based on intron motif into the XS attribute. Here is an example:
I've looked really hard, and I can't find any documentation for their RNA-seq analysis pipeline other than saying that they followed the ICGC STAR 2-pass RNA-seq SOP, which is not documented on the ICGC site at the moment as far as I can find. I've even gone as far a looking at the ICGC and GDC github repositories to see if I can find the commands they used, but thus far, no luck.
you can ask GDC HelpDesk for help. This is their e-mail: support@nci-gdc.datacommons.io