while in the clinic file of HNSCC that downloaded from TCGA database (nationwidechildrens.org_clinical_patient_hnsc.txt) contains 529 samples, with 41 HPV (+) samples, but most of them are tumor free samples.
Does anyone knows what this consistency and how to use this data to analysis?
The HPV data in the standard TCGA download for HNSC was based on Sequenom mass-spec analysis. That was found to be less useful than HPV calling based on RNA expression levels, the method ultimately used in the Nature paper (RNA reads matched to HPV sequence that exceeded a threshold, as noted in the methods or supplemental methods). The cases identified by RNA expression had characteristics better corresponding to clinically defined HPV-positive HNSC. My guess is that Sequenom sometimes picked up HPV sequence that was not genomically integrated in a way that allowed for HPV expression, so that those tumors would not behave clinically as HPV-positive (e.g., in terms of p16 expression and sensitivity to cytotoxic therapy).
Did you get the data from the TCGA publication freeze for head and neck ( https://tcga-data.nci.nih.gov/docs/publications/hnsc_2014/ )? TCGA continues to add sequenced samples after the marker paper comes out.
Yes, I got the data and publication from this website. Maybe TCGA continues to add sequenced samples after paper comes out.
The number varied a lot (from 279 to over 500). Let me double check this. Thanks.