Hi,
Where can I find the library type (poly-A or ribo-depleted) for each RNA-seq sample/study in TCGA? I have tried looking through various papers and GDC portal and couldn't find an exact answer.
Thanks!
Hi,
Where can I find the library type (poly-A or ribo-depleted) for each RNA-seq sample/study in TCGA? I have tried looking through various papers and GDC portal and couldn't find an exact answer.
Thanks!
If you add a custom filter in GDC portal of read_group.library_selection, you will see
The library prep information, if present, is found in read-group fields that are associated with aligned read files (bam files). These are controlled-access files but the metadata is freely available, and can be related to the read-count per gene (gene-expression matrix) which is probably what you have from your GDC searches.
So your path to obtain the information you need is:
gene expression matrix -> bam file -> read group -> library prep info
All the fields associated with files in GDC are here: https://docs.gdc.cancer.gov/API/Users_Guide/Appendix_A_Available_Fields/#file-fields
The following fields are sometimes available for bam files, if the submitter added them (but are empty for the gene expression files)
analysis.metadata.read_groups.library_preparation_kit_catalog_number
analysis.metadata.read_groups.library_preparation_kit_name
analysis.metadata.read_groups.library_preparation_kit_vendor
analysis.metadata.read_groups.library_preparation_kit_version
analysis.metadata.read_groups.library_selection
analysis.metadata.read_groups.library_strand
analysis.metadata.read_groups.library_strategy
For example, let's say you want library prep info from the gene expression file 39f97389-0f71-4942-a7f9-b2df25a8365d.rna_seq.augmented_star_gene_counts.tsv
Search for that file in GDC portal, it will take you to a page all about that file - scroll down to "Analysis" and there is an entry "source files" with a number "1" next to it - click on that.
Now you should be on the page for the associated "sequencing reads" file (8af7bef7-0923-4431-b7e0-9cecbb7579fa.rna_seq.transcriptome.gdc_realn.bam
, a bam file). Under 'read groups' there is some information on the read lengths etc. However, it's still not enough detail.
Let's use the API with curl: make a plain text file called "payload" with the following xml:
{
"filters":{
"op":"=",
"content":{
"field":"files.file_name",
"value":"8af7bef7-0923-4431-b7e0-9cecbb7579fa.rna_seq.transcriptome.gdc_realn.bam"
}
},
"format":"tsv",
"fields":"analysis.metadata.read_groups.read_group_id,analysis.metadata.read_groups.library_selection,analysis.metadata.read_groups.library_strategy,analysis.metadata.read_groups.library_preparation_kit_name",
"size":"100"
}
Change the bam name in the filter to any other bam file you might want to query, and change the fields to any other valid field names that you might be interested in, for the /files endpoint in the user guide's appendix.
Then use this payload to query the files endpoint at gdc:
curl --request POST --header "Content-Type: application/json" --data @payload 'https://api.gdc.cancer.gov/files' > response.txt
The 'response.txt' file in this example should contain:
analysis.metadata.read_groups.0.library_preparation_kit_name analysis.metadata.read_groups.0.library_selection analysis.metadata.read_groups.0.library_strategy analysis.metadata.read_groups.0.read_group_id id
TruSeq Stranded Total RNA Library Prep Kit with Ribo-Zero Gold rRNA Depletion RNA-Seq 7f44fbe0-a4ef-4765-a47b-4869195559ce 80ca4e7a-e74f-4db0-a534-14d431537aa9
Warning - you could get a large file with lots of mostly empty columns if there are multiple entries for a field in one of the results. This is because we are coercing the result into a tsv table ("format":"tsv"
). Whichever entry that is in your response with the highest number of read groups will define the number of columns there - all the others will be filled with empty strings for those columns. You can see which columns can potentially proliferate because they have a zero in them (e.g. "read_groups.0.").
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I just noticed that you were asking about TCGA specifically, instead of GDC in general. It's pretty straightforward in TCGA.
If you understand the TCGA barcode (https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/), you can simply parse out the 20th character of aliquot barcode (or aliquot.submitter_id in GDC), "R" means polyA and "T" means ribo-depletion (https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/portion-analyte-codes)
+1 for this. Maybe downloading metadata using
TCGABiolinks
and searching for "TotalRNASeqV2" might yield some results - I will try it later and let you know.Any luck, Barry?
Nothing, unfortunately.
GDCquery()
only uses 'Illumina' for the platform argument, despite the man pages describing a wide array of options. I double-checked and downloaded all GDCqueries for each project beginning with TCGA and yep, Illumina is the only level in the platform column.Attempting to filter by
experimental.strategy = "Total RNA-Seq"
returnsNULL
objects for all 'TCGA-' projects too.In short, I don't think
TCGAbiolinks
is a valid optionPerhaps ask the TCGA / GDC directly, @komal.rathi