I know this is kind of a long shot, but worth a try! Does anyone have experience working with the data dump for the Mitelman Database of Chromosome Aberrations and Gene Fusions in Cancer:
ftp://ftp1.nci.nih.gov/pub/CGAP/mitelman.tar.gz
I have extracted and rebuild the database with sqlite3, and I am trying to retrieve gene fusion information (all of them), with there associated disease, sample counts, and PubMed references. No matter what I try things never seem to match up with what appears in the web portal.
Where are the two queries I am currently working with. The first one selects each fusion (I only want GeneA/GeneB fusions, and it lists single gene aberrations in the table anyways, also blank entries sometimes??). There are duplicate rows (one for each chromosome of the fusions) which I why I have the GROUP BY at the end.
This query seems to be working.
SELECT
rec.Gene,
rec.Total_Cases,
kod.Benamning
FROM RECCURENT_DATA as rec
INNER JOIN Koder as kod
ON rec.Code = kod.Kod
WHERE rec.Gene != ""
AND rec.Gene LIKE '%/%'
AND kod.KodTyp = 'MORPH'
GROUP BY rec.Gene, rec.Total_Cases, kod.Benamning
Then I have this one to TRY and get the associated PubMed ids via the MolClinGene table.
SELECT
mol.Gene,
GROUP_CONCAT(DISTINCT(ref.Pubmed))
FROM MolClinGene as mol
INNER JOIN REFERENCE as ref
ON mol.RefNo = ref.RefNo
WHERE mol.Gene IN (#LIST OF FUSION NAMES FROM ABOVE#)
GROUP BY mol.Gene
Note that I run the first, and keep a list of the fusions I need to lookup (in python) and then put that in the WHERE clause of the second one. This part fails terribly and I seem to get the right pubs sometimes, the wrong ones sometimes, extra ones sometimes and missing ones sometimes. Maybe someone has done something similar!
Thanks!
Whoops, sorry about that. I updated it.