Tool to Identify Gene, Regulatory Role, and Function at Integration Sites
1
0
Entering edit mode
12 months ago
kelpotus22 • 0

Is there a tool or website that can identify the gene and its regulatory role at a specified integration site on a chromosome (e.g., 1:20746689), and/or in addition along with its function (e.g., DNA binding activity, nucleosome binding activity)?

chromosome regulatory integration • 831 views
ADD COMMENT
2
Entering edit mode
12 months ago

let's have fun with SPARQL.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| build    | chrom | start                                              | end                                                | gene_id           | gene_name | gene_biotype     | go_id        | go_label                                      |
================================================================================================================================================================================================================================================
| "GRCh38" | "1"   | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3"  | "protein_coding" | "GO:0006334" | "nucleosome assembly"                         |
| "GRCh38" | "1"   | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3"  | "protein_coding" | "GO:0000786" | "nucleosome"                                  |
| "GRCh38" | "1"   | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3"  | "protein_coding" | "GO:0005634" | "nucleus"                                     |
| "GRCh38" | "1"   | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3"  | "protein_coding" | "GO:0070828" | "heterochromatin organization"                |
| "GRCh38" | "1"   | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3"  | "protein_coding" | "GO:0031491" | "nucleosome binding"                          |
| "GRCh38" | "1"   | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3"  | "protein_coding" | "GO:0006355" | "regulation of DNA-templated transcription"   |
| "GRCh38" | "1"   | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3"  | "protein_coding" | "GO:0003677" | "DNA binding"                                 |
| "GRCh38" | "1"   | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3"  | "protein_coding" | "GO:0005694" | "chromosome"                                  |
| "GRCh38" | "1"   | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3"  | "protein_coding" | "GO:0016607" | "nuclear speck"                               |
| "GRCh38" | "1"   | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3"  | "protein_coding" | "GO:0042127" | "regulation of cell population proliferation" |
| "GRCh38" | "1"   | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3"  | "protein_coding" | "GO:0005515" | "protein binding"                             |
| "GRCh38" | "1"   | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3"  | "protein_coding" | "GO:0097298" | "regulation of nucleus size"                  |
| "GRCh38" | "1"   | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3"  | "protein_coding" | "GO:0071456" | "cellular response to hypoxia"                |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
`

1) download a human GTF and convert it to XML+RDF with awk 2) download mapping about ensembl and GO from NCBI , join both resources and convert it to XML+RDF with awk 3) concatenate 1 and 2 to create a RDF database 4) query SPARQL with jena/arq

BEGIN {
FS="\t";
}
($3=="gene") {
gene_id="";
gene_name=""
gene_biotype=""
N=split($9,a,/[ ]*[;][ ]*/);
for(i=1;i<=N;++i) {
N2 = split(a[i],b,/[ ]/);
K = b[1];
V=b[2];
gsub(/"/,"",V);
if(K=="gene_id") gene_id=V;
else if(K=="gene_name") gene_name=V;
else if(K=="gene_biotype") gene_biotype=V;
}
if(gene_id=="") next;
printf("<bio:Gene rdf:about=\"%s\">\n",gene_id);
printf("\t<bio:gene_id>%s</bio:gene_id>\n",gene_id);
if(gene_name!="") printf("\t<bio:gene_name>%s</bio:gene_name>\n",gene_name);
if(gene_biotype!="") printf("\t<bio:gene_biotype>%s</bio:gene_biotype>\n",gene_biotype);
printf("\t<bio:location>\n");
printf("\t\t<bio:Location>\n");
printf("\t\t\t<bio:build>%s</bio:build>\n",BUILD);
printf("\t\t\t<bio:chrom>%s</bio:chrom>\n",$1);
printf("\t\t\t<bio:start rdf:datatype=\"http://www.w3.org/2001/XMLSchema#int\">%s</bio:start>\n",$4);
printf("\t\t\t<bio:end rdf:datatype=\"http://www.w3.org/2001/XMLSchema#int\">%s</bio:end>\n",$5);
printf("\t\t</bio:Location>\n");
printf("\t</bio:location>\n");
printf("</bio:Gene>\n");
}
view raw gtf2rdf.awk hosted with ❤ by GitHub
SHELL=/bin/bash
OUTDIR=TMP
BUILD=GRCh38
all: $(OUTDIR)/database.rdf query.01.sparql
/path/to/pache-jena-4.8.0/bin/arq --data=$< --query=query.01.sparql
$(OUTDIR)/database.rdf: $(OUTDIR)/go.rdf $(OUTDIR)/gtf.rdf
mkdir -p $(dir $@)
echo '<?xml version="1.0" encoding="UTF-8"?><rdf:RDF xmlns:bio="https://www.biostars.org/#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xml:base="https://www.biostars.org/">' > $@
cat $^ >> $@
echo "</rdf:RDF>" >> $@
$(OUTDIR)/go.rdf:
mkdir -p $(dir $@)
wget -O - "https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz" | gunzip -c |\
awk -F '\t' '1==9606|cutf2,3,6|sortT$(dir$@)t'\t' -k1,1 > $(addsuffix .tmp1,$@)
wget -O - "https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2ensembl.gz" | gunzip -c |\
awk -F '\t' '1==9606|cutf2,3|sortT$(dir$@)t'\t' -k1,1 > $(addsuffix .tmp2,$@)
join -t $$'\t' -1 1 -2 1 $(addsuffix .tmp1,$@) $(addsuffix .tmp2,$@) > $(addsuffix .tmp3,$@)
cut -f 2,3 $(addsuffix .tmp3,$@) | sort -T $(dir $@) | uniq |\
awk -F '\t' '{GO=1;gsub(/:/,"",GO);printf("<bio:Termrdf:about=\"1,$$2);}' >> $@
cut -f 2,4 $(addsuffix .tmp3,$@) | awk -F '\t' '{GO=1;gsub(/:/,"",GO);printf("<rdf:Descriptionrdf:about=\"2,GO);}' >> $@
rm $(addsuffix .tmp1,$@) $(addsuffix .tmp2,$@) $(addsuffix .tmp3,$@)
$(OUTDIR)/gtf.rdf : gtf2rdf.awk
mkdir -p $(dir $@)
wget -O - "https://ftp.ensembl.org/pub/release-111/gtf/homo_sapiens/Homo_sapiens.$(BUILD).111.chr.gtf.gz" | gunzip -c |\
awk '($$1=="1")' |\
awk -vBUILD=$(BUILD) -f gtf2rdf.awk > $@
view raw Makefile hosted with ❤ by GitHub
PREFIX bio: <https://www.biostars.org/#>
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
SELECT
?build
?chrom
?start
?end
?gene_id
?gene_name
?gene_biotype
?go_id
?go_label
WHERE {
?gene bio:gene_name ?gene_name .
?gene bio:gene_biotype ?gene_biotype .
?gene bio:gene_id ?gene_id .
?gene bio:location ?loc .
?loc a bio:Location .
?loc bio:build ?build .
?loc bio:chrom ?chrom .
?loc bio:start ?start .
?loc bio:end ?end .
OPTIONAL {
?gene bio:has_go_term ?go .
?go bio:go_id ?go_id .
?go rdfs:label ?go_label .
}
FILTER( ?start <= 20746689 ) .
FILTER( ?end >= 20746689 ) .
FILTER( ?chrom = "1" ) .
}
view raw query.01.sparql hosted with ❤ by GitHub
We can make this file beautiful and searchable if this error is corrected: No tabs found in this TSV file in line 0.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| build | chrom | start | end | gene_id | gene_name | gene_biotype | go_id | go_label |
================================================================================================================================================================================================================================================
| "GRCh38" | "1" | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3" | "protein_coding" | "GO:0006334" | "nucleosome assembly" |
| "GRCh38" | "1" | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3" | "protein_coding" | "GO:0000786" | "nucleosome" |
| "GRCh38" | "1" | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3" | "protein_coding" | "GO:0005634" | "nucleus" |
| "GRCh38" | "1" | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3" | "protein_coding" | "GO:0070828" | "heterochromatin organization" |
| "GRCh38" | "1" | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3" | "protein_coding" | "GO:0031491" | "nucleosome binding" |
| "GRCh38" | "1" | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3" | "protein_coding" | "GO:0006355" | "regulation of DNA-templated transcription" |
| "GRCh38" | "1" | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3" | "protein_coding" | "GO:0003677" | "DNA binding" |
| "GRCh38" | "1" | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3" | "protein_coding" | "GO:0005694" | "chromosome" |
| "GRCh38" | "1" | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3" | "protein_coding" | "GO:0016607" | "nuclear speck" |
| "GRCh38" | "1" | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3" | "protein_coding" | "GO:0042127" | "regulation of cell population proliferation" |
| "GRCh38" | "1" | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3" | "protein_coding" | "GO:0005515" | "protein binding" |
| "GRCh38" | "1" | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3" | "protein_coding" | "GO:0097298" | "regulation of nucleus size" |
| "GRCh38" | "1" | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3" | "protein_coding" | "GO:0071456" | "cellular response to hypoxia" |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

ADD COMMENT
0
Entering edit mode

Thank you so much for the detailed guidance! Just a quick query - would I be running these scripts in a bash environment to replicate the results?

Also, regarding the integration site query, will this process be able to identify the regulatory role of the gene, such as whether the site falls within a promoter or enhancer region?

ADD REPLY
0
Entering edit mode

would I be running these scripts in a bash environment to replicate the results?

yeah, I used sparql for fun but i you don't know them, you should use tools like bedtools intersect and join....

ADD REPLY
0
Entering edit mode

I have been using Linux environment but fairly new, still I'm eager to give them a try. Just to confirm, should I run the awk command you provided first like this:

awk -v BUILD=GRCh38 -f gtf2rdf.awk > output.rdf

Followed by executing the Makefile with:

make Makefile

I'm not quite sure how to proceed with executing the query.01.sparql afterward. Could you please provide guidance on this? Please correct me if I'm wrong. Appreciate your help.

ADD REPLY
0
Entering edit mode

just

make
ADD REPLY

Login before adding your answer.

Traffic: 2656 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6