Tool to Identify Gene, Regulatory Role, and Function at Integration Sites

0

Entering edit mode

12 months ago

kelpotus22 • 0

Is there a tool or website that can identify the gene and its regulatory role at a specified integration site on a chromosome (e.g., 1:20746689), and/or in addition along with its function (e.g., DNA binding activity, nucleosome binding activity)?

chromosome regulatory integration • 831 views

ADD COMMENT • link updated 12 months ago by Pierre Lindenbaum 166k • written 12 months ago by kelpotus22 • 0

2

Entering edit mode

12 months ago

Pierre Lindenbaum 166k

let's have fun with SPARQL.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| build    | chrom | start                                              | end                                                | gene_id           | gene_name | gene_biotype     | go_id        | go_label                                      |
================================================================================================================================================================================================================================================
| "GRCh38" | "1"   | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3"  | "protein_coding" | "GO:0006334" | "nucleosome assembly"                         |
| "GRCh38" | "1"   | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3"  | "protein_coding" | "GO:0000786" | "nucleosome"                                  |
| "GRCh38" | "1"   | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3"  | "protein_coding" | "GO:0005634" | "nucleus"                                     |
| "GRCh38" | "1"   | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3"  | "protein_coding" | "GO:0070828" | "heterochromatin organization"                |
| "GRCh38" | "1"   | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3"  | "protein_coding" | "GO:0031491" | "nucleosome binding"                          |
| "GRCh38" | "1"   | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3"  | "protein_coding" | "GO:0006355" | "regulation of DNA-templated transcription"   |
| "GRCh38" | "1"   | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3"  | "protein_coding" | "GO:0003677" | "DNA binding"                                 |
| "GRCh38" | "1"   | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3"  | "protein_coding" | "GO:0005694" | "chromosome"                                  |
| "GRCh38" | "1"   | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3"  | "protein_coding" | "GO:0016607" | "nuclear speck"                               |
| "GRCh38" | "1"   | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3"  | "protein_coding" | "GO:0042127" | "regulation of cell population proliferation" |
| "GRCh38" | "1"   | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3"  | "protein_coding" | "GO:0005515" | "protein binding"                             |
| "GRCh38" | "1"   | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3"  | "protein_coding" | "GO:0097298" | "regulation of nucleus size"                  |
| "GRCh38" | "1"   | "20740266"^^<http://www.w3.org/2001/XMLSchema#int> | "20787323"^^<http://www.w3.org/2001/XMLSchema#int> | "ENSG00000127483" | "HP1BP3"  | "protein_coding" | "GO:0071456" | "cellular response to hypoxia"                |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
`

1) download a human GTF and convert it to XML+RDF with awk 2) download mapping about ensembl and GO from NCBI , join both resources and convert it to XML+RDF with awk 3) concatenate 1 and 2 to create a RDF database 4) query SPARQL with jena/arq

	BEGIN {
	FS="\t";
	}

	($3=="gene") {
	gene_id="";
	gene_name=""
	gene_biotype=""
	N=split($9,a,/[ ][;][ ]/);
	for(i=1;i<=N;++i) {
	N2 = split(a[i],b,/[ ]/);
	K = b[1];
	V=b[2];
	gsub(/"/,"",V);
	if(K=="gene_id") gene_id=V;
	else if(K=="gene_name") gene_name=V;
	else if(K=="gene_biotype") gene_biotype=V;
	}
	if(gene_id=="") next;
	printf("<bio:Gene rdf:about=\"%s\">\n",gene_id);
	printf("\t<bio:gene_id>%s</bio:gene_id>\n",gene_id);
	if(gene_name!="") printf("\t<bio:gene_name>%s</bio:gene_name>\n",gene_name);
	if(gene_biotype!="") printf("\t<bio:gene_biotype>%s</bio:gene_biotype>\n",gene_biotype);
	printf("\t<bio:location>\n");
	printf("\t\t<bio:Location>\n");
	printf("\t\t\t<bio:build>%s</bio:build>\n",BUILD);
	printf("\t\t\t<bio:chrom>%s</bio:chrom>\n",$1);
	printf("\t\t\t<bio:start rdf:datatype=\"http://www.w3.org/2001/XMLSchema#int\">%s</bio:start>\n",$4);
	printf("\t\t\t<bio:end rdf:datatype=\"http://www.w3.org/2001/XMLSchema#int\">%s</bio:end>\n",$5);
	printf("\t\t</bio:Location>\n");
	printf("\t</bio:location>\n");
	printf("</bio:Gene>\n");
	}

view raw gtf2rdf.awk hosted with ❤ by GitHub

	SHELL=/bin/bash
	OUTDIR=TMP
	BUILD=GRCh38

	all: $(OUTDIR)/database.rdf query.01.sparql
	/path/to/pache-jena-4.8.0/bin/arq --data=$< --query=query.01.sparql


	$(OUTDIR)/database.rdf: $(OUTDIR)/go.rdf $(OUTDIR)/gtf.rdf
	mkdir -p $(dir $@)
	echo '<?xml version="1.0" encoding="UTF-8"?><rdf:RDF xmlns:bio="https://www.biostars.org/#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xml:base="https://www.biostars.org/">' > $@
	cat $^ >> $@
	echo "</rdf:RDF>" >> $@


	$(OUTDIR)/go.rdf:
	mkdir -p $(dir $@)
	wget -O - "https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz" \| gunzip -c \|\
	awk -F '\t' ' $1 == 9606^{'} \| c u t - f 2, 3, 6 \| s o r t - T $ (d i r $ @) - t$ '\t' -k1,1 > $(addsuffix .tmp1,$@)
	wget -O - "https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2ensembl.gz" \| gunzip -c \|\
	awk -F '\t' ' $1 == 9606^{'} \| c u t - f 2, 3 \| s o r t - T $ (d i r $ @) - t$ '\t' -k1,1 > $(addsuffix .tmp2,$@)
	join -t $$'\t' -1 1 -2 1 $(addsuffix .tmp1,$@) $(addsuffix .tmp2,$@) > $(addsuffix .tmp3,$@)
	cut -f 2,3 $(addsuffix .tmp3,$@) \| sort -T $(dir $@) \| uniq \|\
	awk -F '\t' '{GO= $1; g s u b (/ : /, "_{"}, G O); p r i n t f (" < b i o : T e r m r d f : a b o u t = \"$ 1,$$2);}' >> $@
	cut -f 2,4 $(addsuffix .tmp3,$@) \| awk -F '\t' '{GO= $1; g s u b (/ : /, "_{"}, G O); p r i n t f (" < r d f : D e s c r i p t i o n r d f : a b o u t = \"$ 2,GO);}' >> $@
	rm $(addsuffix .tmp1,$@) $(addsuffix .tmp2,$@) $(addsuffix .tmp3,$@)


	$(OUTDIR)/gtf.rdf : gtf2rdf.awk
	mkdir -p $(dir $@)
	wget -O - "https://ftp.ensembl.org/pub/release-111/gtf/homo_sapiens/Homo_sapiens.$(BUILD).111.chr.gtf.gz" \| gunzip -c \|\
	awk '($$1=="1")' \|\
	awk -vBUILD=$(BUILD) -f gtf2rdf.awk > $@

view raw Makefile hosted with ❤ by GitHub

	PREFIX bio: <https://www.biostars.org/#>
	PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
	PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>

	SELECT
	?build
	?chrom
	?start
	?end
	?gene_id
	?gene_name
	?gene_biotype
	?go_id
	?go_label
	WHERE {
	?gene bio:gene_name ?gene_name .
	?gene bio:gene_biotype ?gene_biotype .
	?gene bio:gene_id ?gene_id .
	?gene bio:location ?loc .
	?loc a bio:Location .
	?loc bio:build ?build .
	?loc bio:chrom ?chrom .
	?loc bio:start ?start .
	?loc bio:end ?end .
	OPTIONAL {
	?gene bio:has_go_term ?go .
	?go bio:go_id ?go_id .
	?go rdfs:label ?go_label .
	}

	FILTER( ?start <= 20746689 ) .
	FILTER( ?end >= 20746689 ) .
	FILTER( ?chrom = "1" ) .
	}

view raw query.01.sparql hosted with ❤ by GitHub

We can make this file beautiful and searchable if this error is corrected: No tabs found in this TSV file in line 0.

	------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
	\| build \| chrom \| start \| end \| gene_id \| gene_name \| gene_biotype \| go_id \| go_label \|
	================================================================================================================================================================================================================================================
	\| "GRCh38" \| "1" \| "20740266"^^<http://www.w3.org/2001/XMLSchema#int> \| "20787323"^^<http://www.w3.org/2001/XMLSchema#int> \| "ENSG00000127483" \| "HP1BP3" \| "protein_coding" \| "GO:0006334" \| "nucleosome assembly" \|
	\| "GRCh38" \| "1" \| "20740266"^^<http://www.w3.org/2001/XMLSchema#int> \| "20787323"^^<http://www.w3.org/2001/XMLSchema#int> \| "ENSG00000127483" \| "HP1BP3" \| "protein_coding" \| "GO:0000786" \| "nucleosome" \|
	\| "GRCh38" \| "1" \| "20740266"^^<http://www.w3.org/2001/XMLSchema#int> \| "20787323"^^<http://www.w3.org/2001/XMLSchema#int> \| "ENSG00000127483" \| "HP1BP3" \| "protein_coding" \| "GO:0005634" \| "nucleus" \|
	\| "GRCh38" \| "1" \| "20740266"^^<http://www.w3.org/2001/XMLSchema#int> \| "20787323"^^<http://www.w3.org/2001/XMLSchema#int> \| "ENSG00000127483" \| "HP1BP3" \| "protein_coding" \| "GO:0070828" \| "heterochromatin organization" \|
	\| "GRCh38" \| "1" \| "20740266"^^<http://www.w3.org/2001/XMLSchema#int> \| "20787323"^^<http://www.w3.org/2001/XMLSchema#int> \| "ENSG00000127483" \| "HP1BP3" \| "protein_coding" \| "GO:0031491" \| "nucleosome binding" \|
	\| "GRCh38" \| "1" \| "20740266"^^<http://www.w3.org/2001/XMLSchema#int> \| "20787323"^^<http://www.w3.org/2001/XMLSchema#int> \| "ENSG00000127483" \| "HP1BP3" \| "protein_coding" \| "GO:0006355" \| "regulation of DNA-templated transcription" \|
	\| "GRCh38" \| "1" \| "20740266"^^<http://www.w3.org/2001/XMLSchema#int> \| "20787323"^^<http://www.w3.org/2001/XMLSchema#int> \| "ENSG00000127483" \| "HP1BP3" \| "protein_coding" \| "GO:0003677" \| "DNA binding" \|
	\| "GRCh38" \| "1" \| "20740266"^^<http://www.w3.org/2001/XMLSchema#int> \| "20787323"^^<http://www.w3.org/2001/XMLSchema#int> \| "ENSG00000127483" \| "HP1BP3" \| "protein_coding" \| "GO:0005694" \| "chromosome" \|
	\| "GRCh38" \| "1" \| "20740266"^^<http://www.w3.org/2001/XMLSchema#int> \| "20787323"^^<http://www.w3.org/2001/XMLSchema#int> \| "ENSG00000127483" \| "HP1BP3" \| "protein_coding" \| "GO:0016607" \| "nuclear speck" \|
	\| "GRCh38" \| "1" \| "20740266"^^<http://www.w3.org/2001/XMLSchema#int> \| "20787323"^^<http://www.w3.org/2001/XMLSchema#int> \| "ENSG00000127483" \| "HP1BP3" \| "protein_coding" \| "GO:0042127" \| "regulation of cell population proliferation" \|
	\| "GRCh38" \| "1" \| "20740266"^^<http://www.w3.org/2001/XMLSchema#int> \| "20787323"^^<http://www.w3.org/2001/XMLSchema#int> \| "ENSG00000127483" \| "HP1BP3" \| "protein_coding" \| "GO:0005515" \| "protein binding" \|
	\| "GRCh38" \| "1" \| "20740266"^^<http://www.w3.org/2001/XMLSchema#int> \| "20787323"^^<http://www.w3.org/2001/XMLSchema#int> \| "ENSG00000127483" \| "HP1BP3" \| "protein_coding" \| "GO:0097298" \| "regulation of nucleus size" \|
	\| "GRCh38" \| "1" \| "20740266"^^<http://www.w3.org/2001/XMLSchema#int> \| "20787323"^^<http://www.w3.org/2001/XMLSchema#int> \| "ENSG00000127483" \| "HP1BP3" \| "protein_coding" \| "GO:0071456" \| "cellular response to hypoxia" \|
	------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

view raw sparql.output.tsv hosted with ❤ by GitHub

ADD COMMENT • link 12 months ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thank you so much for the detailed guidance! Just a quick query - would I be running these scripts in a bash environment to replicate the results?

Also, regarding the integration site query, will this process be able to identify the regulatory role of the gene, such as whether the site falls within a promoter or enhancer region?

ADD REPLY • link 12 months ago by kelpotus22 • 0

0

Entering edit mode

would I be running these scripts in a bash environment to replicate the results?

yeah, I used sparql for fun but i you don't know them, you should use tools like bedtools intersect and join....

ADD REPLY • link 12 months ago by Pierre Lindenbaum 166k

0

Entering edit mode

I have been using Linux environment but fairly new, still I'm eager to give them a try. Just to confirm, should I run the awk command you provided first like this:

awk -v BUILD=GRCh38 -f gtf2rdf.awk > output.rdf

Followed by executing the Makefile with:

make Makefile

I'm not quite sure how to proceed with executing the query.01.sparql afterward. Could you please provide guidance on this? Please correct me if I'm wrong. Appreciate your help.

ADD REPLY • link 12 months ago by kelpotus22 • 0

0

Entering edit mode

just

make

ADD REPLY • link 12 months ago by Pierre Lindenbaum 166k

Login before adding your answer.