Entering edit mode
3.3 years ago
McClain
•
0
I am trying to replicate the mergeByOverlap
function from R BioConductor in python using the pyranges package. In R the code would be:
gr.snp <- with(gr.snp, GRanges(chr, IRanges(start, end),rsid=gr.snp$rsid))
snp.annotated <- data.frame(mergeByOverlaps(gr.snp, gencode, maxgap=2000, type="start"))
which returns:
nrow(snp.annotated)
[1] 34
colnames(snp.annotated)
[1] "gr.snp.seqnames" "gr.snp.start"
[3] "gr.snp.end" "gr.snp.width"
[5] "gr.snp.strand" "gr.snp.rsid"
[7] "rsid" "gencode.seqnames"
[9] "gencode.start" "gencode.end"
[11] "gencode.width" "gencode.strand"
[13] "gencode.source" "gencode.type"
[15] "gencode.score" "gencode.phase"
[17] "gencode.ID" "gencode.gene_id"
[19] "gencode.gene_type" "gencode.gene_name"
[21] "gencode.level" "gencode.hgnc_id"
[23] "gencode.havana_gene" "gencode.Parent"
[25] "gencode.transcript_id" "gencode.transcript_type"
[27] "gencode.transcript_name" "gencode.transcript_support_level"
[29] "gencode.tag" "gencode.havana_transcript"
[31] "gencode.exon_number" "gencode.exon_id"
[33] "gencode.ont" "gencode.protein_id"
Where gr.snp
is my snp file that I want the annotations for and gencode
is the annotation file.
The closest I've gotten with python is with cluster
or merge
but they arent exactly right:
>>> temp = gr_snp.cluster(genecode_genes, slack = 2000)
>>> len(temp)
469
>>> temp.columns
Index(['Chromosome', 'Start', 'End', 'rsid', 'Strand', 'Cluster'], dtype='object')
Merge does basically the same thing but preserves even less metadata. I need the metadata from both tables to be preserved. Does anyone know how to do this?