So, I have a data file containing several hundred variants in the following format:
CHR # START POS END POS VARIANT ID
1 100 1000 rs1
1 1200 1400 rs2
I ran the latter through Annovar to get the gene each variant was in (or its nearest gene and its distance) as well as the region (intronic, exonic, etc) each variant was in. The output had the following columns
GENE/NEARESTGENE REGION CHR# STARTPOS ENDPOS REFALLELE ALTALLELE VARIANTID
SOMEGENE exonic 1 100 1000 A G rs1
SOMEGENE2 intergenic 1 1200 1400 G T rs2
I moved some columns around to make a file with the following columns, lets call this file.txt
CHR# STARTPOS ENDPOS GENE/NEARESTGENE REGION VARIANTID
1 100 1000 SOMEGENE exonic rs1
1 1200 1400 SOMEGENE2 intergenic rs2
Now, I have several database files - something around 10 - of promoters, TSS, enhancers, etc, all of which in .bed format looking like the following -> lets call these database1.txt ... database10.txt
database1.bed
CHR # STARTPOS ENDPOS LABEL--FOR-THE-REGION/NAME
1 500 600 promoter1
...
database10.bed
CHR # STARTPOS ENDPOS LABEL--FOR-THE-REGION/NAME
1 1100 1250 enhancer1
I want to annotate the "file.txt" using the files from databases1..10.bed. For each database, I would like to add a column to file.txt for that database, and then show the name for that annotation
desired output
CHR# STARTPOS ENDPOS GENE/NEARESTGENE REGION VARIANTID database1.txt ... database10.txt
1 100 1000 SOMEGENE exonic rs1 promoter1 -----
1 1200 1400 SOMEGENE2 intergenic rs2 --- enhancer1
The closest thing I can get to this is bedtools annotate, but it doesnt give the specific label (such as promoter1, or enhancer1) for each annotation - instead it just gives the number of intervals overlapped, or the proportion of intervals overlapped out of the total number in the database
Bedtools intersect does a better job, but it only shows the variants that intersected with the database regions. Since I'm working with multiple .bed databases, I'd have to run each one separately, separates the column, and concatenate all of them together to get my desired output. It also doesn't handle the region and gene columns well.
Annovar does the same thing bedtools intersect does, except with less functionality in that it doesn;t even show the specific label.
Thanks for the help.