I downloaded promoter bed file of humans from UCSC. I basically just looked 1000bases upstream and downloaded the bed file. However, when I did an intersection of my bed file with clinvar variants, all lot of my positions in bed file were overlapping with variants in coding region and intronic region. Promoters are obviously in none of these. How can I ensure that the promoters file I have is only promoters and is not extending into intronic regions and coding regions. Insights will be appreciated.
Edit: Link to bed file: promoters bed file
So I got promoters region for Cage and UCSC. For CAGE promoters I also looked 1000 bases upstream or downstream based on the strand. So the code I have:
cat nonOverlapping_ucscPromoters.bed nonOverlapping_cagePromoters.bed > concat_ucsc_cage_promoters.bed
Next I sorted the file:
sort -k1,1 -k2,2n -k3,3n concat_ucsc_cage_promoters.bed > sorted_ucsc_cage_promoters.bed
Next I used bedTools merge to to join overlapping regions:
bedtools merge -i sorted_ucsc_cage_promoters.bed > nonOverlapping_ucsc_cage_promoters.bed
And this is the bed file you see in the link I've shared. And then in order to do intersect with clinvar file I do:
bedtools intersect -a clinvar.bed -b nonOverlapping_ucsc_cage_promoters.bed -wa > clinVar_promoters.bed
You should include the file you downloaded and the code you used.
Do you mean clinvar file or the promoters bed file?
Just a link to the promoter file and adding your code to the post would be fine for now.
I've added a link to the file and the code
For genes with alternate Transcription Start Sites (i.e. multiple isoforms, transcripts), isn't this problem unavoidable? One way would be to make sure that for any locus you are always choosing the 5' most TSS.
How can I do that? Is there a file I can intersect with?