Question

Database containing the importance of a DNA sequence

0

Entering edit mode

7.0 years ago

Gene_MMP8 ▴ 240

Hi,
I have around 1 million sequences of 20bp in length. These were the neighbourhood sequences of somatic mutations. Is there any database that can tell me the relative importance of these sequences within the human genome? By importance, I mean positional importance, i.e, whether it lies in a promoter region etc. Say I have a 20bp sequence and I want to know whether it falls within some important genomic region.

alignment sequencing • 1.0k views

ADD COMMENT • link updated 7.0 years ago by Alex Reynolds 36k • written 7.0 years ago by Gene_MMP8 ▴ 240

score 1 · Answer 1 · 2018-07-01

If you have the sequences in UCSC BED format, you can use BEDOPS convert2bed to convert gene annotations to BED, bedops to make a file of gene promoters (say, a region 500 nt upstream of the gene TSS), and bedmap to associate sequences with the promoters of genes.

For example, to get some gene annotations and write them to a BED-formatted file:

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_27/gencode.v27.basic.annotation.gff3.gz \
    | gunzip --stdout - \
    | awk '$3 == "gene"' \
    | convert2bed -i gff - \
    > genes.bed

Or replace this step with whatever annotation source you prefer, for your reference genome.

Make a file of promoter regions from the genes:

$ awk -v OFS="\t" '($6 == "+") { print $1, $2, ($2+1), $4; }' genes.bed | bedops --range -500:0 --everything - > promoters.for.bed
$ awk -v OFS="\t" '($6 == "-") { print $1, ($3 - 1), $3, $4; }' genes.bed | bedops --range 0:500 --everything - > promoters.rev.bed
$ bedops --everything promoters.for.bed promoters.rev.bed > promoters.bed

Sort your BED-formatted sequences:

$ sort-bed sequences.unsorted.bed > sequences.bed

Map sequences to gene promoters:

$ bedmap --echo --echo-map --delim '\t' promoters.bed sequences.bed > answer.bed

Each line of the file answer.bed contains a promoter region, its associated gene ID, and any sequences that overlap the gene's promoter.