Question

File conversion from coordinates to genes

0

Entering edit mode

3.7 years ago

storm1907 ▴ 30

Hi, I need to convert chromosomal location to gene name (preferably, HGNC nomenclature) in a txt file with 6 columns: From

1   chr1:183189 0   183189  G   C
1   chr1:609407 0   609407  0   0
1   chr1:609434 0   609434  0   0
1   chr1:609435 0   609435  G   G

to

1   genename    0   183189  G   C
1   genename    0   609407  0   0
1   genename    0   609434  0   0
1   genename    0   609435  G   G

Files are made in GRCH38. I looked in this forum for similar issue, but could not find appropriate solution. One option was with bedtools, but unfortunately, only bed files are suitable in that case. One more option included some outdated R packages, which I could not install.

Thank you!

conversion • 2.6k views

ADD COMMENT • link 3.6 years ago by storm1907 ▴ 30

0

Entering edit mode

but unfortunately, only bed files are suitable in that case.

use awk to convert reformat your input. Hint: split($2,a,/[:]/);printf("%s\t%d\t%s\n",a[1],int(a[2])-1,a[2]);}'

ADD REPLY • link 3.7 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Sorry, I tried that syntax, but get error..

ADD REPLY • link 3.7 years ago by storm1907 ▴ 30

0

Entering edit mode

Also, where can I get bed files with HGNC gene names? From UCSC Table Browser I can only download table with transcripts

ADD REPLY • link 3.7 years ago by storm1907 ▴ 30

0

Entering edit mode

First part of Alex Reynolds answer below gives you that.

ADD REPLY • link 3.7 years ago by GenoMax 147k

0

Entering edit mode

My answer can provide multiple HGNC records for one gene that has multiple transcripts. More information would be needed to filter these for canonical transcripts. And one genomic position could overlap different genes. But bedmap --echo-map-id-uniq will remove duplicate gene names, in either case.

ADD REPLY • link 3.7 years ago by Alex Reynolds 36k

GenoMax · Answer 1 · 2021-03-30

0

Entering edit mode

3.7 years ago

Alex Reynolds 36k

You could use the BEDOPS kit to solve this.

$ awk -v FS="\t" -v OFS="\t" '{ n=split($2, a, ":"); print a[1], a[2]-1, a[2]; }' original_table.txt \
   | sort-bed - \
   > positions.bed

Get annotations:

$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/refGene.txt.gz \
  | gunzip -c - \
  | awk -v FS="\t" -v OFS="\t" '{ if (!match($13, /.*-[0-9]+/)) { print $3, $5, $6, $13, ".", $4; } }' - \
  | sort-bed - \
  > refGene.hg38.bed

Then map:

$ bedmap --echo --echo-map-id-uniq --delim '\t' --unmapped-val "." positions.bed refGene.hg38.bed > mapped.positions.bed

To put your mapped data into the desired format, you'll need to associate positions with your original file:

#!/usr/bin/env python

import sys

original_table_fn = sys.argv[1]
mapped_positions_fn = sys.argv[2]

''' build position table '''
t = {}
with open(original_table_fn, "r") as ifh:
    for line in ifh:
        elems = line.rstrip().split('\t')
        t[elems[1]] = elems 

''' map genes to position table '''
with open(mapped_positions_fn, "r") as ifh:
    for line in ifh:
        (chrom, start, stop, genes) = line.rstrip().split('\t')
        k = "{}:{}".format(chrom, stop)
        try:
            v = t[k]
            v[1] = genes # swap position key for mapped gene(s)
            sys.stdout.write('{}\n'.format('\t'.join(v)))
        except KeyError:
            sys.stderr.write('Warning: Position {} was not found\n'.format(k))

Usage:

$ ./reverse_map.py original_table.txt mapped.positions.bed > answer.txt

ADD COMMENT • link 3.6 years ago by Alex Reynolds 36k

0

Entering edit mode

Thank you! However, when running the last script, I got this error:

 Traceback (most recent call last):
  File "reverse_map.py", line 22, in <module>
    (chrom, start, stop, genes) = line.rstrip().split('\t')
ValueError: not enough values to unpack (expected 4, got 3)

mapped.positions.bed file start is this:

enter image description here

ADD REPLY • link 3.6 years ago by storm1907 ▴ 30

0

Entering edit mode

Because you have gene names missing from some of your rows you are getting that error.

ADD REPLY • link 3.6 years ago by GenoMax 147k

0

Entering edit mode

Yes, I try to convert empty space to string, but without success:

awk 'BEGIN {OFS=FS="\t"} {gsub(/" "/,"na",$4)}1'  $file > $outfile1

ADD REPLY • link updated 3.6 years ago by GenoMax 147k • written 3.6 years ago by storm1907 ▴ 30

0

Entering edit mode

At this moment, this command line helped to me:

awk -F'\t' 'BEGIN{OFS=FS} $4 == "" {$4 = "UNKNOWN"} 1' mapped.positions.bed > outfile

ADD REPLY • link updated 3.6 years ago by GenoMax 147k • written 3.6 years ago by storm1907 ▴ 30

0

Entering edit mode

I forgot about the --unmapped-val operator. Using it would save you the awk step in your comment.

So in the current version of bedmap, you could also do the following:

$ bedmap --echo --echo-map-id-uniq --delim '\t' --unmapped-val "." positions.bed refGene.hg38.bed > mapped.positions.bed

You could replace . with UNKNOWN or na or whatever else is convenient.

See bedmap --help for a full listing of options.

ADD REPLY • link 3.6 years ago by Alex Reynolds 36k

0

Entering edit mode

Hello, also I was wondering, if it possible to write gene names together with SNP position as well in the 2nd column?

ADD REPLY • link 3.6 years ago by storm1907 ▴ 30

0

Entering edit mode

You could pipe the output of bedmap to awk and rewrite columns there:

$ bedmap ... | awk -v FS="\t" -v OFS="\t" '{ ... }' > answer.bed

Replace ... as needed.

Or, modify the Python script above to print out desired columns. It just depends on what you want, where.

ADD REPLY • link 3.6 years ago by Alex Reynolds 36k

0

Entering edit mode

Sorry, I am very new to programming. Is there any reference file also with NG positions, not only chromosomal positions? So that output could be:

1   genename:snp_postion_in_gene     0   183189  G   C 
1   genename:snp_postion_in_gene     0   609407  0   0
1   genename:snp_postion_in_gene     0   609434  0   0
1   genename:snp_postion_in_gene     0   609435  G   G

ADD REPLY • link 3.6 years ago by storm1907 ▴ 30