Genomic coordinates from GENCODE annotation file
2
0
Entering edit mode
6.1 years ago
seta ★ 1.9k

Hi all,

I’m going to extract all variants within some specific genes from 1K genome project and gnomAD, so to get their genomic coordinates, I used the main GTF annotation file for human (version 28, GRCh 38) from GENCODE and made a bed file for the genes of interest. Here is the bed file for one of my genes (ISG15), as you can see, there are multiple records for the same gene. I think this redundancy is related to different transcripts of the same gene, yes is it right? However, I’m confused which start and end positions should be used for variant extraction; could you please help me out?

chr1    1001137 1014541 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1001137 1001281 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1008193 1008279 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1013983 1014541 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1014004 1014475 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1014004 1014007 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1014475 1014478 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1001137 1001281 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1008193 1008279 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1013983 1014004 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1014475 1014541 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1001144 1014435 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1001144 1001263 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1008193 1008279 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1013983 1014435 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1014004 1014435 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1014004 1014007 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1001144 1001263 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1008193 1008279 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1013983 1014004 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1013422 1014540 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1013422 1013576 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1013573 1013576 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1013573 1013576 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1013983 1014540 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1013983 1014475 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1014475 1014478 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1013422 1013573 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1014475 1014540 ENSG00000187608.8    gene_name ISG15    .   +

Also, for including promoter region and getting variants in this region, please kindly tell me should I add about 200 or 500 bp (which one do you suggest?) to the starting position of genes? Please share me if there is any point for more consideration.

Thanks in advance

GENCODE genomic-coordinates GTF • 2.3k views
ADD COMMENT
0
Entering edit mode

although it is late, but I just wanted to point your attention to the genome build for the variants, make sure the gnomAD variants are not aligned to hg19/GRCh37 where you are using annotations from GRCh38.

ADD REPLY
1
Entering edit mode
6.1 years ago
PedroBarbosa ▴ 20

You can look in your 3rd column of the GTF file for gene records and create a bed out of it. Then, you extract the variants in those regions using bcftools/bedtools/etc

awk -v OFS="\t" '$3 == "gene" { print $1,$4-1,$5,".",".",$7 }'  gencode.gtf | bedtools slop -s -r 0 -l 5000 -i stdin -g genome.size.file.txt > genes_with_promotor.bed

bcftools view -Oz -R genes_with_promotor.bed in.vcf > out.vcf.gz

or

bedtools intersect -header -a in.vcf -b genes_with_promotor.bed | bgzip > out.vcf.gz
ADD COMMENT
0
Entering edit mode

Thank you. I made the bed file from gtf file as I posted. My problem was multiple records for a single gene, which as I found from your response, I should use all genomic coordinate from all records for getting variants, yes?

ADD REPLY
0
Entering edit mode

How did you make that bed file ? There should be only one gene record in the whole gtf pointing for ISG15.

It seems you have all the sub features of the gene in that bed. First line of your file seems to be the whole gene feature, which you would obtain with the command I referred.

Edit: Just saw now the promotor question. Updated the command.

ADD REPLY
0
Entering edit mode

Thanks Pedro, yes, I used all sub features of the genes as you mentioned that I corrected with your help. Just two questions, maybe silly one. In your awk command, why you put ".", "." ? and for promoter section, I think using -l 1000 -r 500, please kindly share me your idea about it?

ADD REPLY
1
Entering edit mode
6.1 years ago
Emily 24k

It might be easier to just use Ensembl BioMart or the Ensembl REST API to put in your list of gene IDs and find all overlapping variants from 1000 Genomes and gnomAD.

ADD COMMENT
0
Entering edit mode

Thanks Emily, it sounds easier, however, I tested BioMart with 10 gene names (and also with Entrez ID) and it returned many variants with no information about minor allele, minor allele count and frequency for most of them. Also, I could not find these variants came from 1000 genome (and which population) or genomAD. Could you please kindly advise me in this regard?

ADD REPLY
0
Entering edit mode

If you use the variation database rather than the gene database, you can add 1000 Genomes and/or gnomAD as an additional filter to get only those data. There are a lot of variants that don't have frequency data.

ADD REPLY

Login before adding your answer.

Traffic: 1623 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6