Question

annotate sites with gene info

0

Entering edit mode

4.0 years ago

Rdeveloop • 0

Hi all,

I need a suggestion for annotating with the gene information a huge file. The file is in the gz format (dimension ~ 120 Mb) after unzip it, the dimension of the file is about 30 Giga and there are ~ 40 000 000 rows. In the file (text format), there are several columns, e.g. chr, start, end, ID sites and other information. Is there a tool that can allow me to easily annotate every sites with the gene information?

Thank a lot. Best regards

annotation • 2.0k views

ADD COMMENT • link updated 4.0 years ago by Alex Reynolds 36k • written 4.0 years ago by Rdeveloop • 0

0

Entering edit mode

If you put your input into BED format, you can use bedmap to associate those intervals with genes converted to BED via gtf2bed or gff2bed.

Search biostars for those keywords and you'll find a number of answers that demonstrate this for Gencode and other annotation sets.

ADD REPLY • link 4.0 years ago by Alex Reynolds 36k

0

Entering edit mode

Some example code:

You can pipe things in via a process substitution, e.g.:

$ someTool <(gunzip -c genes.gz) | ...

ADD REPLY • link 4.0 years ago by Alex Reynolds 36k

score 0 · Answer 1 · 2021-07-10

0

Entering edit mode

4.0 years ago

Pierre Lindenbaum 166k

annotate every sites w

"every sites" ? do you mean a vcf file ? sort+bzip+tabix your text file and have a look at bcftools annotate.

ADD COMMENT • link 4.0 years ago by Pierre Lindenbaum 166k

score 0 · Answer 2 · 2021-07-11

0

Entering edit mode

4.0 years ago

Jorge Amigo 14k

If you have your gene information in a bed file (chr, 0-base start, 1-base end, gene) and your input file is a vcf file (it looks so from your very limited description; just describe it deeper if not) you can use bedtools intersect as follows:

bedtools intersect -loj -a input.vcf.gz -b genes.bed.gz

Note that both input files should be sorted previously. Note also that you don't need to uncompress input files in order to feed tools like bedtools or bcftools

ADD COMMENT • link 4.0 years ago by Jorge Amigo 14k

0

Entering edit mode

Thanks a lot. my file is not a vcf file, it is in a file in which there are listed several sites. For each site, it is reported information like chromosome, the start, the end position, the ID position. The format is tsv.gzs. So, I would like to map each site to the corresponding gene and add that information. Maybe I can use you above command but using a gtf file instead of the genes.bed.gz. What do you think? Could it work?

ADD REPLY • link 4.0 years ago by Rdeveloop • 0

0

Entering edit mode

If the input is not a vcf file but a tab delimited table you can try modifying your input file on the fly to a bed-like format (if first 3 columns are chr, start and end you can convert the start second column to 0-base position), and modify the output back to the original format:

zcat input.tsv.gz \
| awk 'OFS="\t"{$2--;print}' \
| bedtools intersect -loj -a - -b genes.bed.gz \
| awk 'OFS="\t"{$2++; print}'

ADD REPLY • link 4.0 years ago by Jorge Amigo 14k