Hi all,
I need a suggestion for annotating with the gene information a huge file. The file is in the gz format (dimension ~ 120 Mb) after unzip it, the dimension of the file is about 30 Giga and there are ~ 40 000 000 rows. In the file (text format), there are several columns, e.g. chr, start, end, ID sites and other information. Is there a tool that can allow me to easily annotate every sites with the gene information?
Thank a lot. Best regards
If you put your input into BED format, you can use
bedmap
to associate those intervals with genes converted to BED viagtf2bed
orgff2bed
.Search biostars for those keywords and you'll find a number of answers that demonstrate this for Gencode and other annotation sets.
Some example code:
You can pipe things in via a process substitution, e.g.: