Entering edit mode
4.5 years ago
the_cowa
▴
40
I have a list of genes and I need coordinates of those genes from the gff file.
I tried with
grep -wFf gene_list sample.gff
but it is taking too much time to respond (size of gff file is 20GB). Is there any other way to extract coordinates ?
Try to make your regex as specific as possible. E.g.
grep GSBRNA2T00155995001 sample.gtf
will be slightly slower thangrep 'gene_id \"GSBRNA2T00155995001' sample.gtf
. How much improvement you can gain from this depends on the structure of your gtf file.If @Pierre's answer worked for you in this: Bed file grepping from the list have you tried to use it here? BTW, programs written in python etc are not likely to be faster than a system utility like
grep
for extracting data.I tried with join but that is also too slow
Break your gff file in several pieces and then do the search.