Question

How is the best way to filter a BED file to get specific exons from their gene ID

0

Entering edit mode

3.1 years ago

ManuelDB ▴ 110

I am working in the research environment of Genomic England (which means the number of tools is very limited) and I have a pandas data frame and one of the columns contains genes ID. Some are repeated.

I have a long bed file with all human exons. I want to get the exons of the genes that match that data frame. What is the best way to do this? I can use bedtools, shell commands and python commands only.

This is one step of an application I am developing.

The bed file looks like this

 #chr1 start end Gene_ID Exon_ID
1    1      10  IDA     ID1
1    10     20  IDA     ID2
1    20     30  IDA     ID3
2    1      10  IDB     ID1
2    20     20  IDB     ID2
2    30     30  IDB     ID3

Imagine I have in my data frame the gene IDB, the result should be

    2    1      10  IDB     ID1
    2    20     20  IDB     ID2
    2    30     30  IDB     ID3

I am thinking of getting a unique gene ID, creating a list and then to the query with some shell script.

Something like this

grep -Fw -f words myfile

Copy from https://unix.stackexchange.com/questions/458431/extract-lines-that-match-a-list-of-words-in-another-file Do you have a better idea?

bed • 1.5k views

ADD COMMENT • link updated 2.2 years ago by Ram 45k • written 3.1 years ago by ManuelDB ▴ 110

1

Entering edit mode

If the Gene_ID in both words and myfile match exactly, I would use unix join instead of grep -F; it will be much faster than grep.

ADD REPLY • link 3.1 years ago by vkkodali_ncbi ★ 3.8k