How is the best way to filter a BED file to get specific exons from their gene ID
0
0
Entering edit mode
2.6 years ago
ManuelDB ▴ 110

I am working in the research environment of Genomic England (which means the number of tools is very limited) and I have a pandas data frame and one of the columns contains genes ID. Some are repeated.

I have a long bed file with all human exons. I want to get the exons of the genes that match that data frame. What is the best way to do this? I can use bedtools, shell commands and python commands only.

This is one step of an application I am developing.

The bed file looks like this

 #chr1 start end Gene_ID Exon_ID
1    1      10  IDA     ID1
1    10     20  IDA     ID2
1    20     30  IDA     ID3
2    1      10  IDB     ID1
2    20     20  IDB     ID2
2    30     30  IDB     ID3

Imagine I have in my data frame the gene IDB, the result should be

    2    1      10  IDB     ID1
    2    20     20  IDB     ID2
    2    30     30  IDB     ID3

I am thinking of getting a unique gene ID, creating a list and then to the query with some shell script.

Something like this

grep -Fw -f words myfile

Copy from https://unix.stackexchange.com/questions/458431/extract-lines-that-match-a-list-of-words-in-another-file Do you have a better idea?

bed • 1.2k views
ADD COMMENT
1
Entering edit mode

If the Gene_ID in both words and myfile match exactly, I would use unix join instead of grep -F; it will be much faster than grep.

ADD REPLY

Login before adding your answer.

Traffic: 2720 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6