I am working in the research environment of Genomic England (which means the number of tools is very limited) and I have a pandas data frame and one of the columns contains genes ID. Some are repeated.
I have a long bed file with all human exons. I want to get the exons of the genes that match that data frame. What is the best way to do this? I can use bedtools, shell commands and python commands only.
This is one step of an application I am developing.
The bed file looks like this
#chr1 start end Gene_ID Exon_ID
1 1 10 IDA ID1
1 10 20 IDA ID2
1 20 30 IDA ID3
2 1 10 IDB ID1
2 20 20 IDB ID2
2 30 30 IDB ID3
Imagine I have in my data frame the gene IDB, the result should be
2 1 10 IDB ID1
2 20 20 IDB ID2
2 30 30 IDB ID3
I am thinking of getting a unique gene ID, creating a list and then to the query with some shell script.
Something like this
grep -Fw -f words myfile
Copy from https://unix.stackexchange.com/questions/458431/extract-lines-that-match-a-list-of-words-in-another-file Do you have a better idea?
If the Gene_ID in both
words
andmyfile
match exactly, I would use unixjoin
instead ofgrep -F
; it will be much faster thangrep
.