I have few cancer reports in pdf format and I need to extract information from it. Can anyone please suggest me some python tools that could be used for the purpose. Since there are many reports (of same type), I don't want to do it manually each time and need to automate it. Any suggestions would be helpful.
Directly use LLM-based solution. For example, ragflow, LLMwhisperer, and tutorials are everywhere on the Internet.
I believe the capability of LLMs is enough for you to get structured output from cancer reports. You don't have to use basic python libs as before, unless your work is very complicated. Just give a try.
Note though that depending on how you set this up, data can be sent to an LLM hosted by a third party. This may not be suitable for sensitive (e.g. medical) data. If this is a concern, you can run LLMs locally with e.g. ollama. For information extraction/summarization, small- to medium-size models should be enough depending on the length of the documents.
Of course, thanks for your add.
To use LLMs for medical propose, the best approach now is to deploy uncensored LLMs locally. You can try dolphin series through ollama. I'm now using dolphin-mixtral.
This depends on the model you use it with and the speed (in tokens/s) you want to have. You can run 8-13b quantized models with 24 GB RAM at decent (inference) speed without GPU if you've got a fast CPU or with a GPU with 8 GB VRAM. For e.g. llama3:70b you'll want a GPU with at least 24 GB VRAM and more than 32 GB of RAM but it will run faster with 2 such GPUs because more of it will be run on VRAM.
Don't go for quantization below 6 bits as in my eperience there's too much degradation below this.
You can use pymupdf to extract tables from pdf files.