Question

Python for pdf reports

1

Entering edit mode

12 weeks ago

prs ▴ 20

I have few cancer reports in pdf format and I need to extract information from it. Can anyone please suggest me some python tools that could be used for the purpose. Since there are many reports (of same type), I don't want to do it manually each time and need to automate it. Any suggestions would be helpful.

Cancer Python3 reports • 605 views

ADD COMMENT • link updated 12 weeks ago by JustinZhang ▴ 120 • written 12 weeks ago by prs ▴ 20

2

Entering edit mode

You can use pymupdf to extract tables from pdf files.

ADD REPLY • link 12 weeks ago by zorbax ▴ 650

score 1 · Accepted Answer · 2024-08-28

1

Entering edit mode

12 weeks ago

JustinZhang ▴ 120

The trend of document parser development has changed a lot together with the LLMs.

Parsers designed for LLM training or RAG. For example, Unstructured, open-parse
Directly use LLM-based solution. For example, ragflow, LLMwhisperer, and tutorials are everywhere on the Internet.

I believe the capability of LLMs is enough for you to get structured output from cancer reports. You don't have to use basic python libs as before, unless your work is very complicated. Just give a try.

ADD COMMENT • link 12 weeks ago by JustinZhang ▴ 120

1

Entering edit mode

Note though that depending on how you set this up, data can be sent to an LLM hosted by a third party. This may not be suitable for sensitive (e.g. medical) data. If this is a concern, you can run LLMs locally with e.g. ollama. For information extraction/summarization, small- to medium-size models should be enough depending on the length of the documents.

ADD REPLY • link 12 weeks ago by Jean-Karim Heriche 27k

1

Entering edit mode

Of course, thanks for your add. To use LLMs for medical propose, the best approach now is to deploy uncensored LLMs locally. You can try dolphin series through ollama. I'm now using dolphin-mixtral.

ADD REPLY • link 12 weeks ago by JustinZhang ▴ 120

0

Entering edit mode

Thank you. I guess ollama needs a good memory and GPU to run. I'll try it on my system.

ADD REPLY • link 12 weeks ago by prs ▴ 20

1

Entering edit mode

This depends on the model you use it with and the speed (in tokens/s) you want to have. You can run 8-13b quantized models with 24 GB RAM at decent (inference) speed without GPU if you've got a fast CPU or with a GPU with 8 GB VRAM. For e.g. llama3:70b you'll want a GPU with at least 24 GB VRAM and more than 32 GB of RAM but it will run faster with 2 such GPUs because more of it will be run on VRAM. Don't go for quantization below 6 bits as in my eperience there's too much degradation below this.

ADD REPLY • link 12 weeks ago by Jean-Karim Heriche 27k