Python for pdf reports
1
1
Entering edit mode
3 months ago
prs ▴ 20

I have few cancer reports in pdf format and I need to extract information from it. Can anyone please suggest me some python tools that could be used for the purpose. Since there are many reports (of same type), I don't want to do it manually each time and need to automate it. Any suggestions would be helpful.

Cancer Python3 reports • 645 views
ADD COMMENT
2
Entering edit mode

You can use pymupdf to extract tables from pdf files.

ADD REPLY
1
Entering edit mode
3 months ago
JustinZhang ▴ 120

The trend of document parser development has changed a lot together with the LLMs.

  1. Parsers designed for LLM training or RAG. For example, Unstructured, open-parse

  2. Directly use LLM-based solution. For example, ragflow, LLMwhisperer, and tutorials are everywhere on the Internet.

I believe the capability of LLMs is enough for you to get structured output from cancer reports. You don't have to use basic python libs as before, unless your work is very complicated. Just give a try.

ADD COMMENT
1
Entering edit mode

Note though that depending on how you set this up, data can be sent to an LLM hosted by a third party. This may not be suitable for sensitive (e.g. medical) data. If this is a concern, you can run LLMs locally with e.g. ollama. For information extraction/summarization, small- to medium-size models should be enough depending on the length of the documents.

ADD REPLY
1
Entering edit mode

Of course, thanks for your add. To use LLMs for medical propose, the best approach now is to deploy uncensored LLMs locally. You can try dolphin series through ollama. I'm now using dolphin-mixtral.

ADD REPLY
0
Entering edit mode

Thank you. I guess ollama needs a good memory and GPU to run. I'll try it on my system.

ADD REPLY
1
Entering edit mode

This depends on the model you use it with and the speed (in tokens/s) you want to have. You can run 8-13b quantized models with 24 GB RAM at decent (inference) speed without GPU if you've got a fast CPU or with a GPU with 8 GB VRAM. For e.g. llama3:70b you'll want a GPU with at least 24 GB VRAM and more than 32 GB of RAM but it will run faster with 2 such GPUs because more of it will be run on VRAM. Don't go for quantization below 6 bits as in my eperience there's too much degradation below this.

ADD REPLY

Login before adding your answer.

Traffic: 2073 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6