About the Project
A legal organization based in the Netherlands needed an AI-powered system to process large volumes of scanned legal documents. The objective was to accurately extract text, identify complex elements such as tables, signatures, and watermarks, and transform unstructured content into clear, structured legal summaries.
Challenges
One of the main challenges was language support, as all documents were written in Dutch. In addition, many files were old and physically degraded, containing stains, faded text, and reconstructed sections. These conditions made traditional OCR solutions unreliable and required a more advanced, flexible approach.
Development Process
During the initial phase, we implemented Microsoft Azure Document Intelligence as a proof of concept. This API-based solution enabled document uploads and returned structured JSON outputs, including text extraction with precise positional data. While the accuracy was high, the client found the operational costs unsuitable for long-term, large-scale usage.
To reduce costs, we developed a fully offline OCR solution using OpenCV within a Dockerized Python environment. This on-premises system replicated the core capabilities of the Azure service without recurring API fees. It successfully extracted printed and handwritten text, recognized signatures through pattern analysis, and detected watermarks and other security elements.
After implementing OCR, the client requested automated legal case summaries. We first integrated Azure Text Summarizer, allowing control over summary length and the choice between extractive and AI-generated summaries. However, its performance with Dutch legal content was limited.
To overcome this limitation, we built a custom Python proof of concept using ChatGPT, which delivered significantly improved summarization accuracy in Dutch. The system was designed to preserve the existing API structure, enabling seamless switching between Azure and OpenAI models without impacting the front-end.
Technologies
Business Value
The final solution delivered a scalable and cost-efficient framework for automated legal document analysis. By offering both cloud-based and offline OCR options, the client could choose the best balance between cost and accuracy based on processing needs.
Integrated summarization tools significantly reduced the time required to review lengthy legal documents, enabling legal professionals to access key insights within seconds. The system was also designed with future GDPR-compliant anonymization capabilities in mind, allowing easy integration if required.
This AI-driven solution now plays a key role in automating document review, improving productivity, and minimizing manual effort for legal teams across the Netherlands.