Équipe Images et Contenus (IC) : HEMMER Arthur

Doctorant

Information extraction; neuro-symbolic reasoning; transactional documents; OCR; constrained decoding

Publié le 20/06/2025

ID HAL: arthurhem

Equipe de recherche de rattachement
Images et contenus (IC)

Thématiques de recherche : My research focuses on neuro-symbolic information extraction from transactional documents such as invoices, payslips, and quotes. I develop hybrid methods that combine deep learning with symbolic validation to improve accuracy, data efficiency, and robustness. These methods integrate domain-specific constraints, including syntactic, arithmetic, and semantic rules, to guide model predictions and support zero-shot generalization. I also work on OCR robustness for numerical texts, proposing metrics for denoising complexity and using confidence scores to improve error detection. The goal is to build reliable and explainable systems for document understanding in real-world conditions.

Points forts des activités de recherche : My work combines neural architectures with symbolic reasoning to address the structural and semantic challenges of transactional documents. I design schema-based validation frameworks that enforce layered constraints on model outputs. This improves extraction quality, particularly in low-resource or zero-shot settings, and supports effective knowledge distillation. I introduced Lazy-k, a constrained decoding method that ensures label consistency while offering a flexible trade-off between accuracy and decoding time. In parallel, I explore post-OCR processing by integrating OCR confidence scores into transformer models. This enhances the detection of noisy or unreliable tokens. My research is closely tied to real industrial needs through a collaboration with Shift Technology, where these methods are tested on production-grade document workflows.

Page perso : https://arthurhemmer.com

Formulaire de recherche