Post-OCR Correction on Complaint Processing

Gonçalo Batalhão Alves

doi:10.34626/qdty-bf04

Please use this identifier to cite or link to this item: https://hdl.handle.net/10216/151937

Author(s):	Gonçalo Batalhão Alves
Title:	Post-OCR Correction on Complaint Processing
Issue Date:	2023-07-20
Abstract:	Health regulation entities deal with a high number of customer complaints, receiving them through various means such as online forms, emails, letters, or a physical complaint book. In an effort to automatise their complaint screening and prioritisation process, they have been using Natural Language Processing (NLP) models to help guide the decision process. Most of the complaints passed to these models do not come from online forms but from digitised documents previously printed. The objective of these entities is to promptly resolve any issues that may be mentioned in a complaint regarding a health service. Therefore, manually analysing each complaint is inefficient due to the amount of incoming data compared to the speed of this manual process. The application of NLP models aims to reduce the processing time of the complaints and improve the decision quality compared to the manual process. Although NLP has seen considerable improvements in past years and its impact has increased, most models are trained on curated text. When there is the need to digitise documents, an optical character recognition (OCR) technique is applied. The resulting extracted text contains some errors due to the quality of the scanning, and as such, it might not have the desired quality for an NLP complaint classification model to produce the correct classification output. Although there are recent approaches to tackle this issue, through the form of Post-OCR correction, there is a lack of good-performing models, and there is little work for languages in which linguistic resources are less abundant, such as Portuguese. This thesis intends to present a novel approach to combat these poorly retrieved texts through the use of NLP models applied to the Post-OCR process. We believe that using an approach similar to that of an intelligent keyboard in a smartphone will improve the quality of these extracted texts. By using a dictionary to predict each correct character from the extracted text of the OCR process and then applying the same technique for words, we will reduce the amount of errors in the extracted text. We also expect that the NLP models used to classify complaints will perform better later in the complaint processing pipeline due to the increased text quality. In order to validate that our methodology is producing good results, we intend to do intrinsic and extrinsic validation. In an intrinsic evaluation, the quality of NLP systems outputs is evaluated against pre-determined ground truth. For this, we intend on using our approach on curated datasets of known competitions, as is the case of ICDAR 2019 Competition on Post-OCR Text Correction [71], and compare our results with the ground truth provided. In contrast, an extrinsic evaluation is aimed at evaluating systems outputs based on their impact on the performance of other NLP systems. In this case, we will compare the outputs of the NLP complaint classification system before and after applying our novel approach
Description:	As entidades reguladoras de saúde lidam com um alto número de reclamações de clientes, recebendo- as através de vários meios, como formulários online, e-mails, cartas ou um livro de reclamações físico. Num esforço para automatizar o processo de triagem e priorização de reclamações, estas entidades têm usado modelos de processamento de linguagem natural (NLP) para ajudar no pro- cesso de decisão. A maioria das reclamações passadas para esses modelos não vêm de formulários online, mas sim de documentos digitalizados anteriormente impressos. O objetivo dessas entidades é resolver prontamente quaisquer problemas mencionados em uma reclamação sobre um serviço de saúde. Deste modo, analisar manualmente cada reclamação é ineficiente devido à quantidade de dados recebidos em comparação com a velocidade desse processo manual. A aplicação de modelos de NLP visa reduzir o tempo de processamento das reclamações e melhorar a qualidade da decisão em comparação com o processo manual. Embora a área de NLP tenha visto melhorias consideráveis nos últimos anos e o seu impacto tenha aumentado, a maioria dos modelos é treinada com texto curado. Quando há a necessidade de digitalizar documentos, é aplicada uma técnica de reconhecimento óptico de caracteres (OCR). O texto extraído resultante contém alguns erros devido à qualidade da digitalização e, como tal, pode não ter a qualidade desejada para que um modelo de classificação de reclamações produza um resultado de classificação correto. Embora haja abordagens recentes para lidar com esse prob- lema, através da forma de correção pós-OCR, faltam modelos de boa performance e também existe pouco trabalho para idiomas em que os recursos linguísticos são menos abundantes, como o português. Esta tese pretende apresentar uma abordagem nova para combater textos com erros introduzi- dos por OCR, através do uso de modelos NLP aplicados ao processo pós-OCR. Acreditamos que, usando uma abordagem semelhante à de um teclado inteligente de um smartphone, a qualidade desses textos extraídos será melhorada. Ao usar um dicionário para prever cada caractere correto do texto, extraído pelo processo OCR, e, em seguida, aplicando a mesma técnica em palavras, reduziremos a quantidade de erros do texto.
Subject:	Engenharia electrotécnica, electrónica e informática Electrical engineering, Electronic engineering, Information engineering
Scientific areas:	Ciências da engenharia e tecnologias::Engenharia electrotécnica, electrónica e informática Engineering and technology::Electrical engineering, Electronic engineering, Information engineering
DOI:	10.34626/qdty-bf04
TID identifier:	203428412
URI:	https://hdl.handle.net/10216/151937
Document Type:	Dissertação
Rights:	embargoedAccess
Embargo End Date:	2026-07-19
Appears in Collections:	FEUP - Dissertação

Files in This Item:

File	Description	Size	Format
636731.pdf Restricted Access	Post-OCR Correction on Complaint Processing	63.83 MB	Adobe PDF	View/Open

Show full item record Recommend this item Display Statistics