Automatic Data Extraction Utilizing Structural Similarity From A Set of Portable Document Format (PDF) Files

Hadipurnawan Satria, Anggina Primanita

Abstract

Instead of storing data in databases, common computer-aided office workers often choose to keep data related to their work in the form of document or report files that they can conveniently and comfortably access with popular off-the-shelf softwares, such as in Portable Document Format (PDF) format files. Their workplaces may actually use databases but they usually do not possess the privilege nor the proficiency to fully utilize them. Said workplaces likely have front-end systems such as Management Information System (MIS) from where workers get their data containing reports or documents.These documents are meant for immediate or presentational uses but workers often keep these files for the data inside which may come to be useful later on. This way, they can manipulate and combine data from one or more report files to suit their work needs, on the occasions that their MIS were not able to fulfill such needs. To do this, workers need to extract data from the report files. However, the files also contain formatting and other contents such as organization banners, signature placeholders, and so on. Extracting data from these files is not easy and workers are often forced to use repeated copy and paste actions to get the data they want. This is not only tedious but also time-consuming and prone to errors. Automatic data extraction is not new, many existing solutions are available but they typically require human guidance to help the data extraction before it can become truly automatic. They may also require certain expertise which can make workers hesitant to use them in the first place. A particular function of an MIS can produce many report files, each containing distinct data, but still structurally similar. If we target all PDF files that come from such same source, in this paper we demonstrated that by exploiting the similarity it is possible to create a fully automatic data extraction system that requires no human guidance. First, a model is generated by analyzing a small sample of PDFs and then the model is used to extract data from all PDF files in the set. Our experiments show that the system can quickly achieve 100% accuracy rate with very few sample files. Though there are occasions where data inside all the PDFs are not sufficiently distinct from each other resulting in lower than 100% accuracy, this can be easily detected and fixed with slight human intervention. In these cases, total no human intervention may not be possible but the amount needed can be significantly reduced. 

Full Text:

PDF

References

M Syabani Purnama, Joni Rokhmat, and Dadi Setiadi. “Implementation of Inlislite Application Based on Management Information Systems at SMKN 1 Praya Tengah”. In: International Journal of Science, Technology & Management 4.1 (2023), pp. 168–174.

Jia Luo et al. “Design and Implementation of an Efficient Electronic Bank Management Information System Based Data Warehouse and Data Mining Processing”. In: Information Processing & Management 59.6 (2022), p. 103086.

Alnekhaira Buti Alshamsi Yaser Alraei et al. “Application of Strategic Management Information System (SMIS) in the Ministry of Interior, UAE: Issues and Challenges”. In: International Journal of Academic Research in Business and Social Science 10.2 (2020), pp. 346–361.

Weili Zhang et al. “Development Trend Analysis of Computer Management Information System”. In: Journal of Electronic Research and Application 4.1 (2020).

Vu Le and Sumit Gulwani. “Flashextract: A framework for data extraction by examples”. In: Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation. 2014, pp. 542–553.

AA Prayogi, M Niswar, M Rijal, et al. “Design and implementation of REST API for academic information system”. In: IOP Conference Series: Materials Science and Engineering. Vol. 875. 1. IOP Publishing. 2020, p. 012047.

Novian Adi Prasetyo and Yudha Saintika. “Integration between Moodle and Academic Information System using Restful API for Online Learning”. In: Jurnal Ilmiah Teknik Elektro Komputer dan Informatika (JITEKI) 7.2 (2021), pp. 358–367.

Muhamad Zaenal Iksan and Falaah Abdussalaam Abdussalaam. “Design of a Web-Based Personnel Administration Management Information System at Politeknik Piksi Ganesha”. In: Jurnal E-Komtek 7.1 (2023), pp. 128–140.

Tsai-Tsung Tsai et al. “Sediment Disaster Management Information System Established for the Reservoirs in Southern Taiwan”. In: Modern Environmental Science and Engineering 3.6 (2017), pp. 407–411.

Siddhartha R Jonnalagadda, Pawan Goyal, and Mark D Huffman. “Automating data extraction in systematic reviews: a systematic review”. In: Systematic reviews 4.1 (2015), pp. 1–16.

Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. “RoadRunner: automatic data extraction from data-intensive web sites”. In: Proceedings of the 2002 ACM SIGMOD international conference on Management of data. 2002, pp. 624–624.

Zach Jensen et al. “A machine learning approach to zeolite synthesis enabled by automatic literature data extraction”. In: ACS central science 5.5 (2019), pp. 892–899.

Miao Zhu and Jacqueline M Cole. “PDFDataExtractor: A tool for reading scientific text and interpreting metadata from the typeset literature in the portable document format”. In: Journal of Chemical Information and Modeling 62.7 (2022), pp. 1633–1643.

Duy Duc An Bui et al. “Extractive text summarization system to aid data extraction from full text in systematic review development”. In: Journal of biomedical informatics 64 (2016), pp. 265–272.

Tanmay Basu et al. “A novel framework to expedite systematic reviews by automatically building information extraction training corpora”. In: arXiv preprint arXiv:1606.06424 (2016).

Tony Stubblebine. “Regular Expression”. In: Pocket Reference. O’Really, (2007).

Martin Barisits et al. “Rucio: Scientific data management”. In: Computing and Software for Big Science 3 (2019), pp. 1–19.

Gopinath Rebala et al. “Machine learning definition and basics”. In: An introduction to machine learning (2019), pp. 1–17.

Jafar Alzubi, Anand Nayyar, and Akshi Kumar. “Machine learning from theory to algorithms: an overview”. In: Journal of physics: conference series. Vol. 1142. IOP Publishing. 2018, p. 012012.

Claude Sammut and Geoffrey I Webb. Encyclopedia of machine learning and data mining. Springer Publishing Company, Incorporated, 2017.

Marcin Czajkowski and Marek Kretowski. “Decision tree underfitting in mining of gene expression data. An evolutionary multi-test tree approach”. In: Expert Systems with Applications 137 (2019), pp. 392–404.

Daniel Bashir et al. “An information-theoretic perspective on overfitting and underfitting”. In: AI 2020: Advances in Artificial Intelligence: 33rd Australasian Joint Conference, AI 2020, Canberra, ACT, Australia, November 29–30, 2020, Proceedings 33. Springer. 2020, pp. 347–358.

Xue Ying. “An overview of overfitting and its solutions”. In: Journal of physics: Conference series. Vol. 1168. IOP Publishing. 2019, p. 022022.

Mohammad Mahdi Bejani and Mehdi Ghatee. “A systematic review on overfitting control in shallow and deep neural networks”. In: Artificial Intelligence Review (2021), pp. 1–48.

Sergey Redyuk et al. “Learning to validate the predictions of black box machine learning models on unseen data”. In: Proceedings of the Workshop on Human-In-the-Loop Data Analytics. 2019, pp. 1–4.

Refbacks

  • There are currently no refbacks.