How CARAT efficiently transforms large amounts of data from PDF catalogs into structured data with Semadox

Semadox automates data extraction from PDF price lists for CARAT — a tailor-made framework reads out extensive sales catalogs.

Testzugang anfordern

Erfahren Sie mehr zur Steigerung Ihrer Produktivität. In einem gemeinsamen Gespräch geben wir Ihnen konkrete Einblicke zur kompletten Automatisierung Ihrer Dateneingänge.

Semadox has made a name for itself in recent years as one of the best technology providers for PDF extraction software in the business sector. For CARAT — a provider of kitchen planning software (MHK Group) — the team has now developed a framework to automatically read extensive product catalogs. The task for Semadox: to convert PDF sales manuals (price lists) from a kitchen manufacturer into a structured data format without errors in an automated workflow.

‍

Challenge: Making product catalogues digitally usable

Many companies receive product data from suppliers in the form of PDF catalogs or price lists. These contain hundreds of pages of articles, prices and features — valuable data that is, however, easily unmachine-readable. CARAT was also faced with the task of automatically converting a manufacturer's PDF kitchen catalogues into structured data. The incoming data were only available as PDF sales manuals (price lists) for the last three years, but a digital data set in a specific format was expected.

‍

So far: disproportionate manual effort

The process of data processing up to now has been anachronistic: The information was manually typed and transferred to Excel by hand — a slow, error-prone process. Not to mention the costs, which easily reach the six-digit euro amount.

In addition, there is a standard format (DCC-IDM Kitchen/Bath 3.0.1) for product and price data in the kitchen industry. In the long term, the catalog data should be available in such a standardized format so that it can be seamlessly integrated into the existing CARAT systems and processed further.

The key question was therefore: How can we read out PDF catalogs in such a way that structured, further processable data is available at the end?

‍

The solution: The Semadox framework for automatic catalog data extraction

The Semadox framework combines cutting-edge AI methods for document analysis with flexible parsers to correctly interpret even unstructured or varying layouts in PDFs.

In concrete terms, this means that the system reads every page of the catalog PDF, recognizes headings, item descriptions, table structures and prices. These are then assigned to the defined data fields (e.g. item number, description, price, category, etc.).

Thanks to machine learning components, the solution adapts to different formats or year differences. For example, we were able to differentiate between different catalog volumes and identify the new, changed or omitted products. The extracted product data is finally output in the desired target format — initially Excel files in accordance with CARAT, but in future also in JSON or even in the above-mentioned industry standard XML to meet future requirements.

‍

With this project, we were able to impressively show the added value of automated catalog reading with Semadox technology brings:

- Huge time savings: Instead of weeks of manual data entry, large PDF catalogs are now processed within hours or even minutes. The CARAT team was able to make the catalog data available much faster.

- High data quality: Automated extraction eliminates typos and omissions. All product information is recorded completely and consistently — a reliable, structured data set instead of confusing PDF pages.

- Format as required: The framework is flexible with regard to the output format. Whether it's an Excel file for immediate use or a JSON/XML for integration into other systems — Semadox delivers the data in the required format 5

- Difference analysis possible: An additional benefit is the now possible comparison of different catalog statuses. Since all data is available digitally, new, changed or omitted articles between volumes can be identified at the push of a button.

This use case also opens new doors for Semadox. Many industries — from furniture retailers to industry — are struggling with mountains of PDF data such as product catalogues, technical data sheets or price lists. Our KI-based solution has shown that there is enormous potential for increasing efficiency here.

We are looking forward to meeting similar challenges with other customers and driving forward the digital transformation of document-centered processes.

‍

Do you also have unstructured documents from which you would like to obtain valuable data? Talk to us — we will find a solution together!

Image source: CARAT

Testzugang anfordern

Termin Buchen