Technology
Mathieu Isabel  

When you need more than layout specific extraction

Over the last few years, I had the opportunity to be involved in multiple projects revolving around extracting structured information out of unstructured or semi-structured content. I was able to experiment with different tools with each their own advantages and drawbacks.

At the beginning of 2023, I embarked on a new personal adventure with the development of a new web site. To say it’s a challenge is an understatement! I went through two major versions including a complete rewrite in Vue3 (which I didn’t know starting the project). I learned a lot along the way. It also allowed me to dive back into Python. In other areas of the project I had the opportunity to work extensively with generative AI for a wide variety of tasks such as contextualized question answering and user chatbot input treatment to only name a couple.

In today’s post I’d like to dive into one of the capability that was an absolute must have to achieve the mission of the site.

To give some context, my new web site requires the ingestion of a wide variety of unstructured content from a variety of sources. For instance, item specifications might be present in unstructured text, or specifications tables provided by the manufacturer and each source of content might differ in layout and information provided. For each type of item you need to handle, you have various extraction subtleties to account for. Sometimes the specifications are abbreviated, sometimes they are expressed in different units that need to be standardized.

To tackle this challenge I had to develop a custom extraction pipeline to allow for the following:

  • Configurable extraction process on a per item type basis
    • The process can be fine tuned on a per field basis for a particular item type
      • For example, how you extract battery specifications for a smart phone is different than how you would extract them for a large battery used in a solar system installation
  • Layout independent extraction
    • It doesn’t require specific text labels to be present nor does it expect the information at specific location in the source document
  • Language agnostic
    • For example, the output of the extraction can be in English while the source content is in Italian
  • Missing data handling
    • For when the information is not present or not extractable in the source

By using the extraction API, one can go from this:

to this:

Out of curiosity, I tested the same extraction pipeline but with document types not currently required by the web site. One of the examples I tested with was a flight boarding pass:

For which the extraction API returned:

As you can see in the response above, the extraction pipeline was able to adjust to certain quirks in the extraction request. For example, it successfully separated the first name and last name without requiring specific instructions. Same principle was applied to extracting the flight origin and destination without the airport IATA code.

One crucial point to note about this extraction pipeline is that it doesn’t require any particular training. The above example was achieved on the first attempt. So the process was the following:

  • In the API request
    • Provide the fields you want to extract in the extraction API request
      • Name, data type, possible values when applicable
    • Provide the source content to extract
  • Submit the API request and get the structured response

The extraction pipeline is able to adapt more easily to a wide variety of extraction tasks because it understands the semantics of both the extraction task to perform but also the meaning present in the source content.

If you have a use case where you think this document extraction API might be valuable, don’t hesitate to contact me in the comments section.

0 thoughts on “When you need more than layout specific extraction

  1. […] There you go! A bit more technical details about the extraction pipeline covered in the introductory article When you need more than layout specific extraction. […]

Leave A Comment