Uncategorized
Mathieu Isabel  

Improving Unstructured Data Extraction Capabilities

While continuing to improve the capabilities of what can be achieved with the LLM Planner and Executor agents, a few updates were made to the extraction capabilities in the platform to provide more robust and detailed data extraction from unstructured content, streamlining the process and enhancing the accuracy of the extracted information.

Let’s go through some of the enhancements.

Support for Full JSON Object Definitions

Overview: In addition to supporting key/value pairs, our system now fully supports JSON object definitions. This enhancement allows for more complex and structured data representation, enabling users to define intricate data extraction schemas that mirror real-world data structures.

Example: Consider a boarding pass extraction scenario. The system can now handle full JSON object definitions, capturing detailed information such as passenger name, flight details, seat number, and more in a structured format.

Sample Extraction Request

[
    {
        "extractionMethodVersion": "v4o-fc",
        "text": "NATIONAL AIRLINES PASSENGER TICKET AND BAGGAGE CHECK\n\nBOARDING PASS FIRST CLASS\n\nNATIONAL AIRLINES FIRST CLASS\n\n5A6BCD78\n\nPassenger JAMES SMITH From: CHICAGO ORD\n\nFlight\n\nDate\n\nTime\n\nNA4321\n\n06 DEC 20\n\n11:40\n\nGate\n\nBoarding till:\n\nSeat\n\n03\n\n11:20\n\n09A\n\nElectronic ticket 629\n\nTo:\n\nNEW YORK JFK Arrival time: 13:30 Therminal:\n\n2\n\nPLEASE BE AT THE GATE AT BOARDING TIME\n\ndreamstime.com\n\nPassenger JAMES SMITH\n\nFrom: CHICAGO ORD To: NEW YORK JFK\n\nFlight\n\nNA4321\n\nDate\n\n06 DEC 20\n\nGate\n\n03\n\nBoarding till:\n\n11:20\n\n5A6BCD78\n\nTime\n\n11:40\n\nSeat 09A\n\nQUIATU\n\nID 186380802 @ Fenix84\n\nFlight|Date|Time|\n|---|---|---\n|||\n|||\n|||\n|||\n|||\n|||\n|||\n|||\n|||\n\n\n",
        "description": "boarding-pass",
        "instructions": "1. Locate the passenger's name on the boarding pass. This is typically found near the top of the document and may be labeled as \"Passenger Name\" or \"Name.\"\n\n2. Identify the flight number and departure date. The flight number is usually a combination of letters and numbers and can be found near the top of the boarding pass. The departure date is typically listed next to the flight number.\n\n3. Find the departure and arrival airports. These are usually listed under the flight number and departure date. The departure airport will be labeled as \"From\" or \"Departure,\" while the arrival airport will be labeled as \"To\" or \"Arrival.\"\n\n4. Locate the ticket number or booking reference number. This is a unique identifier for the booking and can usually be found near the bottom of the boarding pass.\n\n5. Check for any additional relevant information, such as the fare class, seat number, and any special services or requests noted on the boarding pass.\n\n6. Once all relevant information has been extracted, verify that it matches the information provided by the customer requesting the refund. If everything aligns, the refund request can be validated and processed accordingly.",
        "fields": [],
        "object": {
            "boardingPass":{
                "type": "object",
                "properties": {
                    "passengerName": {
                        "type": "string",
                        "description": "The name of the passenger.",
                        "example": "John Doe"
                    },
                    "flightNumber": {
                        "type": "string",
                        "description": "The flight number.",
                        "example": "AA1234"
                    },
                    "dateOfFlight": {
                        "type": "string",
                        "format": "date",
                        "description": "The date of the flight.",
                        "example": "2023-10-15"
                    },
                    "departureAirport": {
                        "type": "string",
                        "description": "The IATA code of the departure airport.",
                        "example": "JFK"
                    },
                    "arrivalAirport": {
                        "type": "string",
                        "description": "The IATA code of the arrival airport.",
                        "example": "LAX"
                    },
                    "seatNumber": {
                        "type": "string",
                        "description": "The seat number assigned to the passenger.",
                        "example": "12A"
                    },
                    "boardingTime": {
                        "type": "string",
                        "format": "time",
                        "description": "The boarding time.",
                        "example": "14:30"
                    },
                    "class": {
                        "type": "string",
                        "description": "The class of service.",
                        "enum": [
                            "Economy",
                            "Business",
                            "First"
                        ],
                        "example": "Economy"
                    },
                    "ticketNumber": {
                        "type": "string",
                        "description": "The ticket number.",
                        "example": "1234567890"
                    },
                    "bookingReference": {
                        "type": "string",
                        "description": "The booking reference code.",
                        "example": "ABC123"
                    },
                    "gateNumber": {
                        "type": "string",
                        "description": "The gate number for boarding.",
                        "example": "G12"
                    },
                    "status": {
                        "type": "string",
                        "description": "The status of the flight.",
                        "enum": [
                            "On-time",
                            "Delayed",
                            "Canceled"
                        ],
                        "example": "On-time"
                    }
                },
                "required": [
                    "passengerName"
                ]
            }
        }
    }
]

Extraction Response

[
    {
        "extractionRate": 0,
        "fields": [],
        "object": {
            "boardingPass": {
                "passengerName": "JAMES SMITH",
                "flightNumber": "NA4321",
                "dateOfFlight": "2020-12-06",
                "departureAirport": "ORD",
                "arrivalAirport": "JFK",
                "seatNumber": "09A",
                "boardingTime": "11:20",
                "class": "First",
                "ticketNumber": "629",
                "bookingReference": "5A6BCD78",
                "gateNumber": "03"
            }
        }
    }
]

Now Supporting Extraction of Lists

As another benefit of that improvement, you can now define lists/arrays within those objects. A prime example of that would be how you extract transactions out of an invoice.

Let’s look at a more complex extraction example that leverages that.

Sample Request

In the extraction request below, you can the object definition is more complex as it specifies a list of passengers to extract and also a list of transactions out of text coming from a travel booking invoice.

[
    {
        "subject": "Booking Receipt",
        "extractionMethodVersion": "v4o-fc",
        "text": "Passengers\nName: Mathieu Isabel\nname:Arthur Isabel\nDetails\n\n|Transaction Date|Description|Quantity|Unit Price|Sub-Total|\n|---|---|---|---|---|\n|2024-01-14|Flight Montreal to Moscow|2|7000|7000|\n|2024-01-14|Flight Moscow to Montreal|2|6400|6400|\n|2024-01-14|Deluxe Suite|5|500|2500|\n",
        "description": "Invoice Transactions Text:\n",
        "instructions": "",
        "object": {
            "passengers":{
                "type": "array",
                "items": {
                    "type":"object",
                    "properties": {
                        "firstName": {
                            "type": "string",
                            "description": "The passenger first name."
                        },
                        "lastName": {
                            "type": "string",
                            "description": "The passenger last  name."
                        }
                    }
                }
            },
            "transactions": {
                "type": "array",
                "description": "An array of all the transactions details.",
                "items": {
                    "type": "object",
                    "properties": {
                        "transactionDate": {
                            "type": "string",
                            "description": "When the transaction occurred."
                        },
                        "description": {
                            "type": "string",
                            "description": "The description for the transaction."
                        },
                        "category": {
                            "type": "string",
                            "description": "The category assigned the transaction based on its description.",
                            "enum":["Air Fare","Hotel","Other"]
                        },
                        "quantity": {
                            "type": "number",
                            "description": "How many items apply to the transaction."
                        },
                        "unitPrice": {
                            "type": "number",
                            "description": "The price for one item in the transaction."
                        },
                        "subTotal": {
                            "type": "number",
                            "description": "The sub-total for the transaction line"
                            
                        }
                    },
                    "required": [
                        "transactionDate",
                        "description",
                        "category",
                        "quantity",
                        "unitPrice",
                        "subTotal"
                    ]
                }
            }
        }
    }
]

Response

As you can now see below, the extraction process was able to extract both passengers from the text and also a list of transactions with specific data fields.

[
    {
        "fields": [],
        "object": {
            "passengers": [
                {
                    "firstName": "Mathieu",
                    "lastName": "Isabel"
                },
                {
                    "firstName": "Arthur",
                    "lastName": "Isabel"
                }
            ],
            "transactions": [
                {
                    "transactionDate": "2024-01-14",
                    "description": "Flight Montreal to Moscow",
                    "category": "Air Fare",
                    "quantity": 2,
                    "unitPrice": 7000,
                    "subTotal": 7000
                },
                {
                    "transactionDate": "2024-01-14",
                    "description": "Flight Moscow to Montreal",
                    "category": "Air Fare",
                    "quantity": 2,
                    "unitPrice": 6400,
                    "subTotal": 6400
                },
                {
                    "transactionDate": "2024-01-14",
                    "description": "Deluxe Suite",
                    "category": "Hotel",
                    "quantity": 5,
                    "unitPrice": 500,
                    "subTotal": 2500
                }
            ]
        }
    }
]

Improved Data Inference Logic

Given the new extraction capabilities, the logic that infers data to be extracted based on an objective has been significantly enhanced to support these new complex object definitions. This improvement ensures the extraction process is more accurate and capable of handling intricate data structures effectively.

Example

Given the following API payload providing the objective for the extraction:

{
    "extractionReasoningLevel": "advanced",
    "subject":"boarding-pass",
    "content": "",
    "objective": "I need to extracted relevant information out of a boarding pass in order to validate a refund request from an airline customer.",
    "outputType":"object"
}

The following extraction request was generated.

{
    "subject": "boarding-pass",
    "extractionMethodVersion": "v4o-fc",
    "text": "",
    "description": "boarding-pass",
    "instructions": "1. Locate the passenger's name on the boarding pass. This is typically found near the top of the document and may be labeled as \"Passenger Name\" or \"Name.\"\n\n2. Identify the flight number and departure date. The flight number is usually a combination of letters and numbers and can be found near the top of the boarding pass. The departure date is typically listed next to the flight number.\n\n3. Find the departure and arrival airports. These are usually listed under the flight number and departure date. The departure airport will be labeled as \"From\" or \"Departure,\" while the arrival airport will be labeled as \"To\" or \"Arrival.\"\n\n4. Locate the ticket number or booking reference number. This is a unique identifier for the booking and can usually be found near the bottom of the boarding pass.\n\n5. Check for any additional relevant information, such as the fare class, seat number, and any special services or requests noted on the boarding pass.\n\n6. Once all relevant information has been extracted, verify that it matches the information provided by the customer requesting the refund. If everything aligns, the refund request can be validated and processed accordingly.",
    "fields": [],
    "object": {
            "boardingPass":{
                "type": "object",
                "properties": {
                    "passengerName": {
                        "type": "string",
                        "description": "The name of the passenger.",
                        "example": "John Doe"
                    },
                    "flightNumber": {
                        "type": "string",
                        "description": "The flight number.",
                        "example": "AA1234"
                    },
                    "dateOfFlight": {
                        "type": "string",
                        "format": "date",
                        "description": "The date of the flight.",
                        "example": "2023-10-15"
                    },
                    "departureAirport": {
                        "type": "string",
                        "description": "The IATA code of the departure airport.",
                        "example": "JFK"
                    },
                    "arrivalAirport": {
                        "type": "string",
                        "description": "The IATA code of the arrival airport.",
                        "example": "LAX"
                    },
                    "seatNumber": {
                        "type": "string",
                        "description": "The seat number assigned to the passenger.",
                        "example": "12A"
                    },
                    "boardingTime": {
                        "type": "string",
                        "format": "time",
                        "description": "The boarding time.",
                        "example": "14:30"
                    },
                    "class": {
                        "type": "string",
                        "description": "The class of service.",
                        "enum": [
                            "Economy",
                            "Business",
                            "First"
                        ],
                        "example": "Economy"
                    },
                    "ticketNumber": {
                        "type": "string",
                        "description": "The ticket number.",
                        "example": "1234567890"
                    },
                    "bookingReference": {
                        "type": "string",
                        "description": "The booking reference code.",
                        "example": "ABC123"
                    },
                    "gateNumber": {
                        "type": "string",
                        "description": "The gate number for boarding.",
                        "example": "G12"
                    },
                    "status": {
                        "type": "string",
                        "description": "The status of the flight.",
                        "enum": [
                            "On-time",
                            "Delayed",
                            "Canceled"
                        ],
                        "example": "On-time"
                    }
                },
                "required": [
                    "passengerName"
                ]
            }
        }
}

As you can see above, a complex schema was generated automatically based on the objective provided.

Conclusion

By supporting full JSON object definitions and improving data inference logic, users can now achieve more accurate and detailed data extraction, leading to better insights and decision-making.

The goal to make improvements in that area is to broaden the capabilities of what could be achieved when executing automatically generated job plans as explained in my previous posts LLM-Based Job Planner and Executor Agents For Task Management and Execution and LLM Based Planner: Better Plans Through Better Objectives

If you have any questions or comments regarding this, feel free to leave a comment below!

Leave A Comment