When you need more than layout specific extraction (part 2)

There you go! A bit more technical details about the extraction pipeline covered in the introductory article When you need more than layout specific extraction.

High Level Process

Extract Text Content and Layout

In order for extraction to be performed, you need to extract the text content if the source is an image. When doing so, it’s important to maintain the structural elements of the document as much as possible because they have meaning that.

Convert Into Intermediary Text Format

With this step, you take the result from the image conversion into text output and try to maintain the structural elements of the document and convert that into a text format that’s more dense/optimized for subsequent extraction steps.

Generate Extraction LLM Prompt

Depending on how you interact with the extraction API, I have step where the LLM extraction prompt is generated based on the item type. For example, if I need to extract data for a stick vacuum, I retrieve the list of all the fields that need to be extracted for that item type along specific extraction configuration on a per field level.

Determine Appropriate LLM Model

Once you have a better idea of the extraction job at hand, you can now make a better decision as to which LLM model is optimal for the extraction task at hand. It’s also an opportunity to optimize costs when that makes sense.

Generate Extraction Batches

Once you have an idea of the scale/complexity of the extraction task at hand and the targeted model, you can now proceed to the next step by generating the actual extraction batches. This allows you to deal the various subtleties of each LLM token limits.

Process Extraction Batches

With everything good to go, you can now iterate through the extraction batches and process their outputs. It’s at that step you can start structuring the extraction outputs.

Sanitize Extracted Values

Depending on the extraction task, you might need to further sanitize the extracted values. For example, you might want to ensure the extracted value is within a particular value domain. At a minimum, you would validate if the extracted value is of the particular data type. For example, if you need to extract a float value and the extraction incorrectly extracted a string, this would be your opportunity to deal with that.

Send Response

Now with the extracted data values in hand, you can now simply send the response payload back to the API caller.

Conclusion

Hopefully this helped you understand a little bit better how the extraction pipeline presented in the first part better. If you’re interested in using the extraction API methods, don’t hesitate to contact me in the comments section.

Dretza