Content Analysis
mathieu.isabel  

From PDF to JSON: Extracting Credit‑Card Statements with the Content API

Financial documents are full of rich, structured information hiding in messy layouts.
In this walkthrough we’ll create an extraction analysis definition, upload a sample credit‑card statement, OCR the raw text, and then run our extraction to get clean, machine‑readable data — all with four API calls.

Stack
Language: cURL (easy to translate to JS/Python SDK)
Document: statement.pdf (2‑page Visa summary)

In this blog post we’ll be using the following sample document:

1) Define What We Want to Extract

First, we describe the fields we care about (statement date, cardholder name, each transaction’s date / description / amount).

🗂 What Is an Analysis Definition?

An analysis definition is a JSON document that tells the Content API what to pull out of a piece of content and how to label it.
Think of it as the contract between your unstructured input (PDF, image, etc.) and the clean JSON you want back.

Below is a quick tour of the key top‑level properties, using the “Credit Card Statement Extraction” analysis definition.

FieldPurposeExample Values
analysisTypeDeclares the overall job you’re doing. For raw data capture it’s "Content Extraction"."Content Extraction"
titleHuman‑readable name that shows up in dashboards."Credit Card Statement Extraction"
publishingStatusLifecycle flag — "Draft", "Published", or "Archived". Drafts can be edited; published defs are version‑locked."Draft"
extractionDefinitionsHeart of the schema – everything under this node describes the data you want back.(see below)

Let’s dive in more details into how you specify what to extract.

🔍 Inside extractionDefinitions

FieldPurposeNotes
subjectPlain‑text label of what this doc represents. Helps metrics & auto‑routing."Credit Card Statement"
extractionMethodVersionWhich internal extractor pipeline to use (e.g., v5 for the latest regex + ML hybrid)."v5"
instructionsFree‑form prompt telling the extractor how to behave (e.g., leave blanks if data missing; don’t mix bank vs. cardholder details).
objectJSON Schema describing the exact structure you expect back — objects, arrays, enums, formats, required fields, etc.

Key points about the object block:

  • It follows the [JSON Schema draft‑07] spec, so any validator will work.
  • type, properties, required, enum, and format are all honored by the engine.
  • Nested objects (e.g., accountHolder.address) let you capture hierarchies cleanly.
  • Arrays (transactions) define an items schema so every row is validated.
  • Examples are optional but improve the model’s accuracy during few‑shot learning.

⏱ Why It Matters

  • Single Source of Truth – Update the schema once; all future extractions conform.
  • Strong Contracts – Downstream code can rely on types & required fields.
  • Auditability – Each definition is versioned; you can reproduce past results.

With this definition in place you only need two IDs when you run an extraction:

  1. definitionId — returned when you POST /analysis/definitions
  2. documentId — returned when you POST /contents

Pass both to /contents/{documentId}/analysis/process?definition-id={definitionId} and you’ll receive JSON that exactly matches the schema above.

That’s the anatomy of an extraction analysis definition — concise but powerful enough to tame any complex document. Let me know if you’d like a live schema validator snippet or Postman collection!

Here’s a truncated version of the extraction analysis definition:

POST /analysis/definitions
Content-Type: application/json

{
    "analysisType": "Content Extraction",
    "title": "Credit Card Statement Extraction",
    "publishingStatus": "Draft",
    "extractionDefinitions": {
        "subject": "Credit Card Statement",
        "extractionMethodVersion": "v5",
        "text": "",
        "description": "Credit Card Statement",
        "instructions": "Extract relevant fields from Credit Card Statement. If certain properties cannot be found, leave blank. Ensure the information extracted belongs to the right party. i.e. Financial Instution information vs the account/cardholder information.",
        "fields": [],
        "object": {
            "CreditCardStatement": {
                "type": "object",
                "properties": {
                    .... content truncated...
                    "statementDate": {
                        "type": "string",
                        "format": "date",
                        "description": "The date when the statement was issued.",
                        "example": "2023-10-01"
                    },
                    "issuingInstitutionName": {
                        "type": "string",
                        "description": "The name of the financial institution that issued the card.",
                        "example": "CIBC",
                        "enum": [
                            "AMEX",
                            "CIBC",
                            "Manulife",
                            "TD"
                        ]
                    },
                    "accountHolder": {
                        "type": "object",
                        "description": "The person the account belongs to.",
                        "properties": {
                            "firstName": {
                                "type": "string",
                                "description": "The first name of the account holder.",
                                "example": "2023-12-15"
                            },
                            "lastName": {
                                "type": "string",
                                "description": "The last name of the account holder",
                                "example": "Smith"
                            },
                            "address": {
                                "type": "object",
                                "description": "The address of the account holder not to be confused with the financial institution address.",
                                "properties": {
                                    .... content truncated...
                                    "countryName": {
                                        "type": "string",
                                        "description": "The country of the account holder.",
                                        "example": "Canada"
                                    },
                                    "postalCode": {
                                        "type": "string",
                                        "description": "The postal code of the account holder",
                                        "example": "J1M 0C9"
                                    }
                                },
                                "required": [
                                    "lineAddress1",
                                    "cityName",
                                    "provinceName",
                                    "countryName",
                                    "postalCode"
                                ]
                            }
                        },
                        "required": [
                            "firstName",
                            "lastName",
                            "address"
                        ]
                    },
                    "cardInfo": {
                        "type": "object",
                        "description": "The person the account belongs to.",
                        "properties": {
                            .... content truncated...
                            "productName": {
                                "type": "string",
                                "description": "The product name of the credit card.",
                                "example": "CIBC Aventura Visa Infinite"
                            }
                        },
                        "required": [
                            "fistSixDigits",
                            "lastFourDigits",
                            "networkType",
                            "productName"
                        ]
                    },
                    "transactions": {
                        "type": "array",
                        "description": "List of transactions provided during the medical visit.",
                        "items": {
                            "type": "object",
                            "properties": {
                             .... content truncated...
                                "transactionDate": {
                                    "type": "string",
                                    "format": "date",
                                    "description": "Date when the transaction occurred.",
                                    "example": "2023-10-15"
                                },
                                "amount": {
                                    "type": "number",
                                    "description": "Amount for the transaction",
                                    "example": 400
                                },
                                "currencyCode": {
                                    "type": "string",
                                    "description": "The ISO3 currency code associated with the transaction. Leave blank if not provided.",
                                    "example": "CAD"
                                }
                            },
                            "required": [
                                "vendor",
                                "vendorLocation",
                                "type",
                                "transactionDate",
                                "amount"
                            ]
                        }
                    },
                    "paymentInfo": {
                        "type": "object",
                        "description": "The payment information in the statement.",
                        "properties": {
                            "lastPaymentAmount": {
                                "type": "number",
                                "description": "The amount of the last payment made for the account.",
                                "example": 155.25
                            },
                            "minimumPaymentAmount": {
                                "type": "number",
                                "description": "The minimum amount to pay.",
                                "example": 125.78
                            }
                            
                        },
                        "required": [
                            "lastPaymentAmount",
                            "minimumPaymentAmount"
                        ]
                    },
                    "balanceInfo": {
                        "type": "object",
                        "description": "The balance information in the statement.",
                        "properties": {
                            .... content truncated...
                            "currentBalanceAmount": {
                                "type": "number",
                                "description": "The current account balance.",
                                "example": 125.78
                            }
                        },
                        "required": [
                            "lastBalanceAmount",
                            "currentBalanceAmount"
                        ]
                    }
                },
                "required": [
                    "statementIdentifier",
                    "accountIdentifier",
                    "statementDate",
                    "issuingInstitutionName",
                    "accountHolder",
                    "cardInfo",
                    "transactions",
                    "paymentInfo",
                    "balanceInfo"
                ]
            }
        }
    },
    "summarizationDefinition": {}
}

From the Analysis Definition creation call, we’ll receive the definitionId which we’ll need to process content in the next steps.

{
    "id": "0138a541-f4e8-4c53-8acc-360a9e2db3fd",
    "definitionId": "521292ac-d086-4071-883b-43b19bfc5e9d",
    "analysisType": "Content Extraction",
    "title": "Credit Card Statement Extraction",
....
Rest of the definition

2) Create a Content Object & Upload the PDF

To create the basic content object, you can simply call the content creation method with this minimal request body.

POST /contents
Content-Type: application/json

{
    "contentType": "Credit Card Statement"
}

From the creation of the Content object, we’ll receive our contentId which will be used for subsequent calls:

{
    "id": "bc36b4a2-4be9-42f6-a7dc-585e0a5c422a",
    "contentId": "92b2c71a-84be-40fb-8f66-ac7caf0138bc",
    "contentType": "Credit Card Statement",
    "publishingStatus": "Draft",
    "scope": [],
    "url": "",
    "title": "",
    "subject": "",
    "publisherName": "",
    "overallRating": 0,
    "snippet": "",
    "languageCode": "",
    "defaultAudioLanguage": "",
    "dateLastCrawled": "",
    "datePublished": "",
    "statistics": {
        "viewCount": 0,
        "likeCount": 0,
        "commentCount": 0,
        "duration": ""
    },
    "texts": [],
    "segments": [],
    "branch": {
        "id": "7526210e-7217-4052-a2ae-0e7ac867fb03"
    },
    "version": {
        "id": "bc36b4a2-4be9-42f6-a7dc-585e0a5c422a",
        "parentVersionId": "bc36b4a2-4be9-42f6-a7dc-585e0a5c422a",
        "createdDateTime": "2025-04-23T16:41:25.533532Z"
    }
}

3) Uploading the Document

The next step is to now upload the document to the platform:

POST /contents/{documentId}/upload

For the upload, you need to set the body of the request as multipart/form-data in which will include a “file” part and a fileName part.

4) Extracting the Raw Text

Since we now have the content uploaded to the platform, we can now proceed with the text extraction. In this particular example, we’ll simply use the default behavior, which is to OCR the document and keep the full text that was extracted. Do not that you can define a “scraping” analysis definition, which would allow you to control further the behavior by applying anonymization and other filters on the full text captured by the OCR process. More on that in another post!

POST /contents/{documentId}/scrape

5)  Run the Extraction & Retrieve Results

The final step would be to run the extraction by providing the extraction analysis definition created in the first step against a specific content object:

POST /contents/{{contentId}}/analysis/process?definition-id={{cc_extraction_definition_id}}

After the extraction task is completed, you will receive an Analysis Result object which contains the extracted properties

{
    "id": "d617e480-6f60-4970-ac18-cdea71bcee09",
    "contentReferences": [
        {
            "id": "a21f6507-1df9-4d7d-88d0-defee70a4006",
            "contentType": "Credit Card Statement",
            "structureType": "full-text"
        }
    ],
    "definitionId": "813bd48a-26b7-4f28-8ae1-87887625cf9f",
    "definitionVersionId": "6b09fa95-6a86-4ed6-b814-46df1cc1f39f",
    "analysisType": "Content Extraction",
    "sourceTextType": "full-text",
    "extractionResults": {
        "fields": [],
        "object": {
            "CreditCardStatement": {
                "statementIdentifier": "TOSTM21000_169844_017 ED 34951",
                "accountIdentifier": "123456789",
                "statementDate": "2020-03-23",
                "issuingInstitutionName": "TD",
                "accountHolder": {
                    "firstName": "JOHN",
                    "lastName": "DOE",
                    "address": {
                        "lineAddress1": "408 Queen St W",
                        "cityName": "Toronto",
                        "provinceName": "ON",
                        "countryName": "Canada",
                        "postalCode": "M5V2A7"
                    }
                },
                "cardInfo": {
                    "fistSixDigits": 123412,
                    "lastFourDigits": 1234,
                    "networkType": "Other",
                    "productName": "TD CASH BACK CARD"
                },
                "transactions": [
                    {
                        "vendor": "MCDONALD'S #5772 004",
                        "vendorLocation": "ETOBICOKE",
                        "type": "Debit",
                        "transactionDate": "2020-02-23",
                        "amount": 105
                    },
                    {
                        "vendor": "TIME HORTONS #6977",
                        "vendorLocation": "ETOBICOKE",
                        "type": "Debit",
                        "transactionDate": "2020-02-23",
                        "amount": 105
                    },
                    {
                        "vendor": "RITUAL- \"TUALAL FRESHI",
                        "vendorLocation": "DOWNTOWN TOR",
                        "type": "Debit",
                        "transactionDate": "2020-02-23",
                        "amount": 8.53
                    },
                    .... content truncated...
                ],
                "paymentInfo": {
                    "lastPaymentAmount": 0,
                    "minimumPaymentAmount": 10
                },
                "balanceInfo": {
                    "lastBalanceAmount": 193.33,
                    "currentBalanceAmount": 458.29,
                    "feeAmount": 0,
                    "interestAmount": 0
                }
            }
        },
       .... content truncated...
    },
    "branch": {
        "id": "<your branch id>"
    },
    "version": {
        "id": "the current version id",
        "parentVersionId": "the parent version id",
        "createdDateTime": "2025-04-23 12:56:35"
    }
}

We now have structured JSON ready for application consumption, analytics or other processes.

📈 What’s Next?

Go check out the Content API on RapidAPI and give the extraction capability a try!

Leave A Comment