mathieu.isabel 1 month agoApril 25, 2025

🩺 Privacy‑First Content Analysis: Summarizing a Medical Note After Anonymization

Electronic health records are goldmines of clinical insight—but they’re also brimming with protected health information (PHI).
With the Content API you can strip identifiers, keep context, and still run downstream AI tasks in a single pipeline.

Goal:

Upload a medical note.

Anonymize PHI (names, dates, MRNs, addresses).

Summarize the sanitized note into a clinical overview.

📄 Sample Note (Before)

Here’s a snapshot of a fake clinical note before we proceed with the analysis.

1) Define Text Extraction Process

As a first step, let’s define how we want to anonymize clinical notes content. In order to do so, we’ll be using a new Analysis Definition of type Content Scraping.

POST /analysis/definitions

{
    "analysisType": "Content Scraping",
    "title": "Clinical Note Scraping",
    "objective": "",
    "scrapingDefinition":{
        "textStorage":[
            "anonymized-filtered-text","anonymized-filtered-segments"
        ],
        "knowledgeStorage":true,
        "filters":[
            {
                "type":"stopwords",
                "instructions":""
            },
            {
                "type":"llm",
                "instructions":"Remove all medical record identifiers"
            }
        ],
        "redaction":{
            "entityTypes":["PERSON"],
            "validationMethod":"none",
            "detectionProvider":"presidio"
        }
    }
}

This configuration sets up an automated “content scraping” job that pulls text from stored clinical notes, anonymizes it, and removes identifiers before the data are used elsewhere.

Section	What it does
analysisType	Declares the job is a Content Scraping workflow.
title	Friendly name: “Clinical Note Scraping.”
objective	Empty for now – you could add a human‑readable purpose later.
scrapingDefinition	Core settings for how the text is gathered and de‑identified.
• textStorage	Specifies two text stores—`anonymized-text` and `anonymized-segments`—as the sources (or destinations) for the scraped content.
• knowledgeStorage	true means the pipeline will save structured facts/features to a separate knowledge base. This will be used for ad-hoc inquiries against that content later in the post.
• filters	Two successive preprocessing steps: 1. Stop‑words filter which removes irrelevant words from the text from a processing point of view. This will also help reduce cost of leveraging LLMs for downstream tasks as it will reduce the number of input tokens. 2. LLM filter that says: “Remove all medical record identifiers.”
• redaction	Additional PII scrubbing using Microsoft Presidio: • `entityTypes`: redact all entities tagged as PERSON. • `validationMethod`: none (no human or LLM QA step). • `detectionProvider`: presidio NLP model performs the entity detection.

In short: The pipeline ingests clinical note text, removes stop‑words, passes the content through an LLM to strip any medical‑record identifiers, and then runs Presidio to redact person names, producing fully anonymized text stored in the two specified repositories while skipping knowledge‑graph extraction.

2) Create & Upload the Content

Now let’s create the content object in the platform and then upload the document so it can go through the OCR process and then be anonymized.

Creating the Content object

POST /contents

{
    "contentType": "Clinical Note"
}

Now it’s time to provide the actual content so it can be anonymized and then used for our summarization analysis. Since in our case we don’t want the sensitive original data to be stored in the platform, we need to apply the scraping definition at upload time. By doing it this way the following would happen:

The document is uploaded and kept in memory
From memory it is then passed to the OCR engine which will return the extracted text which will again only be stored in memory
The text in memory will then go through the anonymization and filtering processes within the service
The resulting anonymized text will then be persisted from memory into blob storage as a final step

POST /contents/{contentId}/upload?scraping-definition-id={}

file: <the file>
fileName: clinical_note_sample_1.pdf

If we inspect the response, we’ll see the following (truncated to keep the article lighter), we can note the following:

The URL of the original document is now blank as it was deleted from the platform as the scraping definition defined it should not be stored
Only anonymized versions of the text segments and text are now stored in blob storage for future usage

{
    "id": "abca8b6e-14d4-468a-b08c-0f035b7a44f1",
    "contentId": "ce5e3aa6-c04a-4519-846e-4eeb4143b905",
    "contentType": "Clinical Note",
    "publishingStatus": "Draft",
    "scope": [],
    "url": "",
   ....truncated....
    "texts": [
        {
            "title": "",
            "source": "",
            "fileUrl": "https://drcopvpvwebgencacn1.blob.core.windows.net/dretza-contents/7526210e-7217-4052-a2ae-0e7ac867fb03/ce5e3aa6-c04a-4519-846e-4eeb4143b905/6bdd0224-ab8e-4977-8c9f-67518c4fff25_anonymized-segments.txt",
            "contentType": "Clinical Note",
            "structureType": "anonymized-filtered-segments",
            "filters": []
        },
        {
            "title": "",
            "source": "",
            "fileUrl": "https://drcopvpvwebgencacn1.blob.core.windows.net/dretza-contents/7526210e-7217-4052-a2ae-0e7ac867fb03/ce5e3aa6-c04a-4519-846e-4eeb4143b905/6bdd0224-ab8e-4977-8c9f-67518c4fff25_anonymized.txt",
            "contentType": "Clinical Note",
            "structureType": "anonymized-filtered-text",
            "filters": []
        }
    ],
    "segments": [],
    "branch": {
        "id": "7526210e-7217-4052-a2ae-0e7ac867fb03"
    },
    "version": {
        "id": "abca8b6e-14d4-468a-b08c-0f035b7a44f1",
        "parentVersionId": "6ca04896-efb0-42c4-8ae0-7ec369475d5d",
        "createdDateTime": "2025-04-24T12:33:37.252848Z"
    }
}

3) Viewing Anonymized Extracted Text

To retrieve the anonymized text your first need to retrieve its download URL from blob storage. This will generate a URL that’s only valid for a short period of time, sufficient to download the document.

GET /contents/{{cn_content_id}}/download-url?asset-type=anonymized-text

With the download URL in hand, you can now download the text file that has been anonymized:

Patient : 

Date Service : 24 Apr 2025 

Location : [ Clinic / Telehealth ] 


# SUBJECTIVE 

· Chief Complaint : " Congestion facial pressure 5 days . " 

· HPI : 

o Onset 5 days ago upper - respiratory - type cold . 
☐ 

o Symptoms : nasal congestion , thick yellow rhinorrhea , maxillary facial pressure , 
☐ 
mild frontal headache , hyposmia , post - nasal drip causing throat clearing , fatigue . 

o fever > 38 ℃ , visual changes , periorbital swelling , tooth pain , altered mental 
☐ 

status .

4)  Run Summarization on the Sanitized Text

We can now proceed with the summarization task. Note that in the analysis definition for that summarization task, we dictate that only the anonymized text should be used as the basis of the analysis through the sourceTextType property:

Let’s trigger the content analysis through the following API call:

POST /contents/{{content_id}}/analysis/process?definition-id={{cn_review_summarization_definition_id}}

You can also see in the result which text type was used for the summarization task by observing the contentReferences property:

Notice that placeholders are not echoed in the summary—the model naturally omits them or uses generic language.

You might remember when we created the scraping definition that we enabled knowledge storage. This means that we created the embeddings associated with the content, which can then be used for retrieval augmented generation (RAG) to answer ad-hoc questions.

In the following example, we’ll ask a question against the content to see if PII was properly anonymized before it was persisted to the knowledge store:

POST /contents/{contentId}/inquire

{
    "query": "What was the patient name?",
    "retrievalTasks": []

}

This will result in the following response:

{
    "query": "What was the patient name?",
    "answer": "The patient name is anonymized and not provided in the content.",
    "reasoning": "The source content includes a placeholder for the patient's name (<PERSON>), indicating that the actual name is not disclosed for privacy reasons.",
    "references": [
        {
            "extract": "Patient Name: <PERSON>",
            "contentType": "Clinical Note",
            "source": "/contents/ce5e3aa6-c04a-4519-846e-4eeb4143b905?segment-start=-1&segment-length=153"
        }
    ]
}

As you can see in the response, the name of the patient was anonymized in the source text and couldn’t be provided in the answer that was generated.

🔒 Why This Matters

Benefit	How the Content API Helps
Compliance	Built‑in anonymization or custom redaction rules.
Security	Optional in‑place anonymization—original raw text never leaves secure storage.
Continuity	Offsets & tokens preserve structure, so extractions, assertions, or annotations still align.
Flexibility	After anonymization you can call any analysis method: summarization, review, assertion, generation, etc.

🚀 Takeaways

Anonymize first, analyze second—zero trade‑off on model quality for most tasks.
One content object can keep multiple versions: raw, anonymized, enriched.
Custom policies let you decide exactly what is redacted vs. retained.

Dretza

🩺 Privacy‑First Content Analysis: Summarizing a Medical Note After Anonymization

1) Define Text Extraction Process

2) Create & Upload the Content

3) Viewing Anonymized Extracted Text

4)  Run Summarization on the Sanitized Text

🔒 Why This Matters

🚀 Takeaways

mathieu.isabel

Leave A Comment Cancel reply

🩺 Privacy‑First Content Analysis: Summarizing a Medical Note After Anonymization

1) Define Text Extraction Process

2) Create & Upload the Content

3) Viewing Anonymized Extracted Text

4) Run Summarization on the Sanitized Text

🔒 Why This Matters

🚀 Takeaways

mathieu.isabel

Leave A Comment Cancel reply

🩺 Privacy‑First Content Analysis: Summarizing a Medical Note After Anonymization

4)  Run Summarization on the Sanitized Text

🔒 Why This Matters

🚀 Takeaways