
🩺 Privacy‑First Content Analysis: Summarizing a Medical Note After Anonymization
Electronic health records are goldmines of clinical insight—but they’re also brimming with protected health information (PHI).
With the Content API you can strip identifiers, keep context, and still run downstream AI tasks in a single pipeline.
Goal:
- Upload a medical note.
- Anonymize PHI (names, dates, MRNs, addresses).
- Summarize the sanitized note into a clinical overview.
📄 Sample Note (Before)
Here’s a snapshot of a fake clinical note before we proceed with the analysis.

1) Define Text Extraction Process
As a first step, let’s define how we want to anonymize clinical notes content. In order to do so, we’ll be using a new Analysis Definition of type Content Scraping.
POST /analysis/definitions
{
"analysisType": "Content Scraping",
"title": "Clinical Note Scraping",
"objective": "",
"scrapingDefinition":{
"textStorage":[
"anonymized-filtered-text","anonymized-filtered-segments"
],
"knowledgeStorage":true,
"filters":[
{
"type":"stopwords",
"instructions":""
},
{
"type":"llm",
"instructions":"Remove all medical record identifiers"
}
],
"redaction":{
"entityTypes":["PERSON"],
"validationMethod":"none",
"detectionProvider":"presidio"
}
}
}
This configuration sets up an automated “content scraping” job that pulls text from stored clinical notes, anonymizes it, and removes identifiers before the data are used elsewhere.
Section | What it does |
---|---|
analysisType | Declares the job is a Content Scraping workflow. |
title | Friendly name: “Clinical Note Scraping.” |
objective | Empty for now – you could add a human‑readable purpose later. |
scrapingDefinition | Core settings for how the text is gathered and de‑identified. |
• textStorage | Specifies two text stores—anonymized-text and anonymized-segments —as the sources (or destinations) for the scraped content. |
• knowledgeStorage | true means the pipeline will save structured facts/features to a separate knowledge base. This will be used for ad-hoc inquiries against that content later in the post. |
• filters | Two successive preprocessing steps: 1. Stop‑words filter which removes irrelevant words from the text from a processing point of view. This will also help reduce cost of leveraging LLMs for downstream tasks as it will reduce the number of input tokens. 2. LLM filter that says: “Remove all medical record identifiers.” |
• redaction | Additional PII scrubbing using Microsoft Presidio: • entityTypes : redact all entities tagged as PERSON.• validationMethod : none (no human or LLM QA step).• detectionProvider : presidio NLP model performs the entity detection. |
In short: The pipeline ingests clinical note text, removes stop‑words, passes the content through an LLM to strip any medical‑record identifiers, and then runs Presidio to redact person names, producing fully anonymized text stored in the two specified repositories while skipping knowledge‑graph extraction.
2) Create & Upload the Content
Now let’s create the content object in the platform and then upload the document so it can go through the OCR process and then be anonymized.
Creating the Content object
POST /contents
{
"contentType": "Clinical Note"
}
Now it’s time to provide the actual content so it can be anonymized and then used for our summarization analysis. Since in our case we don’t want the sensitive original data to be stored in the platform, we need to apply the scraping definition at upload time. By doing it this way the following would happen:
- The document is uploaded and kept in memory
- From memory it is then passed to the OCR engine which will return the extracted text which will again only be stored in memory
- The text in memory will then go through the anonymization and filtering processes within the service
- The resulting anonymized text will then be persisted from memory into blob storage as a final step
POST /contents/{contentId}/upload?scraping-definition-id={}
file: <the file>
fileName: clinical_note_sample_1.pdf
If we inspect the response, we’ll see the following (truncated to keep the article lighter), we can note the following:
- The URL of the original document is now blank as it was deleted from the platform as the scraping definition defined it should not be stored
- Only anonymized versions of the text segments and text are now stored in blob storage for future usage
{
"id": "abca8b6e-14d4-468a-b08c-0f035b7a44f1",
"contentId": "ce5e3aa6-c04a-4519-846e-4eeb4143b905",
"contentType": "Clinical Note",
"publishingStatus": "Draft",
"scope": [],
"url": "",
....truncated....
"texts": [
{
"title": "",
"source": "",
"fileUrl": "https://drcopvpvwebgencacn1.blob.core.windows.net/dretza-contents/7526210e-7217-4052-a2ae-0e7ac867fb03/ce5e3aa6-c04a-4519-846e-4eeb4143b905/6bdd0224-ab8e-4977-8c9f-67518c4fff25_anonymized-segments.txt",
"contentType": "Clinical Note",
"structureType": "anonymized-filtered-segments",
"filters": []
},
{
"title": "",
"source": "",
"fileUrl": "https://drcopvpvwebgencacn1.blob.core.windows.net/dretza-contents/7526210e-7217-4052-a2ae-0e7ac867fb03/ce5e3aa6-c04a-4519-846e-4eeb4143b905/6bdd0224-ab8e-4977-8c9f-67518c4fff25_anonymized.txt",
"contentType": "Clinical Note",
"structureType": "anonymized-filtered-text",
"filters": []
}
],
"segments": [],
"branch": {
"id": "7526210e-7217-4052-a2ae-0e7ac867fb03"
},
"version": {
"id": "abca8b6e-14d4-468a-b08c-0f035b7a44f1",
"parentVersionId": "6ca04896-efb0-42c4-8ae0-7ec369475d5d",
"createdDateTime": "2025-04-24T12:33:37.252848Z"
}
}
3) Viewing Anonymized Extracted Text
To retrieve the anonymized text your first need to retrieve its download URL from blob storage. This will generate a URL that’s only valid for a short period of time, sufficient to download the document.
GET /contents/{{cn_content_id}}/download-url?asset-type=anonymized-text
With the download URL in hand, you can now download the text file that has been anonymized:
Patient :
Date Service : 24 Apr 2025
Location : [ Clinic / Telehealth ]
# SUBJECTIVE
· Chief Complaint : " Congestion facial pressure 5 days . "
· HPI :
o Onset 5 days ago upper - respiratory - type cold .
☐
o Symptoms : nasal congestion , thick yellow rhinorrhea , maxillary facial pressure ,
☐
mild frontal headache , hyposmia , post - nasal drip causing throat clearing , fatigue .
o fever > 38 ℃ , visual changes , periorbital swelling , tooth pain , altered mental
☐
status .
4) Run Summarization on the Sanitized Text
We can now proceed with the summarization task. Note that in the analysis definition for that summarization task, we dictate that only the anonymized text should be used as the basis of the analysis through the sourceTextType property:

Let’s trigger the content analysis through the following API call:
POST /contents/{{content_id}}/analysis/process?definition-id={{cn_review_summarization_definition_id}}
You can also see in the result which text type was used for the summarization task by observing the contentReferences property:

Notice that placeholders are not echoed in the summary—the model naturally omits them or uses generic language.
You might remember when we created the scraping definition that we enabled knowledge storage. This means that we created the embeddings associated with the content, which can then be used for retrieval augmented generation (RAG) to answer ad-hoc questions.
In the following example, we’ll ask a question against the content to see if PII was properly anonymized before it was persisted to the knowledge store:
POST /contents/{contentId}/inquire
{
"query": "What was the patient name?",
"retrievalTasks": []
}
This will result in the following response:
{
"query": "What was the patient name?",
"answer": "The patient name is anonymized and not provided in the content.",
"reasoning": "The source content includes a placeholder for the patient's name (<PERSON>), indicating that the actual name is not disclosed for privacy reasons.",
"references": [
{
"extract": "Patient Name: <PERSON>",
"contentType": "Clinical Note",
"source": "/contents/ce5e3aa6-c04a-4519-846e-4eeb4143b905?segment-start=-1&segment-length=153"
}
]
}
As you can see in the response, the name of the patient was anonymized in the source text and couldn’t be provided in the answer that was generated.
🔒 Why This Matters
Benefit | How the Content API Helps |
---|---|
Compliance | Built‑in anonymization or custom redaction rules. |
Security | Optional in‑place anonymization—original raw text never leaves secure storage. |
Continuity | Offsets & tokens preserve structure, so extractions, assertions, or annotations still align. |
Flexibility | After anonymization you can call any analysis method: summarization, review, assertion, generation, etc. |
🚀 Takeaways
- Anonymize first, analyze second—zero trade‑off on model quality for most tasks.
- One content object can keep multiple versions: raw, anonymized, enriched.
- Custom policies let you decide exactly what is redacted vs. retained.