Radiology Reports Pipeline

tooling
NLP
IBD
de-identification
infrastructure
A reproducible pipeline for parsing, de-identifying, and structuring ~9,800 free-text abdominal radiology reports to support longitudinal IBD research.
Published

June 10, 2026

Building the infrastructure that makes longitudinal questions askable.

The problem

In IBD, imaging is often sparse by necessity. A patient may accumulate three or four MR enterographies across six years, each reported at a different stage of their disease, each sitting in a folder as a free-text Word document. The clinical information is there: bowel wall measurements, activity descriptors, structural findings noted and then effectively lost to follow-up. The question of whether a patient’s bowel is quietly accumulating damage between flares is a real and unanswered clinical question, and much of the data needed to approach it already exists. It is simply not in a form that research can use.

That gap, between data that exists and data that is reachable, is the kind of problem I find difficult to leave alone.

The corpus

The dataset is approximately 9,800 abdominal radiology reports from a single institution, spanning nearly a decade. Modalities include MR enterography for IBD and Crohn’s disease, MRCP for primary sclerosing cholangitis, liver and biliary MRI, rectal cancer staging, and gynaecological pelvic MRI. One Word document per report. Reports follow a consistent free-text structure: modality, clinical information, optional prior-imaging reference, findings, and impression.

The clinical questions this dataset is positioned to answer, including longitudinal reconstruction of structural damage in IBD, require first building the infrastructure that makes those questions answerable. That is what this pipeline is.

Pipeline

Radiology reports pipeline Six-phase pipeline: Phase 0 metadata capture, Phase 1 document parsing, Phase 1b repeat-patient clustering, Phase 2 de-identification, Phase 3 classification, then Phases 4–5 audit and longitudinal linkage. Phases 0–2 complete or in progress; Phases 3–5 designed and pending. phase 0 phase 1 phase 1b phase 2 phase 3 phases 4–5 Metadata archive + timestamps Parse structured sections Clustering repeat patients De-identify HIPAA Safe Harbor Classify modality + disease Audit + link longitudinal table complete in progress designed design commitments: every claim traceable to source · uncertainty flagged, not resolved · patient data never in repository

Where the difficulty was

Before any extraction or modelling, the corpus had to be parsed, de-identified, and structured so that every downstream claim is traceable to its source sentence. That is not the intellectually exciting part of the work. It is, however, the part that determines whether the exciting part is trustworthy.

De-identification turned out to be harder than anticipated, for reasons specific to this corpus. The reports span nearly a decade, and Persian names transliterated into English accumulate variation over time: different spellings, different name ordering, entries that are consistent within one reporting period and inconsistent across another. A generic named-entity approach would not hold. The pipeline matches each report only against name tokens derived from its own filename, after a normalisation step that strips modality, anatomy, and date fragments from the filename before any matching begins.

What complicated this further is that numbers in radiology reports carry multiple roles simultaneously. A date fragment, a bowel wall measurement, and a reference to a prior study interval are all numeric, and all appear in close proximity. Collapsing them indiscriminately would destroy clinical content. The de-identification works through ordered passes that distinguish these categories deliberately, flagging ambiguous cases for review rather than resolving them silently.

One design decision

When a radiologist writes “cannot exclude active inflammation,” that is not a positive finding. It is a hedge, and it belongs in a different category from “consistent with active inflammation.” Any extraction that scores them identically is clinically wrong in a way that matters for downstream research. The pipeline records hedged language as an explicit uncertainty flag and routes it for human review rather than assigning it a label.

That single distinction reflects a broader design commitment: a radiologist reading the outputs should find the uncertainty handling familiar. A system that collapsed ambiguous cases silently would not earn that trust, and should not.

Current state

Phases 0, 1, and 1b are complete on the full corpus. Phase 2 de-identification has been built and run; final validation is in progress. Phases 3 through 5 are designed and pending implementation. The research programme begins once the structured, de-identified corpus is locked.


What is next

Classification and longitudinal linkage are the next phases. What gets built on top of this foundation is worth describing once it is ready to be described properly.


Pipeline built in Python, with python-docx, pandas, and pytest. Phases 0–1b complete; Phase 2 in final validation.