Pull out structured data from reports¶
Extractor
lets you turn free text into structured columns. This is useful for running statistics, tracking patterns, and comparing cases at scale.
Start by defining a feature model with pydantic
.
Each feature attribute should contain three pieces of information: the variable name (e.g. "age"
), its type (e.g. int
) and a short description to aid the LLM in extracting the desired information:
from pydantic import BaseModel, Field
from pfd_toolkit import load_reports, LLM, Extractor
# Define feature model with pydantic
class MyFeatures(BaseModel):
age: int = Field(description="The age of the deceased")
sex: str = Field(description="The sex of the deceased. You may infer sex from pronouns (e.g. 'He', 'Her', etc.)")
cause_of_death: str = Field(description="A one-sentence summary of the cause of death")
Note
As per the example above, your BaseModel instance must specify each of the field types
. Use int
for numbers, str
for strings, and bool
for binary values.
Extractor
accepts any valid BaseModel configuration. For more customisation, please read Pydantic's documentation.
Next, load some report data and set up your LLM client. You then pass the feature model, the reports and the LLM client to an Extractor
instance and call extract_features()
:
reports = load_reports(start_date="2024-01-01", end_date="2024-12-31")
llm_client = LLM(api_key=YOUR-API-KEY)
extractor = Extractor(
reports=reports,
llm=llm_client
)
result_df = extractor.extract_features(
feature_model=MyFeatures,
allow_multiple=True,
force_assign=False
)
result_df
now contains the new age
, sex
, and cause_of_death
columns.
Note
Where the model was unable to extract any given piece of structured data, it will output missing data. Setting force_assign
to True
forces the model to output a value for each feature, even if it cannot be found.
In general, this is only recommended if you are working with binary values (of type bool
). For example:
Choosing which sections the LLM reads¶
Extractor
lets you decide exactly which parts of the report are presented to the model. Each include_*
flag mirrors one of the columns loaded by load_reports
. Turning fields off reduces the amount of text sent to the LLM which often speeds up requests and lowers token usage.
extractor = Extractor(
llm=llm_client,
reports=reports,
include_investigation=True,
include_circumstances=True,
include_concerns=False # Skip coroner's concerns if not relevant
)
In this example only the investigation and circumstances sections are provided to the LLM. The coroner's concerns are omitted entirely. Limiting the excerpt like this may improve accuracy and reduce token costs. However, be careful you're not turning 'off' a report section which could contain information relevant for one of your features.
Re-run the extraction¶
By default, the Extractor
class won't run on the same data with the same configuration twice.
If you want to start fresh, call reset()
to clear cached feature values, and chain it into a new extract_features()
call: