Skip to content

Extractor

Extract custom features from Prevention of Future Death reports using an LLM.

Parameters:

Name Type Description Default
llm LLM

Instance of the LLM helper used for prompting.

required
reports DataFrame

DataFrame of PFD reports. When provided it is copied and stored on the instance. Defaults to None.

None
include_date bool

Include the date column in prompts. Defaults to False.

False
include_coroner bool

Include the coroner column in prompts. Defaults to False.

False
include_area bool

Include the area column in prompts. Defaults to False.

False
include_receiver bool

Include the receiver column in prompts. Defaults to False.

False
include_investigation bool

Include the investigation column in prompts. Defaults to True.

True
include_circumstances bool

Include the circumstances column in prompts. Defaults to True.

True
include_concerns bool

Include the concerns column in prompts. Defaults to True.

True
verbose bool

Emit extra logging when True. Defaults to False.

False

discover_themes

discover_themes(
    *,
    warn_exceed=100000,
    error_exceed=500000,
    max_themes=None,
    min_themes=None,
    extra_instructions=None,
    seed_topics=None,
)

Use an LLM to automatically discover report themes.

The method expects summarise to have been run so that a summary column exists. All summaries are concatenated into one prompt sent to the LLM. The LLM should return a JSON object mapping theme names to descriptions. A new pydantic model is built from this mapping and stored as feature_model.

Parameters:

Name Type Description Default
warn_exceed int

Emit a warning if the estimated token count exceeds this value. Defaults to 100000.

100000
error_exceed int

Raise a ValueError if the estimated token count exceeds this value. Defaults to 500000.

500000
max_themes int or None

Instruct the LLM to identify no more than this number of themes when provided.

None
min_themes int or None

Instruct the LLM to identify at least this number of themes when provided.

None
extra_instructions str

Additional instructions appended to the theme discovery prompt.

None
seed_topics str | list[str] | BaseModel

Optional seed topics to include in the prompt. These are treated as starting suggestions and the model should incorporate them into a broader list of themes.

None

Returns:

Type Description
type[BaseModel]

The generated feature model containing discovered themes.

estimate_tokens

estimate_tokens(col_name=None, return_series=False)

Estimate token counts for all rows of a given column using the tiktoken library.

Parameters:

Name Type Description Default
col_name str

Name of the column containing report summaries. Defaults to summary_col, which is generated after running summarise.

None
return_series bool

Returns a pandas.Series of per-row token counts for that field if True, or an integer if False. Defaults to False.

False

Returns:

Type Description
Union[int, Series]

If return_series is False, returns an int representing the total sum of all token counts across all rows for the provided field. If return_series is True, returns a pandas.Series of token counts aligned to self.reports for the provided field.

export_cache

export_cache(path='extractor_cache.pkl')

Save the current cache to path.

Parameters:

Name Type Description Default
path str

Full path to the cache file including the filename. If path is a directory, extractor_cache.pkl will be created inside it.

'extractor_cache.pkl'

Returns:

Type Description
str

The path to the written cache file.

extract_features

extract_features(
    reports=None,
    *,
    feature_model=None,
    produce_spans=False,
    drop_spans=False,
    force_assign=False,
    allow_multiple=False,
    schema_detail="minimal",
    extra_instructions=None,
    skip_if_present=True,
)

Run feature extraction for the given reports.

Parameters:

Name Type Description Default
reports DataFrame

DataFrame of reports to process. Defaults to the instance's stored reports if omitted.

None
feature_model type[BaseModel]

Pydantic model describing the features to extract. Must be provided on first call or after calling discover_themes.

None
produce_spans bool

When True, create spans_ versions of each feature to capture the supporting text snippets. Defaults to False.

False
drop_spans bool

When True and produce_spans is also True, remove all spans_ columns from the returned DataFrame after extraction. If produce_spans is False a warning is emitted and no columns are dropped. Defaults to False.

False
force_assign bool

When True, the LLM is instructed to avoid returning :data:GeneralConfig.NOT_FOUND_TEXT for any feature.

False
allow_multiple bool

Allow a report to be assigned to multiple categories when True.

False
schema_detail ('full', 'minimal')

Level of detail for the feature schema included in the prompt.

"full"
extra_instructions str

Additional instructions injected into each prompt before the schema.

None
skip_if_present bool

When True (default), skip rows when any feature column already holds a non-missing value that is not equal to :data:GeneralConfig.NOT_FOUND_TEXT. This assumes the row has been processed previously and is logged in an instance of Extractor.cache

True

import_cache

import_cache(path='extractor_cache.pkl')

Load cache from path.

Parameters:

Name Type Description Default
path str

Full path to the cache file including the filename. If path is a directory, extractor_cache.pkl will be loaded from inside it.

'extractor_cache.pkl'

reset

reset()

Reset internal caches and intermediate state.

This clears any cached feature extraction results and token estimations so that extract_features can be run again on the same reports. The instance itself is returned to allow method chaining, e.g. extractor.reset().extract_features().

summarise

summarise(
    result_col_name="summary",
    trim_intensity="medium",
    extra_instructions=None,
)

Summarise selected report fields into one column using the LLM.

Parameters:

Name Type Description Default
result_col_name str

Name of the summary column. Defaults to "summary".

'summary'
trim_intensity ('low', 'medium', 'high', 'very high')

Controls how concise the summary should be. Defaults to "medium".

"low"
extra_instructions str

Additional instructions to append to the prompt before the report excerpt.

None

Returns:

Type Description
DataFrame

A new DataFrame identical to the one provided at initialisation with an extra summary column.

tabulate

tabulate(
    columns=None,
    labels=None,
    *,
    count_col="Count",
    pct_col="Percentage",
    df=None,
)

Return a simple frequency table for extracted feature columns.

Parameters:

Name Type Description Default
columns str or list[str]

Column name or list of column names to summarise. Defaults to all feature columns added by :meth:extract_features (excluding any spans_ columns).

None
labels str or list[str]

Human friendly label or list of labels corresponding to columns. If omitted, column names are used.

None
count_col str

Column names for the count and percentage values in the output DataFrame. Defaults to "Count" and "Percentage".

'Count'
pct_col str

Column names for the count and percentage values in the output DataFrame. Defaults to "Count" and "Percentage".

'Count'
df DataFrame

DataFrame containing the columns to tabulate. Defaults to the reports stored on the instance.

None

Returns:

Type Description
DataFrame

A DataFrame summarising the frequencies of the specified columns.

Raises:

Type Description
RuntimeError

If :meth:extract_features has not been run yet.