`Extractor`¶

Extract custom features from Prevention of Future Death reports using an LLM.

Parameters:

Name	Type	Description	Default
`llm`	`LLM`	Instance of the `LLM` helper used for prompting.	required
`reports`	`DataFrame`	DataFrame of PFD reports. When provided it is copied and stored on the instance. Defaults to `None`.	`None`
`include_date`	`bool`	Include the `date` column in prompts. Defaults to `False`.	`False`
`include_coroner`	`bool`	Include the `coroner` column in prompts. Defaults to `False`.	`False`
`include_area`	`bool`	Include the `area` column in prompts. Defaults to `False`.	`False`
`include_receiver`	`bool`	Include the `receiver` column in prompts. Defaults to `False`.	`False`
`include_investigation`	`bool`	Include the `investigation` column in prompts. Defaults to `True`.	`True`
`include_circumstances`	`bool`	Include the `circumstances` column in prompts. Defaults to `True`.	`True`
`include_concerns`	`bool`	Include the `concerns` column in prompts. Defaults to `True`.	`True`
`verbose`	`bool`	Emit extra logging when `True`. Defaults to `False`.	`False`

discover_themes ¶

discover_themes(
    *,
    warn_exceed=100000,
    error_exceed=500000,
    max_themes=None,
    min_themes=None,
    extra_instructions=None,
    seed_topics=None,
)

Use an LLM to automatically discover report themes.

The method expects summarise to have been run so that a summary column exists. All summaries are concatenated into one prompt sent to the LLM. The LLM should return a JSON object mapping theme names to descriptions. A new pydantic model is built from this mapping and stored as feature_model.

Parameters:

Name	Type	Description	Default
`warn_exceed`	`int`	Emit a warning if the estimated token count exceeds this value. Defaults to `100000`.	`100000`
`error_exceed`	`int`	Raise a `ValueError` if the estimated token count exceeds this value. Defaults to `500000`.	`500000`
`max_themes`	`int or None`	Instruct the LLM to identify no more than this number of themes when provided.	`None`
`min_themes`	`int or None`	Instruct the LLM to identify at least this number of themes when provided.	`None`
`extra_instructions`	`str`	Additional instructions appended to the theme discovery prompt.	`None`
`seed_topics`	`str \| list[str] \| BaseModel`	Optional seed topics to include in the prompt. These are treated as starting suggestions and the model should incorporate them into a broader list of themes.	`None`

Returns:

Type	Description
`type[BaseModel]`	The generated feature model containing discovered themes.

estimate_tokens ¶

estimate_tokens(col_name=None, return_series=False)

Estimate token counts for all rows of a given column using the tiktoken library.

Parameters:

Name	Type	Description	Default
`col_name`	`str`	Name of the column containing report summaries. Defaults to `summary_col`, which is generated after running `summarise`.	`None`
`return_series`	`bool`	Returns a pandas.Series of per-row token counts for that field if `True`, or an integer if `False`. Defaults to `False`.	`False`

Returns:

Type	Description
`Union[int, Series]`	If `return_series` is `False`, returns an `int` representing the total sum of all token counts across all rows for the provided field. If `return_series` is `True`, returns a `pandas.Series` of token counts aligned to `self.reports` for the provided field.

export_cache ¶

export_cache(path='extractor_cache.pkl')

Save the current cache to path.

Parameters:

Name	Type	Description	Default
`path`	`str`	Full path to the cache file including the filename. If `path` is a directory, `extractor_cache.pkl` will be created inside it.	`'extractor_cache.pkl'`

Returns:

Type	Description
`str`	The path to the written cache file.

extract_features ¶

extract_features(
    reports=None,
    *,
    feature_model=None,
    produce_spans=False,
    drop_spans=False,
    force_assign=False,
    allow_multiple=False,
    schema_detail="minimal",
    extra_instructions=None,
    skip_if_present=True,
)

Run feature extraction for the given reports.

Parameters:

Name	Type	Description	Default
`reports`	`DataFrame`	DataFrame of reports to process. Defaults to the instance's stored reports if omitted.	`None`
`feature_model`	`type[BaseModel]`	Pydantic model describing the features to extract. Must be provided on first call or after calling `discover_themes`.	`None`
`produce_spans`	`bool`	When `True`, create `spans_` versions of each feature to capture the supporting text snippets. Defaults to `False`.	`False`
`drop_spans`	`bool`	When `True` and `produce_spans` is also `True`, remove all `spans_` columns from the returned DataFrame after extraction. If `produce_spans` is `False` a warning is emitted and no columns are dropped. Defaults to `False`.	`False`
`force_assign`	`bool`	When `True`, the LLM is instructed to avoid returning :data:`GeneralConfig.NOT_FOUND_TEXT` for any feature.	`False`
`allow_multiple`	`bool`	Allow a report to be assigned to multiple categories when `True`.	`False`
`schema_detail`	`('full', 'minimal')`	Level of detail for the feature schema included in the prompt.	`"full"`
`extra_instructions`	`str`	Additional instructions injected into each prompt before the schema.	`None`
`skip_if_present`	`bool`	When `True` (default), skip rows when any feature column already holds a non-missing value that is not equal to :data:`GeneralConfig.NOT_FOUND_TEXT`. This assumes the row has been processed previously and is logged in an instance of `Extractor.cache`	`True`

import_cache ¶

import_cache(path='extractor_cache.pkl')

Load cache from path.

Parameters:

Name	Type	Description	Default
`path`	`str`	Full path to the cache file including the filename. If `path` is a directory, `extractor_cache.pkl` will be loaded from inside it.	`'extractor_cache.pkl'`

reset ¶

reset()

Reset internal caches and intermediate state.

This clears any cached feature extraction results and token estimations so that extract_features can be run again on the same reports. The instance itself is returned to allow method chaining, e.g. extractor.reset().extract_features().

summarise ¶

summarise(
    result_col_name="summary",
    trim_intensity="medium",
    extra_instructions=None,
)

Summarise selected report fields into one column using the LLM.

Parameters:

Name	Type	Description	Default
`result_col_name`	`str`	Name of the summary column. Defaults to `"summary"`.	`'summary'`
`trim_intensity`	`('low', 'medium', 'high', 'very high')`	Controls how concise the summary should be. Defaults to `"medium"`.	`"low"`
`extra_instructions`	`str`	Additional instructions to append to the prompt before the report excerpt.	`None`

Returns:

Type	Description
`DataFrame`	A new DataFrame identical to the one provided at initialisation with an extra summary column.

tabulate ¶

tabulate(
    columns=None,
    labels=None,
    *,
    count_col="Count",
    pct_col="Percentage",
    df=None,
)

Return a simple frequency table for extracted feature columns.

Parameters:

Name	Type	Description	Default
`columns`	`str or list[str]`	Column name or list of column names to summarise. Defaults to all feature columns added by :meth:`extract_features` (excluding any `spans_` columns).	`None`
`labels`	`str or list[str]`	Human friendly label or list of labels corresponding to `columns`. If omitted, column names are used.	`None`
`count_col`	`str`	Column names for the count and percentage values in the output DataFrame. Defaults to `"Count"` and `"Percentage"`.	`'Count'`
`pct_col`	`str`	Column names for the count and percentage values in the output DataFrame. Defaults to `"Count"` and `"Percentage"`.	`'Count'`
`df`	`DataFrame`	DataFrame containing the columns to tabulate. Defaults to the reports stored on the instance.	`None`

Returns:

Type	Description
`DataFrame`	A DataFrame summarising the frequencies of the specified columns.

Raises:

Type	Description
`RuntimeError`	If :meth:`extract_features` has not been run yet.

Extractor¶

discover_themes ¶

estimate_tokens ¶

export_cache ¶

extract_features ¶

import_cache ¶

reset ¶

summarise ¶

tabulate ¶

`Extractor`¶