Extractor
¶
Extract custom features from Prevention of Future Death reports using an LLM.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
llm
|
LLM
|
Instance of the |
required |
reports
|
DataFrame
|
DataFrame of PFD reports. When provided it is copied and stored on the
instance. Defaults to |
None
|
include_date
|
bool
|
Include the |
False
|
include_coroner
|
bool
|
Include the |
False
|
include_area
|
bool
|
Include the |
False
|
include_receiver
|
bool
|
Include the |
False
|
include_investigation
|
bool
|
Include the |
True
|
include_circumstances
|
bool
|
Include the |
True
|
include_concerns
|
bool
|
Include the |
True
|
verbose
|
bool
|
Emit extra logging when |
False
|
discover_themes ¶
discover_themes(
*,
warn_exceed=100000,
error_exceed=500000,
max_themes=None,
min_themes=None,
extra_instructions=None,
seed_topics=None,
)
Use an LLM to automatically discover report themes.
The method expects summarise
to have been run so that a summary
column exists. All summaries are concatenated into one prompt sent to
the LLM. The LLM should return a JSON object mapping theme names to
descriptions. A new pydantic
model is built from this mapping and
stored as feature_model
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
warn_exceed
|
int
|
Emit a warning if the estimated token count exceeds this value.
Defaults to |
100000
|
error_exceed
|
int
|
Raise a |
500000
|
max_themes
|
int or None
|
Instruct the LLM to identify no more than this number of themes when provided. |
None
|
min_themes
|
int or None
|
Instruct the LLM to identify at least this number of themes when provided. |
None
|
extra_instructions
|
str
|
Additional instructions appended to the theme discovery prompt. |
None
|
seed_topics
|
str | list[str] | BaseModel
|
Optional seed topics to include in the prompt. These are treated as starting suggestions and the model should incorporate them into a broader list of themes. |
None
|
Returns:
Type | Description |
---|---|
type[BaseModel]
|
The generated feature model containing discovered themes. |
estimate_tokens ¶
Estimate token counts for all rows of a given column using
the tiktoken
library.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name
|
str
|
Name of the column containing report summaries. Defaults to
|
None
|
return_series
|
bool
|
Returns a pandas.Series of per-row token counts for that field
if |
False
|
Returns:
Type | Description |
---|---|
Union[int, Series]
|
If |
export_cache ¶
Save the current cache to path
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Full path to the cache file including the filename. If |
'extractor_cache.pkl'
|
Returns:
Type | Description |
---|---|
str
|
The path to the written cache file. |
extract_features ¶
extract_features(
reports=None,
*,
feature_model=None,
produce_spans=False,
drop_spans=False,
force_assign=False,
allow_multiple=False,
schema_detail="minimal",
extra_instructions=None,
skip_if_present=True,
)
Run feature extraction for the given reports.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reports
|
DataFrame
|
DataFrame of reports to process. Defaults to the instance's stored reports if omitted. |
None
|
feature_model
|
type[BaseModel]
|
Pydantic model describing the features to extract. Must be provided
on first call or after calling |
None
|
produce_spans
|
bool
|
When |
False
|
drop_spans
|
bool
|
When |
False
|
force_assign
|
bool
|
When |
False
|
allow_multiple
|
bool
|
Allow a report to be assigned to multiple categories when |
False
|
schema_detail
|
('full', 'minimal')
|
Level of detail for the feature schema included in the prompt. |
"full"
|
extra_instructions
|
str
|
Additional instructions injected into each prompt before the schema. |
None
|
skip_if_present
|
bool
|
When |
True
|
import_cache ¶
Load cache from path
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Full path to the cache file including the filename. If |
'extractor_cache.pkl'
|
reset ¶
Reset internal caches and intermediate state.
This clears any cached feature extraction results and token
estimations so that extract_features
can be run again on
the same reports. The instance itself is returned to allow method
chaining, e.g. extractor.reset().extract_features()
.
summarise ¶
Summarise selected report fields into one column using the LLM.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
result_col_name
|
str
|
Name of the summary column. Defaults to |
'summary'
|
trim_intensity
|
('low', 'medium', 'high', 'very high')
|
Controls how concise the summary should be. Defaults to |
"low"
|
extra_instructions
|
str
|
Additional instructions to append to the prompt before the report excerpt. |
None
|
Returns:
Type | Description |
---|---|
DataFrame
|
A new DataFrame identical to the one provided at initialisation with an extra summary column. |
tabulate ¶
Return a simple frequency table for extracted feature columns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns
|
str or list[str]
|
Column name or list of column names to summarise. Defaults to all
feature columns added by :meth: |
None
|
labels
|
str or list[str]
|
Human friendly label or list of labels corresponding to |
None
|
count_col
|
str
|
Column names for the count and percentage values in the output
DataFrame. Defaults to |
'Count'
|
pct_col
|
str
|
Column names for the count and percentage values in the output
DataFrame. Defaults to |
'Count'
|
df
|
DataFrame
|
DataFrame containing the columns to tabulate. Defaults to the reports stored on the instance. |
None
|
Returns:
Type | Description |
---|---|
DataFrame
|
A DataFrame summarising the frequencies of the specified columns. |
Raises:
Type | Description |
---|---|
RuntimeError
|
If :meth: |