Produce summaries of report text¶
Short, consistent summaries make it possible to skim hundreds or even of thousands of reports. You might use these summaries to quickly identify notable case studies, or to get an 'at a glance' understanding of the breadth contained within PFD reports.
Producing summaries also paves the way for automated theme discovery (see Discover recurring themes).
Use summarise()
to condense each report into a short text snippet. The trim_intensity
option controls how terse the summary should be. Calling summarise
adds a summary
column to your stored reports and keeps a copy on the instance under extractor.summarised_reports
for later reuse.
Getting started¶
You'll likely wish to screen/filter reports with Screener
before generating summaries. For example:
from pfd_toolkit import load_reports, LLM, Screener
# Load reports
reports = load_reports()
# Set up your LLM client
llm_client = LLM(api_key=YOUR-API-KEY)
# Screen reports by user query
user_query = "Deaths in police custody **only**."
screener = Screener(
llm=llm_client,
reports=reports
)
police_df = screener.screen_reports(
user_query=user_query)
Note
For more information on screening reports, see here.
Following this, we can generate summaries of our screened/filtered reports:
from pfd_toolkit import Extractor
# Set up Extractor
extractor = Extractor(
llm=llm_client,
reports=reports
)
summary_df = extractor.summarise(trim_intensity="medium")
The resulting DataFrame contains a new column (default name summary
).
You can specify a different column name via result_col_name
if desired. You can also set a different trim_intensity
(options range from low
to very high
) if desired.
Specify which sections to summarise¶
By default, the summarise()
method will trim the Investigation, Circumstances of Death and Coroner's Concerns sections. You can override this by setting the include_*
flags. For example:
# Set up Extractor
extractor = Extractor(
llm=llm_client,
reports=reports,
# Decide which sections to include:
include_investigation=False,
include_circumstances=False,
include_concerns=True
)
summary_df = extractor.summarise(trim_intensity="medium")
All options and defaults¶
Flag | Report section | What it's useful for | Default |
---|---|---|---|
include_coroner |
Coroner’s name | Simply the name of the coroner. | False |
include_area |
Coroner’s area | The local area the coroner operates within. | False |
include_receiver |
Receiver(s) of the report | The recipient(s) of the reports. | False |
include_investigation |
“Investigation & Inquest” section | Contains procedural detail about the inquest. | True |
include_circumstances |
“Circumstances of Death” section | Describes what actually happened; holds key facts about the death. | True |
include_concerns |
“Coroner’s Concerns” section | Lists the issues the coroner believes should be addressed. | True |
Estimating token counts¶
Token usage is important when working with paid APIs. The estimate_tokens()
helper provides a quick approximation of how many tokens a text column will consume.
estimate_tokens
defaults to the summary column, but you can pass any text series via col_name
. Set return_series=True
to get a per-row estimate instead of the total.