Skip to content

Produce summaries of report text

Short, consistent summaries make it possible to skim hundreds or even of thousands of reports. You might use these summaries to quickly identify notable case studies, or to get an 'at a glance' understanding of the breadth contained within PFD reports.

Producing summaries also paves the way for automated theme discovery (see Discover recurring themes).

Use summarise() to condense each report into a short text snippet. The trim_intensity option controls how terse the summary should be. Calling summarise adds a summary column to your stored reports and keeps a copy on the instance under extractor.summarised_reports for later reuse.


Getting started

You'll likely wish to screen/filter reports with Screener before generating summaries. For example:

from pfd_toolkit import load_reports, LLM, Screener

# Load reports
reports = load_reports()

# Set up your LLM client
llm_client = LLM(api_key=YOUR-API-KEY)

# Screen reports by user query
user_query = "Deaths in police custody **only**."

screener = Screener(
    llm=llm_client,
    reports=reports
)

police_df = screener.screen_reports(
    user_query=user_query)

Note

For more information on screening reports, see here.

Following this, we can generate summaries of our screened/filtered reports:

from pfd_toolkit import Extractor

# Set up Extractor
extractor = Extractor(
    llm=llm_client,
    reports=reports
)

summary_df = extractor.summarise(trim_intensity="medium")

The resulting DataFrame contains a new column (default name summary).

You can specify a different column name via result_col_name if desired. You can also set a different trim_intensity (options range from low to very high) if desired.

Specify which sections to summarise

By default, the summarise() method will trim the Investigation, Circumstances of Death and Coroner's Concerns sections. You can override this by setting the include_* flags. For example:

# Set up Extractor
extractor = Extractor(
    llm=llm_client,
    reports=reports,

    # Decide which sections to include:
    include_investigation=False,
    include_circumstances=False,
    include_concerns=True
)

summary_df = extractor.summarise(trim_intensity="medium")

All options and defaults

Flag Report section What it's useful for Default
include_coroner Coroner’s name Simply the name of the coroner. False
include_area Coroner’s area The local area the coroner operates within. False
include_receiver Receiver(s) of the report The recipient(s) of the reports. False
include_investigation “Investigation & Inquest” section Contains procedural detail about the inquest. True
include_circumstances “Circumstances of Death” section Describes what actually happened; holds key facts about the death. True
include_concerns “Coroner’s Concerns” section Lists the issues the coroner believes should be addressed. True

Estimating token counts

Token usage is important when working with paid APIs. The estimate_tokens() helper provides a quick approximation of how many tokens a text column will consume.

total = extractor.estimate_tokens()
print(f"Total tokens in summaries: {total}")

estimate_tokens defaults to the summary column, but you can pass any text series via col_name. Set return_series=True to get a per-row estimate instead of the total.