Skip to content

Scraper

Scrape UK “Prevention of Future Death” (PFD) reports into a pandas.DataFrame.

The extractor runs in three cascading layers (html → pdf → llm), each independently switchable.

  1. HTML scrape – parse metadata and rich sections directly from the web page.
  2. PDF fallback – download the attached PDF and extract text with PyMuPDF for any missing fields.
  3. LLM fallback – delegate unresolved gaps to a Large Language Model supplied via llm.

Each layer can be enabled or disabled via scraping_strategy.

Parameters:

Name Type Description Default
llm LLM | None

Client implementing _call_llm_fallback(); required only when the LLM stage is enabled.

None
category str

Judiciary category slug (e.g. "suicide", "hospital_deaths") or "all".

'all'
start_date str

Inclusive lower bound for the report date in the YYYY-MM-DD format.

'2000-01-01'
end_date str

Inclusive upper bound for the report date in the YYYY-MM-DD format.

'2050-01-01'
max_workers int

Thread-pool size for concurrent scraping.

10
max_requests int

Maximum simultaneous requests per host (enforced with a semaphore).

5
delay_range tuple[float, float] | None

Random delay (seconds) before every request. Use None to disable (not recommended).

(1, 2)
timeout int

Per-request timeout in seconds.

60
scraping_strategy list[int] | tuple[int, int, int]

Defines the order in which HTML, PDF and LLM scraping are attempted. The sequence indexes correspond to (HTML, PDF, LLM). Provide -1 to disable a stage. For example [1, 2, -1] runs HTML first, then PDF, and disables LLM scraping.

[1, 2, 3]
include_url bool

Include the url column. Defaults to True.

True
include_id bool

Include the id column. Defaults to True.

True
include_date bool

Include the date column. Defaults to True.

True
include_coroner bool

Include the coroner column. Defaults to True.

True
include_area bool

Include the area column. Defaults to True.

True
include_receiver bool

Include the receiver column. Defaults to True.

True
include_investigation bool

Include the investigation column. Defaults to True.

True
include_circumstances bool

Include the circumstances column. Defaults to True.

True
include_concerns bool

Include the concerns column. Defaults to True.

True
include_time_stamp bool

Include a date_scraped column. Defaults to False.

False
verbose bool

Emit debug-level logs when True.

False

Attributes:

Name Type Description
reports DataFrame | None

Cached result of the last call to scrape_reports or top_up.

report_links list[str]

URLs discovered by get_report_links.

NOT_FOUND_TEXT str

Placeholder value set when a field cannot be extracted.

Examples:

from pfd_toolkit import Scraper
scraper = Scraper(
    category="suicide",
    start_date="2020-01-01",
    end_date="2022-12-31",
    scraping_strategy=[1, 2, 3],
    llm=my_llm_client,
)
df = scraper.scrape_reports()          # full scrape
newer_df = scraper.top_up(df)          # later "top-up"
added_llm_df = scraper.run_llm_fallback(df)  # apply LLM retro-actively
get_report_links()

Discover individual report URLs for the current query, across all pages.

Iterates through _get_report_href_values (which collects URLs for a single page).

Pagination continues until a page yields zero new links.

Returns:

Type Description
list[str] | None

All discovered URLs, or None if no links were found for the given category/date window.

run_llm_fallback

run_llm_fallback(reports_df=None)

Ask the LLM to fill cells still set to self.NOT_FOUND_TEXT.

Only the missing fields requested via include_* flags are sent to the model, along with the report’s PDF bytes (when available).

Parameters:

Name Type Description Default
reports_df DataFrame | None

DataFrame to process. Defaults to self.reports.

None

Returns:

Type Description
DataFrame

Same shape as reports_df, updated in place and re-cached to self.reports.

Raises:

Type Description
ValueError

If no LLM client was supplied at construction time.

Examples:

Run the fallback step after scraping::

updated_df = scraper.run_llm_fallback()

scrape_reports

scrape_reports()

Execute a full scrape with the Class configuration.

Workflow
  1. Call get_report_links.
  2. Extract each report according to scraping_strategy.
  3. Cache the final DataFrame to self.reports.

Returns:

Type Description
DataFrame

One row per report. Column presence matches the include_* flags. The DataFrame is empty if nothing was scraped.

Examples:

Scrape reports and inspect columns::

df = scraper.scrape_reports()
df.columns

top_up

top_up(
    old_reports=None,
    start_date=None,
    end_date=None,
    clean=False,
)

Check for and append new PFD reports within the current parameters.

If new links are found they are scraped and appended to self.reports. Any URL (or ID) already present in old_reports is skipped.

Optionally, you can override the start_date and end_date parameters from self for this call only.

Parameters:

Name Type Description Default
old_reports DataFrame | None

Existing DataFrame. Defaults to self.reports.

None
start_date str | None

Optionally override the scraper’s date window for this call only.

None
end_date str | None

Optionally override the scraper’s date window for this call only.

None
clean bool

When True, run the Cleaner on the newly scraped rows before merging them with existing reports.

False

Returns:

Type Description
DataFrame | None

Updated DataFrame if new reports were added; None if no new records were found and old_reports was None.

Raises:

Type Description
ValueError

If old_reports lacks columns required for duplicate checks.

Examples:

Add new reports to an existing DataFrame::

updated = scraper.top_up(df, end_date="2023-01-01")
len(updated) - len(df)  # number of new reports