`Scraper`¶

Scrape UK “Prevention of Future Death” (PFD) reports into a pandas.DataFrame.

The extractor runs in three cascading layers (html → pdf → llm), each independently switchable.

HTML scrape – parse metadata and rich sections directly from the web page.
PDF fallback – download the attached PDF and extract text with PyMuPDF for any missing fields.
LLM fallback – delegate unresolved gaps to a Large Language Model supplied via llm.

Each layer can be enabled or disabled via scraping_strategy.

Parameters:

Name	Type	Description	Default
`llm`	`LLM \| None`	Client implementing `_call_llm_fallback()`; required only when the LLM stage is enabled.	`None`
`category`	`str`	Judiciary category slug (e.g. `"suicide"`, `"hospital_deaths"`) or `"all"`.	`'all'`
`start_date`	`str`	Inclusive lower bound for the report date in the `YYYY-MM-DD` format.	`'2000-01-01'`
`end_date`	`str`	Inclusive upper bound for the report date in the `YYYY-MM-DD` format.	`'2050-01-01'`
`max_workers`	`int`	Thread-pool size for concurrent scraping.	`10`
`max_requests`	`int`	Maximum simultaneous requests per host (enforced with a semaphore).	`5`
`delay_range`	`tuple[float, float] \| None`	Random delay (seconds) before every request. Use `None` to disable (not recommended).	`(1, 2)`
`timeout`	`int`	Per-request timeout in seconds.	`60`
`scraping_strategy`	`list[int] \| tuple[int, int, int]`	Defines the order in which HTML, PDF and LLM scraping are attempted. The sequence indexes correspond to `(HTML, PDF, LLM)`. Provide `-1` to disable a stage. For example `[1, 2, -1]` runs HTML first, then PDF, and disables LLM scraping.	`[1, 2, 3]`
`include_url`	`bool`	Include the `url` column. Defaults to `True`.	`True`
`include_id`	`bool`	Include the `id` column. Defaults to `True`.	`True`
`include_date`	`bool`	Include the `date` column. Defaults to `True`.	`True`
`include_coroner`	`bool`	Include the `coroner` column. Defaults to `True`.	`True`
`include_area`	`bool`	Include the `area` column. Defaults to `True`.	`True`
`include_receiver`	`bool`	Include the `receiver` column. Defaults to `True`.	`True`
`include_investigation`	`bool`	Include the `investigation` column. Defaults to `True`.	`True`
`include_circumstances`	`bool`	Include the `circumstances` column. Defaults to `True`.	`True`
`include_concerns`	`bool`	Include the `concerns` column. Defaults to `True`.	`True`
`include_time_stamp`	`bool`	Include a `date_scraped` column. Defaults to `False`.	`False`
`verbose`	`bool`	Emit debug-level logs when True.	`False`

Attributes:

Name	Type	Description
`reports`	`DataFrame \| None`	Cached result of the last call to `scrape_reports` or `top_up`.
`report_links`	`list[str]`	URLs discovered by `get_report_links`.
`NOT_FOUND_TEXT`	`str`	Placeholder value set when a field cannot be extracted.

Examples:

from pfd_toolkit import Scraper
scraper = Scraper(
    category="suicide",
    start_date="2020-01-01",
    end_date="2022-12-31",
    scraping_strategy=[1, 2, 3],
    llm=my_llm_client,
)
df = scraper.scrape_reports()          # full scrape
newer_df = scraper.top_up(df)          # later "top-up"
added_llm_df = scraper.run_llm_fallback(df)  # apply LLM retro-actively

get_report_links ¶

get_report_links()

Discover individual report URLs for the current query, across all pages.

Iterates through _get_report_href_values (which collects URLs for a single page).

Pagination continues until a page yields zero new links.

Returns:

Type	Description
`list[str] \| None`	All discovered URLs, or None if no links were found for the given category/date window.

run_llm_fallback ¶

run_llm_fallback(reports_df=None)

Ask the LLM to fill cells still set to self.NOT_FOUND_TEXT.

Only the missing fields requested via include_* flags are sent to the model, along with the report’s PDF bytes (when available).

Parameters:

Name	Type	Description	Default
`reports_df`	`DataFrame \| None`	DataFrame to process. Defaults to `self.reports`.	`None`

Returns:

Type	Description
`DataFrame`	Same shape as `reports_df`, updated in place and re-cached to `self.reports`.

Raises:

Type	Description
`ValueError`	If no LLM client was supplied at construction time.

Examples:

Run the fallback step after scraping::

updated_df = scraper.run_llm_fallback()

scrape_reports ¶

scrape_reports()

Execute a full scrape with the Class configuration.

Workflow

Call get_report_links.
Extract each report according to scraping_strategy.
Cache the final DataFrame to self.reports.

Returns:

Type	Description
`DataFrame`	One row per report. Column presence matches the `include_*` flags. The DataFrame is empty if nothing was scraped.

Examples:

Scrape reports and inspect columns::

df = scraper.scrape_reports()
df.columns

top_up ¶

top_up(
    old_reports=None,
    start_date=None,
    end_date=None,
    clean=False,
)

Check for and append new PFD reports within the current parameters.

If new links are found they are scraped and appended to self.reports. Any URL (or ID) already present in old_reports is skipped.

Optionally, you can override the start_date and end_date parameters from self for this call only.

Parameters:

Name	Type	Description	Default
`old_reports`	`DataFrame \| None`	Existing DataFrame. Defaults to `self.reports`.	`None`
`start_date`	`str \| None`	Optionally override the scraper’s date window for this call only.	`None`
`end_date`	`str \| None`	Optionally override the scraper’s date window for this call only.	`None`
`clean`	`bool`	When `True`, run the `Cleaner` on the newly scraped rows before merging them with existing reports.	`False`

Returns:

Type	Description
`DataFrame \| None`	Updated DataFrame if new reports were added; None if no new records were found and old_reports was None.

Raises:

Type	Description
`ValueError`	If old_reports lacks columns required for duplicate checks.

Examples:

Add new reports to an existing DataFrame::

updated = scraper.top_up(df, end_date="2023-01-01")
len(updated) - len(df)  # number of new reports

Scraper¶

get_report_links ¶

run_llm_fallback ¶

scrape_reports ¶

top_up ¶

`Scraper`¶