Scraper
¶
Scrape UK “Prevention of Future Death” (PFD) reports into a pandas.DataFrame.
The extractor runs in three cascading layers
(html → pdf → llm
), each independently switchable.
- HTML scrape – parse metadata and rich sections directly from the web page.
- PDF fallback – download the attached PDF and extract text with PyMuPDF for any missing fields.
- LLM fallback – delegate unresolved gaps to a Large Language Model supplied via llm.
Each layer can be enabled or disabled via scraping_strategy
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
llm
|
LLM | None
|
Client implementing |
None
|
category
|
str
|
Judiciary category slug (e.g. |
'all'
|
start_date
|
str
|
Inclusive lower bound for the report date in the |
'2000-01-01'
|
end_date
|
str
|
Inclusive upper bound for the report date in the |
'2050-01-01'
|
max_workers
|
int
|
Thread-pool size for concurrent scraping. |
10
|
max_requests
|
int
|
Maximum simultaneous requests per host (enforced with a semaphore). |
5
|
delay_range
|
tuple[float, float] | None
|
Random delay (seconds) before every request.
Use |
(1, 2)
|
timeout
|
int
|
Per-request timeout in seconds. |
60
|
scraping_strategy
|
list[int] | tuple[int, int, int]
|
Defines the order in which HTML, PDF and LLM scraping are attempted.
The sequence indexes correspond to |
[1, 2, 3]
|
include_url
|
bool
|
Include the |
True
|
include_id
|
bool
|
Include the |
True
|
include_date
|
bool
|
Include the |
True
|
include_coroner
|
bool
|
Include the |
True
|
include_area
|
bool
|
Include the |
True
|
include_receiver
|
bool
|
Include the |
True
|
include_investigation
|
bool
|
Include the |
True
|
include_circumstances
|
bool
|
Include the |
True
|
include_concerns
|
bool
|
Include the |
True
|
include_time_stamp
|
bool
|
Include a |
False
|
verbose
|
bool
|
Emit debug-level logs when True. |
False
|
Attributes:
Name | Type | Description |
---|---|---|
reports |
DataFrame | None
|
Cached result of the last call to |
report_links |
list[str]
|
URLs discovered by |
NOT_FOUND_TEXT |
str
|
Placeholder value set when a field cannot be extracted. |
Examples:
from pfd_toolkit import Scraper
scraper = Scraper(
category="suicide",
start_date="2020-01-01",
end_date="2022-12-31",
scraping_strategy=[1, 2, 3],
llm=my_llm_client,
)
df = scraper.scrape_reports() # full scrape
newer_df = scraper.top_up(df) # later "top-up"
added_llm_df = scraper.run_llm_fallback(df) # apply LLM retro-actively
get_report_links ¶
Discover individual report URLs for the current query, across all pages.
Iterates through _get_report_href_values (which collects URLs for a single page).
Pagination continues until a page yields zero new links.
Returns:
Type | Description |
---|---|
list[str] | None
|
All discovered URLs, or None if no links were found for the given category/date window. |
run_llm_fallback ¶
Ask the LLM to fill cells still set to self.NOT_FOUND_TEXT
.
Only the missing fields requested via include_*
flags are sent to
the model, along with the report’s PDF bytes (when available).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reports_df
|
DataFrame | None
|
DataFrame to process. Defaults to |
None
|
Returns:
Type | Description |
---|---|
DataFrame
|
Same shape as |
Raises:
Type | Description |
---|---|
ValueError
|
If no LLM client was supplied at construction time. |
Examples:
Run the fallback step after scraping::
updated_df = scraper.run_llm_fallback()
scrape_reports ¶
Execute a full scrape with the Class configuration.
Workflow
- Call
get_report_links
. - Extract each report according to
scraping_strategy
. - Cache the final DataFrame to
self.reports
.
Returns:
Type | Description |
---|---|
DataFrame
|
One row per report. Column presence matches the |
Examples:
Scrape reports and inspect columns::
df = scraper.scrape_reports()
df.columns
top_up ¶
Check for and append new PFD reports within the current parameters.
If new links are found they are scraped and appended to
self.reports
. Any URL (or ID) already present in
old_reports is skipped.
Optionally, you can override the start_date and end_date
parameters from self
for this call only.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
old_reports
|
DataFrame | None
|
Existing DataFrame. Defaults to |
None
|
start_date
|
str | None
|
Optionally override the scraper’s date window for this call only. |
None
|
end_date
|
str | None
|
Optionally override the scraper’s date window for this call only. |
None
|
clean
|
bool
|
When |
False
|
Returns:
Type | Description |
---|---|
DataFrame | None
|
Updated DataFrame if new reports were added; None if no new records were found and old_reports was None. |
Raises:
Type | Description |
---|---|
ValueError
|
If old_reports lacks columns required for duplicate checks. |
Examples:
Add new reports to an existing DataFrame::
updated = scraper.top_up(df, end_date="2023-01-01")
len(updated) - len(df) # number of new reports