Skip to content

Cleaning scraped data

Cleaner provides an optional step for polishing scraped reports with the same LLM used in the fallback stage. It loops over selected columns, sends field specific prompts in batches and writes the corrected text back into a copy of your DataFrame.

Basic usage

from pfd_toolkit import Cleaner

cleaner = Cleaner(df, llm, include_receiver=False)
clean_df = cleaner.clean_reports(anonymise=True)

The class works field by field so you decide which columns to modify. Use the boolean flags like include_receiver or include_area to toggle each field. Custom prompt strings allow advanced users to steer how the LLM rewrites the text.

Anonymising sensitive details

When anonymise=True the investigation, circumstances and concerns fields are automatically de‑identified by swapping names and pronouns for gender‑neutral placeholders.

The underlying logic constructs a field-specific prompt for every text snippet, sends them to the LLM in batches and writes the results back into a copy of your DataFrame.

Any prompts that return an error marker are ignored so the original text remains untouched. Call generate_prompt_template() to preview the finalised prompts before you run the clean.

Prompt templates and error handling

Cleaner exposes the final prompt for each field via generate_prompt_template(). This lets you inspect or tweak the wording before sending anything to the model. If the LLM returns a recognised error marker, the class reverts to the original text rather than introducing blank values.

See the API reference for a detailed breakdown of all arguments and attributes.