11 KiB
API Reference: GDPR DSAR Workflow Automation
PIIPatternMatcher
Scans text for PII using compiled regex patterns with confidence scoring and contextual boosting.
Constructor
PIIPatternMatcher(custom_patterns=None)
| Parameter | Type | Description |
|---|---|---|
custom_patterns |
dict or None |
Additional regex patterns to include in scanning |
Methods
scan_text(text, min_confidence=0.5)
Scan a string for PII matches.
| Parameter | Type | Default | Description |
|---|---|---|---|
text |
str |
required | Text to scan for PII |
min_confidence |
float |
0.5 |
Minimum confidence threshold (0.0-1.0) |
Returns: list[dict] -- Each match contains type, value, description, confidence, gdpr_category, position.
scan_file(file_path, min_confidence=0.5)
Scan a file on disk for PII matches.
| Parameter | Type | Default | Description |
|---|---|---|---|
file_path |
str |
required | Absolute path to the file |
min_confidence |
float |
0.5 |
Minimum confidence threshold |
Returns: dict with file, size_bytes, matches, match_count, pii_types_found.
Built-in PII Patterns
| Pattern Name | Description | Confidence | GDPR Category |
|---|---|---|---|
email |
Email address | 0.95 | contact_information |
phone_international |
International phone number | 0.70 | contact_information |
uk_phone |
UK phone number | 0.80 | contact_information |
ssn_us |
US Social Security Number | 0.85 | government_id |
nino_uk |
UK National Insurance Number | 0.90 | government_id |
credit_card |
Credit/debit card number | 0.85 | financial_data |
iban |
International Bank Account Number | 0.80 | financial_data |
ipv4 |
IPv4 address | 0.60 | online_identifier |
date_of_birth |
Date of birth (DD/MM/YYYY) | 0.65 | demographic_data |
uk_postcode |
UK postcode | 0.75 | location_data |
passport_uk |
UK passport number (9 digits) | 0.40 | government_id |
eu_vat |
EU VAT number | 0.50 | financial_data |
PIIDiscoveryEngine
Discovers PII across structured (database) and unstructured (files) data sources.
Constructor
PIIDiscoveryEngine(custom_patterns=None)
Methods
scan_database(connection_string, search_identifiers, tables=None)
Generate parameterized SQL queries for PII discovery in databases.
| Parameter | Type | Default | Description |
|---|---|---|---|
connection_string |
str |
required | Database connection string (redacted in output) |
search_identifiers |
dict |
required | Key-value pairs to search for (e.g., {"email": "user@example.com"}) |
tables |
list[str] or None |
auto | Tables to scan; defaults to common tables |
Returns: dict with source_type, connection, tables_scanned, queries_generated, queries.
scan_files(directories, search_identifiers, file_extensions=None, max_file_size_mb=50)
Scan files in directories for PII matching identifiers.
| Parameter | Type | Default | Description |
|---|---|---|---|
directories |
list[str] |
required | Directory paths to scan |
search_identifiers |
dict |
required | Identifiers to search for |
file_extensions |
list[str] or None |
common types | File extensions to include |
max_file_size_mb |
int |
50 |
Skip files larger than this |
Returns: dict with files_scanned, files_with_matches, matches, raw_text_matches.
scan_with_ner(text_corpus, entity_types=None, confidence_threshold=0.7)
Scan text using Named Entity Recognition (spaCy NER with regex fallback).
| Parameter | Type | Default | Description |
|---|---|---|---|
text_corpus |
list[str] |
required | List of file paths to scan |
entity_types |
list[str] or None |
common types | NER entity types to detect |
confidence_threshold |
float |
0.7 |
Minimum confidence for results |
Supported Entity Types: PERSON, EMAIL, PHONE_NUMBER, LOCATION, DATE_OF_BIRTH, ORG, GPE
Returns: dict with files_processed, total_entities, results, model_used.
consolidate_results(*result_sets)
Merge results from database, file, and NER scans into a unified record set.
Returns: dict with total_records, source_count, sources, records.
full_scan(search_identifiers, sources=None, db_connection="", directories=None)
Run a complete PII discovery scan across all source types.
Returns: Consolidated dict from all scans.
DataMapper
Maps discovered PII to GDPR Article 15 disclosure categories.
Constructor
DataMapper(data_inventory_path=None)
| Parameter | Type | Description |
|---|---|---|
data_inventory_path |
str or None |
Path to JSON data inventory for overrides |
Methods
map_to_article15(pii_records, data_subject_id)
Map PII records to Article 15 required categories including processing purposes, legal basis, retention periods, and recipients.
Returns: dict with categories, supplementary_info, article_15_reference.
Article 15 Categories Mapped
| Category | Article Reference | Contents |
|---|---|---|
| Processing Purposes | Art. 15(1)(a) | Why data is processed |
| Data Categories | Art. 15(1)(b) | Types of personal data |
| Recipients | Art. 15(1)(c) | Who receives the data |
| Retention Period | Art. 15(1)(d) | How long data is kept |
| Data Subject Rights | Art. 15(1)(e-f) | Rights to rectify, erase, restrict, object |
| Data Source | Art. 15(1)(g) | Where data was collected from |
| Automated Decisions | Art. 15(1)(h) | Profiling and automated decision-making |
| International Transfers | Art. 15(2) | Safeguards for cross-border transfers |
ExemptionReviewer
Reviews DSAR data against applicable GDPR/UK GDPR exemptions.
Methods
review_exemptions(mapped_data, exemption_checks=None)
Flag applicable exemptions for DPO review.
Returns: dict with exemption_count, exemptions, review_status.
apply_redactions(mapped_data, approved_exemptions)
Apply approved exemption redactions to the mapped data.
Returns: Redacted dict with redaction_log.
Supported Exemption Types
| Type | Legal Basis | Action |
|---|---|---|
third_party_data |
Art. 15(4) / DPA 2018 Sch. 2 Para 16 | redact |
legal_professional_privilege |
DPA 2018 Sch. 2 Para 19 | withhold |
trade_secrets |
Recital 63 GDPR | redact |
crime_prevention |
DPA 2018 Sch. 2 Para 2 | withhold |
management_forecasting |
DPA 2018 Sch. 2 Para 22 | withhold |
negotiations |
DPA 2018 Sch. 2 Para 24 | withhold |
regulatory_function |
DPA 2018 Sch. 2 Para 20 | withhold |
DSARResponseGenerator
Generates compliant DSAR response packages per GDPR Article 15.
Constructor
DSARResponseGenerator(template_dir=None, organization_name="Organization",
dpo_email="dpo@organization.com", controller_name="Data Protection Officer")
Methods
generate_response(dsar_id, data_subject, mapped_data, format="json", request_date=None)
Generate a complete response package with cover letter, data export, supplementary info, and audit metadata.
| Parameter | Type | Default | Description |
|---|---|---|---|
dsar_id |
str |
required | DSAR reference ID |
data_subject |
str |
required | Name of the data subject |
mapped_data |
dict |
required | Output from DataMapper/ExemptionReviewer |
format |
str |
"json" |
Export format: json or csv |
request_date |
str or None |
today | Date the request was received |
Returns: dict with documents list containing filename, type, and content for each document.
save_response_package(response, output_dir)
Save all response documents to disk.
Returns: list[str] of saved file paths.
DSARWorkflowEngine
Manages the complete DSAR lifecycle: intake, tracking, deadlines, and compliance.
Constructor
DSARWorkflowEngine(config_path=None)
Methods
register_dsar(requester_name, requester_email, request_channel, request_text, identity_docs=None)
Register a new DSAR and start the 30-day compliance clock.
Returns: dict with dsar_id, deadline, status, identity_verified.
update_status(dsar_id, new_status, notes="")
Update DSAR processing status.
Valid Statuses: received, identity_verification, verification_failed, in_progress, pii_discovery, exemption_review, dpo_review, response_generation, response_sent, closed, refused.
apply_extension(dsar_id, reason)
Apply a 2-month extension for complex requests per Art. 12(3).
pause_clock(dsar_id, reason)
Pause the response clock (e.g., awaiting identity verification).
days_remaining(dsar_id)
Calculate remaining days until DSAR deadline. Returns: int.
get_overdue_dsars()
Get all DSARs past their deadline. Returns: list[dict].
generate_dashboard()
Generate a DSAR processing dashboard summary. Returns: dict with status breakdown and overdue info.
DSARAuditLogger
Maintains JSONL audit trails for DSAR processing lifecycle.
Constructor
DSARAuditLogger(log_path="dsar_audit_logs")
Methods
log_event(dsar_id, event_type, details=None)
Log a DSAR processing event to the JSONL audit file.
get_audit_trail(dsar_id)
Retrieve the complete audit trail. Returns: list[dict].
generate_compliance_report(dsar_id)
Generate a compliance report with pass/fail checks for all processing steps.
Returns: dict with compliance_checks, timeline, overall_compliance (COMPLIANT or REVIEW_REQUIRED).
CLI Usage
# Full automated pipeline
python agent.py --action full_pipeline \
--requester-name "Jane Smith" \
--requester-email "jane.smith@example.com" \
--scan-dirs /var/log/app /data/exports \
--db-connection "postgresql://user:pass@localhost/appdb" \
--output-dir dsar_output \
--format json
# Scan text for PII
python agent.py --action scan_pii \
--scan-text "Contact jane@example.com or call +44 20 7946 0958"
# Scan files only
python agent.py --action scan_files \
--scan-dirs /data/exports /var/log \
--requester-email "jane@example.com"
# Generate dashboard
python agent.py --action dashboard
CLI Arguments
| Argument | Default | Description |
|---|---|---|
--action |
full_pipeline |
Action to perform |
--requester-name |
Test Subject |
Data subject name |
--requester-email |
test@example.com |
Data subject email |
--request-channel |
email |
Request channel |
--scan-dirs |
[] |
Directories to scan |
--db-connection |
"" |
Database connection string |
--output-dir |
dsar_output |
Output directory |
--config |
dsar_config.json |
Configuration file path |
--format |
json |
Output format (json or csv) |
--min-confidence |
0.5 |
Minimum PII confidence threshold |
--scan-text |
"" |
Direct text to scan for PII |