API Reference¶

GDPR Pseudonymizer - Module documentation for developers

Overview¶

The gdpr_pseudonymizer package is organized into the following subpackages:

Package	Purpose
`core`	Document processing orchestration
`nlp`	Named Entity Recognition (NER) pipeline
`pseudonym`	Pseudonym assignment and library management
`data`	Database, models, encryption, and repositories
`validation`	Interactive human-in-the-loop validation workflow
`cli`	Command-line interface (Typer-based)
`utils`	File handling, configuration, and logging

Core Module (`gdpr_pseudonymizer.core`)¶

DocumentProcessor¶

Orchestrates the single-document pseudonymization workflow: entity detection, validation, pseudonym assignment, and file output.

from gdpr_pseudonymizer.core.document_processor import DocumentProcessor

processor = DocumentProcessor(
    db_path="mappings.db",
    passphrase="your-secure-passphrase",
    theme="neutral",        # neutral | star_wars | lotr
    model_name="spacy"
)

result = processor.process_document(
    input_path="input.txt",
    output_path="output.txt",
    skip_validation=False,    # Set True for programmatic use (no UI)
    entity_type_filter=None   # Optional: set of types e.g. {"PERSON", "LOCATION"}
)

ProcessingResult¶

Dataclass returned by DocumentProcessor.process_document():

Attribute	Type	Description
`success`	`bool`	Whether processing completed successfully
`input_file`	`str`	Input file path
`output_file`	`str`	Output file path
`entities_detected`	`int`	Total entities detected
`entities_new`	`int`	Newly assigned pseudonyms
`entities_reused`	`int`	Reused pseudonyms (idempotency)
`processing_time_seconds`	`float`	Total processing time
`error_message`	`str \| None`	Error description if failed

NLP Module (`gdpr_pseudonymizer.nlp`)¶

DetectedEntity¶

Dataclass representing a detected named entity:

Attribute	Type	Description
`text`	`str`	Original entity text
`entity_type`	`str`	Classification: `PERSON`, `LOCATION`, `ORG`
`start_pos`	`int`	Character offset start position
`end_pos`	`int`	Character offset end position
`confidence`	`float \| None`	NER confidence score (0.0-1.0)
`gender`	`str \| None`	Gender: `male`, `female`, `neutral`, `unknown`
`is_ambiguous`	`bool`	Whether flagged as ambiguous
`source`	`str`	Detection source: `spacy`, `regex`, or `hybrid`

EntityDetector (Abstract Base Class)¶

Interface for NER implementations. All detectors must implement:

class EntityDetector(ABC):
    @abstractmethod
    def load_model(self, model_name: str) -> None: ...

    @abstractmethod
    def detect_entities(self, text: str) -> list[DetectedEntity]: ...

    @abstractmethod
    def get_model_info(self) -> dict[str, str]: ...

    @property
    @abstractmethod
    def supports_gender_classification(self) -> bool: ...

HybridDetector¶

Default detector combining spaCy NER with regex pattern matching. This is the detector used by DocumentProcessor.

from gdpr_pseudonymizer.nlp.hybrid_detector import HybridDetector

detector = HybridDetector()
detector.load_model("fr_core_news_lg")
entities = detector.detect_entities("Marie Dubois travaille a Paris.")

SpaCyDetector¶

Pure spaCy NER implementation:

from gdpr_pseudonymizer.nlp.spacy_detector import SpaCyDetector

detector = SpaCyDetector()
detector.load_model("fr_core_news_lg")
entities = detector.detect_entities(text)

RegexMatcher¶

Pattern-based entity matcher using French title patterns, compound names, name dictionaries, and organization suffixes:

from gdpr_pseudonymizer.nlp.regex_matcher import RegexMatcher

matcher = RegexMatcher()
matcher.load_patterns()  # Loads from config/detection_patterns.yaml
entities = matcher.match_entities(text)
stats = matcher.get_pattern_stats()

NameDictionary¶

French name dictionary for pattern-based detection:

from gdpr_pseudonymizer.nlp.name_dictionary import NameDictionary

names = NameDictionary()
names.load()
names.is_first_name("Marie")  # True
names.is_last_name("Dubois")  # True

EntityGrouping (`entity_grouping` module)¶

Groups variant forms of the same real-world entity into single validation items, reducing user validation fatigue. For example, "Marie Dubois", "Pr. Dubois", and "Dubois" are grouped as one item.

from gdpr_pseudonymizer.nlp.entity_grouping import group_entity_variants

groups = group_entity_variants(detected_entities)
for canonical, occurrences, variant_texts in groups:
    print(f"{canonical.text} (appears as: {variant_texts})")

Return type: list[tuple[DetectedEntity, list[DetectedEntity], set[str]]]

Each tuple contains:

Element	Type	Description
`canonical`	`DetectedEntity`	Representative entity (longest text form)
`occurrences`	`list[DetectedEntity]`	All entity instances in the group
`variant_texts`	`set[str]`	Unique text forms in the group

Grouping rules by entity type:

Type	Rule
`PERSON`	Title stripping + surname matching. "Marie Dubois" and "Dubois" group together. Different first names stay separate ("Marie Dubois" vs "Jean Dubois"). Ambiguous single-word surnames matching multiple people are isolated.
`LOCATION`	French preposition stripping. "a Lyon" and "Lyon" group together.
`ORG`	Case-insensitive matching. "ACME Corp" and "acme corp" group together.

Pseudonym Module (`gdpr_pseudonymizer.pseudonym`)¶

LibraryBasedPseudonymManager¶

Loads pseudonym libraries from JSON files and assigns pseudonyms with gender matching:

from gdpr_pseudonymizer.pseudonym.library_manager import LibraryBasedPseudonymManager

manager = LibraryBasedPseudonymManager()
manager.load_library("star_wars")

assignment = manager.assign_pseudonym(
    entity_type="PERSON",
    first_name="Marie",
    last_name="Dubois",
    gender="female"
)
print(assignment.pseudonym_full)  # e.g., "Leia Organa"

PseudonymAssignment¶

Dataclass returned by pseudonym assignment:

Attribute	Type	Description
`pseudonym_full`	`str`	Complete pseudonym string
`pseudonym_first`	`str \| None`	First name component (PERSON only)
`pseudonym_last`	`str \| None`	Last name component (PERSON only)
`theme`	`str`	Library theme used
`exhaustion_percentage`	`float`	Library usage (0.0-1.0)
`is_ambiguous`	`bool`	Ambiguity flag
`ambiguity_reason`	`str \| None`	Reason for ambiguity

GenderDetector¶

Auto-detects French first name gender from a bundled 945-name INSEE-sourced dictionary. Used by CompositionalPseudonymEngine to assign gender-matched pseudonyms automatically.

from gdpr_pseudonymizer.pseudonym.gender_detector import GenderDetector

detector = GenderDetector()
detector.load()

detector.detect_gender("Marie")          # "female"
detector.detect_gender("Jean")           # "male"
detector.detect_gender("Camille")        # None (ambiguous)
detector.detect_gender("Xyzabc")         # None (unknown)

# Full name detection (extracts first name, checks entity type)
detector.detect_gender_from_full_name("Marie Dupont", "PERSON")    # "female"
detector.detect_gender_from_full_name("Jean-Pierre Martin", "PERSON")  # "male" (compound: uses first component)
detector.detect_gender_from_full_name("Paris", "LOCATION")         # None (non-PERSON)

Key methods:

Method	Description
`load()`	Load gender lookup dictionary from JSON (lazy-loaded on first detect call)
`detect_gender(first_name)`	Detect gender from a single first name. Returns `"male"`, `"female"`, or `None`
`detect_gender_from_full_name(full_name, entity_type)`	Extract first name from full name and detect gender. Non-PERSON entities always return `None`

Dictionary stats: 470 male, 457 female, 18 ambiguous names. Case-insensitive matching.

CompositionalPseudonymEngine¶

Handles compositional logic: "Marie Dubois" maps to "Leia Organa", and "Marie" alone maps to "Leia" for consistency. Optionally integrates GenderDetector for automatic gender-matched pseudonym assignment.

from gdpr_pseudonymizer.pseudonym.assignment_engine import CompositionalPseudonymEngine

engine = CompositionalPseudonymEngine(
    pseudonym_manager=manager,
    mapping_repository=repo,
    gender_detector=detector  # Optional: enables auto gender detection
)

result = engine.assign_compositional_pseudonym(
    entity_text="Marie Dubois",
    entity_type="PERSON",
    gender="female"
)

Key methods:

Method	Description
`assign_compositional_pseudonym(entity_text, entity_type, gender)`	Assign pseudonym with component reuse
`strip_titles(text)`	Remove French honorifics (Dr., Mme, Maitre)
`strip_prepositions(text)`	Remove French prepositions from locations
`parse_full_name(entity_text)`	Split into (first, last, is_compound)
`find_standalone_components(component, component_type)`	Look up existing component mapping

Data Module (`gdpr_pseudonymizer.data`)¶

Database Functions¶

from gdpr_pseudonymizer.data.database import init_database, open_database

# Initialize a new encrypted database
init_database("mappings.db", "your-passphrase")

# Open existing database (context manager)
with open_database("mappings.db", "your-passphrase") as db_session:
    # Use repositories with db_session
    pass

SQLAlchemy Models¶

Entity (`entities` table)¶

Column	Type	Description
`id`	`str` (UUID)	Primary key
`entity_type`	`str`	PERSON, LOCATION, ORG
`first_name`	`str \| None`	Original first name (encrypted)
`last_name`	`str \| None`	Original last name (encrypted)
`full_name`	`str`	Original entity text (encrypted, unique)
`pseudonym_first`	`str \| None`	Assigned first name pseudonym
`pseudonym_last`	`str \| None`	Assigned last name pseudonym
`pseudonym_full`	`str`	Complete pseudonym
`gender`	`str \| None`	Gender classification
`confidence_score`	`float \| None`	NER confidence
`theme`	`str`	Library theme used
`first_seen_timestamp`	`datetime`	First detection time

Operation (`operations` table)¶

Column	Type	Description
`id`	`str` (UUID)	Primary key
`timestamp`	`datetime`	Operation timestamp
`operation_type`	`str`	PROCESS, BATCH, VALIDATE, etc.
`files_processed`	`list[str]`	JSON array of file paths
`entity_count`	`int`	Entities processed
`processing_time_seconds`	`float`	Duration
`success`	`bool`	Outcome

Repositories¶

MappingRepository (Abstract)¶

class MappingRepository(ABC):
    def find_by_full_name(self, full_name: str) -> Entity | None: ...
    def find_by_component(self, component: str, component_type: str) -> list[Entity]: ...
    def save(self, entity: Entity) -> Entity: ...
    def save_batch(self, entities: list[Entity]) -> list[Entity]: ...
    def find_all(self, entity_type=None, is_ambiguous=None) -> list[Entity]: ...

AuditRepository¶

from gdpr_pseudonymizer.data.repositories.audit_repository import AuditRepository

repo = AuditRepository(session)
repo.log_operation(operation)
operations = repo.find_operations(operation_type="PROCESS", success=True)
repo.export_to_csv("audit_log.csv")

EncryptionService¶

AES-256-SIV deterministic authenticated encryption:

from gdpr_pseudonymizer.data.encryption import EncryptionService

salt = EncryptionService.generate_salt()
valid, msg = EncryptionService.validate_passphrase("my-passphrase")

service = EncryptionService(passphrase="...", salt=salt)
ciphertext = service.encrypt("Marie Dubois")
plaintext = service.decrypt(ciphertext)

Validation Module (`gdpr_pseudonymizer.validation`)¶

Validation Workflow¶

from gdpr_pseudonymizer.validation.workflow import run_validation_workflow

validated_entities = run_validation_workflow(
    entities=detected_entities,
    document_text=text,
    document_path="input.txt",
    pseudonym_assigner=my_assigner_fn  # Optional callback
)

Review States¶

State	Description
`PENDING`	Awaiting user review
`CONFIRMED`	Confirmed as correct entity
`REJECTED`	Rejected as false positive
`MODIFIED`	Entity text modified by user
`ADDED`	Manually added by user

Action Class	Description
`ConfirmAction(entity)`	Accept entity and pseudonym
`RejectAction(entity)`	Mark as false positive
`ModifyAction(entity, new_text)`	Change entity text
`AddAction(text, entity_type, start_pos, end_pos)`	Add missed entity
`ChangePseudonymAction(entity, new_pseudonym)`	Override pseudonym

Utils Module (`gdpr_pseudonymizer.utils`)¶

File Handling¶

from gdpr_pseudonymizer.utils.file_handler import read_file, write_file, validate_file_path

text = read_file("input.txt")
write_file("output.txt", pseudonymized_text)
validate_file_path("doc.txt", allowed_extensions=[".txt", ".md"])

Configuration¶

from gdpr_pseudonymizer.utils.config_manager import load_config, Config

config = load_config("config.yaml")  # Or None for defaults
# config.theme, config.model_name, config.db_path, etc.

Logging¶

from gdpr_pseudonymizer.utils.logger import configure_logging, get_logger

configure_logging("INFO")
logger = get_logger("my_module")
logger.info("entity_detected", entity_type="PERSON", confidence=0.92)

Exceptions¶

All exceptions inherit from PseudonymizerError:

Exception	When Raised
`ConfigurationError`	Invalid or missing configuration
`ModelNotFoundError`	NLP model cannot be loaded
`EncryptionError`	Encryption or decryption fails
`ValidationError`	Validation workflow error
`FileProcessingError`	File I/O operation fails

from gdpr_pseudonymizer.exceptions import (
    PseudonymizerError,
    ConfigurationError,
    ModelNotFoundError,
    EncryptionError,
    ValidationError,
    FileProcessingError,
)

Extension Points¶

Custom Pseudonym Libraries¶

Create a JSON file in data/pseudonyms/ following this schema:

{
  "theme": "my_theme",
  "data_sources": [
    {
      "source_name": "Description",
      "url": "https://...",
      "license": "License type",
      "usage_justification": "Why this source",
      "extraction_date": "2026-01-01",
      "extraction_method": "How data was collected"
    }
  ],
  "first_names": {
    "male": ["Name1", "Name2"],
    "female": ["Name3", "Name4"],
    "neutral": ["Name5", "Name6"]
  },
  "last_names": ["LastName1", "LastName2"],
  "locations": {
    "cities": ["City1", "City2"],
    "regions": ["Region1", "Region2"]
  },
  "organizations": {
    "companies": ["Company1", "Company2"],
    "agencies": ["Agency1"],
    "institutions": ["Institute1"]
  }
}

Minimum requirements: 500+ first names, 500+ last names, 80+ locations, 35+ organizations.

Usage:

manager = LibraryBasedPseudonymManager()
manager.load_library("my_theme")

Or via CLI:

poetry run gdpr-pseudo process doc.txt --theme my_theme

Custom NLP Detector¶

Extend the EntityDetector abstract base class:

from gdpr_pseudonymizer.nlp.entity_detector import EntityDetector, DetectedEntity

class MyDetector(EntityDetector):
    def load_model(self, model_name: str) -> None:
        # Load your model
        pass

    def detect_entities(self, text: str) -> list[DetectedEntity]:
        # Return detected entities
        pass

    def get_model_info(self) -> dict[str, str]:
        return {"name": "my_model", "version": "1.0"}

    @property
    def supports_gender_classification(self) -> bool:
        return False

Programmatic Usage Example¶

Complete example of pseudonymizing a document programmatically:

from gdpr_pseudonymizer.core.document_processor import DocumentProcessor

# Initialize processor
processor = DocumentProcessor(
    db_path="project.db",
    passphrase="MySecurePassphrase123",
    theme="neutral",
    model_name="spacy"
)

# Process document (skip_validation=True for non-interactive use)
result = processor.process_document(
    input_path="interview.txt",
    output_path="interview_pseudonymized.txt",
    skip_validation=True,
    entity_type_filter={"PERSON", "LOCATION"}  # Optional: only process these types
)

if result.success:
    print(f"Processed {result.entities_detected} entities")
    print(f"New: {result.entities_new}, Reused: {result.entities_reused}")
    print(f"Time: {result.processing_time_seconds:.2f}s")
else:
    print(f"Error: {result.error_message}")

CLI Reference - Command-line interface documentation
Methodology - Technical approach and GDPR compliance
Installation Guide - Setup instructions
Tutorial - Usage tutorials