Usage Tutorial¶

GDPR Pseudonymizer - Step-by-step tutorials for common workflows.

Tutorial 1: Single Document Pseudonymization (5 minutes)¶

Learn the basics by pseudonymizing a single French document.

Step 1: Create a Test Document¶

Create a file with French text containing personal information:

echo "Marie Dubois travaille a Paris pour Acme SA. Elle collabore avec Jean Martin de Lyon." > interview.txt

Entities in this text: - PERSON: Marie Dubois, Jean Martin - LOCATION: Paris, Lyon - ORGANIZATION: Acme SA

Step 2: Set Your Passphrase¶

The tool encrypts all entity mappings. Set a passphrase (minimum 12 characters):

Windows (PowerShell):

$env:GDPR_PSEUDO_PASSPHRASE = "MySecurePassphrase123"

macOS/Linux:

export GDPR_PSEUDO_PASSPHRASE="MySecurePassphrase123"

Important: Store this passphrase securely. Without it, you cannot reverse pseudonymization.

Step 3: Run Pseudonymization¶

poetry run gdpr-pseudo process interview.txt

You'll enter the interactive validation workflow - see Validation UI Walkthrough below.

Step 4: Review Output¶

After validation, check the pseudonymized file:

cat interview_pseudonymized.txt

Example output (with neutral theme):

Sophie Martin travaille a Marseille pour TechnoPlus SARL. Elle collabore avec Pierre Laurent de Bordeaux.

Step 5: View Mapping Database¶

See the entity mappings stored in the encrypted database:

poetry run gdpr-pseudo list-mappings

Tutorial 2: Batch Processing Multiple Documents¶

Process multiple documents with consistent pseudonyms across all files.

Step 1: Create Test Documents¶

mkdir documents
echo "Marie Dubois est directrice chez Acme SA." > documents/doc1.txt
echo "Jean Martin collabore avec Marie Dubois a Paris." > documents/doc2.txt
echo "Acme SA organise une reunion a Lyon avec Marie Dubois." > documents/doc3.txt

Step 2: Initialize Database¶

poetry run gdpr-pseudo init --db project.db

Enter a passphrase when prompted (or use environment variable).

Step 3: Process All Documents¶

poetry run gdpr-pseudo batch documents/ --db project.db -o output/

What happens: - Each document enters validation workflow - Same entities receive same pseudonyms across all documents - Progress bar shows completion status

Step 4: Verify Consistency¶

cat output/doc1_pseudonymized.txt
cat output/doc2_pseudonymized.txt
cat output/doc3_pseudonymized.txt

"Marie Dubois" should have the same pseudonym in all three documents.

Step 5: View Statistics¶

poetry run gdpr-pseudo stats --db project.db

Shows entity counts, themes used, and processing history.

Filtering by Entity Type¶

If you only need to pseudonymize certain entity types, use the --entity-types flag:

# Process only PERSON entities (skip LOCATION and ORG)
poetry run gdpr-pseudo batch documents/ --db project.db --entity-types PERSON

# Process PERSON and LOCATION entities (skip ORG)
poetry run gdpr-pseudo batch documents/ --db project.db --entity-types PERSON,LOCATION

This is useful when: - You only need to anonymize people's names but want to keep real locations - You want to process entity types in separate passes for review efficiency - Your documents contain many irrelevant ORG detections you want to skip

The --entity-types flag also works with the process command for single documents.

Tutorial 3: Using Configuration Files¶

Create a configuration file to set default options.

Step 1: Generate Config Template¶

poetry run gdpr-pseudo config --init

This creates .gdpr-pseudo.yaml in the current directory.

Step 2: Edit Configuration¶

Open .gdpr-pseudo.yaml and customize:

database:
  path: project_mappings.db

pseudonymization:
  theme: star_wars    # neutral | star_wars | lotr | neutral_id
  model: spacy

batch:
  workers: 4          # 1-8 (use 1 for interactive validation)
  output_dir: pseudonymized_output

logging:
  level: INFO

Step 3: View Effective Configuration¶

poetry run gdpr-pseudo config

Shows merged configuration from all sources.

Step 4: Update Config via CLI¶

Change settings without editing the file:

# Change theme
poetry run gdpr-pseudo config set pseudonymization.theme lotr

# Change database path
poetry run gdpr-pseudo config set database.path my_mappings.db

# View updated config
poetry run gdpr-pseudo config

Configuration Priority¶

Settings are applied in this order (highest to lowest priority):

CLI flags (e.g., --theme star_wars)
Custom config file (--config /path/to/config.yaml)
Project config (./.gdpr-pseudo.yaml)
Home config (~/.gdpr-pseudo.yaml)
Default values

Example: CLI flag overrides config file:

# Uses lotr theme even if config says neutral
poetry run gdpr-pseudo process doc.txt --theme lotr

Tutorial 4: Choosing a Pseudonym Theme¶

Compare the four available themes to pick the best one for your use case.

Theme Comparison¶

Entity Type	Neutral	Star Wars	Lord of the Rings	Neutral ID
PERSON	Sophie Martin	Leia Organa	Arwen Evenstar	PER-001
LOCATION	Marseille	Coruscant	Rivendell	LOC-001
ORGANIZATION	TechnoPlus SARL	Rebel Alliance	House of Elrond	ORG-001

Neutral Theme (Default)¶

Best for: Professional documents, legal compliance, realistic output.

poetry run gdpr-pseudo process doc.txt --theme neutral

Example transformation:

Input:  Marie Dubois travaille a Paris pour Acme SA.
Output: Sophie Martin travaille a Marseille pour TechnoPlus SARL.

Characteristics: - French-sounding names - Real French cities and regions - Realistic company names with proper suffixes (SA, SARL, SAS)

Star Wars Theme¶

Best for: Easy identification of pseudonymized content, fun testing.

poetry run gdpr-pseudo process doc.txt --theme star_wars

Example transformation:

Input:  Marie Dubois travaille a Paris pour Acme SA.
Output: Leia Organa travaille a Coruscant pour Rebel Alliance.

Characteristics: - Iconic Star Wars character names - Planets and locations from the Star Wars universe - Factions and organizations (Rebel Alliance, Galactic Empire, etc.)

Lord of the Rings Theme¶

Best for: Literary projects, distinctive pseudonymization, fantasy enthusiasts.

poetry run gdpr-pseudo process doc.txt --theme lotr

Example transformation:

Input:  Marie Dubois travaille a Paris pour Acme SA.
Output: Arwen Evenstar travaille a Rivendell pour House of Elrond.

Characteristics: - Characters from Tolkien's Middle-earth - Locations: Rivendell, Gondor, The Shire, etc. - Organizations: Kingdoms, houses, and alliances

Neutral ID Theme¶

Best for: Formal/legal documents, audit trails, contexts where themed names feel unprofessional.

poetry run gdpr-pseudo process doc.txt --theme neutral_id

Example transformation:

Input:  Marie Dubois travaille a Paris pour Acme SA.
Output: PER-001 travaille a LOC-001 pour ORG-001.

Characteristics: - Sequential counter-based identifiers (PER-001, PER-002, LOC-001, ORG-001) - Compound names get sub-identifiers: "Marie Dupont" → PER-001 with PER-001-P (first) and PER-001-N (last) - No exhaustion — counters are unlimited - No cultural bias — fully neutral output

Switching Themes¶

Important: Once you've processed documents with a theme, stick with it for consistency. Changing themes mid-project creates inconsistent pseudonyms.

If you must switch:

# Create new database for new theme
poetry run gdpr-pseudo init --db star_wars_project.db --force
poetry run gdpr-pseudo batch documents/ --db star_wars_project.db --theme star_wars

Validation UI Walkthrough¶

Every document goes through interactive validation to ensure 100% accuracy.

The Validation Screen¶

When you process a document, you'll see:

================================================================================
Entity Validation Session
================================================================================
Total entities detected: 5
Unique entities to validate: 3 (2 duplicates grouped)

Processing entity 1 of 3
--------------------------------------------------------------------------------
Entity: Marie Dubois
Type: PERSON (detected by spaCy NER)
Confidence: High (92%)

Context:
  "...travaille avec Marie Dubois depuis trois ans. Elle dirige..."
                      ^^^^^^^^^^^^

Proposed pseudonym: [Sophie Martin] (theme: neutral)
--------------------------------------------------------------------------------
[Space] Accept  [R] Reject  [E] Edit  [C] Change pseudonym  [H] Help

Keyboard Shortcuts¶

Core Actions: | Key | Action | Description | |-----|--------|-------------| | Space | Accept | Confirm entity and pseudonym | | R | Reject | Mark as false positive (keep original) | | E | Edit | Modify entity text | | A | Add | Add a missed entity manually | | C | Change | Choose different pseudonym |

Navigation: | Key | Action | Description | |-----|--------|-------------| | N / Right Arrow | Next | Go to next entity | | P / Left Arrow | Previous | Go to previous entity | | X | Cycle contexts | Show other occurrences of entity (dot indicator shows position: ● ○ ○) | | Q | Quit | Exit validation (with confirmation) |

Batch Operations (hidden - press H to see): | Key | Action | Description | |-----|--------|-------------| | Shift+A | Accept All Type | Accept all entities of current type (shows count: ✓ Accepted all 15 PERSON entities) | | Shift+R | Reject All Type | Reject all entities of current type (shows count: ✗ Rejected all 8 LOCATION entities) |

Help: | Key | Action | |-----|--------| | H / ? | Show help overlay (displays all shortcuts including batch operations) |

Entity Variant Grouping¶

The validation UI automatically groups related entity forms into single validation items. For example, if a document contains "Marie Dubois", "Pr. Dubois", and "Dubois", these are shown as one item:

Entity: Marie Dubois
Type: PERSON
Also appears as: Pr. Dubois, Dubois

Accepting or rejecting this item applies the decision to all variant forms. This reduces validation fatigue when documents use multiple forms to refer to the same person (titles, surnames, abbreviations).

Validation Workflow¶

Summary Screen: See entity counts by type (PERSON, LOCATION, ORG)
Review Entities: Go through each entity with context (variants grouped)
Make Decisions: Accept, reject, edit, or change pseudonym
Final Confirmation: Review summary of changes
Process Document: Pseudonymization applied

Tips for Efficient Validation¶

Press H for all shortcuts: Many shortcuts (like Shift+A/Shift+R) aren't shown on the main screen
Use batch operations: Press Shift+A to accept all PERSON entities if they look correct
Cycle contexts: Press X to see all occurrences of an entity before deciding — a dot indicator (● ○ ○ ○) shows your position; for >10 contexts, a truncated format (○ ○ … ● … ○ ○) is used
Trust high-confidence: Entities with >90% confidence are usually correct
Review low-confidence: Yellow/red confidence scores need careful review

Common Workflows¶

Academic Research: Interview Transcripts¶

# Set up project
export GDPR_PSEUDO_PASSPHRASE="AcademicResearch2026!"
poetry run gdpr-pseudo init --db research_project.db

# Process all interviews
poetry run gdpr-pseudo batch interviews/ --db research_project.db -o anonymized/

# Export audit log for ethics board
poetry run gdpr-pseudo export audit_log.json --db research_project.db

# Share anonymized files (keep mappings.db secure!)

HR Analysis: Employee Feedback¶

# Pseudonymize for ChatGPT analysis
poetry run gdpr-pseudo process feedback_report.txt --theme star_wars

# Upload pseudonymized version to ChatGPT
# Analyze output (references "Luke Skywalker" instead of real names)
# Map insights back using list-mappings
poetry run gdpr-pseudo list-mappings --search "Luke"

Legal: Case Document Preparation¶

# Initialize with strong passphrase
poetry run gdpr-pseudo init --db case_12345.db

# Process case documents
poetry run gdpr-pseudo batch case_documents/ --db case_12345.db --theme neutral

# Export mappings for reference
poetry run gdpr-pseudo list-mappings --db case_12345.db --export mappings.csv

Tutorial 5: Managing Mappings¶

Review and manage your entity-to-pseudonym mappings.

Validate Existing Mappings¶

Review stored mappings without processing documents:

# View all mappings with metadata
poetry run gdpr-pseudo validate-mappings

# Interactive review mode
poetry run gdpr-pseudo validate-mappings --interactive

# Filter by entity type
poetry run gdpr-pseudo validate-mappings --type PERSON

Import Mappings from Another Project¶

Combine mappings from multiple databases:

# Import from old project to new
poetry run gdpr-pseudo import-mappings old_project.db --db new_project.db

# Prompt for each duplicate (instead of auto-skipping)
poetry run gdpr-pseudo import-mappings old_project.db --prompt-duplicates

Securely Destroy a Database¶

When a project is complete and mappings are no longer needed:

# Interactive destruction (safest - prompts for confirmation and passphrase)
poetry run gdpr-pseudo destroy-table --db project.db

# Force destruction (skips confirmation, still verifies passphrase)
poetry run gdpr-pseudo destroy-table --db project.db --force

Security features: - Passphrase verification prevents accidental deletion of wrong database - SQLite magic number check prevents deletion of non-database files - Symlink protection prevents redirect attacks - 3-pass secure wipe overwrites data before file deletion

Tips and Best Practices¶

Security¶

Use environment variables for passphrases rather than --passphrase flag (which appears in shell history)
Store mapping databases separately from pseudonymized documents — the combination allows re-identification
Use strong passphrases (12+ characters, mix of letters, numbers, symbols)
Back up your mapping database before batch operations — you cannot reverse pseudonymization without it

Workflow Efficiency¶

Press H during validation to see all keyboard shortcuts (batch operations are hidden by default)
Use batch accept/reject (Shift+A / Shift+R) for entity types where detection is reliable
Process a small test file first to verify settings before running batch jobs
Use the same database for all related documents to ensure consistent pseudonyms
Choose your theme upfront — switching themes mid-project creates inconsistent pseudonyms

Organization¶

One database per project — keep separate mapping databases for unrelated projects
Use --output or -o to keep pseudonymized files in a separate directory
Export audit logs regularly for compliance documentation: poetry run gdpr-pseudo export audit.json
Use stats command to monitor processing history and entity counts

Known Limitations¶

French only — no other languages in v1.0
Supported formats — .txt, .md, .pdf, .docx, .xlsx, .csv (PDF/DOCX require pip install gdpr-pseudonymizer[formats], Excel requires pip install gdpr-pseudonymizer[excel])
Validation is mandatory — every entity must be reviewed (AI detection ~60% F1)
Passphrase is irrecoverable — if lost, existing mappings cannot be decrypted

Troubleshooting¶

No entities detected¶

Cause: Document may not contain recognizable French text.

Solutions: 1. Ensure text is in French with proper encoding (UTF-8) 2. Check for French names, locations, or organizations 3. Verify file is .txt or .md format

Inconsistent pseudonyms across documents¶

Cause: Using different database files.

Solution: Always specify the same database:

poetry run gdpr-pseudo process doc1.txt --db shared.db
poetry run gdpr-pseudo process doc2.txt --db shared.db

Validation UI not responding¶

Cause: Terminal compatibility issue.

Solutions: 1. Use a standard terminal (PowerShell, Terminal.app, bash) 2. Avoid running in IDE terminals (may have input issues) 3. Try poetry shell then run command directly

Forgot passphrase¶

Consequence: Cannot access existing mappings or reverse pseudonymization.

Solution: Create new database (previous mappings are lost):

poetry run gdpr-pseudo init --force

Next Steps¶

CLI Reference - Complete command documentation
Installation Guide - Detailed setup instructions
FAQ - Common questions and answers
Troubleshooting Guide - Comprehensive error reference