Skip to content

Troubleshooting Guide

GDPR Pseudonymizer - Error reference and solutions


Installation Issues

poetry: command not found

Cause: Poetry is not in your system PATH.

Solution: 1. Check installation location: - Windows: %APPDATA%\Python\Scripts - macOS/Linux: ~/.local/bin 2. Add to PATH: - Windows (PowerShell): $env:PATH += ";$env:APPDATA\Python\Scripts" - macOS/Linux: export PATH="$HOME/.local/bin:$PATH" (add to ~/.bashrc or ~/.zshrc) 3. Restart your terminal 4. Alternative: use python -m poetry instead of poetry

Python version not supported

Error: The currently activated Python version X.Y.Z is not supported

Solution: 1. Install Python 3.10, 3.11, or 3.12 2. Configure Poetry to use correct version:

poetry env use python3.11
poetry install
3. Verify with poetry env info (look for "Python: 3.10.x" or "3.11.x" or "3.12.x")

Note: Python 3.9 is no longer supported (EOL October 2025). Python 3.13+ is not yet tested.

spaCy model download fails

Possible causes: Network issues, insufficient disk space (~1GB needed), firewall blocking.

Solutions:

  1. Check disk space:

    # macOS/Linux
    df -h
    # Windows PowerShell
    Get-PSDrive C
    

  2. Manual installation:

    poetry run python -m spacy download fr_core_news_lg
    

  3. Retry with verbose output:

    poetry run python -m spacy download fr_core_news_lg --verbose
    

  4. Behind corporate firewall: Contact IT for proxy configuration or download the model package manually from https://github.com/explosion/spacy-models

poetry install fails with dependency conflicts

Solution: 1. Verify Python version (must be 3.10-3.12) 2. Clear virtual environment and reinstall:

poetry env remove python
poetry install
3. Update Poetry:
poetry self update

gdpr-pseudo: command not found

Cause: The CLI requires the poetry run prefix during development.

Solution: Always use poetry run:

# CORRECT
poetry run gdpr-pseudo --help

# INCORRECT
gdpr-pseudo --help

Alternative: Activate Poetry shell:

poetry shell
gdpr-pseudo --help
exit


Passphrase Issues

Passphrase must be at least 12 characters

Cause: Security requirement -- passphrases must be 12+ characters.

Solution: 1. Use a passphrase with at least 12 characters 2. Or set via environment variable:

# macOS/Linux
export GDPR_PSEUDO_PASSPHRASE="your-secure-passphrase-here"

# Windows PowerShell
$env:GDPR_PSEUDO_PASSPHRASE = "your-secure-passphrase-here"

Incorrect passphrase

Error: Incorrect passphrase. Please check your passphrase and try again.

Solution: - Verify you are using the exact passphrase used when creating the database - Check for trailing spaces or invisible characters - If using environment variable, verify: echo $GDPR_PSEUDO_PASSPHRASE (Linux/macOS) or echo $env:GDPR_PSEUDO_PASSPHRASE (PowerShell)

Forgot passphrase

Consequence: The mapping database cannot be decrypted. Existing mappings are permanently inaccessible and pseudonymization cannot be reversed.

Recovery: Create a new database (previous mappings are lost):

poetry run gdpr-pseudo init --force

Prevention: Store passphrases in a secure password manager.

Passphrase in config file is forbidden

Cause: A passphrase field was found in .gdpr-pseudo.yaml. Plaintext credential storage is blocked for security.

Solution: Remove the passphrase field from your config file. Use one of: - Environment variable: GDPR_PSEUDO_PASSPHRASE (recommended for automation) - Interactive prompt (most secure -- default behavior) - CLI flag: --passphrase (not recommended -- visible in shell history)


Database Errors

Database file not found: mappings.db

Cause: No database has been created yet.

Solution: Initialize a database:

poetry run gdpr-pseudo init

Database may be corrupted

Solution: 1. If you have a backup, restore it 2. Try exporting data: poetry run gdpr-pseudo export backup.json 3. Create a new database: poetry run gdpr-pseudo init --force

Inconsistent pseudonyms across documents

Cause: Using different database files for related documents.

Solution: Always specify the same database:

poetry run gdpr-pseudo process doc1.txt --db shared.db
poetry run gdpr-pseudo process doc2.txt --db shared.db


NLP Processing Errors

No entities detected

Possible causes: - Document does not contain recognizable French text - File encoding is not UTF-8 - File format not supported

Solutions: 1. Ensure text is in French with proper encoding (UTF-8 with accents: é, è, à) 2. Verify the document contains names, locations, or organizations 3. Verify file is a supported format (.txt, .md, .pdf, .docx, .xlsx, .csv) 4. For PDF/DOCX, ensure optional extras are installed: pip install gdpr-pseudonymizer[formats]. For Excel, install pip install gdpr-pseudonymizer[excel]. 5. Test with a known-good sample:

echo "Marie Dubois travaille a Paris pour Acme SA." > test.txt
poetry run gdpr-pseudo process test.txt

Invalid theme 'xyz'

Solution: Use one of the valid themes: neutral, star_wars, lotr

poetry run gdpr-pseudo process doc.txt --theme neutral

Validation UI Issues

Validation UI not responding

Cause: Terminal compatibility issue with keyboard input capture.

Solutions: 1. Use a standard terminal (PowerShell, Terminal.app, bash) 2. Avoid running in IDE integrated terminals (VS Code, PyCharm -- may have input issues) 3. Try poetry shell then run the command directly 4. On Windows, try Windows Terminal instead of legacy cmd.exe

Keyboard shortcuts not working

Solution: Press H or ? during validation to see the full help overlay with all available shortcuts. Some shortcuts (batch operations like Shift+A, Shift+R) are hidden by default.


Platform-Specific Issues

Windows: spaCy access violations

Symptom: Crash or access violation errors when running spaCy on Windows.

Solutions: 1. Use WSL (recommended): Install Windows Subsystem for Linux and run the tool there 2. Limit threads: Set OMP_NUM_THREADS=1 in environment:

$env:OMP_NUM_THREADS = 1
3. Update dependencies: - Update Windows to latest version - Update Visual C++ Redistributable

Note: This is a known spaCy issue on Windows (spaCy #12659). CI skips spaCy-dependent tests on Windows for this reason.

macOS: Xcode Command Line Tools required

Error: Build errors during poetry install

Solution:

xcode-select --install

macOS: Apple Silicon (M1/M2/M3)

Python 3.10+ has native ARM support. If using Homebrew:

brew install python@3.11

Linux: Missing build tools

Error: Compilation errors during poetry install

Solution:

Ubuntu/Debian:

sudo apt install python3-dev build-essential

Fedora:

sudo dnf install python3-devel gcc

Permission denied errors

Solutions: - macOS/Linux: Check file permissions with ls -la, ensure write access - Windows: Run PowerShell as Administrator for installation steps - Ensure write access to the project directory and output paths


Performance Issues

Processing slows or crashes on large batches

Solutions: - Process files in smaller batches - Reduce parallel workers: --workers 1 - Close other applications to free memory (spaCy model uses ~1.5GB per worker) - Monitor memory usage -- the tool requires up to 8GB RAM with 4 workers

spaCy model loading is slow

Cause: The fr_core_news_lg model is ~571MB and takes a few seconds to load on first use.

Mitigation: The model is cached in memory after first load. Subsequent documents in a batch session process faster.


PDF/DOCX Processing Issues

PDF support requires 'pdfplumber'

Cause: The optional PDF dependency is not installed.

Solution:

pip install gdpr-pseudonymizer[pdf]
# Or install both PDF and DOCX support:
pip install gdpr-pseudonymizer[formats]

DOCX support requires 'python-docx'

Cause: The optional DOCX dependency is not installed.

Solution:

pip install gdpr-pseudonymizer[docx]
# Or install both PDF and DOCX support:
pip install gdpr-pseudonymizer[formats]

"This PDF appears to be scanned/image-based"

Cause: The PDF contains images instead of text (scanned document). pdfplumber can only extract embedded text, not perform OCR.

Solutions: 1. Use an OCR tool (e.g., Adobe Acrobat, Tesseract) to convert the scanned PDF to a text-based PDF first 2. Export the scanned PDF to .txt manually, then process the text file 3. If partial text was extracted, review the output -- it may still contain useful content

Note: OCR support is not planned for the current version.

PDF is password-protected

Error: PDF is password-protected: <file>. Please provide an unprotected PDF.

Solution: Remove the password protection from the PDF before processing. Most PDF editors (Adobe Acrobat, Preview on macOS) can remove passwords if you know the original password.

Output file has .txt extension for PDF/DOCX input

This is expected behavior. PDF/DOCX inputs produce plaintext .txt output because the tool extracts text content only. Format-preserving output (PDF-to-PDF, DOCX-to-DOCX) is planned for v1.2+.


When to File Bug Reports

File a bug report on GitHub if you encounter:

  1. Crashes that are not explained by the troubleshooting entries above
  2. Incorrect pseudonymization (entity replaced with wrong type of pseudonym)
  3. Data loss (database corruption, missing mappings)
  4. Security concerns (passphrase exposure, encryption issues)

How to report: - GitHub Issues: https://github.com/LioChanDaYo/RGPDpseudonymizer/issues - Include: OS version, Python version, full error message, steps to reproduce - Do NOT include sensitive data (real names, document content) in bug reports