Skip to content

Methodology: GDPR-Compliant Pseudonymization of French Text

GDPR Pseudonymizer - Technical Methodology and Academic Reference

Version: 1.0 Date: 2026-02-08 Suitable for: Ethics board review, DPIA documentation, academic citation


1. Introduction

GDPR Pseudonymizer is an AI-assisted pseudonymization tool designed for French-language text documents. It combines automated Named Entity Recognition (NER) with mandatory human validation to achieve 100% pseudonymization accuracy while maintaining document utility for downstream analysis.

This document describes the technical methodology, validation procedures, quality controls, and GDPR compliance mapping for use in academic research, ethics board submissions, and Data Protection Impact Assessments (DPIAs).

1.1 Design Philosophy

The tool follows a human-in-the-loop approach to pseudonymization:

  1. AI pre-detection identifies candidate entities using hybrid NLP + regex techniques
  2. Mandatory human validation ensures every entity is reviewed before replacement
  3. Consistent pseudonymization maintains document coherence across multi-document corpora
  4. Local-only processing ensures no data leaves the user's machine

This approach prioritizes zero false negatives (no missed entities) over full automation, recognizing that current NLP accuracy on French interview and business documents does not meet the threshold required for unsupervised pseudonymization.


2. Technical Approach

2.1 NER Library Selection

The NER component was selected through a controlled benchmark on a 25-document test corpus containing 1,855 manually annotated entities (PERSON, LOCATION, ORGANIZATION) drawn from French interview transcripts and business documents.

Benchmark Results (Story 1.2, 2026-01-16):

Library Model Precision Recall F1 Score
spaCy fr_core_news_lg 3.8.0 27.0% 32.7% 29.5%
Stanza fr_default 10.3% 14.1% 11.9%

Per-Entity Type Performance (spaCy):

Entity Type Ground Truth Precision Recall F1
PERSON 1,627 37.8% 31.3% 34.2%
LOCATION 123 29.6% 58.5% 39.3%
ORGANIZATION 105 3.8% 27.6% 6.7%

Selection rationale: spaCy was selected for its 2.5x better F1 score, faster inference speed, larger community, and better documentation. Both libraries failed the initial 85% F1 target, leading to the adoption of a hybrid detection strategy with mandatory human validation.

Reference: Full benchmark analysis available in docs/nlp-benchmark-report.md.

2.2 Hybrid Detection Strategy

To compensate for baseline NER limitations, a hybrid approach combines NLP with rule-based pattern matching (Story 1.8, 2026-01-22):

Detection pipeline:

Input Text
    ├── spaCy NER (fr_core_news_lg)  →  NLP-detected entities
    ├── Regex Pattern Matcher         →  Pattern-detected entities
    │   ├── French title patterns (M., Mme, Dr., Maître/Me)
    │   ├── Compound name patterns (Jean-Pierre, Marie-Claire)
    │   ├── French name dictionary matching
    │   ├── Organization suffix patterns (SA, SARL, SAS, Cabinet)
    │   └── False positive filtering (title-only, label words)
    └── Entity Deduplication          →  Merged, deduplicated entity set

Hybrid performance improvement:

Metric spaCy Only Hybrid (NLP + Regex) Improvement
Total entities detected 2,679 3,625 +35.3%
PERSON entities 1,612 2,454 +52.2%
Processing time per doc 0.07s 0.07s No overhead

Reference: Full hybrid benchmark in docs/hybrid-benchmark-report.md.

2.3 Pseudonymization Algorithm

The pseudonymization engine uses a compositional approach to maintain internal consistency:

Core principles:

  1. Component-based mapping: Full names are decomposed into components (e.g., "Marie Dubois" → ["Marie", "Dubois"]), each mapped to a pseudonym component. Partial references ("Marie" alone) resolve to the same pseudonym component ("Leia").

  2. Themed pseudonym libraries: Three libraries provide culturally distinct pseudonym sets:

  3. Neutral: French-sounding names, real French cities, realistic company names
  4. Star Wars: Characters, planets, and factions from the Star Wars universe
  5. Lord of the Rings: Characters and locations from Tolkien's Middle-earth

  6. Entity type coverage:

  7. PERSON: First name + last name with gender matching (when NER provides classification)
  8. LOCATION: Cities, regions, countries
  9. ORGANIZATION: Companies, institutions, agencies

  10. French name preprocessing:

  11. Title stripping: "Dr. Marie Dubois" → "Marie Dubois" (title preserved in output)
  12. Compound name handling: "Jean-Pierre" treated as atomic unit
  13. Deduplication: Identical entities validated once, applied to all occurrences

2.4 Encryption Scheme

Entity mappings are stored in an encrypted SQLite database using industry-standard cryptographic primitives:

Component Implementation Standard
Encryption AES-256-SIV (deterministic AEAD) RFC 5297, NIST approved
Key Derivation PBKDF2-HMAC-SHA256 NIST SP 800-132
Iterations 210,000 OWASP 2023 recommendation
Salt 32 bytes, cryptographically random Per-database unique

Deterministic encryption rationale: AES-256-SIV is used to enable encrypted field queries (e.g., looking up existing mappings for compositional pseudonymization). This is an industry-standard trade-off used by AWS DynamoDB, Google Tink, and MongoDB for searchable encrypted fields.

Security model: The database is local-only (no network exposure). An attacker requires both physical access to the machine and knowledge of the passphrase to decrypt mappings.


3. Validation Procedures and Quality Controls

3.1 Interactive Validation Workflow

Every document undergoes mandatory human validation before pseudonymization is applied:

  1. Entity detection: Hybrid NLP + regex pipeline identifies candidate entities
  2. Summary screen: Validator sees entity counts by type (PERSON, LOCATION, ORG)
  3. Entity-by-entity review: Each entity displayed with:
  4. Surrounding context (10 words before/after with highlighting)
  5. Entity type and detection source (NLP or regex)
  6. Confidence score (color-coded: green >80%, yellow 60-80%, red <60%)
  7. Proposed pseudonym from themed library
  8. Validator actions: Accept, reject (false positive), edit entity text, add missed entity, change pseudonym
  9. Batch operations: Accept/reject all entities of a given type
  10. Final confirmation: Summary of all decisions before processing

Validation performance: <2 minutes for typical documents (20-30 entities). Entity deduplication reduces validation time by ~66% for large documents with repeated entities.

3.2 NER Accuracy Metrics

Stage Detection Rate False Positives Notes
spaCy baseline 29.5% F1 High Pre-trained on news text
Hybrid (NLP + regex) ~40-50% F1 Moderate +35.3% recall improvement
Post-validation 100% accuracy 0% Human review catches all errors

3.3 LLM Utility Preservation

Pseudonymized documents were validated for downstream utility using LLM-as-judge methodology (Story 4.1, 2026-02-06):

Test design: A/B comparison of LLM responses on original vs. pseudonymized versions of 10 representative documents (5 interview transcripts, 5 business documents).

Evaluation dimensions (1-5 scale):

Dimension Mean Score Interpretation
Thematic Accuracy 4.27/5.0 Themes fully preserved
Relationship Coherence 4.27/5.0 Entity relationships maintained
Factual Preservation 4.27/5.0 Facts accurately extracted
Overall Utility 4.27/5.0 (85.4%) Exceeds 80% threshold

Conclusion: Pseudonymized documents retain sufficient utility for LLM-based analysis (summarization, theme extraction, relationship mapping).

Reference: Full validation report in docs/llm-validation-report.md.


4. Limitations and Edge Cases

4.1 NER Accuracy

  • Baseline NER accuracy is 29.5% F1 (below the 85% target for unsupervised processing)
  • Hybrid detection improves to ~40-50% F1 but still requires mandatory human validation
  • ORG detection is particularly weak (3.8% precision = 96% false positive rate)
  • Pre-trained models optimized for news text, not interview/business documents
  • Fine-tuning with user validation data is planned for v3.0

4.2 Language and Format Limitations

  • French only in v1.0 (English, Spanish, German planned for future versions)
  • Text formats only: .txt and .md (no PDF, DOCX, or HTML support in v1.0)
  • Text must use proper French encoding (UTF-8 with accents: é, è, à, etc.)

4.3 Entity Detection Edge Cases

  • Uncommon French names not in spaCy's training data may be missed by NLP (caught during validation)
  • Compound names ("Jean-Pierre") treated as atomic units (not split)
  • Titles ("Dr.", "Maître", "Me") are stripped for matching, preserved in output
  • Abbreviations may not be detected by NLP or regex patterns
  • Gender-aware pseudonym assignment not yet implemented for all cases (FE-007, planned for v1.1)

4.4 Operational Constraints

  • Passphrase is irrecoverable: If lost, existing mappings cannot be decrypted and pseudonymization cannot be reversed
  • Theme consistency: Switching themes mid-project creates inconsistent pseudonyms
  • Single-user design: No multi-user access control in v1.0 (SQLite with WAL mode supports concurrent reads)

5. GDPR Compliance Mapping

5.1 Pseudonymization Definition (Article 4(5))

"Pseudonymisation means the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person."

-- GDPR Article 4(5)

GDPR Pseudonymizer compliance:

Requirement Implementation
Personal data replaced with pseudonyms PERSON, LOCATION, ORG entities replaced with themed pseudonyms
Additional information kept separately Mapping database stored separately from pseudonymized documents
Technical measures for separation AES-256-SIV encryption with passphrase protection
Reversibility Authorized users can de-pseudonymize with correct passphrase

5.2 Pseudonymization vs. Anonymization

Important distinction: Pseudonymized data is still personal data under GDPR (Recital 26). The mapping table allows re-identification, which means:

  • GDPR obligations still apply to pseudonymized data
  • The mapping database must be protected with appropriate security measures
  • Pseudonymization is a risk reduction measure, not an exemption from GDPR

Anonymization (irreversible removal of all identifying information) is outside the scope of this tool. GDPR Pseudonymizer specifically preserves reversibility for use cases requiring it (academic research, legal proceedings, audit requirements).

5.3 GDPR Article Compliance Matrix

GDPR Article Requirement Implementation Status
Art. 4(5) Pseudonymization definition Consistent replacement with separate encrypted mapping Compliant
Art. 25 Data protection by design 100% local processing, no cloud dependencies, encrypted storage Compliant
Art. 32 Security of processing AES-256-SIV encryption, PBKDF2 key derivation (210K iterations), passphrase protection Compliant
Art. 30 Records of processing Comprehensive audit logs with timestamp, entity count, model version, success/failure; JSON/CSV export Compliant
Art. 35 Privacy Impact Assessment Transparent methodology (this document), auditable processing pipeline Supports DPIA
Art. 89 Research safeguards Pseudonymization as technical safeguard for research/statistical purposes Compliant
Recital 26 Pseudonymized data scope Tool documentation clearly states pseudonymized data remains personal data Acknowledged
Recital 28 Technical safeguard Pseudonymization applied as organizational/technical measure Compliant

5.4 Audit Trail

Every pseudonymization operation is logged in an operations table:

Field Description
Timestamp When the operation occurred
Operation type PROCESS, BATCH, IMPORT, etc.
Input file(s) Source document path(s)
Entity count Number of entities processed
Model version NLP model used (spaCy fr_core_news_lg version)
Theme Pseudonym library theme
Success/failure Operation outcome
Processing time Duration in seconds

Audit logs can be exported in JSON or CSV format for compliance reporting:

poetry run gdpr-pseudo export audit_log.json

6. Academic Citation

6.1 How to Cite

When referencing this tool in academic publications, ethics board submissions, or DPIAs:

Deveaux, L. (2026). GDPR Pseudonymizer: AI-Assisted Pseudonymization for French Documents with Human Verification (Version 1.0) [Computer software]. GitHub. https://github.com/LioChanDaYo/RGPDpseudonymizer

BibTeX:

@software{deveaux2026gdpr,
  author = {Deveaux, Lionel},
  title = {{GDPR Pseudonymizer: AI-Assisted Pseudonymization for French Documents with Human Verification}},
  year = {2026},
  version = {1.0},
  url = {https://github.com/LioChanDaYo/RGPDpseudonymizer}
}

6.2 References

  1. European Parliament and Council. (2016). Regulation (EU) 2016/679 (General Data Protection Regulation). Official Journal of the European Union, L 119, 1-88.

  2. Honnibal, M., & Montani, I. (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. https://spacy.io

  3. Harkins, P. (2023). RFC 5297: Synthetic Initialization Vector (SIV) Authenticated Encryption Using the Advanced Encryption Standard (AES). IETF.

  4. OWASP Foundation. (2023). Password Storage Cheat Sheet. https://cheatsheetseries.owasp.org/cheatsheets/Password_Storage_Cheat_Sheet.html

  5. NIST. (2010). SP 800-132: Recommendation for Password-Based Key Derivation. National Institute of Standards and Technology.



Disclaimer: This tool assists with GDPR compliance but does not constitute legal advice. Consult your Data Protection Officer (DPO) or legal counsel for specific compliance guidance applicable to your context.