Methodology: GDPR-Compliant Pseudonymization of French Text¶
GDPR Pseudonymizer - Technical Methodology and Academic Reference
Version: 1.0 Date: 2026-02-08 Suitable for: Ethics board review, DPIA documentation, academic citation
1. Introduction¶
GDPR Pseudonymizer is an AI-assisted pseudonymization tool designed for French-language text documents. It combines automated Named Entity Recognition (NER) with mandatory human validation to achieve 100% pseudonymization accuracy while maintaining document utility for downstream analysis.
This document describes the technical methodology, validation procedures, quality controls, and GDPR compliance mapping for use in academic research, ethics board submissions, and Data Protection Impact Assessments (DPIAs).
1.1 Design Philosophy¶
The tool follows a human-in-the-loop approach to pseudonymization:
- AI pre-detection identifies candidate entities using hybrid NLP + regex techniques
- Mandatory human validation ensures every entity is reviewed before replacement
- Consistent pseudonymization maintains document coherence across multi-document corpora
- Local-only processing ensures no data leaves the user's machine
This approach prioritizes zero false negatives (no missed entities) over full automation, recognizing that current NLP accuracy on French interview and business documents does not meet the threshold required for unsupervised pseudonymization.
2. Technical Approach¶
2.1 NER Library Selection¶
The NER component was selected through a controlled benchmark on a 25-document test corpus containing 1,855 manually annotated entities (PERSON, LOCATION, ORGANIZATION) drawn from French interview transcripts and business documents.
Benchmark Results (Story 1.2, 2026-01-16):
| Library | Model | Precision | Recall | F1 Score |
|---|---|---|---|---|
| spaCy | fr_core_news_lg 3.8.0 |
27.0% | 32.7% | 29.5% |
| Stanza | fr_default |
10.3% | 14.1% | 11.9% |
Per-Entity Type Performance (spaCy):
| Entity Type | Ground Truth | Precision | Recall | F1 |
|---|---|---|---|---|
| PERSON | 1,627 | 37.8% | 31.3% | 34.2% |
| LOCATION | 123 | 29.6% | 58.5% | 39.3% |
| ORGANIZATION | 105 | 3.8% | 27.6% | 6.7% |
Selection rationale: spaCy was selected for its 2.5x better F1 score, faster inference speed, larger community, and better documentation. Both libraries failed the initial 85% F1 target, leading to the adoption of a hybrid detection strategy with mandatory human validation.
Reference: Full benchmark analysis available in docs/nlp-benchmark-report.md.
2.2 Hybrid Detection Strategy¶
To compensate for baseline NER limitations, a hybrid approach combines NLP with rule-based pattern matching (Story 1.8, 2026-01-22):
Detection pipeline:
Input Text
├── spaCy NER (fr_core_news_lg) → NLP-detected entities
├── Regex Pattern Matcher → Pattern-detected entities
│ ├── French title patterns (M., Mme, Dr., Maître/Me)
│ ├── Compound name patterns (Jean-Pierre, Marie-Claire)
│ ├── French name dictionary matching
│ ├── Organization suffix patterns (SA, SARL, SAS, Cabinet)
│ └── False positive filtering (title-only, label words)
└── Entity Deduplication → Merged, deduplicated entity set
Hybrid performance improvement:
| Metric | spaCy Only | Hybrid (NLP + Regex) | Improvement |
|---|---|---|---|
| Total entities detected | 2,679 | 3,625 | +35.3% |
| PERSON entities | 1,612 | 2,454 | +52.2% |
| Processing time per doc | 0.07s | 0.07s | No overhead |
Reference: Full hybrid benchmark in docs/hybrid-benchmark-report.md.
2.3 Pseudonymization Algorithm¶
The pseudonymization engine uses a compositional approach to maintain internal consistency:
Core principles:
-
Component-based mapping: Full names are decomposed into components (e.g., "Marie Dubois" → ["Marie", "Dubois"]), each mapped to a pseudonym component. Partial references ("Marie" alone) resolve to the same pseudonym component ("Leia").
-
Themed pseudonym libraries: Three libraries provide culturally distinct pseudonym sets:
- Neutral: French-sounding names, real French cities, realistic company names
- Star Wars: Characters, planets, and factions from the Star Wars universe
-
Lord of the Rings: Characters and locations from Tolkien's Middle-earth
-
Entity type coverage:
- PERSON: First name + last name with gender matching (when NER provides classification)
- LOCATION: Cities, regions, countries
-
ORGANIZATION: Companies, institutions, agencies
-
French name preprocessing:
- Title stripping: "Dr. Marie Dubois" → "Marie Dubois" (title preserved in output)
- Compound name handling: "Jean-Pierre" treated as atomic unit
- Deduplication: Identical entities validated once, applied to all occurrences
2.4 Encryption Scheme¶
Entity mappings are stored in an encrypted SQLite database using industry-standard cryptographic primitives:
| Component | Implementation | Standard |
|---|---|---|
| Encryption | AES-256-SIV (deterministic AEAD) | RFC 5297, NIST approved |
| Key Derivation | PBKDF2-HMAC-SHA256 | NIST SP 800-132 |
| Iterations | 210,000 | OWASP 2023 recommendation |
| Salt | 32 bytes, cryptographically random | Per-database unique |
Deterministic encryption rationale: AES-256-SIV is used to enable encrypted field queries (e.g., looking up existing mappings for compositional pseudonymization). This is an industry-standard trade-off used by AWS DynamoDB, Google Tink, and MongoDB for searchable encrypted fields.
Security model: The database is local-only (no network exposure). An attacker requires both physical access to the machine and knowledge of the passphrase to decrypt mappings.
3. Validation Procedures and Quality Controls¶
3.1 Interactive Validation Workflow¶
Every document undergoes mandatory human validation before pseudonymization is applied:
- Entity detection: Hybrid NLP + regex pipeline identifies candidate entities
- Summary screen: Validator sees entity counts by type (PERSON, LOCATION, ORG)
- Entity-by-entity review: Each entity displayed with:
- Surrounding context (10 words before/after with highlighting)
- Entity type and detection source (NLP or regex)
- Confidence score (color-coded: green >80%, yellow 60-80%, red <60%)
- Proposed pseudonym from themed library
- Validator actions: Accept, reject (false positive), edit entity text, add missed entity, change pseudonym
- Batch operations: Accept/reject all entities of a given type
- Final confirmation: Summary of all decisions before processing
Validation performance: <2 minutes for typical documents (20-30 entities). Entity deduplication reduces validation time by ~66% for large documents with repeated entities.
3.2 NER Accuracy Metrics¶
| Stage | Detection Rate | False Positives | Notes |
|---|---|---|---|
| spaCy baseline | 29.5% F1 | High | Pre-trained on news text |
| Hybrid (NLP + regex) | ~40-50% F1 | Moderate | +35.3% recall improvement |
| Post-validation | 100% accuracy | 0% | Human review catches all errors |
3.3 LLM Utility Preservation¶
Pseudonymized documents were validated for downstream utility using LLM-as-judge methodology (Story 4.1, 2026-02-06):
Test design: A/B comparison of LLM responses on original vs. pseudonymized versions of 10 representative documents (5 interview transcripts, 5 business documents).
Evaluation dimensions (1-5 scale):
| Dimension | Mean Score | Interpretation |
|---|---|---|
| Thematic Accuracy | 4.27/5.0 | Themes fully preserved |
| Relationship Coherence | 4.27/5.0 | Entity relationships maintained |
| Factual Preservation | 4.27/5.0 | Facts accurately extracted |
| Overall Utility | 4.27/5.0 (85.4%) | Exceeds 80% threshold |
Conclusion: Pseudonymized documents retain sufficient utility for LLM-based analysis (summarization, theme extraction, relationship mapping).
Reference: Full validation report in docs/llm-validation-report.md.
4. Limitations and Edge Cases¶
4.1 NER Accuracy¶
- Baseline NER accuracy is 29.5% F1 (below the 85% target for unsupervised processing)
- Hybrid detection improves to ~40-50% F1 but still requires mandatory human validation
- ORG detection is particularly weak (3.8% precision = 96% false positive rate)
- Pre-trained models optimized for news text, not interview/business documents
- Fine-tuning with user validation data is planned for v3.0
4.2 Language and Format Limitations¶
- French only in v1.0 (English, Spanish, German planned for future versions)
- Text formats only:
.txtand.md(no PDF, DOCX, or HTML support in v1.0) - Text must use proper French encoding (UTF-8 with accents: é, è, à, etc.)
4.3 Entity Detection Edge Cases¶
- Uncommon French names not in spaCy's training data may be missed by NLP (caught during validation)
- Compound names ("Jean-Pierre") treated as atomic units (not split)
- Titles ("Dr.", "Maître", "Me") are stripped for matching, preserved in output
- Abbreviations may not be detected by NLP or regex patterns
- Gender-aware pseudonym assignment not yet implemented for all cases (FE-007, planned for v1.1)
4.4 Operational Constraints¶
- Passphrase is irrecoverable: If lost, existing mappings cannot be decrypted and pseudonymization cannot be reversed
- Theme consistency: Switching themes mid-project creates inconsistent pseudonyms
- Single-user design: No multi-user access control in v1.0 (SQLite with WAL mode supports concurrent reads)
5. GDPR Compliance Mapping¶
5.1 Pseudonymization Definition (Article 4(5))¶
"Pseudonymisation means the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person."
-- GDPR Article 4(5)
GDPR Pseudonymizer compliance:
| Requirement | Implementation |
|---|---|
| Personal data replaced with pseudonyms | PERSON, LOCATION, ORG entities replaced with themed pseudonyms |
| Additional information kept separately | Mapping database stored separately from pseudonymized documents |
| Technical measures for separation | AES-256-SIV encryption with passphrase protection |
| Reversibility | Authorized users can de-pseudonymize with correct passphrase |
5.2 Pseudonymization vs. Anonymization¶
Important distinction: Pseudonymized data is still personal data under GDPR (Recital 26). The mapping table allows re-identification, which means:
- GDPR obligations still apply to pseudonymized data
- The mapping database must be protected with appropriate security measures
- Pseudonymization is a risk reduction measure, not an exemption from GDPR
Anonymization (irreversible removal of all identifying information) is outside the scope of this tool. GDPR Pseudonymizer specifically preserves reversibility for use cases requiring it (academic research, legal proceedings, audit requirements).
5.3 GDPR Article Compliance Matrix¶
| GDPR Article | Requirement | Implementation | Status |
|---|---|---|---|
| Art. 4(5) | Pseudonymization definition | Consistent replacement with separate encrypted mapping | Compliant |
| Art. 25 | Data protection by design | 100% local processing, no cloud dependencies, encrypted storage | Compliant |
| Art. 32 | Security of processing | AES-256-SIV encryption, PBKDF2 key derivation (210K iterations), passphrase protection | Compliant |
| Art. 30 | Records of processing | Comprehensive audit logs with timestamp, entity count, model version, success/failure; JSON/CSV export | Compliant |
| Art. 35 | Privacy Impact Assessment | Transparent methodology (this document), auditable processing pipeline | Supports DPIA |
| Art. 89 | Research safeguards | Pseudonymization as technical safeguard for research/statistical purposes | Compliant |
| Recital 26 | Pseudonymized data scope | Tool documentation clearly states pseudonymized data remains personal data | Acknowledged |
| Recital 28 | Technical safeguard | Pseudonymization applied as organizational/technical measure | Compliant |
5.4 Audit Trail¶
Every pseudonymization operation is logged in an operations table:
| Field | Description |
|---|---|
| Timestamp | When the operation occurred |
| Operation type | PROCESS, BATCH, IMPORT, etc. |
| Input file(s) | Source document path(s) |
| Entity count | Number of entities processed |
| Model version | NLP model used (spaCy fr_core_news_lg version) |
| Theme | Pseudonym library theme |
| Success/failure | Operation outcome |
| Processing time | Duration in seconds |
Audit logs can be exported in JSON or CSV format for compliance reporting:
poetry run gdpr-pseudo export audit_log.json
6. Academic Citation¶
6.1 How to Cite¶
When referencing this tool in academic publications, ethics board submissions, or DPIAs:
Deveaux, L. (2026). GDPR Pseudonymizer: AI-Assisted Pseudonymization for French Documents with Human Verification (Version 1.0) [Computer software]. GitHub. https://github.com/LioChanDaYo/RGPDpseudonymizer
BibTeX:
@software{deveaux2026gdpr,
author = {Deveaux, Lionel},
title = {{GDPR Pseudonymizer: AI-Assisted Pseudonymization for French Documents with Human Verification}},
year = {2026},
version = {1.0},
url = {https://github.com/LioChanDaYo/RGPDpseudonymizer}
}
6.2 References¶
-
European Parliament and Council. (2016). Regulation (EU) 2016/679 (General Data Protection Regulation). Official Journal of the European Union, L 119, 1-88.
-
Honnibal, M., & Montani, I. (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. https://spacy.io
-
Harkins, P. (2023). RFC 5297: Synthetic Initialization Vector (SIV) Authenticated Encryption Using the Advanced Encryption Standard (AES). IETF.
-
OWASP Foundation. (2023). Password Storage Cheat Sheet. https://cheatsheetseries.owasp.org/cheatsheets/Password_Storage_Cheat_Sheet.html
-
NIST. (2010). SP 800-132: Recommendation for Password-Based Key Derivation. National Institute of Standards and Technology.
7. Related Documents¶
- NLP Benchmark Report - Full NER library evaluation
- Hybrid Detection Report - Hybrid approach performance analysis
- LLM Validation Report - Utility preservation testing
- Installation Guide - Setup instructions
- CLI Reference - Command documentation
- FAQ - Common questions
Disclaimer: This tool assists with GDPR compliance but does not constitute legal advice. Consult your Data Protection Officer (DPO) or legal counsel for specific compliance guidance applicable to your context.