Methodology: GDPR-Compliant Pseudonymization of French Text¶

GDPR Pseudonymizer - Technical Methodology and Academic Reference

Version: 1.0 Date: 2026-02-08 Suitable for: Ethics board review, DPIA documentation, academic citation

1. Introduction¶

GDPR Pseudonymizer is an AI-assisted pseudonymization tool designed for French-language text documents. It combines automated Named Entity Recognition (NER) with mandatory human validation to achieve 100% pseudonymization accuracy while maintaining document utility for downstream analysis.

This document describes the technical methodology, validation procedures, quality controls, and GDPR compliance mapping for use in academic research, ethics board submissions, and Data Protection Impact Assessments (DPIAs).

1.1 Design Philosophy¶

The tool follows a human-in-the-loop approach to pseudonymization:

AI pre-detection identifies candidate entities using hybrid NLP + regex techniques
Mandatory human validation ensures every entity is reviewed before replacement
Consistent pseudonymization maintains document coherence across multi-document corpora
Local-only processing ensures no data leaves the user's machine

This approach prioritizes zero false negatives (no missed entities) over full automation, recognizing that current NLP accuracy on French interview and business documents does not meet the threshold required for unsupervised pseudonymization.

2. Technical Approach¶

2.1 NER Library Selection¶

The NER component was selected through a controlled benchmark on a 25-document test corpus containing 1,855 manually annotated entities (PERSON, LOCATION, ORGANIZATION) drawn from French interview transcripts and business documents.

Benchmark Results (Story 1.2, 2026-01-16):

Library	Model	Precision	Recall	F1 Score
spaCy	`fr_core_news_lg` 3.8.0	27.0%	32.7%	29.5%
Stanza	`fr_default`	10.3%	14.1%	11.9%

Per-Entity Type Performance (spaCy):

Entity Type	Ground Truth	Precision	Recall	F1
PERSON	1,627	37.8%	31.3%	34.2%
LOCATION	123	29.6%	58.5%	39.3%
ORGANIZATION	105	3.8%	27.6%	6.7%

Selection rationale: spaCy was selected for its 2.5x better F1 score, faster inference speed, larger community, and better documentation. Both libraries failed the initial 85% F1 target, leading to the adoption of a hybrid detection strategy with mandatory human validation.

Reference: Full benchmark analysis available in docs/nlp-benchmark-report.md.

2.2 Hybrid Detection Strategy¶

To compensate for baseline NER limitations, a hybrid approach combines NLP with rule-based pattern matching (Story 1.8, 2026-01-22):

Detection pipeline:

Input Text
    ├── spaCy NER (fr_core_news_lg)  →  NLP-detected entities
    ├── Regex Pattern Matcher         →  Pattern-detected entities
    │   ├── French title patterns (M., Mme, Dr., Maître/Me)
    │   ├── Compound name patterns (Jean-Pierre, Marie-Claire)
    │   ├── French name dictionary matching
    │   ├── Organization suffix patterns (SA, SARL, SAS, Cabinet)
    │   └── False positive filtering (title-only, label words)
    └── Entity Deduplication          →  Merged, deduplicated entity set

Hybrid performance improvement:

Metric	spaCy Only	Hybrid (NLP + Regex)	Improvement
Total entities detected	2,679	3,625	+35.3%
PERSON entities	1,612	2,454	+52.2%
Processing time per doc	0.07s	0.07s	No overhead

Reference: Full hybrid benchmark in docs/hybrid-benchmark-report.md.

2.3 Pseudonymization Algorithm¶

The pseudonymization engine uses a compositional approach to maintain internal consistency:

Core principles:

Component-based mapping: Full names are decomposed into components (e.g., "Marie Dubois" → ["Marie", "Dubois"]), each mapped to a pseudonym component. Partial references ("Marie" alone) resolve to the same pseudonym component ("Leia").
Themed pseudonym libraries: Three libraries provide culturally distinct pseudonym sets:
Neutral: French-sounding names, real French cities, realistic company names
Star Wars: Characters, planets, and factions from the Star Wars universe
Lord of the Rings: Characters and locations from Tolkien's Middle-earth
Entity type coverage:
PERSON: First name + last name with gender matching (when NER provides classification)
LOCATION: Cities, regions, countries
ORGANIZATION: Companies, institutions, agencies
French name preprocessing:
Title stripping: "Dr. Marie Dubois" → "Marie Dubois" (title preserved in output)
Compound name handling: "Jean-Pierre" treated as atomic unit
Deduplication: Identical entities validated once, applied to all occurrences

2.4 Encryption Scheme¶

Entity mappings are stored in an encrypted SQLite database using industry-standard cryptographic primitives:

Component	Implementation	Standard
Encryption	AES-256-SIV (deterministic AEAD)	RFC 5297, NIST approved
Key Derivation	PBKDF2-HMAC-SHA256	NIST SP 800-132
Iterations	210,000	OWASP 2023 recommendation
Salt	32 bytes, cryptographically random	Per-database unique

Deterministic encryption rationale: AES-256-SIV is used to enable encrypted field queries (e.g., looking up existing mappings for compositional pseudonymization). This is an industry-standard trade-off used by AWS DynamoDB, Google Tink, and MongoDB for searchable encrypted fields.

Security model: The database is local-only (no network exposure). An attacker requires both physical access to the machine and knowledge of the passphrase to decrypt mappings.

3. Validation Procedures and Quality Controls¶

3.1 Interactive Validation Workflow¶

Every document undergoes mandatory human validation before pseudonymization is applied:

Entity detection: Hybrid NLP + regex pipeline identifies candidate entities
Summary screen: Validator sees entity counts by type (PERSON, LOCATION, ORG)
Entity-by-entity review: Each entity displayed with:
Surrounding context (10 words before/after with highlighting)
Entity type and detection source (NLP or regex)
Confidence score (color-coded: green >80%, yellow 60-80%, red <60%)
Proposed pseudonym from themed library
Validator actions: Accept, reject (false positive), edit entity text, add missed entity, change pseudonym
Batch operations: Accept/reject all entities of a given type
Final confirmation: Summary of all decisions before processing

Validation performance: <2 minutes for typical documents (20-30 entities). Entity deduplication reduces validation time by ~66% for large documents with repeated entities.

3.2 NER Accuracy Metrics¶

Stage	Detection Rate	False Positives	Notes
spaCy baseline	29.5% F1	High	Pre-trained on news text
Hybrid (NLP + regex)	~40-50% F1	Moderate	+35.3% recall improvement
Post-validation	100% accuracy	0%	Human review catches all errors

3.3 LLM Utility Preservation¶

Pseudonymized documents were validated for downstream utility using LLM-as-judge methodology (Story 4.1, 2026-02-06):

Test design: A/B comparison of LLM responses on original vs. pseudonymized versions of 10 representative documents (5 interview transcripts, 5 business documents).

Evaluation dimensions (1-5 scale):

Dimension	Mean Score	Interpretation
Thematic Accuracy	4.27/5.0	Themes fully preserved
Relationship Coherence	4.27/5.0	Entity relationships maintained
Factual Preservation	4.27/5.0	Facts accurately extracted
Overall Utility	4.27/5.0 (85.4%)	Exceeds 80% threshold

Conclusion: Pseudonymized documents retain sufficient utility for LLM-based analysis (summarization, theme extraction, relationship mapping).

Reference: Full validation report in docs/llm-validation-report.md.

4. Limitations and Edge Cases¶

4.1 NER Accuracy¶

Baseline NER accuracy is 29.5% F1 (below the 85% target for unsupervised processing)
Hybrid detection improves to ~40-50% F1 but still requires mandatory human validation
ORG detection is particularly weak (3.8% precision = 96% false positive rate)
Pre-trained models optimized for news text, not interview/business documents
Fine-tuning with user validation data is planned for v3.0

4.2 Language and Format Limitations¶

French only in v1.0 (English, Spanish, German planned for future versions)
Text formats only: .txt and .md (no PDF, DOCX, or HTML support in v1.0)
Text must use proper French encoding (UTF-8 with accents: é, è, à, etc.)

4.3 Entity Detection Edge Cases¶

Uncommon French names not in spaCy's training data may be missed by NLP (caught during validation)
Compound names ("Jean-Pierre") treated as atomic units (not split)
Titles ("Dr.", "Maître", "Me") are stripped for matching, preserved in output
Abbreviations may not be detected by NLP or regex patterns
Gender-aware pseudonym assignment not yet implemented for all cases (FE-007, planned for v1.1)

4.4 Operational Constraints¶

Passphrase is irrecoverable: If lost, existing mappings cannot be decrypted and pseudonymization cannot be reversed
Theme consistency: Switching themes mid-project creates inconsistent pseudonyms
Single-user design: No multi-user access control in v1.0 (SQLite with WAL mode supports concurrent reads)

5.1 Pseudonymization Definition (Article 4(5))¶

"Pseudonymisation means the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person."

-- GDPR Article 4(5)

GDPR Pseudonymizer compliance:

Requirement	Implementation
Personal data replaced with pseudonyms	PERSON, LOCATION, ORG entities replaced with themed pseudonyms
Additional information kept separately	Mapping database stored separately from pseudonymized documents
Technical measures for separation	AES-256-SIV encryption with passphrase protection
Reversibility	Authorized users can de-pseudonymize with correct passphrase

5.2 Pseudonymization vs. Anonymization¶

Important distinction: Pseudonymized data is still personal data under GDPR (Recital 26). The mapping table allows re-identification, which means:

GDPR obligations still apply to pseudonymized data
The mapping database must be protected with appropriate security measures
Pseudonymization is a risk reduction measure, not an exemption from GDPR

Anonymization (irreversible removal of all identifying information) is outside the scope of this tool. GDPR Pseudonymizer specifically preserves reversibility for use cases requiring it (academic research, legal proceedings, audit requirements).

GDPR Article	Requirement	Implementation	Status
Art. 4(5)	Pseudonymization definition	Consistent replacement with separate encrypted mapping	Compliant
Art. 25	Data protection by design	100% local processing, no cloud dependencies, encrypted storage	Compliant
Art. 32	Security of processing	AES-256-SIV encryption, PBKDF2 key derivation (210K iterations), passphrase protection	Compliant
Art. 30	Records of processing	Comprehensive audit logs with timestamp, entity count, model version, success/failure; JSON/CSV export	Compliant
Art. 35	Privacy Impact Assessment	Transparent methodology (this document), auditable processing pipeline	Supports DPIA
Art. 89	Research safeguards	Pseudonymization as technical safeguard for research/statistical purposes	Compliant
Recital 26	Pseudonymized data scope	Tool documentation clearly states pseudonymized data remains personal data	Acknowledged
Recital 28	Technical safeguard	Pseudonymization applied as organizational/technical measure	Compliant

5.4 Audit Trail¶

Every pseudonymization operation is logged in an operations table:

Field	Description
Timestamp	When the operation occurred
Operation type	PROCESS, BATCH, IMPORT, etc.
Input file(s)	Source document path(s)
Entity count	Number of entities processed
Model version	NLP model used (spaCy fr_core_news_lg version)
Theme	Pseudonym library theme
Success/failure	Operation outcome
Processing time	Duration in seconds

Audit logs can be exported in JSON or CSV format for compliance reporting:

poetry run gdpr-pseudo export audit_log.json

6. Academic Citation¶

6.1 How to Cite¶

When referencing this tool in academic publications, ethics board submissions, or DPIAs:

Deveaux, L. (2026). GDPR Pseudonymizer: AI-Assisted Pseudonymization for French Documents with Human Verification (Version 1.0) [Computer software]. GitHub. https://github.com/LioChanDaYo/RGPDpseudonymizer

BibTeX:

@software{deveaux2026gdpr,
  author = {Deveaux, Lionel},
  title = {{GDPR Pseudonymizer: AI-Assisted Pseudonymization for French Documents with Human Verification}},
  year = {2026},
  version = {1.0},
  url = {https://github.com/LioChanDaYo/RGPDpseudonymizer}
}

6.2 References¶

European Parliament and Council. (2016). Regulation (EU) 2016/679 (General Data Protection Regulation). Official Journal of the European Union, L 119, 1-88.
Honnibal, M., & Montani, I. (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. https://spacy.io
Harkins, P. (2023). RFC 5297: Synthetic Initialization Vector (SIV) Authenticated Encryption Using the Advanced Encryption Standard (AES). IETF.
OWASP Foundation. (2023). Password Storage Cheat Sheet. https://cheatsheetseries.owasp.org/cheatsheets/Password_Storage_Cheat_Sheet.html
NIST. (2010). SP 800-132: Recommendation for Password-Based Key Derivation. National Institute of Standards and Technology.

NLP Benchmark Report - Full NER library evaluation
Hybrid Detection Report - Hybrid approach performance analysis
LLM Validation Report - Utility preservation testing
Installation Guide - Setup instructions
CLI Reference - Command documentation
FAQ - Common questions

Disclaimer: This tool assists with GDPR compliance but does not constitute legal advice. Consult your Data Protection Officer (DPO) or legal counsel for specific compliance guidance applicable to your context.