The NACE Transition at Insee
Insee — French National Institute of Statistics and Economic Studies
2026-04-06
The Problem
Public statistics rely on shared classifications
What NACE is
The Problem
A recurring challenge for every statistical institute (NACE, COICOP, ISCO, …).
Standard ML maintenance
Classification revision
Central question: how can we leverage years of accumulated production knowledge in the old classification to bootstrap a model in the new one?
The Problem
The Problem
The Problem
| Multivocal | Univocal | |
|---|---|---|
| Correspondence table | 25 % | 75 % |
| Actual training set | 52 % | 48 % |
~1.4 M
observations that cannot be recoded by table lookup
The problem cannot be solved by a deterministic mapping table alone.
Human and Virtual Annotation
Expert agreement is imperfect
Explicit reasoning protocol
Methodology
Why not ask the LLM directly?
Constrained selection instead:
Methodology
For each observation:
A form of Rule-Based Augmented Generation (RBAG) with a deterministic retriever.
Inputs to the prompt
| Element | Source |
|---|---|
| Activity text + precisions | Production register |
| NACE 2.0 code | Insee NACE Experts |
| Candidate set | Correspondence table |
| Explanatory notes | Official NACE 2.1 documentation |
Methodology
System prompt — same for every observation
You are an expert in the NACE classification. Your task is to assign a NACE 2.1 code to a business based on its activity description, using a candidate list derived from the existing NACE 2.0 code…
User prompt — observation-specific
# Main activity : {{activity}}
# NACE 2.0 code : {{nace_old}}
# Candidate NACE 2.1 codes + notes : {{proposed_codes}}
========
# Instructions (7 rules)
→ candidate list only — no external code
→ first activity only
→ strict JSON — no explanation
{proposed_codes} — example:
86.95: Physiotherapy activities
Includes: physiotherapy, medical massage, occupational therapy, praxitherapy…
Does not include: osteopaths, chiropractors, non-medical massage parlors…
86.96: Traditional, complementary, and alternative medicine [...]
86.99: Other human health activities n.e.c. [...]
Methodology
| Model | Size | Inference speed | Behaviour |
|---|---|---|---|
| Qwen3-6-35B MoE | 35B | Very fast (13 it/s) | Moderately restrictive |
| Qwen3-6-35B MoE + thinking | 35B | Less fast (1 it/s) | Less restrictive |
| Gemma4-26B MoE | 26B | Fast (8 it/s) | Moderately restrictive |
➕️ Aggregation: majority vote.
Different sizes, different throughput.
Infrastructure
SSPCloud is Insee’s data science platform, built on the open-source Onyxia project and powered by Kubernetes as its underlying technology stack.
Main elements of the technical stack:
Entirely open-source, containerised, reproducible.
Results
Results
Accuracies at level 5 (sub-class) — % of cases where the predicted code matches the manual annotation.
| Model | Inference | Overall | Codable | LLM-only |
|---|---|---|---|---|
| Qwen3-6-35B MoE | 13 it/s | 75.2 % | 78.3 % | 83.3 % |
| Qwen3-6-35B MoE + thinking | ~1 it/s | 75.7 % | 80.4 % | 84.1 % |
| Gemma4-26B MoE | 8 it/s | 75.2 % | 80.0 % | 83.5 % |
| Majority vote | — | 78.3 % | — | 86.9 % |
The codable filter is per-model, hence the dash for the ensemble vote.
Ensemble accuracy comes at a real cost: three full inference passes, one of them in thinking mode (~10× slower).
A substantial share of disagreements are not LLM errors but ambiguous descriptions or legacy-coding mistakes.
Results
Self-reported confidence is not a reliable filter on its own.
Results
Once happy with the LLM on the benchmark, we apply at scale:
Tip
This is semi-synthetic training set: real businesses, real descriptions, but algorithmic labels.
Results
TorchTextClassifiers retrained on the semi-synthetic corpus
~80 %
overall accuracy on a representative NACE 2.1 test set
≈ the legacy NACE 2.0 classifier on its NACE 2.0 task
Last retrain used an earlier vintage of the corpus; today’s pipeline produces better labels.
Semi-synthetic LLM-generated labels can substitute for manual annotations in the early stages of a classification transition.
An Alternative: Pure RAG
Limitations of RBAG
What RAG brings
Even where RBAG works, RAG is worth mastering — and quantifying.
An Alternative: Pure RAG
An Alternative: Pure RAG
qwen3-6-35b-moe · 28,499 multivocal obs · only the candidate-set source differs
Level-5 accuracy decomposition
| RBAG | Pure RAG | |
|---|---|---|
| Truth in candidate set | 90.0 % | 83.5 % |
| LLM picks right (when in set) | 83.4 % | 78.1 % |
| Overall | 75.3 % | 65.3 % |
The retriever is the bottleneck
Closing the gap needs better retriever quality (embeddings, reranker) — not a wider top-k.
Conclusion and Perspectives
Take-aways
Limits & ongoing work