MT masking - astilba

When you send a translation string to a machine-translation (MT) or LLM engine, two things must survive the round-trip unmodified: the interpolation variables and formatter keywords ({{count}}, {{date, datetime}}), the $t() nesting refs, and any markup tags. @astilba/core provides the masking and validation logic to make that reliable.

The problem masking solves

Left unprotected, an MT engine will happily “translate” the parts it shouldn’t:

{{count}} items might come back as {{cuenta}} elementos — the variable renamed, so interpolation silently breaks.
A formatter keyword like one/other inside a token, or a $t() ref name, can be translated to uno/otros, which then resolves to nothing.
A markup tag can be dropped or rewritten.

astilba’s answer is to mask every non-text token with an opaque sentinel before the MT call, and validate that every placeholder came back unmodified after it.

Mask, translate, unmask

maskTokens replaces every non-text token (interpolation, nesting, markup) with an opaque sentinel drawn from the Unicode private-use area (U+E000…U+E001). The formatter keyword and the $t() ref name live inside the masked span, so the engine never even sees them to translate.

import { maskTokens, unmask } from "@astilba/core";

const tokens = [
  { type: "text", raw: "Hello, " },
  { type: "interpolation", raw: "{{name}}", variable: "name" },
  { type: "text", raw: "! You have " },
  { type: "interpolation", raw: "{{count}}", variable: "count" },
  { type: "text", raw: " messages." },
] as const;

const { masked, parts } = maskTokens(tokens);
// masked → "Hello, \uE0000\uE001! You have \uE0001\uE001 messages."
// parts  → ["{{name}}", "{{count}}"]

// ...send `masked` to your MT engine, get a translation back...

const restored = unmask(translated, parts); // splices the originals back in

Sentinels use private-use-area delimiters so they carry no linguistic content for the engine to “helpfully” translate, while still being detectable if the engine mangles them.

One guard on masking

maskTokens throws (MASK_VALIDATION) if the literal text already contains a reserved sentinel delimiter (U+E000 / U+E001). This is rare but legal in real values — private-use glyphs from icon fonts like Material Icons or Nerd Fonts — and masking it would be ambiguous. Strip or escape those characters before masking.

After translation: two complementary checks

Once the translation comes back, astilba offers two checks, each suited to a different point in the pipeline.

`validateSentinels` — operate on the still-masked string

If you still have the masked string the engine returned (before unmasking), validateSentinels checks that every sentinel was returned exactly once, unmodified, and that the engine invented none. Reordering is allowed (target languages reorder freely); pass requireOrder: true to also assert original order.

import { validateSentinels } from "@astilba/core";

const check = validateSentinels(translatedMasked, parts);
check.ok;     // true if every placeholder survived exactly once
check.errors; // e.g. ['placeholder #0 ({{name}}) was dropped by MT']

It also detects a corrupted sentinel — stray delimiter characters that aren’t part of a valid token.

`validatePlaceholderTokens` — operate on restored tokens

validatePlaceholderTokens is the fail-closed placeholder validator. It compares a source value’s tokens against its translation’s tokens and fails if any placeholder was added, dropped, or modified. Placeholder identity is the canonical fields directly — variable

format for interpolation, ref + options for nesting, raw for markup — so a value and its own translation carry byte-identical placeholders, and no syntax-specific normalisation is needed.

import { validatePlaceholderTokens } from "@astilba/core";

const check = validatePlaceholderTokens(sourceTokens, translatedTokens);
check.ok;     // true if the placeholder multisets match
check.errors; // e.g. ['source placeholder "interp:name|" is missing or altered...']

The string-entry form

The one place a raw string must be re-tokenized is a translation returned from MT — it was never in the model, so it has no token view. validatePlaceholders(source, translated, tokenize) takes the adapter’s tokenizer by injection, tokenizes both sides, and defers to validatePlaceholderTokens. The i18next adapter pre-binds its tokenizer so you get an ergonomic two-argument validatePlaceholders(source, translated) — see the adapter reference.

An adapter wanting looser placeholder matching (for example, normalising whitespace) can pre-normalise its tokens before calling validatePlaceholderTokens. By default the check is strict, because a value and its translation carry byte-identical placeholders.

The canonical model — the ValueToken kinds masking operates on.
@astilba/core API — the masking function signatures.

​The problem masking solves

​Mask, translate, unmask

​One guard on masking

​After translation: two complementary checks

​validateSentinels — operate on the still-masked string

​validatePlaceholderTokens — operate on restored tokens

​The string-entry form

​Related

The problem masking solves

Mask, translate, unmask

One guard on masking

After translation: two complementary checks

`validateSentinels` — operate on the still-masked string

`validatePlaceholderTokens` — operate on restored tokens

The string-entry form

Related