Stealth Assessment for Educational Games

Orientation

Why this guide exists

The original stealth assessment and ECD literature is strong, but newcomers usually enter the topic from the wrong side. They start with logs, dashboards, or model names, then try to work backward toward what those traces might mean. The literature moves in the opposite direction.

This guide reorganizes the field into an easier sequence for instructional designers, learning scientists, and educational game researchers. It keeps the foundational sources visible, but turns them into a workflow you can use when designing a game, instrumenting events, or deciding whether a given inference is actually defensible.

Positioning

The original books and papers explain stealth assessment, ECD, and game-based assessment in depth. This page is a practical onboarding guide that reorders those ideas for people who need to move from concept to implementation without overclaiming what telemetry can do.

Core Idea

The core idea: telemetry is not assessment

Gameplay logs are only raw traces of action until they are linked to a coherent interpretive argument. Stealth assessment begins when a game is deliberately designed so that learner actions can serve as evidence for claims about understanding, strategy, misconception, sensemaking, persistence, or other target constructs.

What to avoid

Do not treat “high click density,” “lots of retries,” or “fast completion” as self-evident indicators of learning. Those are candidate signals. Whether they count as evidence depends on the task structure, construct theory, dependence structure, and validation work around them.

Design Spine

Backward design, ECD, and stealth assessment are related but not identical

Layer 01

Backward design

Clarifies the desired understandings, transfer goals, and big ideas that matter instructionally.

Layer 02

Evidence-centered design

Makes the evidentiary logic explicit: what observations would justify claims about those goals?

Layer 03

Stealth assessment

Embeds that evidentiary logic inside gameplay so learners can be assessed without breaking flow.

Layer 04

Analytics implementation

Adds event design, feature extraction, modeling, reporting, and feedback mechanisms.

The cleanest formulation is this: backward design clarifies what learners should understand and do; ECD clarifies what evidence would justify claims about that learning; stealth assessment embeds that evidentiary logic into gameplay.

Caution

Backward design and ECD both “keep the end in mind,” but they operate at different levels. Backward design is pedagogical and curricular. ECD is evidentiary and inferential. Treating them as interchangeable makes the guide too loose for researchers and too vague for designers.

Learning Engineering

Learning engineering turns stealth assessment into a design-and-improvement loop

A stealth-assessment-only framing can make this topic sound like hidden scoring. A learning-engineering framing is broader. It asks how theory, design, instrumentation, feedback, and iteration can work together to improve learning experiences over time.

Interpretation

Under a learning engineering lens, stealth assessment is an embedded evidence layer inside a learning system. Its value is not only that it estimates learner state. Its value is also that it helps teams improve tasks, revise feedback timing, compare design versions, detect stuck points, and redesign environments more intelligently.

Pipeline

From raw telemetry to interpretable evidence

The crucial move is not from logs to model, but from logs to designed indicators, then from indicators to evidence, then from evidence to inference, support, and revision.

This is also where sequence data enters. Ordered clicks, dialogue turns, action transitions, hint use, retries, and timing are not an optional sidecar. They are often the raw material from which episode-level evidence is built before any BN, IRT, CDM, or learner-facing support rule can do useful work.

Practical rule

If the game produces rich sequence data, do not ask whether sequence methods replace psychometric models. Ask which parts of the temporal trace should be summarized into evidence for BN, IRT, or CDM, and which parts should remain process evidence in their own right.

System Pipeline

How data enters the game system, moves through models, and returns to the learner

A useful stealth-assessment guide needs more than model names. It needs an operational pipeline showing where telemetry is captured, how it is transformed, where inference happens, and what actually goes back to the learner, teacher, or design team.

System layer	What goes in	What comes out	Who uses it
Telemetry capture	timestamped gameplay events, states, attempts, hints, dialogue	ordered raw event stream	data pipeline
Evidence layer	sequences, timings, episode boundaries, task metadata	scored indicators, motifs, episode labels, task responses	modeling layer
Inference layer	evidence-coded inputs	belief states, proficiency estimates, mastery profiles, strategy labels	support logic
Action layer	model outputs plus business rules	feedback, adaptive next step, dashboard flag, redesign insight	learner, teacher, designer

Data Integration

A practical guide to integrating sequence, score, and network data

Educational games often produce several data types at once: ordered action traces, scored task outcomes, hint and timing records, and sometimes social or network data from collaboration. The practical challenge is not collecting them. It is aligning them at the right unit and deciding which signals become direct evidence, which remain context, and which should stay in a parallel analytic track.

Animated Integration Map

This D3-based diagram shows the recommended flow: keep modalities separate first, align them at the episode grain, transform each into meaningful evidence, and only then send them into model and support layers.

Data type	Typical raw form	Best immediate transformation	Common downstream role
Sequence data	ordered events, actions, turns, transitions, dwell times	episodes, motifs, transition counts, state labels, revision indicators	process evidence or feeder features for BN, IRT, or CDM
Score data	correctness, level completion, rubric score, challenge result	task response table at item or episode level	IRT, CDM, growth models, reporting layer
Hint and support data	hint request, hint timing, scaffold exposure, feedback viewed	support-opportunity features and response-process flags	fairness checks, explanatory covariates, adaptation rules
SNA or collaboration data	who interacted with whom, reply network, help network, co-action graph	network measures, role labels, group-position indicators	context variables, team-level evidence, dashboard layer

The most practical architecture

Do not throw all modalities into one flat table immediately. Use a layered architecture instead: raw tables by modality, then a shared alignment key, then modality-specific evidence features, then an inference layer, then a learner-facing action layer. This keeps temporal and social meaning alive long enough to be useful.

Step	What to do	Practical rule
1. Align keys	Make every source share learner ID, session ID, task or episode ID, and timestamp or order index.	If two sources cannot be aligned cleanly, do not pretend they belong in one model yet.
2. Pick the grain	Choose whether inference happens at event, episode, task, session, learner, or team level.	Most stealth-assessment systems work better at the episode or task level than at the raw click level.
3. Transform by modality	Build sequence features from traces, scored-response tables from outcomes, and contextual indicators from network data.	Each modality should be transformed according to what makes it meaningful, not according to what is easiest to merge.
4. Assign evidence roles	Mark each feature as direct evidence, explanatory covariate, opportunity variable, or contextual descriptor.	Network position often belongs in the context layer before it belongs in the direct competence layer.
5. Choose integration strategy	Fuse early, fuse late, or run parallel models with a decision layer.	When constructs are still uncertain, parallel modeling is often safer than one giant fused model.

Pattern 01

Feature fusion

Sequence, score, and network features are all summarized to the same episode or learner grain and sent into one inference model. This is compact, but it risks flattening temporal meaning too early.

Pattern 02

Evidence fusion

Each modality is first turned into evidence-coded indicators, and only then combined. This is often the best fit for BN-style or ECD-grounded systems.

Pattern 03

Parallel models

A score model, a sequence model, and a network model run separately, then a decision layer uses their outputs together. This is often the safest architecture early in design.

Pattern 04

Multilevel integration

Individual evidence feeds learner-level inference while network or group measures remain team-level context. Use this when collaboration structure matters but should not be mistaken for the same construct as individual mastery.

If your main goal is...	Use sequence data as...	Use score data as...	Use SNA data as...
skill diagnosis	evidence features such as revision, variable control, or stuck loops	primary task-response layer	context or opportunity structure
adaptive support	real-time state or trigger features	stability anchor for learner estimate	collaboration signal for who should help whom
team or CPS analysis	interaction sequences and turn-taking patterns	shared task outcome	primary structural layer
design iteration	where learners get stuck or revise	which tasks fail most often	who gets isolated or over-centralized

Best beginner rule

Integrate at the level of meaning, not just the level of rows. Sequence data tells you how the learner got somewhere. Score data tells you whether the task outcome was successful. Network data tells you the social position and interaction structure around that performance. A good stealth-assessment system keeps those roles distinct before bringing them together.

Multimodal ECD

When multimodal ECD is actually feasible

Multimodality does not automatically improve stealth assessment. It becomes defensible only when each modality contributes something the construct argument genuinely needs, and when those signals can be aligned at a meaningful grain without collapsing into noise.

Chapter 09 · D3 multilayer evidentiary network

This D3 view shows a recommended multilayer structure for multimodal stealth assessment: raw modality, transformed evidence, inference layer, and learner-facing action.

How to read it: click a modality on the left to highlight its recommended path. This is intentionally selective, not a spaghetti map of every imaginable edge.

Practical rule

In multimodal stealth assessment, every modality should have one of four roles: direct evidence, supporting evidence, context or opportunity structure, or design-diagnostic trace. If you cannot assign one of those roles, the modality is probably premature.

Question	What to answer before building
Why this modality?	What construct-relevant evidence does it add that clickstream or score data does not?
At what grain?	Will it align at event, episode, task, learner, or team level?
Direct evidence or context?	Does it justify a claim about competence, or only describe exposure, opportunity, or coordination?
How will it be validated?	What response-process, external, or fairness evidence would support using it in assessment?

Immersive Learning

For 3D immersive learning, spatial data changes the whole problem

In VR, AR, MR, and other immersive environments, the assessment problem is not just what the learner clicked. It is where they looked, how they moved, what they approached or ignored, how they oriented their body, what objects were within reach, and how the environment itself constrained or afforded action. That makes spatial data a first-class evidence source, but also a major source of validity risk.

Most realistic design pattern

Translate spatial traces into zones, relations, and episodes. For example: entered the hazard area before reading instructions, looked at the gauge after the anomaly cue, returned to the control panel after feedback, or maintained shared visual orientation with a collaborator. These are far more interpretable than raw x-y-z streams.

Common spatial problem	Current practical solution	Example use
Raw gaze in a moving 3D scene is hard to interpret	map gaze rays to AOIs, objects, or social targets; use fixation and transition indicators	attention to teacher, peer, control panel, or hazard source in VR classrooms and simulations
Movement traces are too dense and device-specific	summarize into path length, dwell, entropy, revisits, or zone-transition motifs	distinguishing systematic exploration from aimless wandering
Collaboration is spatial as well as verbal	combine proximity, facing direction, and turn-taking with dialogue or network events	shared attention, co-navigation, and coordination analysis
Intervention timing is fragile	trigger support only after multi-signal evidence accumulates, not after one glance or one step	spatially aware hints that wait for repeated missed cues or return loops

Current research directions

Recent immersive-learning analytics reviews emphasize multimodal integration, stronger theoretical framing, and better use of spatial process data. Recent case work also shows more interest in gaze-based network analysis, head and hand movement as behavioral indicators, and eye-tracking plus VR for self-regulated learning. These are promising, but they still require careful evidence design before they become defensible assessment signals.

Concrete cases to learn from

VR classroom eye-tracking pipelines now map gaze to dynamic attention targets rather than treating eye points as isolated samples.
Immersive learning analytics reviews in 2025 point to weak theoretical integration and uneven multimodal practice, which means many systems still collect more spatial data than they can justify.
Recent VR studies are using hand and head movement features as indicators of curiosity, cognitive load, or self-regulated learning, but these signals are still best treated as evidence candidates rather than automatic construct measures.

Tools

Packages and tutorials worth embedding into your workflow

The goal is not to list every package. The goal is to match each data type or inference layer to a package that is mature, documented, and realistic for educational game analytics work.

Need	Recommended package	Why it fits	Tutorial or docs
Bayesian networks	`bnlearn` (R)	Structure learning, parameter learning, and inference in one mature package.	bnlearn documentation
IRT and explanatory IRT	`mirt` (R)	Strong for unidimensional and multidimensional IRT, mixed designs, and item or person covariates.	mirt site · manual
CDM and Q-matrix work	`GDINA` (R)	Broad CDM family support, Q-matrix validation tools, item fit, and diagnostic modeling.	GDINA homepage
Sequence analysis	`TraMineR` (R)	Well-established toolkit for state or event sequences, sequence visualization, distances, and subsequences.	TraMineR site · install guide
Network and SNA	`igraph` and `ggraph`/`tidygraph` (R)	Fast graph computation with a flexible plotting layer for social, reply, help, or attention networks.	igraph docs · ggraph+t tidygraph vignette
Animated explanatory diagrams	`D3` (JS)	Best fit for bespoke SVG teaching diagrams and pipeline animations inside this guide.	d3-transition docs

Tested in this environment on May 3, 2026

Package	Load status	Observed note
`TraMineR`	loaded	Available here, version 2.2.13.
`igraph`	loaded	Available here, version 2.2.1.
`ggraph`	loaded	Available here, version 2.2.2.
`tidygraph`	loaded	Available here, version 1.3.1.
`bnlearn`	loaded	Installed and loaded here, version 5.1.
`mirt`	loaded	Installed and loaded here, version 1.46.1.
`GDINA`	loaded	Installed and loaded here, version 2.9.12.
`D3`	embedded	Used successfully on this page for the animated integration map.

Practical starter stack

If you are building a first serious pipeline, a pragmatic stack is often enough: TraMineR for sequence summaries, mirt or GDINA for the scored diagnostic layer, bnlearn when you need explicit evidence networks, and igraph/ggraph when collaboration or attention structure matters.

Troubleshooting notes

First check namespace loading: after installation, verify requireNamespace() before debugging model syntax. In this environment, bnlearn, mirt, and GDINA all loaded successfully after installation.
For sequence and SNA work: this environment already has TraMineR, igraph, ggraph, and tidygraph, so those are the easiest places to prototype first.
For Windows setups: record the exact R path and package versions before troubleshooting modeling errors. Here the tests were run from C:\Program Files\R\R-4.5.2\bin\Rscript.exe with user-library installs under the local R 4.5 library.
If install succeeds but scripts still fail: check whether another R installation or another library path is being used by your pipeline, especially when scripts run from editors, schedulers, or separate shells.
For immersive or multimodal pipelines: debug the data alignment layer before the modeling layer. Missing keys, inconsistent episode grains, and time-sync problems are more common than model-specific bugs.
For D3-based explanations: if the animation does not appear, check whether external script loading is blocked. The diagram on this page depends on the D3 v7 CDN script rather than a bundled local copy.

AI Workflow

How to integrate GPT-5.5 into game development for stealth assessment

GPT-5.5 is most useful here when it is treated as a workflow coordinator and specification engine, not as a magical black box that directly decides learner state. In game development, its strongest role is helping teams translate messy design goals into concrete telemetry plans, evidence maps, adaptation logic, and implementation artifacts that can be checked by humans and by the game system.

Practical positioning

As of April 23, 2026, OpenAI announced GPT-5.5 rollout in ChatGPT and Codex. Official help also stated that GPT-5.5 and GPT-5.5 Pro were not launching to the API that same day. So the safest design assumption is this: use GPT-5.5 today for design, coding, and tool-assisted workflow orchestration in ChatGPT or Codex, and treat API deployment details as something to verify against current official platform docs before implementation.

Game-dev stage	Good GPT-5.5 role	What to ask it for
Mechanic design	design translator	convert learning goals into tasks, failure states, evidence opportunities, and support triggers
Telemetry planning	schema generator	enumerate events, payload fields, IDs, timestamps, and episode boundaries
Evidence design	ECD assistant	map claims to indicators, and indicators to in-game actions or spatial relations
Implementation	agentic coding partner	write logging hooks, validation scripts, dashboards, and analysis starters
Adaptive support	rule author	draft support logic, but return structured rules rather than free-form pedagogical prose
Review and QA	skeptical auditor	check overclaiming, missing keys, fairness risks, and weak evidence mappings

Prompt Pattern 01

Telemetry planner

Ask GPT-5.5 to produce an event schema with event name, trigger condition, required fields, grain, and downstream evidence role.

Prompt Pattern 02

Evidence mapper

Ask it to convert a mechanic into claims, indicators, and possible false interpretations, so the team can stress-test construct validity early.

Prompt Pattern 03

Adaptive rule writer

Ask for support rules in structured fields such as trigger, confidence threshold, blocked conditions, learner-facing action, and teacher-facing note.

Prompt Pattern 04

Integration reviewer

Ask it to inspect sequence, score, and network pipelines together and identify where alignment or episode-grain assumptions could break.

Prompting rule

For stealth-assessment work, prefer prompts that ask GPT-5.5 for structured intermediate artifacts rather than final truth claims. Event schemas, JSON-like evidence maps, support-rule tables, AOI definitions, and test cases are far easier to validate than free-form interpretations.

Use case	Prompt starter
Event schema	`You are a telemetry architect for an educational game. Given the mechanic below, produce a table with event_name, trigger, required_fields, example_payload, unit_of_analysis, and evidence_role. Do not infer learner state yet.`
Evidence map	`You are an evidence-centered design assistant. Convert this gameplay loop into claims, direct evidence, alternative explanations, and missing observations we would still need before making an assessment claim.`
Adaptive support rule	`Write learner-support rules for this game mechanic. Return JSON fields for trigger, minimum_evidence, blocked_if, learner_action, teacher_signal, and redesign_note. Use conservative thresholds.`
Spatial telemetry plan	`Given this 3D immersive task, list the spatial variables worth logging, the AOIs or zones to define, and which features should remain context rather than direct evidence.`
Pipeline audit	`Review this multimodal pipeline. Identify any mismatches in learner ID, session ID, episode grain, time alignment, and opportunities for false causal interpretation.`

Best current OpenAI-side practice

If you later move from ChatGPT or Codex prototyping into API deployment, use structured outputs or tool schemas so the model returns machine-readable artifacts instead of prose. That is especially important for event definitions, adaptation rules, and dashboard annotations that must flow into a game system without manual cleanup.

Model Families

Different model families answer different questions

Framing rule

Do not write as if IRT/CDM is the whole of stealth assessment. Historically, the stealth-assessment literature is at least as strongly tied to Bayesian evidence models and ECD-based design as it is to psychometric latent-trait models.

How sequence data integrates with these families

Sequence data can feed all four families, but not in the same way. For BN, temporal traces often become evidence indicators such as revision-after-mismatch or hint-after-repeat-failure. For IRT, sequences are usually compressed into scored episodes or explanatory process features. For CDM, sequences can help generate task-level evidence for specific attributes, such as variable control or representational switching. For process-data models, the sequence itself remains the main analytic object rather than a pre-compressed score.

Q-Matrix Visual

Instead of defining a Q-matrix only in prose, map each task to the attributes it truly requires. The highlighted row cycles to show how one task can load on one, two, or several attributes.

Read it this way: rows are tasks or episodes, columns are attributes. Filled cells mean the task is intended to elicit evidence about that attribute. If too many cells are turned on without a theory, the Q-matrix becomes hard to defend.

Minimal decision rule

Use BN when you need structured belief updates across facets. Use IRT when you need a comparable proficiency scale. Use CDM when you need a skill-by-skill diagnostic profile. Use process-data models when the time-ordered behavior is itself the substance of the inference, or when you need to build the evidence features that later feed BN, IRT, or CDM.

BN Path

Why Bayesian networks matter for stealth assessment

Bayesian networks matter when stealth assessment needs a structured evidence argument about several related facets at once. Instead of reducing everything to one score, BN lets the system update beliefs about strategy, misconception, or subskill states as new evidence accumulates during play.

What the BN dataframe usually looks like

Why it connects to stealth assessment

Stealth assessment is often built around an evidence model rather than a single outcome variable. BN fits that logic well because it can encode how multiple gameplay indicators relate to several latent facets and then update those beliefs continuously as the learner acts.

BN emphasis	What it assumes or needs	What to watch out for
facet-level belief update	explicit evidence model	weakly justified edges make the network look smarter than it is
multiple indicators per construct	theory-based mapping from events to evidence	do not confuse raw event frequency with diagnostic evidence
real-time support	posterior thresholds linked to action rules	support triggers need separate validation from model fit

IRT Path

Why IRT can matter for stealth assessment

IRT becomes relevant when stealth assessment needs a comparable proficiency scale rather than only a narrative process description. In that case, gameplay has to be translated into scored task or episode responses that behave enough like item responses to support a latent-trait interpretation.

What the IRT dataframe usually looks like

Why it connects to stealth assessment

Stealth assessment often needs a learner estimate that can be compared across tasks, times, or design variants. IRT supports that kind of inference when the evidence layer can be converted into reasonably comparable scored responses. The tradeoff is that temporal richness is usually compressed before estimation.

IRT emphasis	What it assumes or needs	What to watch out for
latent trait scale	comparable scored tasks or episodes	do not pretend raw clicks are items
task response matrix	stable unit of analysis and scoring rule	repeated attempts can create local dependence
adaptive difficulty	link between theta and task challenge	support logic should not be confused with measurement itself

CDM Path

Why CDM can matter for stealth assessment

CDM becomes relevant when stealth assessment needs a diagnostic mastery profile rather than one continuous score. Here the central design problem is not only scoring performance, but defining which tasks provide evidence for which attributes and defending that mapping with a Q-matrix.

What the CDM dataframes usually look like

Why it connects to stealth assessment

Stealth assessment often aims to support learners in a targeted way during or after play. CDM is attractive when that support should be tied to specific component skills. The price is that the attribute model and Q-matrix have to be defended carefully; otherwise the diagnosis looks more precise than the evidence really is.

CDM emphasis	What it assumes or needs	What to watch out for
skill-by-skill diagnosis	clear attribute definitions	vague attributes make the whole model unstable
Q-matrix mapping	defensible task-to-attribute links	overloading too many attributes per task weakens interpretation
targeted support	feedback tied to specific mastery gaps	diagnostic output still needs response-process validation

Simulation

Simulation 1: same score, different process

Try two learners

Both learners reach the same final outcome. Switch the scenario to see why gameplay process matters before you infer understanding.

Simulation

Simulation 2: streaming events are not the same as stable inference

Evidence threshold 4

Noise level 0.30

Support strength 0.62

raw stream

12

usable evidence

5

status

tentative

Move the sliders to see how many incoming events become usable evidence. The lesson is simple: a live event stream can grow quickly while defensible inference remains tentative.

Validity

Validity and fairness need to stay in the same frame

Game-based evidence can be richer than conventional item scores, but it also creates more ways to overclaim. Unobtrusive measurement is not automatically valid measurement, and real-time traces are not automatically fair traces.

Watch for these risks

Local dependence: repeated attempts by the same learner can inflate apparent information.
Opportunity structure: some learners may encounter more or different tasks than others.
Timing artifacts: speed may reflect interface friction or device constraints, not skill.
Process ambiguity: the same action can mean exploration, confusion, or persistence depending on context.
Construct drift: features that predict an outcome may still fail to map cleanly to the intended competence.

Challenges

The field is moving forward, but several hard problems remain open

Challenge 01

Process validity

Many systems can predict an outcome from log features, but that does not by itself prove those features are valid evidence of the intended construct. The mapping from behavior to competence still needs theory, task analysis, and validation.

Challenge 02

Dependence and opportunity

Game data are rarely IID. Repeated attempts, unequal task exposure, branching paths, and collaborative contexts complicate inference and make naive comparisons risky.

Challenge 03

Fairness and bias

As stealth assessment becomes more adaptive and more algorithmic, fairness moves from a side note to a design requirement. Bias can enter through tasks, evidence rules, model fitting, and support decisions.

Challenge 04

Actionability

A score or mastery label is not yet useful support. The hard problem is turning inference into timely, instructionally sound interventions without overreacting to noisy evidence.

Problem area	Why it matters now	What a stronger design does
Construct interpretation	Feature-rich logs encourage overclaiming.	Link each indicator back to a claim, task, and response-process rationale.
Temporal complexity	Sequences, latency, and branching paths are often where meaning lives.	Preserve time long enough to build evidence before flattening to totals.
Fairness	Adaptive systems can differentially misread groups or play styles.	Audit opportunity structure, response process, and differential algorithmic behavior.
Deployment	Research prototypes often stop at post hoc analysis.	Design explicit return paths to hints, next-task logic, teacher dashboards, and redesign loops.

Current Research

Current research is pushing toward joint modeling, fairness, and actionable support

Recent work is no longer asking only whether stealth assessment is possible. It is increasingly asking how process data can be integrated with psychometric models, how adaptive support should be triggered, how fairness should be audited, and how the whole system can operate as part of a learning-engineering loop.

Active directions as of May 3, 2026

Joint modeling of outcomes and process data: recent work is combining gameplay process indicators with assessment outcomes rather than analyzing them separately.
Adaptive feedback pipelines: explanatory IRT and related approaches are being used to connect diagnostic inference to concrete feedback decisions inside educational games.
Goal-aware and sequence-aware inference: newer studies are treating immediate gameplay goals and temporal traces as evidence, not just background context.
Fairness-centric stealth assessment: the field is starting to treat algorithmic bias and group-differential interpretation as central concerns rather than afterthoughts.
Learning-engineering deployment: more work is framing stealth assessment as part of an iterative design system that feeds redesign, not just measurement.

Emerging direction	What is changing	Why it matters for this guide
Process plus psychometrics	Sequence/process indicators are being modeled together with item or outcome data.	This supports the guide's emphasis on an evidence layer between telemetry and inference.
Adaptive feedback	Models are increasingly judged by what support they trigger, not only by fit.	The pipeline must continue all the way back to the learner.
Goal recognition and richer state inference	Immediate intent, strategy, and local goals are being treated as evidence sources.	Sequence data should not be reduced too early.
Fairness auditing	Bias analysis is expanding from traditional score fairness to algorithmic decision behavior.	Support rules need as much scrutiny as model outputs.
From prototype to system	Research is moving from one-off validation studies toward deployable learning systems.	Backward design, ECD, modeling, and support logic need to stay connected.

New direction to watch

A notable shift in 2025-2026 is that stealth assessment is being discussed less as quiet measurement and more as a design problem involving ethical support, dynamic evidence accumulation, and system-level actionability. That is a better fit for educational game analytics and for learning engineering.

Workflow

A practical workflow for a first stealth-assessment design pass

Write the instructional goal in backward-design language: what should learners understand or be able to transfer?
Translate that into ECD language: what claim do you want to make and what observations would count as evidence?
Define comparable gameplay episodes rather than treating the whole log as one undifferentiated stream.
Build an evidence layer: decide which raw events, sequences, timings, or state changes become indicators, scored responses, motifs, or episode labels.
Choose an inference family only after you know whether you want a trait estimate, mastery profile, dynamic belief update, or process-pattern description.
Specify how outputs will be used: formative feedback, teacher dashboard, adaptation rule, next-task selection, or redesign insight.
Validate conservatively: response process, external relations, fairness, and dependence checks before promotional claims.

Best beginner rule

If you are not yet sure what evidence would justify your claim, you are not ready to choose between IRT, CDM, or any other model family.

Downloads

Public-safe starter materials

Synthetic data

Gameplay log CSV

A tiny public-safe event log showing sessions, tasks, retries, hints, and simple evidence flags.

Template

Evidence map sheet

A starter table for writing claims, indicators, tasks, and candidate interpretation rules.

CDM starter

Q-matrix CSV

A public-safe starter showing how tasks can be linked to attributes such as variable control, revision, and graph interpretation.

BN starter

bnlearn script

A tiny Bayesian-network starter that fits a simple evidence model from episode-level indicators.

IRT starter

mirt script

A minimal IRT starter that builds a task-response matrix and estimates a 2PL model.

CDM starter

GDINA script

A minimal cognitive-diagnosis starter that pairs a response matrix with the Q-matrix template.

Sequence starter

TraMineR script

A short sequence-analysis starter that converts task outcomes into symbolic sequences and summarizes them.

SNA starter

igraph ggraph script

A tiny network-analysis starter for help or collaboration ties with centrality and a simple graph plot.

Reading path

Paper guide

A short annotated reading path from ECD to stealth assessment and learning engineering.

FAQ

Practical FAQ

Is stealth assessment just hidden testing?

No. Hiddenness is only one feature. The stronger definition is that assessment is embedded in a designed activity so that learner behavior can be interpreted without interrupting the experience. If a game quietly records actions but has no defensible evidence model, that is logging, not stealth assessment.

Do I need IRT or CDM to do this well?

Not necessarily. Those are important model families, but they are not the whole field. Bayesian networks, dynamic Bayesian networks, explanatory IRT, cross-classified IRT, and process-data approaches all have legitimate roles. The right question is not “which model is fashionable?” but “what inferential question am I actually trying to answer?”

What exactly gets piped from the game into BN, IRT, CDM, or process models?

Usually not raw clicks by themselves. The game emits timestamped events and task metadata first. Those are segmented into comparable episodes. Sequence features or rules then turn those events into evidence-coded inputs, such as scored task responses, revision indicators, hint-after-failure flags, state-transition counts, or attribute-tagged task outcomes.

From there, BN consumes structured evidence indicators, IRT consumes scored responses or explanatory task features, CDM consumes attribute-linked task outcomes plus a Q-matrix, and process-data models consume the ordered trace more directly. The outputs then feed a separate action layer: hints, next-task recommendations, dashboards, or redesign notes.

Can I infer learning in real time from a live event stream?

Sometimes, but carefully. Real-time event capture is easier than real-time psychometric inference. A stream of actions may arrive instantly while stable evidence remains thin or noisy. Distinguish event capture, feature aggregation, evidence accumulation, inference, and recalibration.

Where do sequence methods fit if I also want BN, IRT, or CDM?

Often at the evidence layer. Sequence mining, HMM-style state modeling, transition analysis, or hand-built motif rules can help define what counts as productive exploration, stuck behavior, revision, or variable control. Those sequence-derived features can then feed BN, IRT, or CDM instead of being discarded.

In other cases, sequence analysis remains a parallel analytic lens rather than a feeder model. One track supports psychometric inference, while the other helps interpret response processes, uncover strategy structure, or explain why the psychometric outputs behave as they do.

How do I combine sequence, score, and network data without making a mess?

Start by aligning learner, session, task, and time keys across all sources. Then decide the inference grain, usually episode or task rather than raw click. After that, transform each modality separately: sequence data into motifs or state labels, score data into response tables, and network data into contextual indicators.

Only then decide whether to fuse them into one model or keep parallel models with a shared decision layer. In many early-stage stealth-assessment designs, parallel modeling is safer because it preserves interpretation and makes it easier to see which modality is actually carrying the evidence.

What changes when I take a learning-engineering perspective?

The question expands from “How do we infer learner state?” to “How do we improve a learning system over time?” In that frame, stealth assessment supports feedback, adaptation, and redesign, not just score reporting.

What do reviewers usually challenge first?

Usually one of five things: whether the task really elicits the claimed construct, whether the evidence mapping is theory-driven, whether dependence and repeated attempts were handled, whether fairness issues were considered, and whether the model output was overinterpreted as deeper understanding than the data justify.

References

Foundational and bridge references

Arieli-Attali, M., Ward, S., Thomas, J., Deonovic, B., & von Davier, A. A. (2019). The expanded evidence-centered design (e-ECD) for learning and assessment systems. Frontiers in Psychology, 10, 853. https://doi.org/10.3389/fpsyg.2019.00853

Baker, R. S., Boser, U., & Snow, E. L. (2022). Learning engineering: A view on where the field is at, where it’s going, and the research needed. Technology, Mind, and Behavior, 3(1). https://doi.org/10.1037/tmb0000058

Bley, S. (2017). Developing and validating a technology-based diagnostic assessment using the evidence-centered game design approach. Empirical Research in Vocational Education and Training, 9, 6. https://doi.org/10.1186/s40461-017-0049-0

Chen, F., Cui, Y., & Chu, M.-W. (2020). Utilizing game analytics to inform and validate digital game-based assessment with evidence-centered game design. International Journal of Artificial Intelligence in Education, 30(3), 481–503. https://doi.org/10.1007/s40593-020-00202-6

Hansen, E. G. (2011). Evidence-centered design for learning (ETS RM-11-02). ETS.

Kim, Y. J., Almond, R. G., & Shute, V. J. (2016). Applying evidence-centered design for the development of game-based assessments in Physics Playground. International Journal of Testing, 16(2), 142–163. https://doi.org/10.1080/15305058.2015.1108322

Kim, Y. J., & Shute, V. J. (2015). The interplay of game elements with psychometric qualities, learning, and enjoyment in game-based assessment. Computers & Education, 87, 340–356. https://doi.org/10.1016/j.compedu.2015.07.009

Kolodner, J. L. (2023). Learning engineering: What it is, why I’m involved, and why I think more of you should be. Journal of the Learning Sciences, 32(2), 305–323. https://doi.org/10.1080/10508406.2023.2190717

Lee, V. R. (2023). Learning sciences and learning engineering: A natural or artificial distinction? Journal of the Learning Sciences, 32(2), 288–304. https://doi.org/10.1080/10508406.2022.2100705

Levy, R. (2019). Dynamic Bayesian network modeling of game-based diagnostic assessments. Multivariate Behavioral Research, 54(6), 771–794. https://doi.org/10.1080/00273171.2019.1590794

Mislevy, R. J., Almond, R. G., & Lukas, J. F. (2003). A brief introduction to evidence-centered design. ETS. https://doi.org/10.1002/j.2333-8504.2003.tb01908.x

Mislevy, R. J., Behrens, J. T., DiCerbo, K. E., & Levy, R. (2012). Design and discovery in educational assessment. Journal of Educational Data Mining, 4(1), 11–48. https://doi.org/10.5281/zenodo.3554641

Rupp, A. A., Gushta, M., Mislevy, R. J., & Shaffer, D. W. (2010). Evidence-centered design of epistemic games. Journal of Technology, Learning, and Assessment, 8(4). https://ejournals.bc.edu/index.php/jtla/article/view/1623

Shute, V. J., & Ventura, M. (2013). Stealth assessment: Measuring and supporting learning in video games. MIT Press.

Shute, V. J., Wang, L., Greiff, S., Zhao, W., & Moore, G. (2016). Measuring problem solving skills via stealth assessment in an engaging video game. Computers in Human Behavior, 63, 106–117. https://doi.org/10.1016/j.chb.2016.05.047

Udeozor, C., Chan, P., Russo Abegão, F., & Glassey, J. (2023). Game-based assessment framework for virtual reality, augmented reality and digital game-based learning. International Journal of Educational Technology in Higher Education, 20, 36. https://doi.org/10.1186/s41239-023-00405-6

Anghel, E., Khorramdel, L., & von Davier, M. (2024). The use of process data in large-scale assessments: A literature review. Large-scale Assessments in Education, 12, 13. https://doi.org/10.1186/s40536-024-00202-1

Bijl, A., Veldkamp, B. P., Wools, S., & de Klerk, S. (2024). Serious games in high-stakes assessment contexts: A systematic literature review into the game design principles for valid game-based performance assessment. Educational Technology Research and Development, 72(4), 2041â€“2064. https://doi.org/10.1007/s11423-024-10362-0

Demedts, F., Said-Metwaly, S., Kiili, K., Ninaus, M., Lindstedt, A., Reynvoet, B., Sasanguie, D., & Depaepe, F. (2025). Adaptive feedback in digital educational games: An explanatory item response theory approach. Journal of Computer Assisted Learning, 41(5), e70104. https://doi.org/10.1111/jcal.70104

Feng, T., & Cai, L. (2024). Sensemaking of process data from evaluation studies of educational games: An application of cross-classified item response theory modeling. Journal of Educational Measurement. https://doi.org/10.1111/jedm.12396

Rahimi, S., Shute, V. J., & Almond, R. G. (2026). Stealth assessments in digital learning environments: Current trends, new directions, and ethical considerations. Journal of Research on Technology in Education, 58(1), 1â€“10. https://doi.org/10.1080/15391523.2025.2587551

Rahimi, S., Shute, V. J., Khodabandelou, R., Kuba, R., Babaee, M., & Esmaeiligoujar, S. (2023). Stealth assessment: A systematic review of the literature. In Proceedings of the 17th International Conference of the Learning Sciences. https://doi.org/10.22318/icls2023.395429

Sakr, A., & Abdullah, T. (2024). Virtual, augmented reality and learning analytics impact on learners, and educators: A systematic review. Education and Information Technologies, 29, 19913â€“19962. https://doi.org/10.1007/s10639-024-12602-5

Tao, L., Cukurova, M., & Song, Y. (2025). Learning analytics in immersive virtual learning environments: A systematic literature review. Smart Learning Environments, 12, 43. https://doi.org/10.1186/s40561-025-00381-6

Vanecek, D., Rehman, I. U., & Dobias, M. (2026). Integrating virtual reality and eye-tracking as a gamified assessment tool for self-regulated learning. Education and Information Technologies. https://doi.org/10.1007/s10639-026-13967-5

Wiggins, G., & McTighe, J. (2005). Understanding by design (Expanded 2nd ed.). ASCD.

Stealth assessment for educational games

Why this guide exists

The core idea: telemetry is not assessment

Backward design, ECD, and stealth assessment are related but not identical

Learning engineering turns stealth assessment into a design-and-improvement loop

From raw telemetry to interpretable evidence

How data enters the game system, moves through models, and returns to the learner

A practical guide to integrating sequence, score, and network data

Animated Integration Map

When multimodal ECD is actually feasible

Chapter 09 · D3 multilayer evidentiary network

For 3D immersive learning, spatial data changes the whole problem

Packages and tutorials worth embedding into your workflow

How to integrate GPT-5.5 into game development for stealth assessment

Different model families answer different questions

Q-Matrix Visual

Why Bayesian networks matter for stealth assessment

Why IRT can matter for stealth assessment

Why CDM can matter for stealth assessment

Simulation 1: same score, different process

Try two learners

Simulation 2: streaming events are not the same as stable inference

Validity and fairness need to stay in the same frame

The field is moving forward, but several hard problems remain open

Current research is pushing toward joint modeling, fairness, and actionable support

A practical workflow for a first stealth-assessment design pass

Public-safe starter materials

Practical FAQ

Foundational and bridge references