The World of Sequence Mining — beginner blueprint

Chapter 01 · Map

Why a family of methods, not just one?

An event log — classroom turns, clinical actions, MOOC clicks, gameplay moves — is a stream of categorical events ordered in time. Researchers want very different things from the same stream:

"How likely is each next move given the current one?" → TNA
"Does B follow A more than chance would predict?" → LSA
"Which long sub-sequences recur across many sessions?" → SPM
"Are observed moves driven by a few hidden cognitive states?" → HMM

Each method makes different assumptions and produces a different artefact. Treating them as competitors is a category error — they answer different questions. The next chapter maps method to question.

Chapter 02 · Map

Method × question decision tree

Your question	Method	R package	Output you'll get
"What's the typical flow between codes?"	TNA	`tna`	Directed weighted graph with centrality + bootstrap stability
"Did groups differ in their flow structure?"	TNA (group)	`tna::group_model`	Per-group networks + permutation contrast
"Does B significantly follow A at lag 1?"	LSA	`LagSequential` / custom z	z-score matrix per lag
"Which 3-step patterns recur in ≥ 30% of sessions?"	SPM	`arulesSequences` (cSPADE)	Ranked list of frequent ordered patterns
"Are there latent states behind the codes?"	HMM	`seqHMM` / `depmixS4`	K hidden states + transition + emission matrices
"Do hidden states differ by group / covariates?"	Mixture HMM	`seqHMM::build_mhmm`	Sub-populations with distinct dynamics
"How long does each state typically last?"	Semi-Markov	`mhsmm`	State + duration distributions

Visual decision tree

Decision tree · method by questionRead top-to-bottom: question → method → refinement. Cross-method chaining is encouraged (Chapter 8).

If you start here

Confirmatory hypothesis

You wrote a hypothesis like "after teacher feedback, students respond more often than chance." → LSA with the Allison–Liker correction.

If you start here

Exploratory mapping

You don't know what to expect. You want to see the dialogue's "shape" — which moves dominate, which transitions are typical. → TNA.

If you start here

Pattern discovery

You suspect specific motifs (ASK→EXPLAIN→EXAMPLE) repeat across many sessions and you want to rank them. → SPM.

Chapter 02b · Statistical background

Why ordinary IID statistics are often the wrong starting point

Many standard statistical procedures quietly assume that observations are independent and identically distributed (IID), or at least close enough that large-sample approximations behave well. Sequence data usually break that intuition. In an event stream, what happens at time t is often constrained by what happened at t-1, t-2, or by the broader local state of the interaction.

Core idea

Order is not noise in sequence data. Order is the object of study.

If you collapse the sequence into totals or proportions too early, you destroy the very dependence structure you are trying to analyze. Sequence methods exist because temporal dependence is meaningful, not because it is a nuisance to be averaged away.

Why IID fails hereThe same events can be counted as flat rows or understood as a dependent process. Sequence analysis starts from the second view.

So does the central limit theorem fail?

Not exactly. The central limit theorem is not simply "false" for all dependent data, but you do not get to assume the simple IID version by default. Some dependent processes still admit asymptotic normality under additional mixing or stationarity conditions, but most applied behavioral datasets do not justify that leap automatically. Classroom talk, gameplay actions, clinical routines, and clickstreams are often path-dependent, bursty, and heterogeneous across sessions.

That is why sequence analysis often leans on resampling, permutation, simulation, or model-based likelihood logic instead of casually treating every event as if it were an independent draw from one big population.

What goes wrong

Naive event-level tests

If you run a standard test on raw event counts as though every turn were independent, standard errors can look too small and effects can look too certain. The dependence inflates the apparent sample size.

What to ask instead

What null structure is plausible?

For process data, the right question is often not "is the mean different?" but "different from what kind of ordered null process?" That is where permutation and sequence-specific null models become useful.

Why permutation shows up so often

Permutation logic is attractive in sequence work because it can compare the observed structure to a rearranged world where the target effect is absent, while preserving parts of the data you still believe. For example, you may want to preserve session boundaries, overall code frequencies, or group labels, but break the specific alignment that would make one transition pattern look stronger than another.

That is also why permutation is not one single thing. A good permutation scheme must respect the unit of analysis. Sometimes you permute group labels across sessions. Sometimes you shuffle within session under constraints. Sometimes you simulate from a fitted null process instead of permuting raw events. The right choice depends on which dependence structure you are trying to keep and which structure you are testing against.

Practical reading rule

When a sequence paper says "permutation test," ask two follow-up questions immediately: what exactly was permuted, and what structure was intentionally preserved?

Permutation logicA useful permutation test breaks only the alignment tied to the hypothesis while preserving the dependence structure you still believe.

Interactive permutation lab

Permutations 400

Observed group gap 0.27

Observed

0.27

Null mean

0.00

Approx. p

0.00

This mini simulation keeps the session-level scores fixed and shuffles only the group labels. That is the point: preserve the session structure, break the group assignment, and ask how unusual the observed gap looks under that null.

Observed sessions

Top row = group A, bottom row = group B

Press Run shuffle to generate a null distribution and compare the observed group gap to permutation outcomes.

How this maps onto the four methods here

TNA: often uses bootstrap or permutation-style comparisons because edge weights and centralities come from dependent event transitions, not IID rows.
LSA: explicitly asks whether an observed lagged transition is above a chance model, so the null model is built into the logic of the z statistic.
SPM: is descriptive first; the statistical challenge is usually choosing support thresholds and filtering trivial patterns, not pretending the discovered motifs came from IID events.
HMM: handles dependence by modeling it directly through latent states and transition probabilities rather than pretending successive events are independent.

Common beginner mistake

Treating a sequence dataset like a flat spreadsheet of unrelated rows is usually the fastest way to get misleading certainty. Before testing anything, decide whether the meaningful unit is the event, the session, the transition, or the entire sequence.

Chapter 03 · Data

A shared dataframe to learn on

Every figure in this guide comes from one synthetic CSV bundled at example_data/fake_dialogue_long.csv. It is not real data — transition matrices were hand-tuned so each method has something interesting to find.

Column	Type	Carries
user_id	P001..P100	Session identifier (50 treatment + 50 control).
cond	treatment / control	Group label.
turn_idx	int	Monotonic turn index within a session.
speaker	A / B	Who spoke this turn.
ASK ... ELABORATE	0 / 1	One-hot code per turn (5 codes).

First ten rows

Each row is one turn. The five code columns are 0/1; exactly one is on per turn (in this synthetic data).

user_id, cond,      turn_idx, speaker, ASK, EXPLAIN, EXAMPLE, AGREE, ELABORATE
P001,    treatment, 1,        A,        0,   1,       0,       0,     0
P001,    treatment, 2,        B,        0,   0,       0,       0,     1
P001,    treatment, 3,        A,        0,   0,       0,       0,     1
P001,    treatment, 4,        A,        0,   0,       1,       0,     0
P001,    treatment, 5,        B,        0,   0,       0,       0,     1
P001,    treatment, 6,        A,        0,   0,       0,       0,     1
P001,    treatment, 7,        B,        0,   1,       0,       0,     0
P001,    treatment, 8,        A,        0,   0,       0,       0,     1
P001,    treatment, 9,        B,        0,   0,       0,       0,     1
P001,    treatment, 10,       B,        0,   0,       0,       1,     0

Sample-size cheat sheet

Method	Practical minimum	Comfortable
TNA	~30 sequences per group, ~1,000 transitions total	100+ sequences for stable bootstrap
LSA	Each lag-1 cell expected count ≥ 5 (Bakeman & Quera)	Several hundred coded events per session
SPM	50+ sequences (otherwise “frequent” is unreliable)	Hundreds-thousands; cSPADE shines at scale
HMM (K=3-4)	~100 sequences, ~20 obs/sequence	Mixture HMM needs substantially more

Per-method, this CSV is reshaped exactly once into the structure that method needs:

Method	Reshape
TNA / HMM	Wide matrix: one row per `user_id`, columns t1..t60, cells are code names. Built via `TraMineR::seqdef()`.
LSA	Single ordered vector per session, then aggregated lag-1 transitions.
SPM	Transaction format: `sequenceID eventID size item` rows, written to a basket file and read by `arulesSequences::read_baskets()`.

Chapter 03b · Quick start

Five-minute first figure

Open R. Paste the block. You should see a TNA network of the synthetic data within a minute.

# 1) Install once (skips already-installed)
pkgs <- c("tna", "TraMineR", "arulesSequences", "seqHMM", "ggplot2")
new <- setdiff(pkgs, rownames(installed.packages()))
if (length(new)) install.packages(new,
       repos = "https://cloud.r-project.org")

# 2) Load + grab the bundled fake CSV from GitHub
library(tna); library(TraMineR)
url <- "https://educatian.github.io/sequence-mining/example_data/fake_dialogue_long.csv"
long <- read.csv(url, stringsAsFactors = FALSE)

# 3) Reshape one-hot to a single ‘code’ column, then to wide matrix
CODES <- c("ASK","EXPLAIN","EXAMPLE","AGREE","ELABORATE")
long$code <- CODES[apply(long[, CODES], 1, which.max)]
ids <- unique(long$user_id)
Lmax <- max(table(long$user_id))
W <- matrix(NA_character_, length(ids), Lmax,
            dimnames = list(ids, paste0("t", 1:Lmax)))
for (uid in ids) {
  s <- long[long$user_id == uid, ]
  W[uid, 1:nrow(s)] <- s$code[order(s$turn_idx)]
}

# 4) Fit + plot
seq_obj <- seqdef(W, alphabet = CODES)
m <- build_model(seq_obj)
plot(m)              # your first TNA figure
centralities(m)      # in/out degree etc.

Tip · Once you have seq_obj

You can immediately call seqfplot(seq_obj) for code-frequency plots and seqIplot(seq_obj) for index plots — from TraMineR. Those visualisations are method-agnostic and worth knowing before any modelling.

Chapter 04 · Method

TNA — Transition Network Analysis

TNA models your stream as a first-order Markov chain (next state depends only on current state) and renders the resulting transition matrix as a directed weighted graph — with centrality, bootstrap edge stability, community detection, and permutation-based group comparison built in.

Use when

"I want a network picture of how my codes flow."

Especially good when you want one figure that summarises an entire cohort's typical dynamics, or when you want to compare the structure of two cohorts.

Original guides

Use this page as the on-ramp, then jump to the primary TNA materials.

Primer + tutorial: Advanced Learning Analytics Methods, Chapter 15 · Package site: sonsoles.me/tna · Software paper: Applied Psychological Measurement article

Follow-on extensions: FTNA tutorial (Chapter 16) and TNA clusters / heterogeneity tutorial (Chapter 17).

Minimal code

library(tna); library(TraMineR)
seq_obj <- seqdef(wide_matrix, alphabet = CODES)
m <- build_model(seq_obj)            # fit first-order TNA
plot(m)                                # transition graph
centralities(m)                        # in/out degree, betweenness, closeness, ...
boot <- bootstrap(m)                 # edge stability
gm <- group_model(seq_obj, group = cond)  # per-group
permutation_test(gm)                   # test group difference

TNA · full cohortEdge labels = first-order transition probability. Self-loops show how often a state stays put. The strong ASK→EXPLAIN→EXAMPLE→ELABORATE backbone reflects the synthetic transition matrix.

TNA · per-groupTreatment vs Control side-by-side. Each panel uses the same node positions; visual differences reflect transition-probability differences.

Famous gotcha

TNA's default model is first-order Markov. Long-range dependencies and durations are invisible. If your data has clear "phases" (e.g. exploration → exploitation), consider FTNA, clustered TNA, or HMM instead.

Chapter 05 · Method

LSA — Lag Sequential Analysis

LSA is the classic confirmatory test: given antecedent A, is consequent B significantly more likely at lag k than chance? It produces a z-score per (A, B, lag) cell. Popularised by Bakeman & Gottman; Allison & Liker (1982) corrected the variance for the dependence between row and column counts.

Use when

"I have a specific hypothesis about which transitions matter."

And you want a defensible significance test rather than just a pretty graph.

Original guides

LSA has a longer methods history than the R ecosystem around it.

Canonical book: Bakeman & Quera's Sequential Analysis and Observational Methods for the Behavioral Sciences · Chapter preview: Time-Window and Log-Linear Sequential Analysis

R implementation docs: LagSequential manual, package overview + vignette index, and O'Connor (1999) for the original SAS/SPSS program lineage.

Minimal code (no external package)

# Reshape long-format CSV into a list of code-vectors (one per session)
sessions <- split(long$code, long$user_id)
CODES <- c("ASK","EXPLAIN","EXAMPLE","AGREE","ELABORATE")
trans <- matrix(0, length(CODES), length(CODES),
                dimnames = list(CODES, CODES))
# Build lag-1 transition counts
for (s in sessions) {
  for (i in 1:(length(s) - 1))
    trans[s[i], s[i+1]] <- trans[s[i], s[i+1]] + 1
}
# Allison-Liker z
N <- sum(trans); rsum <- rowSums(trans); csum <- colSums(trans)
pA <- rsum / N; pB <- csum / N
expected <- outer(rsum, pB)
var_AL  <- outer(rsum, pB * (1 - pB)) * outer(1 - pA, rep(1, length(pB)))
z <- (trans - expected) / sqrt(var_AL)

LSA · AL z-scoresRead row to column. The +20.6 cell (ASK→EXPLAIN) confirms the synthetic backbone. Negative z (teal) means the consequent occurs less than chance after that antecedent.

Reading the z values

When is a cell “significant”?

Under the standard normal approximation, |z| > 1.96 is the two-sided p < .05 threshold; |z| > 2.58 is p < .01. In our heatmap, ASK→EXPLAIN at z = 20.6 is many orders past chance — expected for hand-tuned synthetic data, not realistic for a real corpus.

Multiple-comparison warning. A 5×5 lag-1 table tests 25 cells at once; if you also test multiple lags (1, 2, 3), the family-wise error rate inflates fast. Apply Bonferroni (divide alpha by the number of cells you test) or report only effects with |z| > 3 as a coarse safeguard.

Lag k > 1

To test "does B follow A two turns later," replace s[i+1] with s[i+k] in the inner loop and adjust the loop bound to length(s) - k. The Allison–Liker formula stays the same. In practice researchers report lag 1 and lag 2; longer lags require much more data.

Famous gotcha

Raw z-scores are inflated when codes can repeat. Always use the Allison–Liker correction (or its equivalent) for sampling dependence; otherwise spurious "significance" is routine.

Chapter 06 · Method

SPM — Sequential Pattern Mining

SPM looks for variable-length sub-sequences that occur in at least minSupport fraction of your sessions — without imposing any Markov assumption. The cSPADE algorithm (Zaki 2001) implemented in arulesSequences is the standard R route; SPMF (Java) and PrefixSpan (Python) cover the rest.

Use when

"I want frequent ordered motifs across many sessions."

Especially useful when motifs span more than two events — LSA only looks at lags one at a time, SPM finds sub-sequences of arbitrary length.

Original guides

For SPM, the algorithm paper and software manuals matter as much as the concept.

Foundational algorithm: Zaki (2001) SPADE paper · R docs: arulesSequences on CRAN and cSPADE function manual

Broader algorithm library: SPMF documentation + examples · Python route: PrefixSpan-py.

Minimal code

library(arulesSequences)
# Step 1 — write a basket file (one row per event):
#   sequenceID  eventID  size  item
# Example first 4 lines of "baskets.txt":
#   P001 1 1 ASK
#   P001 2 1 EXPLAIN
#   P001 3 1 EXAMPLE
#   P001 4 1 ELABORATE
# Step 2 — read it back as a transactions object:
trans <- read_baskets("baskets.txt",
                       info = c("sequenceID","eventID","SIZE"),
                       sep = " ")
patterns <- cspade(trans,
                   parameter = list(support = 0.3,
                                    maxsize = 1, maxlen = 4))
df <- as(patterns, "data.frame")
df$n_items <- stringr::str_count(df$sequence, ",") + 1
df_focus <- subset(df, n_items >= 2 & support < 0.98)
df_focus[order(-df_focus$support, -df_focus$n_items), ]   # more interpretable top patterns

SPM · support ceiling problemLeft: raw top-support bars flatten the story. Right: a filtered 2+ item motif view gives support spread and a more interpretable narrative.

Practical plotting rule

Do not make the raw top-support bar chart your final figure.

If the first ranks are all 1.00 or 0.99, the chart is telling you only that some events occur almost everywhere. For an interpretable figure, filter to multi-item patterns, optionally remove near-ceiling supports, and rank within that reduced set.

Bad first plot

Singletons + ceiling support

Bars all look the same, the labels carry no narrative, and readers cannot tell whether the dataset has meaningful motif structure or just globally common codes.

Better final plot

2+ item motifs with spread

Keep only patterns of length at least 2, then plot the top 10 or top 15 with visible support spread. That is where recurring instructional or behavioral motifs start to become legible.

Famous gotcha

minSupport is a tuning knife. Too low → combinatorial explosion (millions of patterns, mostly noise). Too high → only trivial patterns survive. Always pair with closed/maximal pattern filtering.

Chapter 07 · Method

HMM — Hidden Markov Models

HMM lets the state be latent. Observed codes are noisy emissions from a smaller set of unobserved cognitive/behavioural states. The model learns (1) emission probabilities P(code | state), (2) transition probabilities P(state_t+1 | state_t), and (3) the most-likely state sequence per observation (Viterbi).

Use when

"My codes are measurements of a deeper construct, not the construct itself."

e.g. you suspect "exploring", "consolidating", and "stuck" states drive the visible moves but you can't observe them directly.

Original guides

HMM is the one chapter where the package vignettes are almost mandatory reading.

Main R entry point: seqHMM package index · Core paper: seqHMM Journal of Statistical Software article · Estimation tips: seqHMM estimation vignette

Algorithm details: seqHMM algorithms vignette · Alternative R package: depmixS4 JSS paper · Python route: hmmlearn tutorial.

Minimal code

library(seqHMM); library(TraMineR)
seq_obj <- seqdef(wide_matrix, alphabet = CODES)
hmm0 <- build_hmm(observations = seq_obj, n_states = 3)
fit  <- fit_model(hmm0,
                  control_em = list(restart = list(times = 5)))
fit$model$transition_probs        # K x K state-transition matrix
fit$model$emission_probs          # K x V code-emission matrix
plot(fit$model)                   # observed + hidden state plot

HMM · K=3 hidden statesTop: observed code per turn per session. Bottom: most-likely hidden state assignment. Compact contiguous bands suggest the state model captures real structure rather than noise.

Choosing K with BIC

# Fit K = 2..6 and compare BIC
bic_table <- sapply(2:6, function(K) {
  set.seed(K)
  fit <- fit_model(build_hmm(seq_obj, n_states = K),
                    control_em = list(restart = list(times = 5)))
  BIC(fit$model)
})
best_K <- (2:6)[which.min(bic_table)]
cat("Best K by BIC:", best_K, "\n")

Pick the K with the lowest BIC, but also check that the gain from K to K+1 is meaningful. A drop of 2–6 BIC is "weak"; >10 is "strong" (Kass & Raftery 1995). For very long sequences BIC tends to over-penalise — cross-validated log-likelihood is more honest but takes longer to compute.

Hard state assignments (Viterbi)

paths <- hidden_paths(fit$model)   # most-likely state per turn per session
table(paths)                       # overall state usage

Run TNA on paths to get a transition graph in latent space — often more interpretable than the raw-code TNA.

Famous gotcha

Choosing K (number of hidden states) is hard and consequential. EM is local-optimum prone — run many random initialisations and compare BIC / AIC / cross-validated log-likelihood. Picking K by eyeballing is the #1 reproducibility failure in applied HMM papers.

Chapter 08 · Compare

Side-by-side: same data, four lenses

To make the family relationship vivid: all four figures above were fit on exactly the same 4,136-turn synthetic CSV. They each tell a different story.

Method	Headline number	What it answers
TNA	P(EXPLAIN \| ASK) = 0.53	In the synthetic example used here, ASK was most often followed by EXPLAIN.
LSA	z(ASK→EXPLAIN) = +20.6	In the same synthetic example, that follow-up occurred more often than the chance model would predict.
SPM	support(EXPLAIN,EXPLAIN,ASK) = 0.99	In the same synthetic example, this 3-step pattern appeared in 99% of sessions.
HMM (K=3)	logLik = −6271.79	In the same synthetic example, a K=3 state model produced an interpretable transition structure.

When to chain them

TNA → HMM: Use TNA to confirm the data has structure; use HMM to ask whether that structure is driven by latent states.
SPM → LSA: Use SPM exploratorily to surface frequent motifs; use LSA to test whether the surfaced motifs are statistically meaningful.
LSA → TNA: Use LSA to identify which transitions are above chance; visualise only those edges in TNA for a cleaner figure.
HMM → TNA: Run TNA on the inferred hidden-state sequence to get a transition graph in latent space.

Chapter 09 · Troubleshoot

Troubleshooting matrix

Install issues (do this first)

Symptom	Cause	Fix
`installation of package 'tna' had non-zero exit status`	Windows: missing RTools	Install RTools45 at default path; reopen R.
`arulesSequences` won't install	Depends on `arules` & C++ toolchain	`install.packages("arules"); install.packages("arulesSequences")` in that order.
SPMF Java errors	SPMF (the Java library) needs JRE 8+	Install OpenJDK 17 from adoptium.net; or stick to `arulesSequences`.
`seqHMM` compile fails on macOS	Missing Fortran compiler	Install gfortran via CRAN tools page.
CRAN timeout / 404	Default mirror down	Try `install.packages(..., repos = "https://cloud.r-project.org")`.

Runtime issues

Symptom	Likely cause	Fix
TNA: `could not find function "build_tna"`	Old API name	Use `build_model()` (tna 1.2.x).
seqdef: "found missing values ('NA')"	Sessions of unequal length	That's fine — `seqdef` codes voids; the message is informational.
LSA: NaN cells in z matrix	Row or column count is 0	Drop never-occurring codes or replace NaN with 0 after computing z.
cSPADE: 0 patterns returned	support too high	Lower `parameter = list(support = ...)` until you get patterns.
cSPADE: millions of patterns	support too low / no maxlen	Add `maxlen`, `maxsize` constraints; consider closed patterns.
HMM: degenerate states (all P≈0)	EM stuck in local optimum	Pass `control_em = list(restart = list(times = N))` with N ≥ 5.
HMM: K too high → overfit	BIC keeps rising with K	Use cross-validated log-likelihood; pick K where CV peaks, not BIC.
All methods: very few sessions per group	n < 30 per group	Bootstrap heavily; report uncertainty; prefer descriptive over inferential framing.

Universal caveat

None of these methods are causal. A heavy ASK→EXPLAIN edge tells you about the structure of the dialogue, not about what would happen if you intervened. Causal claims need a causal design.

Chapter 09b · FAQ

Practical FAQ for first sequence-mining projects

Why this section exists

Most early mistakes in sequence mining happen before the code runs: the wrong method is chosen, the event stream is too thin, or the interpretation overreaches what the output can support. These are the questions reviewers and beginners usually ask first.

How do I choose between TNA, LSA, SPM, and HMM?

Start with the research question, not the most sophisticated-looking method. TNA is best when you want to show the overall flow between codes. LSA is best when you need to test whether a specific lagged transition occurs more or less than chance. SPM is best when you want to discover longer recurring motifs. HMM is best when you think the visible code stream is an imperfect signal of a smaller set of hidden states.

The main mistake is using a method because it produces an attractive figure. A network graph is not automatically better than a motif table, and a latent-state model is not automatically deeper than a transition model. Each one answers a different question.

Practical rule: write the research question in one sentence first, then ask which output object answers that sentence most directly.

Can I use more than one method on the same dataset?

Yes. In many papers, the strongest design is a small sequence-mining pipeline rather than a single method. For example, TNA can describe the overall flow, LSA can test whether key lag-1 transitions are above chance, and SPM can surface longer motifs that are not obvious in the graph.

The important part is keeping the questions distinct. If you use multiple methods, each one should add a non-redundant layer of evidence rather than repeating the same descriptive point in a different format.

Practical rule: use method A to surface structure and method B to test, sharpen, or reinterpret that structure.

What is the minimum viable dataset for sequence mining?

There is no universal cutoff, but event sparsity is the common failure mode. If most sessions contain only a few coded events, TNA edges become unstable, LSA cells collapse toward zero counts, SPM returns either nothing or only trivial patterns, and HMM has almost no temporal signal to learn from.

What matters more than raw participant count is repeated coded structure: multiple events per session, recurring codes across sessions, and enough cases to stabilize estimates. A smaller dataset with dense sessions is often more usable than a larger dataset with only one or two events per case.

Practical rule: before modeling, inspect session length, code frequency, and recurrence across cases. If those are weak, the sequence method will usually be weak too.

Do I need timestamps, and what if my sessions have different lengths?

Stable order within each session is usually enough. TNA, LSA, SPM, and HMM all rely on sequence order first. Exact timestamps matter when you want to make claims about spacing, elapsed time, pauses, or tempo, but not for basic ordered-event analysis.

Unequal session lengths are normal. TNA, LSA, and SPM can all handle them as long as session boundaries are preserved. HMM can also handle unequal lengths, but extremely short sessions contribute little information about state transitions.

Practical rule: protect within-session order and session boundaries first; treat clock time as optional unless timing itself is part of the theory.

Can I compare treatment and control groups with these methods?

Yes, but be explicit about what differs. TNA compares flow structure, LSA compares transition tendencies, SPM compares motif prevalence, and HMM compares latent-state dynamics or emission structure. Those are process differences, not automatically intervention effects.

In a paper, it is safer to say that groups differed in sequence structure or estimated state dynamics unless the study design supports causal language. Reviewers often object when process patterns are narrated as if they proved what the intervention did.

Practical rule: describe group differences in structure, motifs, or states; reserve causal language for designs that justify causality.

Why do my top SPM patterns all have support 1.00 or 0.99?

This usually means you are looking at trivial, nearly universal patterns rather than informative motifs. Single items or very common short subsequences often dominate the ranking because they appear in almost every session, not because they tell the most interesting story.

That is why the raw top-support plot is often a poor final figure. The more interpretable view is usually a filtered ranking of multi-item patterns, optionally with ceiling-support patterns removed and closed or maximal filtering applied.

Practical rule: for presentation, filter to 2+ item patterns and rank within a support range that shows real spread.

When is HMM worth the extra complexity?

Use HMM when you have a real latent-state theory. If your codes are already the phenomenon of interest, a direct transition or motif method may be enough. HMM becomes worthwhile when you believe observed actions are noisy emissions from a smaller hidden process such as exploration, consolidation, or confusion.

HMM also demands more modeling discipline: choosing K, checking local optima, examining emission interpretability, and avoiding overfit. If you do not need a latent-state claim, it may be complexity without payoff.

Practical rule: only choose HMM if your paper needs a latent-state argument, not just a temporal pattern description.

Should I report only the prettiest figure?

No. Sequence-mining figures are persuasive, but reviewers also need the analytic spine: how sessions were defined, how codes were constructed, which method was chosen and why, what uncertainty or diagnostics were checked, and what preprocessing decisions affected the result.

A good figure should be the visual summary of a coherent analytic decision chain, not a substitute for that chain. This is especially true when the figure is based on synthetic or highly tuned examples.

Practical rule: report the question, sequence definition, preprocessing, model settings, and one figure that directly answers the question.

What do reviewers usually challenge in sequence-mining papers?

The three most common targets are method fit, coding reliability, and over-interpretation. Reviewers want to know why this particular method fits the question, whether the codebook is stable and meaningful, and whether the claims go beyond what the output can support.

They also watch for visualization artifacts: thick edges without uncertainty, motif lists with no filtering rationale, or latent states that have been named more confidently than the emissions justify.

Practical rule: be ready to justify the method, defend the code construction, and explain why the visual pattern is analytically meaningful rather than merely decorative.

Green flags

Repeated event codes, clear session boundaries, a defensible codebook, and a method whose output directly answers the stated question.

Red flags

One-off events, unstable coding, tiny per-group sample sizes, or a paper claim that treats exploratory pattern structure as if it were intervention evidence.

Chapter 10 · Cheat sheet

One-page cheat sheet

TNA

tna::build_model

build_model(seq_obj)
group_model(seq_obj, group=g)
centralities(m)
bootstrap(m); plot(boot)
permutation_test(group_model)

Output: directed weighted graph
+ centrality + edge stability.

LSA

Lag-Sequential (custom)

trans <- count_lag1(sessions)
expected <- outer(rsum, csum/N)
var_AL   <- outer(rsum, p*(1-p)) *
            outer(1-pA, 1)
z <- (trans - expected) /
     sqrt(var_AL)

Output: K×K z-score matrix.
Significance: |z| > 1.96.

SPM

arulesSequences::cspade

trans <- read_baskets(file)
patterns <- cspade(trans,
  parameter = list(
    support = 0.3,
    maxlen  = 4,
    maxsize = 1))
as(patterns, "data.frame")

Output: ranked frequent
sub-sequences with support.

HMM

seqHMM::build_hmm

hmm0 <- build_hmm(seq_obj,
                  n_states = K)
fit  <- fit_model(hmm0,
  control_em =
    list(restart =
         list(times = 5)))
hidden_paths(fit$model)

Output: K hidden states +
transition + emission matrices.

Inputs / outputs at a glance

Method	Input	Output	Headline metric
TNA	TraMineR `stslist`	Directed weighted graph object	edge weight = P(next \| current)
LSA	List of code-vectors per session	K×K z-score matrix	z above ±1.96 = chance-rejected
SPM	Basket-format transactions	data.frame of patterns + support	support = fraction of sessions
HMM	TraMineR `stslist`	K-state model + paths	logLik (compare with BIC)

Chapter 11 · Credits

References — cite the originals

TNA

Tikka, S., López-Pernas, S., & Saqr, M. (2025). tna: An R Package for Transition Network Analysis. Applied Psychological Measurement. doi:10.1177/01466216251348840
Saqr, M., López-Pernas, S., Törmänen, T., Kaliisa, R., Misiejuk, K., & Tikka, S. (2025). Transition Network Analysis: A Novel Framework. LAK '25 Proceedings. doi:10.1145/3706468.3706513
Package site: sonsoles.me/tna/ · tutorials at lamethods.org/book2/ (chapters 15–17 cover TNA / FTNA / TNA-clusters).

Lag Sequential Analysis

Bakeman, R., & Quera, V. (2011). Sequential Analysis and Observational Methods for the Behavioral Sciences. Cambridge University Press. Book page · Chapter 11 preview
Allison, P. D., & Liker, J. K. (1982). Analyzing sequential categorical data on dyadic interaction. Psychological Bulletin, 91(3), 393–403. doi:10.1037/0033-2909.91.3.393 (The variance correction we use.)
O'Connor, B. P. (1999). Simple and flexible SAS and SPSS programs for analyzing lag-sequential categorical data. Behavior Research Methods, Instruments, & Computers, 31, 718–726. doi:10.3758/BF03200753
Draper, Z. A., & O'Connor, B. P. LagSequential [R package]. CRAN · reference manual · vignette / docs index.

Sequential Pattern Mining

Zaki, M. J. (2001). SPADE: An efficient algorithm for mining frequent sequences. Machine Learning, 42, 31–60. Springer page (cSPADE adds constraints.)
Buchta, C., & Hahsler, M. arulesSequences [R package]. CRAN · cSPADE function manual.
Fournier-Viger, P. et al. SPMF — Java reference library with 55+ algorithms. official docs + examples
Python: PrefixSpan-py.

HMM

Helske, S., & Helske, J. (2019). Mixture Hidden Markov Models for Sequence Data: The seqHMM Package in R. Journal of Statistical Software, 88(3). doi:10.18637/jss.v088.i03
seqHMM docs: package index · estimation vignette · algorithms vignette.
Visser, I., & Speekenbrink, M. (2010). depmixS4: An R Package for Hidden Markov Models. Journal of Statistical Software, 36(7). JSS page.
Python: hmmlearn · tutorial.

Important

This is a community on-ramp, not the canonical source.

Numbers in this guide come from a synthetic dataset bundled here for demonstration. Wording is summarised in our own words to be beginner-accessible. For methodological depth, defer to the publications above.