Preview This leaderboard is in preview mode. Rows shown below are illustrative — real submissions begin Jul 1, 2026 (warm-up) and Aug 1, 2026 (sealed final phase).
K · search Start-kit Register
Public leaderboard · live during final phase

Leaderboard.

One ranking per track. Refreshed every 15 minutes during the sealed phase. Top-3 per track go through a reproducibility audit at test-freeze — only audited entries appear in the final NeurIPS rankings.

Warm-up: Jul 1 — Jul 31 Sealed: Aug 1 — Sep 1 Audit: Oct 1 Final: Nov 1
Submissions snapshot · placeholder refresh · 15 min
0 teams
0 submissions
5 tracks
Track 01 · EEG-to-IMG placeholder rows

Evoked visual retrieval, top-5 accuracy.

Rank held-out candidate images from a single EEG epoch. Targets are frozen DINOv2-giant embeddings, so the score isolates the EEG side. Controlled shift: test stimuli are unseen during training. Higher is better.

MetricTop-5 retrieval accuracy
Tie-breakTop-1 accuracy
SponsorAlljoined
# Submission Affiliation Top-5 Δ
#01 placeholder-model-1 Team A XX.X
#02 placeholder-model-2 Team B XX.X −X.X
#03 placeholder-model-3 Team C XX.X −X.X
#04 placeholder-model-4 Team D XX.X −XX.X
#05 placeholder-model-5 Team E XX.X −XX.X
#06 placeholder-model-6 Team F XX.X −XX.X
N = X of XX submissions · last update X min ago How scoring works ↓
Track 02 · BCI decoding placeholder rows

Calibration-stable command decoding.

Predict the cued command (motor imagery, mental math, word association) on later sessions of the same subject, with no per-session recalibration. The score reflects calibration-free stability across session drift. Higher is better.

MetricBalanced accuracy
Tie-breakMean per-class F1
SponsorMeta FAIR Brain & AI
# Submission Affiliation Bal. Acc Δ
#01 placeholder-model-1 Team A XX.X
#02 placeholder-model-2 Team B XX.X −X.X
#03 placeholder-model-3 Team C XX.X −X.X
#04 placeholder-model-4 Team D XX.X −X.X
#05 placeholder-model-5 Team E XX.X −X.X
#06 placeholder-model-6 Team F XX.X −XX.X
N = X of XXX submissions · last update X min ago How scoring works ↓
Track 03 · Sleep onset placeholder rows

Latency to stable N2 on wearable EEG.

Predict seconds from recording start to the first stable N2 epoch, on consumer-grade wearable EEG. The shift is to a sparse home-wearable montage, which is too narrow to support full per-epoch staging but still resolves onset timing. Lower is better.

MetricMean absolute error (s)
Tie-breakMedian absolute error
SponsorInteraXon
# Submission Affiliation MAE (s) Δ
#01 placeholder-model-1 Team A XXX.X
#02 placeholder-model-2 Team B XXX.X +X.X
#03 placeholder-model-3 Team C XXX.X +X.X
#04 placeholder-model-4 Team D XXX.X +X.X
#05 placeholder-model-5 Team E XXX.X +XX.X
#06 placeholder-model-6 Team F XXX.X +XX.X
N = X of XX submissions · last update XX min ago How scoring works ↓
Track 04 · EMG-to-Text placeholder rows

Wristband EMG to typed text.

Transduce typed keystrokes from wristband surface EMG. The controlled shift is cross-user: held-out subjects vary in forearm anatomy, typing strategy, and sensor re-placement, so the score rewards user-invariant features rather than per-user templates. Lower is better.

MetricCharacter error rate (%)
Tie-breakWord error rate
SponsorMeta Reality Labs
# Submission Affiliation CER (%) Δ
#01 placeholder-model-1 Team A XX.X
#02 placeholder-model-2 Team B XX.X +X.X
#03 placeholder-model-3 Team C XX.X +X.X
#04 placeholder-model-4 Team D XX.X +X.X
#05 placeholder-model-5 Team E XX.X +XX.X
N = X of XX submissions · last update X min ago How scoring works ↓
Track 05 · Foundation transfer placeholder rows

One shared encoder across all four tracks.

Rank shared encoders by their mean rank across EEG-to-IMG, BCI, Sleep, and EMG. The visible score is presented on a 0–100 scale where 100 indicates first place on every track; internally it is the negated mean rank, so that confidence intervals stay higher-is-better. Higher is better.

MetricMean rank score (0–100)
ConstraintSingle shared encoder
AuditWeights identity check
# Encoder Affiliation Mean rank Δ
#01 placeholder-model-1 Team A XX.X
#02 placeholder-model-2 Team B XX.X −X.X
#03 placeholder-model-3 Team C XX.X −X.X
#04 placeholder-model-4 Team D XX.X −XX.X
#05 placeholder-model-5 Team E XX.X −XX.X
#06 placeholder-model-6 Team F XX.X −XX.X
N = X of XX submissions · last update XX min ago How scoring works ↓
Scoring & refresh policy

How a submission becomes a number on this page.

The scoring code is open-source and identical between local NeuralBench task runs and the Codabench server. The only thing the server adds is the sealed test split.

Refresh cadence 15 min

Live during the final phase.

Codabench evaluates each upload immediately. The leaderboard page on this site is regenerated every 15 minutes, so there can be a short lag between submission and what you see here.

Final phaseAug 1 — Sep 1, 2026
Daily cap5 / team / day
Aggregation BEST-OF-5

Final score = best of last five.

The public board shows your best-ever number. The final NeurIPS ranking, however, only considers your last five submissions. This rewards focused iteration over exhaustive lottery search.

Public boardBest ever
Final rankingBest of last 5
Reproducibility audit OCT 1

Top-3 per track replay from config.

We re-run the committed training pipeline against the sealed split. Within ±2 σ of the submitted score, you stay on the board. Outside, you drop. Audit is led by Arnaud Delorme (EEGLAB).

Tolerance±2 σ on metric
Audit windowOct 1 — Nov 1
Formal definitions

Scoring math, mirrored from the proposal.

These are the equations the evaluator actually runs. They match the proposal sections on error bars, test-set sizing, and overall ranking, and are reproduced here for participants who want to reason about score variance and ranking before submitting.

Prediction, unit score, and track score

A submission \(a\) receives a hidden signal \(X_{t,i}\) and metadata \(m_{t,i}\) for track \(t\), then writes a prediction \(\hat{y}_{a,t,i}\). The evaluator keeps \(y_{t,i}\) hidden and computes the score.

\[ \hat{y}_{a,t,i} = f_a(X_{t,i}, m_{t,i}) \]

Examples are first collapsed into independent bootstrap units \(u \in U_t\): subject-image query blocks for EEG-to-IMG, subject-session-context cells for BCI, recordings for sleep onset, and user-session blocks for EMG-to-text. Each unit gets an oriented contribution \(s_{a,t,u}\), where higher is always better; for MAE and CER we use the negative error internally.

\[ s_{a,t,u} = \mathrm{score}_t(\hat{y}_{a,t,u}, y_{t,u}) \] \[ \mathcal{S}_{a,t} = \frac{1}{|U_t|}\sum_{u \in U_t} s_{a,t,u} \]

The visible leaderboard for track \(t\) is the point-estimate ordering of \(\mathcal{S}_{a,t}\).

python · predictions to S_a,t
1import numpy as np
2
3def build_unit_scores(y_pred, y_true, unit_ids, score_unit, lower_is_better=False):
4 """Return s_{a,t,u} after collapsing examples into units."""
5 y_pred, y_true, unit_ids = map(np.asarray, (y_pred, y_true, unit_ids))
6 scores = []
7 for unit in np.unique(unit_ids):
8 idx = unit_ids == unit
9 value = score_unit(y_pred[idx], y_true[idx])
10 scores.append(-value if lower_is_better else value)
11 return np.asarray(scores, dtype=float)
12
13def track_score(unit_scores):
14 """Compute S_{a,t}; all returned scores are higher-is-better."""
15 return float(np.mean(unit_scores))

Confidence interval, p-value, and rank stability

For bootstrap draw \(b\), the evaluator resamples independent units \(U_t^{(b)}\) and recomputes each team's score. Pairwise uncertainty is calculated on the paired score difference, not from two separate confidence intervals.

\[ \mathcal{S}^{(b)}_{a,t} = \frac{1}{|U_t^{(b)}|}\sum_{u \in U_t^{(b)}} s_{a,t,u} \] \[ \Delta^{(b)}_{a,c,t} = \mathcal{S}^{(b)}_{a,t} - \mathcal{S}^{(b)}_{c,t} \] \[ \begin{aligned} \mathrm{CI}_{95}(\Delta_{a,c,t}) = \big[&q_{0.025}(\Delta^{(b)}_{a,c,t}),\\ &q_{0.975}(\Delta^{(b)}_{a,c,t})\big] \end{aligned} \]

If this interval contains zero, neighbouring teams are flagged as statistically indistinguishable. For prize-relevant comparisons, the two-sided bootstrap p-value is Holm-adjusted.

\[ \begin{aligned} p_{\mathrm{boot}} = 2\min\big(&\Pr_b[\Delta^{(b)}_{a,c,t} \le 0],\\ &\Pr_b[\Delta^{(b)}_{a,c,t} \ge 0]\big) \end{aligned} \]

Rank stability is computed by re-ranking all teams inside each bootstrap draw.

\[ r^{(b)}_{a,t} = \mathrm{rank}\left(\mathcal{S}^{(b)}_{a,t}\right) \] \[ \begin{gathered} \Pr(r^{(b)}_{a,t}\le1),\\ \Pr(r^{(b)}_{a,t}\le3),\\ \Pr(r^{(b)}_{a,t}\le5) \end{gathered} \]
python · CI, p_boot, Holm, ranks
1import numpy as np
2from confidence_intervals import get_bootstrap_indices, get_conf_int
3from statsmodels.stats.multitest import multipletests
4
5def bootstrap_track(unit_scores, unit_ids=None, n_boot=10_000):
6 teams = list(unit_scores)
7 n_units = len(unit_scores[teams[0]])
8 score_boot = {team: np.empty(n_boot) for team in teams}
9 rank_boot = {team: np.empty(n_boot, dtype=int) for team in teams}
10 for b in range(n_boot):
11 idx = get_bootstrap_indices(n_units, conditions=unit_ids, random_state=b)
12 scores = {team: float(np.mean(np.asarray(vals)[idx])) for team, vals in unit_scores.items()}
13 for rank, team in enumerate(sorted(teams, key=scores.get, reverse=True), start=1):
14 score_boot[team][b] = scores[team]
15 rank_boot[team][b] = rank
16 return score_boot, rank_boot
17
18def pair_summary(score_boot, team_a, team_c, alpha=5):
19 delta = score_boot[team_a] - score_boot[team_c]
20 ci_low, ci_high = get_conf_int(delta, alpha=alpha)
21 p_boot = 2 * min(np.mean(delta <= 0), np.mean(delta >= 0))
22 return {"delta": float(np.mean(delta)), "ci95": (float(ci_low), float(ci_high)),
23 "p_boot": min(float(p_boot), 1.0), "indistinguishable": bool(ci_low <= 0 <= ci_high)}
24
25def add_holm(rows, alpha=0.05):
26 reject, p_holm, _, _ = multipletests([r["p_boot"] for r in rows], method="holm", alpha=alpha)
27 for row, adj_p, keep in zip(rows, p_holm, reject):
28 row.update(p_holm=float(adj_p), significant_after_holm=bool(keep))
29 return rows
30
31def rank_stability(rank_boot):
32 return {team: {"top1": float(np.mean(r <= 1)), "top3": float(np.mean(r <= 3)),
33 "top5": float(np.mean(r <= 5))} for team, r in rank_boot.items()}

Foundation Transfer special case

Track 5 evaluates one shared biosignal encoder through organizer-fitted heads on all four tracks (three EEG, one EMG). In each bootstrap draw, the evaluator recomputes the per-track leaderboards, takes the encoder's rank on each one, and averages ranks. Lower mean rank is better; for confidence-interval code we orient it as a higher-is-better score by negating the mean rank.

\[ R^{(b)}_{a,5} = \frac{1}{4}\sum_{t \in \{\mathrm{IMG}, \mathrm{BCI}, \mathrm{Sleep}, \mathrm{EMG}\}} r^{(b)}_{a,t} \] \[ \mathcal{S}^{(b)}_{a,5} = -R^{(b)}_{a,5} \]
python · Track 5 mean rank
1import numpy as np
2
3def foundation_transfer_score(track_rank_boot, tracks=("img", "bci", "sleep", "emg")):
4 """Return S^{(b)}_{a,5} = -mean rank across all four tracks for each Track 5 team."""
5 teams = list(track_rank_boot[tracks[0]])
6 return {
7 team: -np.mean([track_rank_boot[t][team] for t in tracks], axis=0)
8 for team in teams
9 }

Test-set sizing

The hidden-test size for each track is chosen so the expected half-width of the 95% interval falls below \(\nu_t\), the smallest practically meaningful difference for that track. \(\hat{\sigma}_t\) is the pilot standard deviation at the top-level bootstrap unit and \(n_{\mathrm{eff},t}\) is the number of independent held-out units. If a dataset cannot support this target, intervals widen and ties are reported rather than over-interpreting small margins.

\[ 1.96\,\hat{\sigma}_t / \sqrt{n_{\mathrm{eff},t}} \le \nu_t \]
python · test-set sizing
1import math
2
3def ci_half_width(sigma_hat, n_eff, z=1.96):
4 return z * sigma_hat / math.sqrt(n_eff)
5
6def required_n_eff(sigma_hat, nu_t, z=1.96):
7 """Smallest independent hidden-test count satisfying the target half-width."""
8 return math.ceil((z * sigma_hat / nu_t) ** 2)
9
10def meets_resolution_target(sigma_hat, n_eff, nu_t):
11 return ci_half_width(sigma_hat, n_eff) <= nu_t

Overall ranking

Each valid submission gets rank points \(P_{\mathrm{team},t}\) on its track (linearly interpolated against the field, so the top of the field scores 1 and the bottom scores 0). The submitted-track average summarises a team's record across the tracks it entered; the all-track score averages over all four task-specific tracks, padding missing tracks with zero so transfer is rewarded over single-track wins. \(r_{\mathrm{team},t}\) is the team's rank, \(N_t\) is the number of valid submissions on the track, and \(T_{\mathrm{team}}\) is the set of tracks the team submitted.

\[ P_{\mathrm{team},t} = \begin{cases} 1-\dfrac{r_{\mathrm{team},t}-1}{N_t-1}, & N_t>1, \\ 1, & N_t=1 \end{cases} \] \[ \mathcal{S}_{\mathrm{submitted}}(\mathrm{team}) = \frac{1}{|T_{\mathrm{team}}|}\sum_{t\in T_{\mathrm{team}}} P_{\mathrm{team},t} \] \[ \mathcal{S}_{\mathrm{all}}(\mathrm{team}) = \frac{1}{4}\sum_{t\in T} P^{\star}_{\mathrm{team},t} \] \[ P^{\star}_{\mathrm{team},t} = \begin{cases} P_{\mathrm{team},t}, & t\in T_{\mathrm{team}}, \\ 0, & t\notin T_{\mathrm{team}} \end{cases} \]
python · rank-point aggregation
1def rank_points(rank, n_submissions):
2 return 1.0 if n_submissions == 1 else 1.0 - (rank - 1) / (n_submissions - 1)
3
4def submitted_track_score(points_by_track, submitted_tracks):
5 return sum(points_by_track[t] for t in submitted_tracks) / len(submitted_tracks)
6
7def all_track_score(points_by_track, all_tracks=("img", "bci", "sleep", "emg")):
8 """Missing tracks get zero points."""
9 return sum(points_by_track.get(t, 0.0) for t in all_tracks) / len(all_tracks)
Start-kit drops Jun 1, 2026 · Warm-up Jul 1

Take a baseline and beat it.

Every track ships at least one fully-trained baseline. The start-kit walks you from clone to submission.parquet in fifteen minutes. From there, it's a leaderboard fight.