Live during the final phase.
Codabench evaluates each upload immediately. The leaderboard page on this site is regenerated every 15 minutes, so there can be a short lag between submission and what you see here.
One ranking per track. Refreshed every 15 minutes during the sealed phase. Top-3 per track go through a reproducibility audit at test-freeze — only audited entries appear in the final NeurIPS rankings.
Rank held-out candidate images from a single EEG epoch. Targets are frozen DINOv2-giant embeddings, so the score isolates the EEG side. Controlled shift: test stimuli are unseen during training. Higher is better.
Predict the cued command (motor imagery, mental math, word association) on later sessions of the same subject, with no per-session recalibration. The score reflects calibration-free stability across session drift. Higher is better.
Predict seconds from recording start to the first stable N2 epoch, on consumer-grade wearable EEG. The shift is to a sparse home-wearable montage, which is too narrow to support full per-epoch staging but still resolves onset timing. Lower is better.
Transduce typed keystrokes from wristband surface EMG. The controlled shift is cross-user: held-out subjects vary in forearm anatomy, typing strategy, and sensor re-placement, so the score rewards user-invariant features rather than per-user templates. Lower is better.
Rank shared encoders by their mean rank across EEG-to-IMG, BCI, Sleep, and EMG. The visible score is presented on a 0–100 scale where 100 indicates first place on every track; internally it is the negated mean rank, so that confidence intervals stay higher-is-better. Higher is better.
The scoring code is open-source and identical between local NeuralBench task runs and the Codabench server. The only thing the server adds is the sealed test split.
Codabench evaluates each upload immediately. The leaderboard page on this site is regenerated every 15 minutes, so there can be a short lag between submission and what you see here.
The public board shows your best-ever number. The final NeurIPS ranking, however, only considers your last five submissions. This rewards focused iteration over exhaustive lottery search.
We re-run the committed training pipeline against the sealed split. Within ±2 σ of the submitted score, you stay on the board. Outside, you drop. Audit is led by Arnaud Delorme (EEGLAB).
These are the equations the evaluator actually runs. They match the proposal sections on error bars, test-set sizing, and overall ranking, and are reproduced here for participants who want to reason about score variance and ranking before submitting.
A submission \(a\) receives a hidden signal \(X_{t,i}\) and metadata \(m_{t,i}\) for track \(t\), then writes a prediction \(\hat{y}_{a,t,i}\). The evaluator keeps \(y_{t,i}\) hidden and computes the score.
\[ \hat{y}_{a,t,i} = f_a(X_{t,i}, m_{t,i}) \]Examples are first collapsed into independent bootstrap units \(u \in U_t\): subject-image query blocks for EEG-to-IMG, subject-session-context cells for BCI, recordings for sleep onset, and user-session blocks for EMG-to-text. Each unit gets an oriented contribution \(s_{a,t,u}\), where higher is always better; for MAE and CER we use the negative error internally.
\[ s_{a,t,u} = \mathrm{score}_t(\hat{y}_{a,t,u}, y_{t,u}) \] \[ \mathcal{S}_{a,t} = \frac{1}{|U_t|}\sum_{u \in U_t} s_{a,t,u} \]The visible leaderboard for track \(t\) is the point-estimate ordering of \(\mathcal{S}_{a,t}\).
1import numpy as np23def build_unit_scores(y_pred, y_true, unit_ids, score_unit, lower_is_better=False):4"""Return s_{a,t,u} after collapsing examples into units."""5y_pred, y_true, unit_ids = map(np.asarray, (y_pred, y_true, unit_ids))6scores = []7for unit in np.unique(unit_ids):8idx = unit_ids == unit9value = score_unit(y_pred[idx], y_true[idx])10scores.append(-value if lower_is_better else value)11return np.asarray(scores, dtype=float)1213def track_score(unit_scores):14"""Compute S_{a,t}; all returned scores are higher-is-better."""15return float(np.mean(unit_scores))
For bootstrap draw \(b\), the evaluator resamples independent units \(U_t^{(b)}\) and recomputes each team's score. Pairwise uncertainty is calculated on the paired score difference, not from two separate confidence intervals.
\[ \mathcal{S}^{(b)}_{a,t} = \frac{1}{|U_t^{(b)}|}\sum_{u \in U_t^{(b)}} s_{a,t,u} \] \[ \Delta^{(b)}_{a,c,t} = \mathcal{S}^{(b)}_{a,t} - \mathcal{S}^{(b)}_{c,t} \] \[ \begin{aligned} \mathrm{CI}_{95}(\Delta_{a,c,t}) = \big[&q_{0.025}(\Delta^{(b)}_{a,c,t}),\\ &q_{0.975}(\Delta^{(b)}_{a,c,t})\big] \end{aligned} \]If this interval contains zero, neighbouring teams are flagged as statistically indistinguishable. For prize-relevant comparisons, the two-sided bootstrap p-value is Holm-adjusted.
\[ \begin{aligned} p_{\mathrm{boot}} = 2\min\big(&\Pr_b[\Delta^{(b)}_{a,c,t} \le 0],\\ &\Pr_b[\Delta^{(b)}_{a,c,t} \ge 0]\big) \end{aligned} \]Rank stability is computed by re-ranking all teams inside each bootstrap draw.
\[ r^{(b)}_{a,t} = \mathrm{rank}\left(\mathcal{S}^{(b)}_{a,t}\right) \] \[ \begin{gathered} \Pr(r^{(b)}_{a,t}\le1),\\ \Pr(r^{(b)}_{a,t}\le3),\\ \Pr(r^{(b)}_{a,t}\le5) \end{gathered} \]1import numpy as np2from confidence_intervals import get_bootstrap_indices, get_conf_int3from statsmodels.stats.multitest import multipletests45def bootstrap_track(unit_scores, unit_ids=None, n_boot=10_000):6teams = list(unit_scores)7n_units = len(unit_scores[teams[0]])8score_boot = {team: np.empty(n_boot) for team in teams}9rank_boot = {team: np.empty(n_boot, dtype=int) for team in teams}10for b in range(n_boot):11idx = get_bootstrap_indices(n_units, conditions=unit_ids, random_state=b)12scores = {team: float(np.mean(np.asarray(vals)[idx])) for team, vals in unit_scores.items()}13for rank, team in enumerate(sorted(teams, key=scores.get, reverse=True), start=1):14score_boot[team][b] = scores[team]15rank_boot[team][b] = rank16return score_boot, rank_boot1718def pair_summary(score_boot, team_a, team_c, alpha=5):19delta = score_boot[team_a] - score_boot[team_c]20ci_low, ci_high = get_conf_int(delta, alpha=alpha)21p_boot = 2 * min(np.mean(delta <= 0), np.mean(delta >= 0))22return {"delta": float(np.mean(delta)), "ci95": (float(ci_low), float(ci_high)),23"p_boot": min(float(p_boot), 1.0), "indistinguishable": bool(ci_low <= 0 <= ci_high)}2425def add_holm(rows, alpha=0.05):26reject, p_holm, _, _ = multipletests([r["p_boot"] for r in rows], method="holm", alpha=alpha)27for row, adj_p, keep in zip(rows, p_holm, reject):28row.update(p_holm=float(adj_p), significant_after_holm=bool(keep))29return rows3031def rank_stability(rank_boot):32return {team: {"top1": float(np.mean(r <= 1)), "top3": float(np.mean(r <= 3)),33"top5": float(np.mean(r <= 5))} for team, r in rank_boot.items()}
Track 5 evaluates one shared biosignal encoder through organizer-fitted heads on all four tracks (three EEG, one EMG). In each bootstrap draw, the evaluator recomputes the per-track leaderboards, takes the encoder's rank on each one, and averages ranks. Lower mean rank is better; for confidence-interval code we orient it as a higher-is-better score by negating the mean rank.
\[ R^{(b)}_{a,5} = \frac{1}{4}\sum_{t \in \{\mathrm{IMG}, \mathrm{BCI}, \mathrm{Sleep}, \mathrm{EMG}\}} r^{(b)}_{a,t} \] \[ \mathcal{S}^{(b)}_{a,5} = -R^{(b)}_{a,5} \]1import numpy as np23def foundation_transfer_score(track_rank_boot, tracks=("img", "bci", "sleep", "emg")):4"""Return S^{(b)}_{a,5} = -mean rank across all four tracks for each Track 5 team."""5teams = list(track_rank_boot[tracks[0]])6return {7team: -np.mean([track_rank_boot[t][team] for t in tracks], axis=0)8for team in teams9}
The hidden-test size for each track is chosen so the expected half-width of the 95% interval falls below \(\nu_t\), the smallest practically meaningful difference for that track. \(\hat{\sigma}_t\) is the pilot standard deviation at the top-level bootstrap unit and \(n_{\mathrm{eff},t}\) is the number of independent held-out units. If a dataset cannot support this target, intervals widen and ties are reported rather than over-interpreting small margins.
\[ 1.96\,\hat{\sigma}_t / \sqrt{n_{\mathrm{eff},t}} \le \nu_t \]1import math23def ci_half_width(sigma_hat, n_eff, z=1.96):4return z * sigma_hat / math.sqrt(n_eff)56def required_n_eff(sigma_hat, nu_t, z=1.96):7"""Smallest independent hidden-test count satisfying the target half-width."""8return math.ceil((z * sigma_hat / nu_t) ** 2)910def meets_resolution_target(sigma_hat, n_eff, nu_t):11return ci_half_width(sigma_hat, n_eff) <= nu_t
Each valid submission gets rank points \(P_{\mathrm{team},t}\) on its track (linearly interpolated against the field, so the top of the field scores 1 and the bottom scores 0). The submitted-track average summarises a team's record across the tracks it entered; the all-track score averages over all four task-specific tracks, padding missing tracks with zero so transfer is rewarded over single-track wins. \(r_{\mathrm{team},t}\) is the team's rank, \(N_t\) is the number of valid submissions on the track, and \(T_{\mathrm{team}}\) is the set of tracks the team submitted.
\[ P_{\mathrm{team},t} = \begin{cases} 1-\dfrac{r_{\mathrm{team},t}-1}{N_t-1}, & N_t>1, \\ 1, & N_t=1 \end{cases} \] \[ \mathcal{S}_{\mathrm{submitted}}(\mathrm{team}) = \frac{1}{|T_{\mathrm{team}}|}\sum_{t\in T_{\mathrm{team}}} P_{\mathrm{team},t} \] \[ \mathcal{S}_{\mathrm{all}}(\mathrm{team}) = \frac{1}{4}\sum_{t\in T} P^{\star}_{\mathrm{team},t} \] \[ P^{\star}_{\mathrm{team},t} = \begin{cases} P_{\mathrm{team},t}, & t\in T_{\mathrm{team}}, \\ 0, & t\notin T_{\mathrm{team}} \end{cases} \]1def rank_points(rank, n_submissions):2return 1.0 if n_submissions == 1 else 1.0 - (rank - 1) / (n_submissions - 1)34def submitted_track_score(points_by_track, submitted_tracks):5return sum(points_by_track[t] for t in submitted_tracks) / len(submitted_tracks)67def all_track_score(points_by_track, all_tracks=("img", "bci", "sleep", "emg")):8"""Missing tracks get zero points."""9return sum(points_by_track.get(t, 0.0) for t in all_tracks) / len(all_tracks)
Every track ships at least one fully-trained baseline. The start-kit walks you from clone to submission.parquet in fifteen minutes. From there, it's a leaderboard fight.