Live during the final phase.
Codabench evaluates each upload immediately. The leaderboard page on this site is regenerated every 15 minutes — there can be a short lag between submission and what you see here.
One ranking per track. Refreshed every 15 minutes during the sealed phase. Top-3 per track go through a reproducibility audit at test-freeze — only audited entries appear in the final NeurIPS rankings.
Rank held-out candidate images from EEG epochs. Targets are frozen DINOv2-giant embeddings. Higher is better.
Generalize motor imagery, mental math, and word association labels to later sessions without recalibration. Higher is better.
Estimate seconds from recording start to stable sleep onset on consumer-grade EEG. Lower is better.
Decode typed keystrokes from surface EMG across users, anatomy, and re-placement. Lower is better.
Rank shared encoders by their per-track score averaged across EEG-to-IMG, BCI, and Sleep. Normalized so 100 = perfect on every track. Higher is better.
The scoring code is open-source and identical between local neuralbench score and the Codabench server. The only thing the server adds is the sealed test split.
Codabench evaluates each upload immediately. The leaderboard page on this site is regenerated every 15 minutes — there can be a short lag between submission and what you see here.
The public board shows your best-ever number. The final NeurIPS ranking, however, only considers your last five submissions. This rewards focused iteration over exhaustive lottery search.
We re-run the committed training pipeline against the sealed split. Within ±2 σ of the submitted score, you stay on the board. Outside, you drop. Audit is led by Arnaud Delorme (EEGLAB).
You can submit under an anonymized handle during the public phase. Affiliations only appear on the board after you opt in — usually right before the final ranking publishes on Nov 1. Useful for double-blind paper submissions.
These are the equations the evaluator actually runs. They appear in §Error bars, §Test-set sizing, and §Overall ranking of the NeurIPS proposal and are reproduced here for participants who want to reason about score variance and ranking before submitting.
Higher-is-better tracks (EEG-to-IMG, BCI) maximise the primary metric; lower-is-better tracks (sleep onset, EMG-to-text) minimise the error. For pairwise significance within each track, each of \(B = 10{,}000\) paired bootstrap draws yields a difference between two teams' scores. Let \(\mathcal{S}^{(b)}_{\mathrm{team}_1}\) and \(\mathcal{S}^{(b)}_{\mathrm{team}_2}\) denote the two teams' scores at draw \(b\), with \(\Pr_b[\cdot]\) the empirical probability over those draws.
\[ \Delta^{(b)}_{\mathrm{team}_1, \mathrm{team}_2} = \mathcal{S}^{(b)}_{\mathrm{team}_1} - \mathcal{S}^{(b)}_{\mathrm{team}_2} \]The two-sided bootstrap p-value \(p_{\mathrm{boot}} = 2\min(\Pr_b[\Delta^{(b)} \le 0],\; \Pr_b[\Delta^{(b)} \ge 0])\) is Holm-adjusted across the family of prize-relevant comparisons to control the family-wise error rate in the post-competition analysis. The public leaderboard ranks by point estimate and flags neighbours with unadjusted \(p_{\mathrm{boot}} > 0.05\) (equivalently, paired CI of \(\Delta\) contains zero) as statistically indistinguishable. Top-1, top-3, and top-5 rank stability (the bootstrap probability that a team holds those positions) accompanies each rank.
The hidden-test size for each track is chosen so the expected half-width of the 95% interval falls below \(\nu_t\), the smallest practically meaningful difference for that track. \(\hat{\sigma}_t\) is the pilot standard deviation at the top-level bootstrap unit and \(n_{\mathrm{eff},t}\) is the number of independent held-out units. If a dataset cannot support this target, intervals widen and ties are reported rather than over-interpreting small margins.
\[ 1.96\,\hat{\sigma}_t / \sqrt{n_{\mathrm{eff},t}} \le \nu_t \]Each valid submission gets rank points \(P_{\mathrm{team},t}\) on its track (linearly interpolated against the field, so the top of the field scores 1 and the bottom scores 0). The submitted-track average summarises a team's record across the tracks it entered; the all-track score averages over all four task-specific tracks, padding missing tracks with zero so transfer is rewarded over single-track wins. \(r_{\mathrm{team},t}\) is the team's rank, \(N_t\) is the number of valid submissions on the track, and \(T_{\mathrm{team}}\) is the set of tracks the team submitted.
\[ P_{\mathrm{team},t} = \begin{cases} 1-\dfrac{r_{\mathrm{team},t}-1}{N_t-1}, & N_t>1, \\[4pt] 1, & N_t=1, \end{cases} \qquad \mathcal{S}_{\mathrm{submitted}}(\mathrm{team}) = \frac{1}{|T_{\mathrm{team}}|}\sum_{t\in T_{\mathrm{team}}} P_{\mathrm{team},t} \] \[ \mathcal{S}_{\mathrm{all}}(\mathrm{team}) = \frac{1}{4}\sum_{t\in T} P^{\star}_{\mathrm{team},t}, \qquad P^{\star}_{\mathrm{team},t} = \begin{cases} P_{\mathrm{team},t}, & t\in T_{\mathrm{team}}, \\ 0, & t\notin T_{\mathrm{team}}. \end{cases} \]Every track ships at least one fully-trained baseline. The start-kit walks you from clone to submission.parquet in fifteen minutes. From there, it's a leaderboard fight.
1$ neuralbench score bci --out submission.parquet2$ neuralbench upload submission.parquet --track bci3↳ uploaded · queued #1248