Benchmarks

The benchmark suite compares GLOSS against four baselines on three chemistry datasets covering the full spectrum of value sources.

Setup

Dataset

\(n\)

Source

Notes

Buchwald–Hartwig

3,955

Experimental yields

One-hot reaction features

QM9 HOMO–LUMO gap

100,000

DFT properties

20 RDKit descriptors

Arrhenius-2D

10,000

Virtual reaction surface

Three peak basins, exposes secondary-peak trap

Algorithms: GLOSS (4:2:2 ratio, \(q=8\)), UCB-BO (ablation isolating multi-stream architecture from acquisition choice), BO(EI), GA, Random.

Protocol: All algorithms share an identical RF surrogate (100 trees), initialize from 8 points drawn from the bottom 20% of each pool (no lucky-init confound), and run for 20 rounds × 5 seeds.

Headline result — QM9-100k

Algorithm

\(t_{95}\) (rounds)

Reach 95%

GLOSS (4:2:2)

7.2

5 / 5

UCB-BO

16.6

3 / 5

BO(EI)

18.4

2 / 5

GA

20.0

0 / 5

Random

20.0

0 / 5

2.31× / 2.56× speedup over UCB-BO and BO(EI).

Scaling — GLOSS’s advantage grows with pool size

Sweeping QM9 pool size from \(n=5{,}000\) to \(n=100{,}000\):

  • GLOSS: \(t_{95}\) stays between 3.4 and 7.6 rounds across the entire sweep — no systematic trend with \(n\).

  • BO(EI): 6.8 → 18.4 rounds (degrades).

  • UCB-BO: 5.2 → 16.6 rounds (degrades).

  • Random: Never reaches 95% at any scale.

Note

A larger pool means the bottom-20% initialization sits, on average, farther from the global optimum, so the initial surrogate has high uncertainty over a larger fraction of the space. Without a dedicated exploration stream, BO concentrates each batch near whichever locally promising region its few initial points are closest to. GLOSS’s Unexplored stream keeps injecting geometrically diverse points whose acquisition is independent of the current surrogate.

Trajectory analysis (Arrhenius-2D, seed 61)

The Arrhenius surface has a primary peak at \((0.44, 0.45)\) with \(y_{\mathrm{opt}}=18.35\) and a wider secondary peak at \((0.83, 0.66)\) reaching \(\approx 74\%\) of the global value.

Algorithm

Peak % at round

Behavior

GLOSS

100% @ 11

Probes the secondary-peak region first, then locks onto the primary peak.

UCB-BO

63% @ 17

Trapped at the secondary peak from round 1 onward.

BO(EI)

75% @ 17

Same secondary-peak trap, marginally wider exploration.

GA

100% @ 19

Tournament selection wanders broadly; transitions basins around round 15.

Reproducing

git clone https://github.com/zbc0315/gloss.git
cd gloss
pip install -e ".[all]"
python -m benchmarks.bench_main --study all

Individual studies:

python -m benchmarks.bench_main --study main         # 3 datasets × 5 algorithms
python -m benchmarks.bench_main --study scaling      # QM9 n=5k → 100k
python -m benchmarks.bench_main --study complexity   # Arrhenius C1 → C5
python -m benchmarks.bench_main --study pilot        # quick smoke (1 seed, 10 rounds)