To find default noise settings, we ran 50 random seeds for all combinations of nine annealing schedules and seven noise values. Average performance over the runs was typically 5% better than without noise, while the best NLL score over the 50 runs was 12% better than without noise. Given that the typical mode of running SAM is to generate many models and pick the best, this 12% value is quite an improvement. Our chosen default is 5 sequences worth of noise using an exponential annealing schedule with factor 0.8. This is a somewhat arbitrary choice based on the range of scores obtained -- no clear winner among the settings emerged. The tested setpoints added between 20% and 350% more reestimation cycles over the noiseless case. If less time is available, we suggest a linear schedule with 1 noise sequence. In general, as many models should be created as possible and then the best one further refined. This procedure is automated in SAM.
The histograms in Figure 5 show average test set NLL scores for 1000 training runs on 50 training globins with just default noise, random model lengths without noise, and all heuristics (noise, random model lengths, and surgery). The vertical bar at 334 indicates the NLL score for training without noise. Note in particular how the combination of noise and surgery both improves the test set scores and sharpens their distribution, indicating that far fewer than 1000 runs are needed to generate good models.