0 Posted 2024-05-30Updated 2024-08-16Machine Learning / Math6 minutes read (About 942 words)

Simulated Annealing (SA)

Video tutorial: Challenging Luck, 2021

Practicing Python code: challengingLuck: Using Annealing Algorithm to Solve the Sudo Challenge

Simulated Annealing (SA) is a probabilistic technique used for finding an approximate solution to an optimization problem. It is particularly useful for problems where the search space is large and complex, and other methods might get stuck in local optima. Here’s a structured way to start learning about the Simulated Annealing algorithm:

$$ P(E, E’, T) = \begin{cases}
1 & \text{if } E’ < E \\
\exp\left(\frac{-(E’ - E)}{T}\right) & \text{if } E’ \ge E
\end{cases}
$$

This idea is similar to DNA annealing during PCR. During the temperature drop after the high-temperature mutation, the DNA gradually returns to the double strand with its reverse complement strand or the primer due to the decrease in entropy. Unlike DNA annealing, Simulated Annealing (SA) introduces random values to replace the original ones. After that, the “Energy” is calculated, and only when the new score is lower than the previous one is the new value accepted, continuing the iteration until the lowest value is found. It works like this:

Calculate the initial E → randomly mutate the value and calculate the new E' → if E' ≤ E, then accept the mutated element; otherwise, try another value → if E' meets the lowest E, stop; otherwise, continue until no smaller E' can be found.

As a result, you could expect that it would waste a lot of resources on exploration and easily fall into local optima.

An example of the SA apply in sudo challenge

Simulated annealing in Sudo

In this example, it actually failed to get the result because it fall into local optimal. It is a very good example to show the capability and limitations of the SA.

SA and Stochastic gradient descent

$$
w_{t+1} = w_t - \eta \nabla f(w_t; x_i)
$$

$w$ is the parameter, weight matrix, for example. While $w_t$ is the old one and the $w_{t-1}$ is updated parameter.
$\eta$ is the learning rate
$x$ is the input

Stochastic Gradient Descent (SGD) is an optimization algorithm used primarily for training machine learning models. It iteratively updates the model parameters by computing the gradient of the loss function using a randomly selected subset (mini-batch) of the training data, rather than the entire dataset. This randomness introduces variability in the updates, which helps escape local optima and speeds up convergence. The learning rate controls the step size of each update, determining how far the parameters move in the direction of the negative gradient. By continuously adjusting the parameters, SGD aims to minimize the loss function and improve the model’s performance.

So, it is very similar to Stochastic gradient descent (SGD). But for SGD, there are a learning process from the data. SGD is primally based on the exploitation. But in SA, exploration has more contribution compared with SGD because it hugely relies on random generation first and evaluating later.

SA and Genetic Algorithm (GA)

What is GA?
Genetic Algorithms (GA) are evolutionary algorithms inspired by the principles of natural selection and genetics. They work by evolving a population of candidate solutions through successive generations. Each generation undergoes selection, where the fittest individuals are chosen based on their performance. These selected individuals then undergo crossover (recombination) to produce offspring that combine their parents’ characteristics. Additionally, mutation introduces random changes to some individuals to maintain genetic diversity within the population. This process of selection, crossover, and mutation allows GAs to explore a wide search space and balance between exploring new solutions and exploiting the best solutions found so far. This diversity helps GAs avoid getting trapped in local optima, making them effective for solving complex optimization problems, including those that are non-differentiable and non-convex.

From a personal point of view, GA is like an upgraded version of SA. SA works at the level of a single individual, while GA operates at the population level. Similar to SA, GA evaluates and selects the “fitness scores” of each individual. The next generation introduces many random mutations, just like SA. However, unlike SA, GA also includes “crossover” steps, which can help enrich the “better phenotypes”.

Feature	Simulated Annealing (SA)	Stochastic Gradient Descent (SGD)	Genetic Algorithm (GA)
Approach	Probabilistic, accepts worse solutions occasionally	Deterministic, updates in the direction of the gradient	Evolutionary, uses selection, crossover, and mutation
Objective Function	Non-differentiable and non-convex functions	Differentiable functions	Non-differentiable and non-convex functions
Exploration vs. Exploitation	Balances both, reduces acceptance of worse solutions over time	Primarily exploitation with some exploration via mini-batches	Balances both, uses population diversity to explore the search space
Cooling Schedule / Learning Rate	Uses a cooling schedule to reduce probability of accepting worse solutions	Uses a learning rate to control step size of updates	Uses selection pressure to favor better solutions and mutation rate to introduce diversity
Population-Based	No	No	Yes, operates on a population of solutions
Escape Local Optima	Yes, by accepting worse solutions with a probability	Limited, may get stuck in local optima	Yes, by maintaining a diverse population
Gradient Requirement	No	Yes	No
Applications	Combinatorial and continuous optimization without gradients	Training machine learning models, especially in deep learning	Optimization problems, including scheduling, design, and artificial intelligence
Natural Inspiration	Annealing in metallurgy	Gradient descent in calculus	Natural selection and genetics
Operators	Acceptance probability based on temperature	Gradient-based updates	Selection, crossover (recombination), and mutation