Vanishing gradient problem

Vanishing gradient problem

In machine learning, the vanishing gradient problem is the problem of greatly diverging gradient magnitudes between earlier and later layers encountered when training neural networks with backpropagation. In such methods, neural network weights are updated proportional to their partial derivative of the loss function. As the number of forward propagation steps in a network increases, for instance due to greater network depth, the gradients of earlier weights are calculated with increasingly many multiplications. These multiplications shrink the gradient magnitude. Consequently, the gradients of earlier weights will be exponentially smaller than the gradients of later weights. This difference in gradient magnitude might introduce instability in the training process, slow it, or halt it entirely. For instance, consider the hyperbolic tangent activation function. The gradients of this function are in range [0,1]. The product of repeated multiplication with such gradients decreases exponentially. The inverse problem, when weight gradients at earlier layers get exponentially larger, is called the exploding gradient problem. Backpropagation allowed researchers to train supervised deep artificial neural networks from scratch, initially with little success. Hochreiter's diplom thesis of 1991 formally identified the reason for this failure in the "vanishing gradient problem", which not only affects many-layered feedforward networks, but also recurrent networks. The latter are trained by unfolding them into very deep feedforward networks, where a new layer is created for each time-step of an input sequence processed by the network (the combination of unfolding and backpropagation is termed backpropagation through time). == Prototypical models == This section is based on the paper On the difficulty of training Recurrent Neural Networks by Pascanu, Mikolov, and Bengio. === Recurrent network model === A generic recurrent network has hidden states h 1 , h 2 , … {\displaystyle h_{1},h_{2},\dots } , inputs u 1 , u 2 , … {\displaystyle u_{1},u_{2},\dots } , and outputs x 1 , x 2 , … {\displaystyle x_{1},x_{2},\dots } . Let it be parameterized by θ {\displaystyle \theta } , so that the system evolves as ( h t , x t ) = F ( h t − 1 , u t , θ ) {\displaystyle (h_{t},x_{t})=F(h_{t-1},u_{t},\theta )} Often, the output x t {\displaystyle x_{t}} is a function of h t {\displaystyle h_{t}} , as some x t = G ( h t ) {\displaystyle x_{t}=G(h_{t})} . The vanishing gradient problem already presents itself clearly when x t = h t {\displaystyle x_{t}=h_{t}} , so we simplify our notation to the special case with: x t = F ( x t − 1 , u t , θ ) {\displaystyle x_{t}=F(x_{t-1},u_{t},\theta )} Now, take its differential: d x t = ∇ θ F ( x t − 1 , u t , θ ) d θ + ∇ x F ( x t − 1 , u t , θ ) d x t − 1 = ∇ θ F ( x t − 1 , u t , θ ) d θ + ∇ x F ( x t − 1 , u t , θ ) [ ∇ θ F ( x t − 2 , u t − 1 , θ ) d θ + ∇ x F ( x t − 2 , u t − 1 , θ ) d x t − 2 ] ⋮ = [ ∇ θ F ( x t − 1 , u t , θ ) + ∇ x F ( x t − 1 , u t , θ ) ∇ θ F ( x t − 2 , u t − 1 , θ ) + ⋯ ] d θ {\displaystyle {\begin{aligned}dx_{t}&=\nabla _{\theta }F(x_{t-1},u_{t},\theta )d\theta +\nabla _{x}F(x_{t-1},u_{t},\theta )dx_{t-1}\\&=\nabla _{\theta }F(x_{t-1},u_{t},\theta )d\theta +\nabla _{x}F(x_{t-1},u_{t},\theta )\left[\nabla _{\theta }F(x_{t-2},u_{t-1},\theta )d\theta +\nabla _{x}F(x_{t-2},u_{t-1},\theta )dx_{t-2}\right]\\&\;\;\vdots \\&=\left[\nabla _{\theta }F(x_{t-1},u_{t},\theta )+\nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{\theta }F(x_{t-2},u_{t-1},\theta )+\cdots \right]d\theta \end{aligned}}} Training the network requires us to define a loss function to be minimized. Let it be L ( x T , u 1 , … , u T ) {\displaystyle L(x_{T},u_{1},\dots ,u_{T})} , then minimizing it by gradient descent gives Δ θ = − η ⋅ [ ∇ x L ( x T ) ( ∇ θ F ( x t − 1 , u t , θ ) + ∇ x F ( x t − 1 , u t , θ ) ∇ θ F ( x t − 2 , u t − 1 , θ ) + ⋯ ) ] T {\displaystyle \Delta \theta =-\eta \cdot \left[\nabla _{x}L(x_{T})\left(\nabla _{\theta }F(x_{t-1},u_{t},\theta )+\nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{\theta }F(x_{t-2},u_{t-1},\theta )+\cdots \right)\right]^{T}} where η {\displaystyle \eta } is the learning rate. The vanishing/exploding gradient problem appears because there are repeated multiplications, of the form ∇ x F ( x t − 1 , u t , θ ) ∇ x F ( x t − 2 , u t − 1 , θ ) ∇ x F ( x t − 3 , u t − 2 , θ ) ⋯ {\displaystyle \nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{x}F(x_{t-2},u_{t-1},\theta )\nabla _{x}F(x_{t-3},u_{t-2},\theta )\cdots } ==== Example: recurrent network with sigmoid activation ==== For a concrete example, consider a typical recurrent network defined by x t = F ( x t − 1 , u t , θ ) = W rec σ ( x t − 1 ) + W in u t + b {\displaystyle x_{t}=F(x_{t-1},u_{t},\theta )=W_{\text{rec}}\sigma (x_{t-1})+W_{\text{in}}u_{t}+b} where θ = ( W rec , W in ) {\displaystyle \theta =(W_{\text{rec}},W_{\text{in}})} is the network parameter, σ {\displaystyle \sigma } is the sigmoid activation function, applied to each vector coordinate separately, and b {\displaystyle b} is the bias vector. Then, ∇ x F ( x t − 1 , u t , θ ) = W rec diag ⁡ ( σ ′ ( x t − 1 ) ) {\displaystyle \nabla _{x}F(x_{t-1},u_{t},\theta )=W_{\text{rec}}\operatorname {diag} (\sigma '(x_{t-1}))} , and so ∇ x F ( x t − 1 , u t , θ ) ∇ x F ( x t − 2 , u t − 1 , θ ) ⋯ ∇ x F ( x t − k , u t − k + 1 , θ ) = W rec diag ⁡ ( σ ′ ( x t − 1 ) ) W rec diag ⁡ ( σ ′ ( x t − 2 ) ) ⋯ W rec diag ⁡ ( σ ′ ( x t − k ) ) {\displaystyle {\begin{aligned}&\nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{x}F(x_{t-2},u_{t-1},\theta )\cdots \nabla _{x}F(x_{t-k},u_{t-k+1},\theta )\\&=W_{\text{rec}}\operatorname {diag} (\sigma '(x_{t-1}))W_{\text{rec}}\operatorname {diag} (\sigma '(x_{t-2}))\cdots W_{\text{rec}}\operatorname {diag} (\sigma '(x_{t-k}))\end{aligned}}} Since | σ ′ | ≤ 1 {\displaystyle \left|\sigma '\right|\leq 1} , the operator norm of the above multiplication is bounded above by ‖ W rec ‖ k {\displaystyle \left\|W_{\text{rec}}\right\|^{k}} . So if the spectral radius of W rec {\displaystyle W_{\text{rec}}} is γ < 1 {\displaystyle \gamma <1} , then at large k {\displaystyle k} , the above multiplication has operator norm bounded above by γ k → 0 {\displaystyle \gamma ^{k}\to 0} . This is the prototypical vanishing gradient problem. The effect of a vanishing gradient is that the network cannot learn long-range effects. Recall Equation (loss differential): ∇ θ L = ∇ x L ( x T , u 1 , … , u T ) [ ∇ θ F ( x t − 1 , u t , θ ) + ∇ x F ( x t − 1 , u t , θ ) ∇ θ F ( x t − 2 , u t − 1 , θ ) + ⋯ ] {\displaystyle \nabla _{\theta }L=\nabla _{x}L(x_{T},u_{1},\dots ,u_{T})\left[\nabla _{\theta }F(x_{t-1},u_{t},\theta )+\nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{\theta }F(x_{t-2},u_{t-1},\theta )+\cdots \right]} The components of ∇ θ F ( x , u , θ ) {\displaystyle \nabla _{\theta }F(x,u,\theta )} are just components of σ ( x ) {\displaystyle \sigma (x)} and u {\displaystyle u} , so if u t , u t − 1 , … {\displaystyle u_{t},u_{t-1},\dots } are bounded, then ‖ ∇ θ F ( x t − k − 1 , u t − k , θ ) ‖ {\displaystyle \left\|\nabla _{\theta }F(x_{t-k-1},u_{t-k},\theta )\right\|} is also bounded by some M > 0 {\displaystyle M>0} , and so the terms in ∇ θ L {\displaystyle \nabla _{\theta }L} decay as M γ k {\displaystyle M\gamma ^{k}} . This means that, effectively, ∇ θ L {\displaystyle \nabla _{\theta }L} is affected only by the first O ( γ − 1 ) {\displaystyle O(\gamma ^{-1})} terms in the sum. If γ ≥ 1 {\displaystyle \gamma \geq 1} , the above analysis does not quite work. For the prototypical exploding gradient problem, the next model is clearer. === Dynamical systems model === Following (Doya, 1993), consider this one-neuron recurrent network with sigmoid activation: x t + 1 = ( 1 − ε ) x t + ε σ ( w x t + b ) + ε w ′ u t {\displaystyle x_{t+1}=(1-\varepsilon )x_{t}+\varepsilon \sigma (wx_{t}+b)+\varepsilon w'u_{t}} At the small ε {\displaystyle \varepsilon } limit, the dynamics of the network becomes d x d t = − x ( t ) + σ ( w x ( t ) + b ) + w ′ u ( t ) {\displaystyle {\frac {dx}{dt}}=-x(t)+\sigma (wx(t)+b)+w'u(t)} Consider first the autonomous case, with u = 0 {\displaystyle u=0} . Set w = 5.0 {\displaystyle w=5.0} , and vary b {\displaystyle b} in [ − 3 , − 2 ] {\displaystyle [-3,-2]} . As b {\displaystyle b} decreases, the system has 1 stable point, then has 2 stable points and 1 unstable point, and finally has 1 stable point again. Explicitly, the stable points are ( x , b ) = ( x , ln ⁡ ( x 1 − x ) − 5 x ) {\displaystyle (x,b)=\left(x,\ln \left({\frac {x}{1-x}}\right)-5x\right)} . Now consider Δ x ( T ) Δ x ( 0 ) {\displaystyle {\frac {\Delta x(T)}{\Delta x(0)}}} and Δ x ( T ) Δ b {\displaystyle {\frac {\Delta x(T)}{\Delta b}}} , where T {\displaystyle T} is large enough that the system has settled into one of the stable points. If ( x ( 0 ) , b ) {\displaystyle (x(0),b)} puts the system very close to an unstable point, then a tiny variation in x ( 0 ) {\displaystyle x(0)} or b {\displaystyle b} wo

Caffe (software)

Caffe (Convolutional Architecture for Fast Feature Embedding) is a deep learning framework, originally developed at University of California, Berkeley. It is open source, under a BSD license. It is written in C++, with a Python interface. == History == Yangqing Jia created the Caffe project during his PhD at UC Berkeley, while working the lab of Trevor Darrell. The first version, called "DeCAF", made its first appearance in Spring 2013 when it was used for the ILSVRC challenge (later called ImageNet). The library was named Caffe and released to the public in December 2013. It reached end-of-support in 2018. It is hosted on GitHub. == Features == Caffe supports many different types of deep learning architectures geared towards image classification and image segmentation. It supports CNN, RCNN, LSTM and fully-connected neural network designs. Caffe supports GPU- and CPU-based acceleration computational kernel libraries such as Nvidia cuDNN and Intel MKL. == Applications == Caffe is being used in academic research projects, startup prototypes, and even large-scale industrial applications in vision, speech, and multimedia. Yahoo! has also integrated Caffe with Apache Spark to create CaffeOnSpark, a distributed deep learning framework. == Caffe2 == In April 2017, Facebook announced Caffe2, which included new features such as recurrent neural network (RNN). At the end of March 2018, Caffe2 was merged into PyTorch.

Separating words problem

In theoretical computer science, the separating words problem is the problem of finding the smallest deterministic finite automaton that behaves differently on two given strings, meaning that it accepts one of the two strings and rejects the other string. It is an open problem how large such an automaton must be, in the worst case, as a function of the length of the input strings. == Example == The two strings 0010 and 1000 may be distinguished from each other by a three-state automaton in which the transitions from the start state go to two different states, both of which are terminal in the sense that subsequent transitions from these two states always return to the same state. The state of this automaton records the first symbol of the input string. If one of the two terminal states is accepting and the other is rejecting, then the automaton will accept only one of the strings 0010 and 1000. However, these two strings cannot be distinguished by any automaton with fewer than three states. == Simplifying assumptions == For proving bounds on this problem, it may be assumed without loss of generality that the inputs are strings over a two-letter alphabet. For, if two strings over a larger alphabet differ then there exists a string homomorphism that maps them to binary strings of the same length that also differ. Any automaton that distinguishes the binary strings can be translated into an automaton that distinguishes the original strings, without any increase in the number of states. It may also be assumed that the two strings have equal length. For strings of unequal length, there always exists a prime number p whose value is logarithmic in the smaller of the two input lengths, such that the two lengths are different modulo p. An automaton that counts the length of its input modulo p can be used to distinguish the two strings from each other in this case. Therefore, strings of unequal lengths can always be distinguished from each other by automata with few states. == History and bounds == The problem of bounding the size of an automaton that distinguishes two given strings was first formulated by Goralčík & Koubek (1986), who showed that the automaton size is always sublinear. Later, Robson (1989) proved the upper bound O(n2/5(log n)3/5) on the automaton size that may be required. This was improved by Chase (2020) to O(n1/3(log n)7). There exist pairs of inputs that are both binary strings of length n for which any automaton that distinguishes the inputs must have size Ω(log n). Closing the gap between this lower bound and Chase's upper bound remains an open problem. Jeffrey Shallit has offered a prize of 100 British pounds for any improvement to Robson's upper bound. == Special cases == Several special cases of the separating words problem are known to be solvable using few states: If two binary words have differing numbers of zeros or ones, then they can be distinguished from each other by counting their Hamming weights modulo a prime of logarithmic size, using a logarithmic number of states. More generally, if a pattern of length k appears a different number of times in the two words, they can be distinguished from each other using O(k log n) states. If two binary words differ from each other within their first or last k positions, they can be distinguished from each other using k + O(1) states. This implies that almost all pairs of binary words can be distinguished from each other with a logarithmic number of states, because only a polynomially small fraction of pairs have no difference in their initial O(log n) positions. If two binary words have Hamming distance d, then there exists a prime p with p = O(d log n) and a position i at which the two strings differ, such that i is not equal modulo p to the position of any other difference. By computing the parity of the input symbols at positions congruent to i modulo p, it is possible to distinguish the words using an automaton with O(d log n) states.

Factored language model

The factored language model (FLM) is an extension of a conventional language model introduced by Jeff Bilmes and Katrin Kirchoff in 2003. In an FLM, each word is viewed as a vector of k factors: w i = { f i 1 , . . . , f i k } . {\displaystyle w_{i}=\{f_{i}^{1},...,f_{i}^{k}\}.} An FLM provides the probabilistic model P ( f | f 1 , . . . , f N ) {\displaystyle P(f|f_{1},...,f_{N})} where the prediction of a factor f {\displaystyle f} is based on N {\displaystyle N} parents { f 1 , . . . , f N } {\displaystyle \{f_{1},...,f_{N}\}} . For example, if w {\displaystyle w} represents a word token and t {\displaystyle t} represents a Part of speech tag for English, the expression P ( w i | w i − 2 , w i − 1 , t i − 1 ) {\displaystyle P(w_{i}|w_{i-2},w_{i-1},t_{i-1})} gives a model for predicting current word token based on a traditional Ngram model as well as the Part of speech tag of the previous word. A major advantage of factored language models is that they allow users to specify linguistic knowledge such as the relationship between word tokens and Part of speech in English, or morphological information (stems, root, etc.) in Arabic. Like N-gram models, smoothing techniques are necessary in parameter estimation. In particular, generalized back-off is used in training an FLM.

AI Photo Editors: Free vs Paid (2026)

Trying to pick the best AI photo editor? An AI photo editor is software that uses machine learning to help you get more done — it scales effortlessly from a single task to thousands. The best picks balance beginner-friendly simplicity with the depth power users need, and they ship updates often. Whether you are a beginner or a pro, the right AI photo editor slots into your workflow and pays for itself fast. Read on for hands-on impressions, pricing tiers, and the standout features that matter.

Screen space directional occlusion

Screen space directional occlusion (SSDO) is a computer graphics technique enhancing screen space ambient occlusion (SSAO) by taking direction into account to sample the ambient light (both the light coming directly at an object, as well as the light reflected off of the object directly behind it), to better approximate global illumination. SSDO was introduced by Tobias Ritschel, Thorsten Grosch, and Hans-Peter Seidel in their 2009 ACM Symposium on Interactive 3D Graphics and Games paper Approximating dynamic global illumination in image space, which describes it as extending SSAO to directional occlusion with one diffuse indirect bounce of light; later literature notes that SSDO still suffers from common screen-space artifacts such as noise and banding. == Method == The original SSDO paper describes a two-pass screen-space approach, with one pass for direct lighting and a second pass for indirect bounces. Later literature describes SSDO as assuming a general shadowing direction that allows color bleeding and a single light bounce.

Mark Keane (cognitive scientist)

Mark Thomas Gerard Keane (Irish: Marcus Ó Cathain, born 3 July 1961, Dublin, Ireland) is a cognitive scientist and author of several books on human cognition and artificial intelligence, including Cognitive Psychology: A Student's Handbook (8 editions, with Michael Eysenck), Advances in the Psychology of Thinking (1992, with Ken Gilhooly), Novice Programming Environments (1992/2018, with Marc Eisenstadt and Tim Rajan), Advances in Case-Based Reasoning (1995, with J-P Haton and Michel Manago)., Case-Based Reasoning: Research & Development (2022, with N Wiratunga). == Education == Keane received a B.A. in Psychology from University College Dublin in 1982. He then received a Ph.D. from Trinity College Dublin in 1987. He then moved to postdoctoral positions in Queen Mary University of London and the Open University. == Academic career == He was a Lecturer in Psychology at Cardiff University. He became a lecturer in Computer Science at Trinity College Dublin in 1990, and became a fellow in 1994. Keane moved to become Chair of Computer Science at University College Dublin in 1998. In 2006, he was seconded to Science Foundation Ireland as Director of ICT, overseeing on a $700m research investment. He advised the Irish Government on its 3.7B euro Strategy for Science, Technology & Innovation (SSTI). From 2006 to 2007, he was Director General of Science Foundation Ireland before returning to University College Dublin where he was appointed VP of Innovation & Partnerships (2007-2009). Keane's research has been split between cognitive science and computer science. His cognitive science research has been in analogy, metaphor, conceptual combination and similarity. His computer science research has been in natural language processing, machine learning, case-based reasoning, text analytics and explainable artificial intelligence. He has been a PI in the Science Foundation Ireland funded Insight Centre for Data Analytics working on digital journalism and digital humanities. More recently, he was deputy director of the VistaMilk SFI Research Centre that is exploring precision agriculture in the dairy sector.