Abstract neural network (ANN) models, also known variously as artificial neural networks, connectionism, parallel distributed processing (PDP), perceptrons, backpropagation networks, deep networks, AI (artificial intelligence) models, and ML (machine learning), represent a large class of models that include the core mechanism of distributed processing performed by interconnected neuron-like processing elements (units).
Perhaps the most fundamental shared feature of all ANN models is the distinction between activations and weights, where activations are dynamic state variables representing neural activity that are communicated to other neurons over weighted synaptic connections. Learning happens by changing the strengths of these synaptic weights, typically under the influence of the local activation state.
After providing an overview of these models, we consider how they relate to neuroscience in general and Axon more specifically. In summary, Axon has a number of neurobiologically-motivated properties that differ significantly from most ANN models, which have potentially important functional implications.
Historically, early developments included Frank Rosenblatt’s two-layer perceptron model (Rosenblatt, 1959; Rosenblatt, 1962), which was critiqued by Minsky & Papert (1969), leading to a widespread rejection of this framework (which was clearly misguided, in retrospect). The ability to use hidden layers via the error-backpropagation framework developed by Rumelhart et al. (1986) reinvigorated interest in these models. However, after a subsequent decline in interest (and focus on more rigorous statistically-based frameworks), the remarkable performance of the Krizhevsky et al. (2012) “AlexNet” deep neural network, running on a GPU, triggered a resurgence of interest that has grown exponentially to this day. See LeCun et al. (2015) and Schmidhuber (2015) for historical reviews, and Rumelhart & McClelland (1986) for the foundational insights that remain highly relevant to this day.
The current, widely used large language models (LLMs) represent the most impactful result of this line of work, used daily by millions of people in the form of ChatGPT and various other related models. Interestingly, the core mechanisms of this model can be directly traced back to the error-backpropagagion models of the 1980s, with major advances deriving from tremendous improvements in the size of both the models and the amount of data that they can be trained on, resulting from major improvements in GPU technology.
The ability to train an LLM on essentially the entire corpus of human-generated text on the internet (i.e., the big data approach) derives from the fact that the LLM is based on predictive learning, which is a core feature of the Axon model and, by hypothesis, the mammalian neocortex. By contrast, the AlexNet model required a large corpus of human-labeled images (the ImageNet corpus; Deng et al., 2009) to provide the category labels that drive the error signals in that and many other error backpropagation models. Specifically, an LLM learns from the error signals generated by trying to predict the next word in a stream of language input. It is remarkable that this simple principle can lead to the amazing abilities of such models.
In general, the progress of ANN models is consistent with the bitter lesson articulated by Rich Sutton, which is essentially that big data consumed by relatively simple, general-purpose models ultimately prevails over attempts to develop more complex, bespoke algorithms and representations. This is an instance of the bias-variance tradeoff, which is a basic statistical principle governing the tradeoff between building in stronger biases into a learning system, vs using more general-purpose, unbiased models. Stronger biases are beneficial when data is scarce, to reduce the amount of variance in learning outcomes (which is directly associated with the ability of the model to generalize to novel inputs). But when data is plentiful, biases are unnecessary and even harmful.
In this context, Axon is somewhere in between. The neocortical error-driven learning mechanisms in Axon are effectively a biologically-based version of the primary general-purpose learning mechanisms in most ANN models. However, the Rubicon model represents a strongly evolutionarily-shaped set of subcortical brain areas that drive a more strongly biased form of goal-driven learning. This system enables the simulated organism to learn new skills with significantly fewer learning trials, consistent with stronger biases.
One of the most important differences between the Axon framework and the vast majority of ANN models is that these models use strictly feedforward connections, such that information only flows in one direction throughout the network (“forward”). This constraint enables error gradients to be efficiently computed, and also greatly simplifies the activation dynamics of the models. Whenever you include bidirectional connectivity in a network, which is a core feature of Axon, the resulting positive feedback loops create significant problems for both error backpropagation and the overall behavior of the network.
Thus, a major scientific question motivating this work is to understand why the brain is based on bidirectional connectivity, and how that changes the nature of the computations performed relative to feedforward ANNs. One central hypothesis is that bidirectional connectivity is critical for conscious awareness, which allows the system to access its own state of knowledge in ways that a strictly feedforward model cannot. The computational benefits of consciousness thus represent one potential answer to this question, and potentially could limit the problems of confabulation and “unnatural” failures revealed by “adversarial attacks” that continue to plague LLMs and other models.
Key advances in ANNs in relation to neuroscience
The following are some of the key advances in modern ANN models relative to the original 1980s backprop nets. Some of these advances align with biological properties in Axon, but others do not.
• ReLu linearized activation function. A key advance in the AlexNet model was the use of a rectified linear unit (ReLU) activation function, relative to the saturating sigmoidal (S-shaped) activation functions used in the earlier models. The linear nature of this function solves the problem of the exponential decay of error signals across multiple hidden layers in a deep neural network, and also has the added bonus of giving each unit a much wider effective dynamic range, allowing one unit to do the work of many sigmoidal ones.
This is not consistent with the known biology, where individual neurons in the cortex have strong saturation properties and a relatively limited dynamic range of activation signaling. This saturating nonlinearity is critical for bidirectional excitatory networks, providing a built-in damping limit on potentially runaway positive feedback dynamics.
Interestingly, the exponential decay phenomenon is at least partially mitigated in the brain despite the presence of saturating nonlinearities, by virtue of robust inhibition that keeps most neurons well below the saturation levels of activity. Indeed, neurons in awake behaving neocortex are characterized as being precisely balanced between inhibition and excitation (Shadlen & Newsome, 1998; Okun & Lampl, 2008; Isaacson & Scanziani, 2011; Rubin et al., 2017). This balance right around the threshold of firing also makes them more chaotic and contributes to the Poisson noise observed in spiking neurons, which could also potentially amplify responses to the temporal differences that drive learning in the GeneRec and kinase algorithms.
• Shortcut connections and the importance of residual error. The original insights of Schraudolph (1998) about the importance of focusing learning on residual error signals (e.g., by subtracting the mean) was amplified and elaborated in the ResNet architecture of He et al. (2015) which includes pervasive shortcut connections between deep layers. The key idea is that these shortcut connections allow higher layers to automatically benefit from all of the knowledge represented in the lower layers, so that they can focus on learning the residual information beyond this more “basic” level. An early influential idea along these lines was the cascade correlation algorithm of Fahlman & Lebiere (1989), which involved progressively adding new units as a function of residual error. Other widely-used normalization mechanisms (e.g., “batch norm”) also provide a centering, residual-error focus.
Shortcut connections are a prominent feature of the brain and are often used in Axon models. Furthermore, the ubiquitous pooled inhibition in the brain, which is also essential for Axon, provides a form of normalization and dynamic range centering.
• Attentional mechanisms as in the transformer. Perhaps the single most dramatic recent innovation relative to the original backpropagation networks is the introduction of “attention” mechanisms in the transformer architecture that is used in LLMs. Certainly, attention is a critical aspect of human cognition, but the transformer architecture may be capturing more about the episodic memory functions of the hippocampus rather than attention as it operates in the brain.