Human cortex can’t represent arbitrarily complex abstractions in a single forward pass. It’s depth-limited (capped by the number of sequential inferential steps it can execute per corticothalamic cycle) and that ceiling determines what kinds of reasoning are biologically possible at all, not merely how fast they happen.A single cortical area can be simulated by ~14 ReLU transforms[1] at ~4ms/pass. GPT-3 has 192 of these layers and can do a forward pass in <2ms. With sufficient data[2], you can train a model of this depth for every microcircuit[3] (cortical upper-bound estimate).So human cortex, given identical inter-region scaffolding and reward signals[4], should be strictly computationally dominated by larger ANNs, even given arbitrary developmental time.Beyond this, some otherwise embeddable concepts are inacquirable outside of critical learning periods (see perfect pitch[5]). Synapses are physically encaged during maturation, especially in lower sensory/perceptual[6] areas, and uncaging them with psychedelics[7] indeed reopens critical windows.An adult acquiring a genuinely new conceptual domain, i.e. real new primitives, cannot embed those primitives in early sensory cortex the way childhood experience shapes visual processing.****Say you’re a young child who really loves plant aesthetics. You’re so often attending to leaf and branch geometries that concepts like phyllotaxis become primitives. Later, as an adult, you find it really easy to understand network structures.Or, more relevant to BCIs for alignment, suppose you’re a very smart toddler learning about a linear algebra concept. The relevant representations are worn into early visual cortex; as an adult, you easily compose much more complex representations if they’re well-grounded by the LA primitive.Contrast with using naive adult infra for the same struct. It may introspectively feel similar[8] to holding a new primitive, but takes far more computation and is less composable.Reasoning with primitives feels like, in our example, rotating a solid object and intuitively feeling that its shape doesn’t change; therefore, even though you didn’t[9] read about Liouville’s Theorem as a kid, you can intuitively feel that shape conservationvolume conservation and extend that to nonstandard geometries.But that horizontal leap took extra effort. You may have already embedded an autonomic circuit which runs through this process so fluidly that it feels as clear as seeing blue in an open sky, but unless you did so as a young child, you’re going to be less efficient at stacking further representations than with true primitives.So, then, if serial depth is limiting human cognition, why didn’t we evolve more compute-efficient software? “Add six extra cortical neuron layers” sounds like a genetically simple, non-personality-altering change, and brains’ heaviest energy expenses are interconnect, not soma. – hypothetical readerI don’t know. All mammals I’ve studied have six cortical layers. Maybe jitter means 6 layers is Pareto-optimal in general, and humans are smarter because we use heavier inter-region stacking? Perhaps error propagation is harder in biological networks[10]?In any case, provided you can get useful error signals to condition a less-lossy SGD model with, you can expand into more effective cortical columns while using human meta-learning to direct the model much better than training an LLM.Multiple local passes usually chain in series. One cortical area’s internal pass: L4 input (~3 equivalent DNN layers) + L2/3 (~3 layers) + L5 thick-tufted pyramidal (~8 layers, see Beniaguev et al. 2021) = ~14 equivalent layers per area. This can be substantially larger (60-140) given intralayer feedback (eg L5 L5); however, such feedback takes so long (up to 30ms) that you’re limiting lateral association. Also, this is more of an RNN than a feedforward MLP; more metabolically & data efficient but repeated autoassociation isn’t as expressive as “multiply depth bypasses” would suggest. ↩︎I will cover this in the next post. You can train anything with semantically coherent I/O, sufficient compute/data, and a large enough network. Cortex has an abundance of time-coded voltages which we can convert into embeddings. Differentiating these signals into something which would robustly inform SGD means extracting local updates from tissue. ↩︎Signals sweep serially through multiple cortical areas, so a full macroscopic pass in humans can be up to 140 ReLU-transform equivalents, which isn’t far from GPT-3’s 192. I’m claiming that we can stack an additional, say, 96 ReLUs for each microcircuit, using the brain as a meta-learning signal and router. ↩︎I’m hand-waving at the macroarchitectural implementation of human continual/meta-learning here, and I don’t claim to know what it actually is. BCIs can use enhanced ~gradients without knowing the process which enhanced them. ↩︎This paper says otherwise, but is wrong in the sense which I care about because relative pitch is learnable in the same way I’ve exposed so far. ↩︎Primary visual cortex consolidates around age 5 (peak plasticity 0-3, closure ~6-8); auditory cortex slightly later; language areas in early adolescence (Hartshorne et al. 2018: sharp decline after ~17); prefrontal regions not until the mid-twenties. See Hensch 2005 for review. ↩︎Critical periods can be pharmacologically reopened by destabilizing the extracellular matrix via TrkB/BDNF (see Moliner et al. 2023), the same mechanism behind moderate-dose psychedelic effects on learning. TrkB activation in parvalbumin interneurons transiently dissolves perineuronal nets, which reopens critical periods in rodents. The antidepressant effects of psychedelics are pretty tightly linked to TrkB and separate from hallucinations; 5HT-2a laterally disinhibits cortex, the knockout of which prevents hallucinations in mice. The head twitch suppression effects of 5HT-2a KO in mice (via volinanserin) correspond to hallucinatory suppression of psychedelics in me, an alleged human. Whether chronically pegging TrkB in adults would eventually downregulate plasticity via eg downstream MAPK-ERK shifts -> ECM tightening is unclear. I find it unlikely given the long-term ECM softening seen with fluoxetine (Castrén lab, Science 2008). ↩︎^I think it’s hard to reflect on one’s internal composition of primitives because the macroscopic inter-region computation which reflectivity (and related architectural prerequisites of human meta-learning) operate at is the inter-region / broadcast level. You can’t look directly at the subroutines of “blueness” because it’s connectomically inaccessible. I’ll cover this in a post about alignment of superintelligent humans.That is, unless you, like Terence Tao and John von Neumann, learned applicable constructs as a child. ↩︎From my simple simulations with predictive coding networks at varying depths & signal-to-noise ratios, it seems that canonical PC is not robust to settling-phase noise. In fact, noise is so disruptive that I’m doubting whether settling-based PC is biologically plausible. Alternatives like Saponati and Vincks’ rule might be stabler; if you know of other examples, please comment with even a quick description. ↩︎Discuss Read More
Biological Computing Underhang
Human cortex can’t represent arbitrarily complex abstractions in a single forward pass. It’s depth-limited (capped by the number of sequential inferential steps it can execute per corticothalamic cycle) and that ceiling determines what kinds of reasoning are biologically possible at all, not merely how fast they happen.A single cortical area can be simulated by ~14 ReLU transforms[1] at ~4ms/pass. GPT-3 has 192 of these layers and can do a forward pass in <2ms. With sufficient data[2], you can train a model of this depth for every microcircuit[3] (cortical upper-bound estimate).So human cortex, given identical inter-region scaffolding and reward signals[4], should be strictly computationally dominated by larger ANNs, even given arbitrary developmental time.Beyond this, some otherwise embeddable concepts are inacquirable outside of critical learning periods (see perfect pitch[5]). Synapses are physically encaged during maturation, especially in lower sensory/perceptual[6] areas, and uncaging them with psychedelics[7] indeed reopens critical windows.An adult acquiring a genuinely new conceptual domain, i.e. real new primitives, cannot embed those primitives in early sensory cortex the way childhood experience shapes visual processing.****Say you’re a young child who really loves plant aesthetics. You’re so often attending to leaf and branch geometries that concepts like phyllotaxis become primitives. Later, as an adult, you find it really easy to understand network structures.Or, more relevant to BCIs for alignment, suppose you’re a very smart toddler learning about a linear algebra concept. The relevant representations are worn into early visual cortex; as an adult, you easily compose much more complex representations if they’re well-grounded by the LA primitive.Contrast with using naive adult infra for the same struct. It may introspectively feel similar[8] to holding a new primitive, but takes far more computation and is less composable.Reasoning with primitives feels like, in our example, rotating a solid object and intuitively feeling that its shape doesn’t change; therefore, even though you didn’t[9] read about Liouville’s Theorem as a kid, you can intuitively feel that shape conservationvolume conservation and extend that to nonstandard geometries.But that horizontal leap took extra effort. You may have already embedded an autonomic circuit which runs through this process so fluidly that it feels as clear as seeing blue in an open sky, but unless you did so as a young child, you’re going to be less efficient at stacking further representations than with true primitives.So, then, if serial depth is limiting human cognition, why didn’t we evolve more compute-efficient software? “Add six extra cortical neuron layers” sounds like a genetically simple, non-personality-altering change, and brains’ heaviest energy expenses are interconnect, not soma. – hypothetical readerI don’t know. All mammals I’ve studied have six cortical layers. Maybe jitter means 6 layers is Pareto-optimal in general, and humans are smarter because we use heavier inter-region stacking? Perhaps error propagation is harder in biological networks[10]?In any case, provided you can get useful error signals to condition a less-lossy SGD model with, you can expand into more effective cortical columns while using human meta-learning to direct the model much better than training an LLM.Multiple local passes usually chain in series. One cortical area’s internal pass: L4 input (~3 equivalent DNN layers) + L2/3 (~3 layers) + L5 thick-tufted pyramidal (~8 layers, see Beniaguev et al. 2021) = ~14 equivalent layers per area. This can be substantially larger (60-140) given intralayer feedback (eg L5 L5); however, such feedback takes so long (up to 30ms) that you’re limiting lateral association. Also, this is more of an RNN than a feedforward MLP; more metabolically & data efficient but repeated autoassociation isn’t as expressive as “multiply depth bypasses” would suggest. ↩︎I will cover this in the next post. You can train anything with semantically coherent I/O, sufficient compute/data, and a large enough network. Cortex has an abundance of time-coded voltages which we can convert into embeddings. Differentiating these signals into something which would robustly inform SGD means extracting local updates from tissue. ↩︎Signals sweep serially through multiple cortical areas, so a full macroscopic pass in humans can be up to 140 ReLU-transform equivalents, which isn’t far from GPT-3’s 192. I’m claiming that we can stack an additional, say, 96 ReLUs for each microcircuit, using the brain as a meta-learning signal and router. ↩︎I’m hand-waving at the macroarchitectural implementation of human continual/meta-learning here, and I don’t claim to know what it actually is. BCIs can use enhanced ~gradients without knowing the process which enhanced them. ↩︎This paper says otherwise, but is wrong in the sense which I care about because relative pitch is learnable in the same way I’ve exposed so far. ↩︎Primary visual cortex consolidates around age 5 (peak plasticity 0-3, closure ~6-8); auditory cortex slightly later; language areas in early adolescence (Hartshorne et al. 2018: sharp decline after ~17); prefrontal regions not until the mid-twenties. See Hensch 2005 for review. ↩︎Critical periods can be pharmacologically reopened by destabilizing the extracellular matrix via TrkB/BDNF (see Moliner et al. 2023), the same mechanism behind moderate-dose psychedelic effects on learning. TrkB activation in parvalbumin interneurons transiently dissolves perineuronal nets, which reopens critical periods in rodents. The antidepressant effects of psychedelics are pretty tightly linked to TrkB and separate from hallucinations; 5HT-2a laterally disinhibits cortex, the knockout of which prevents hallucinations in mice. The head twitch suppression effects of 5HT-2a KO in mice (via volinanserin) correspond to hallucinatory suppression of psychedelics in me, an alleged human. Whether chronically pegging TrkB in adults would eventually downregulate plasticity via eg downstream MAPK-ERK shifts -> ECM tightening is unclear. I find it unlikely given the long-term ECM softening seen with fluoxetine (Castrén lab, Science 2008). ↩︎^I think it’s hard to reflect on one’s internal composition of primitives because the macroscopic inter-region computation which reflectivity (and related architectural prerequisites of human meta-learning) operate at is the inter-region / broadcast level. You can’t look directly at the subroutines of “blueness” because it’s connectomically inaccessible. I’ll cover this in a post about alignment of superintelligent humans.That is, unless you, like Terence Tao and John von Neumann, learned applicable constructs as a child. ↩︎From my simple simulations with predictive coding networks at varying depths & signal-to-noise ratios, it seems that canonical PC is not robust to settling-phase noise. In fact, noise is so disruptive that I’m doubting whether settling-based PC is biologically plausible. Alternatives like Saponati and Vincks’ rule might be stabler; if you know of other examples, please comment with even a quick description. ↩︎Discuss Read More