Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading

Transfer Learning Slashes Cosmology AI Costs: Neutrino Mass Degeneracy Triggers Negative Transfer

Transfer Learning Slashes Cosmology AI Costs: Neutrino Mass Degeneracy Triggers Negative Transfer
A strategy borrowed from generative AI — train cheaply on the familiar, then fine-tune on the hard problem — can cut the number of expensive physics simulations needed by nearly a factor of ten. But a new paper published Wednesday in the Journal of Cosmology and Astroparticle Physics establishes that the same shortcut carries a structural risk that goes beyond cosmology: when an AI is pretrained on any established theoretical framework, it encodes that framework's parameter associations as deep network biases, and those biases become a liability the moment the training objective is to detect something genuinely new. That failure mode, documented here for the first time at this level of technical specificity in physics inference, applies to any scientific domain where a foundation-model approach is used to search beyond a standard model — including particle physics at the Large Hadron Collider. The paper, by Veena Krishnaraj, Adrian Bayer, Christian Kragh Jespersen, and Peter Melchior at Princeton University and the Flatiron Institute, asks whether transfer learning can accelerate the search for physics beyond ΛCDM — the standard cosmological model that describes a universe dominated by cold dark matter and a cosmological constant — by first training a neural network on cheaper standard-model simulations before exposing it to more computationally demanding beyond-ΛCDM scenarios. The answer is mostly yes, with a precisely identified exception that researchers will need to guard against as next-generation surveys scale up. Why AI Cosmology Simulation Costs Are a Real Bottleneck ΛCDM, despite explaining the universe's large-scale structure, expansion history, and the cosmic microwave background with remarkable accuracy, is known to be incomplete. Physicists suspect that neutrinos carry mass large enough to leave measurable imprints on cosmic structure, that gravity may deviate from general relativity at cosmological scales, and that dark energy may evolve rather than holding steady. Confirming any of those hypotheses requires running large suites of high-precision N-body simulations — virtual universes computed under different physical assumptions — and then training a machine-learning model on those simulations to infer cosmological parameters. Each individual simulation in a beyond-ΛCDM suite costs significantly more compute time than its standard-model equivalent. Running the thousands of simulations a well-calibrated inference network needs can be prohibitive, particularly as surveys like the ESA Euclid mission and the Vera C. Rubin Observatory's Legacy Survey of Space and Time begin generating hundreds of millions of galaxy observations at precision levels that demand correspondingly precise theoretical models. The Princeton-Flatiron team tested a two-stage pipeline using the Quijote simulations, an existing suite of more than 44,000 full N-body simulations covering thousands of cosmological models. In the first stage, the network was pretrained on 22,000 ΛCDM simulations. In the second stage, it was fine-tuned on a much smaller number of beyond-ΛCDM simulations — covering massive neutrinos, modified gravity, and primordial non-Gaussianities — to test whether pretraining could reduce the fine-tuning simulation budget. In most cases, it did. With as few as a few hundred fine-tuning simulations, the pretrained network matched the accuracy of a network trained from scratch on several thousand. How the Dummy-Node Architecture Powers the Cost Reduction The network is a fully connected neural net that takes the matter power spectrum — a summary of how much structure exists at each spatial scale in the universe — as its input. The input is a 79-bin vector covering wavenumbers from 0.0089 to 0.5 inverse megaparsecs per Hubble radius. The network outputs cosmological parameter estimates for each simulation. The critical architectural innovation is the inclusion of "dummy nodes" in the output layer during pretraining. When the network is first trained on ΛCDM, it only has five parameters to predict: the matter density fraction, the baryon density fraction, the Hubble constant, the spectral index, and σ8, the amplitude of matter clustering on 8-megaparsec scales. But the network's output layer is built with extra nodes — dummy outputs that are not used during ΛCDM training — whose number matches the additional parameters in the eventual beyond-ΛCDM fine-tuning task. This dummy-node bottleneck structure provides extra representational capacity at fine-tuning time. When the network is subsequently trained on simulations that include, say, neutrino mass as a new parameter, those dummy nodes are activated as outputs for the new quantity. The pretrained weights do not need to be frozen or replaced — they serve as an initialization that encodes the physics shared between ΛCDM and the target cosmology, while the available node capacity allows the network to adapt to the new signal. The Princeton team tested two alternative architectures — one without dummy nodes, and one that froze the pretrained weights and attached a trainable inference head — and found both performed worse, with the frozen-weight head architecture suffering severe negative transfer. The dummy-node approach is conceptually related to the modular "head" architectures in large language models, where a pretrained base network is adapted to downstream tasks by modifying only a small set of task-specific layers while inheriting the representations of the pretrained backbone. σ8 and Neutrino Mass: Why Physical Degeneracy Breaks the Shortcut The failure mode the paper documents is not random. It occurs specifically when the new physics produces signatures that closely resemble patterns the network has already learned to associate with a standard-model parameter. Massive neutrinos suppress the growth of small-scale cosmic structure — they stream freely through gravitational wells rather than clumping, leaving behind a characteristic deficit in the matter power spectrum at high wavenumber. But that suppression signature is nearly indistinguishable, in the standard power spectrum, from a lower value of σ8. Because the pretrained network has spent 22,000 ΛCDM simulations learning to associate small-scale power deficits with lower σ8, it arrives at fine-tuning with a strong prior that those features mean σ8 is low — not that neutrino mass is present. This degeneracy between σ8 and neutrino mass is a known challenge in cosmological inference, confirmed independently by multiple survey analyses including the Baryon Oscillation Spectroscopic Survey. What is new here is the demonstration that it manifests as a transfer learning failure specifically: the network's pretraining encodes the degeneracy as a rigid feature attribution, which fine-tuning on a small simulation budget cannot overcome. The team used SHAP (Shapley Additive Explanations) analysis to visualize this directly. In the pretrained network, high-wavenumber (small-scale) power spectrum bins carry strong positive attribution for σ8 — they are the model's primary evidence for inferring clustering amplitude. After fine-tuning on neutrino mass cosmologies, those same bins are reassigned to neutrino mass, forcing σ8's attribution to shift to larger scales with a reversed sign. The network is, in effect, unlearning its σ8 mapping at small scales and repurposing it for Mν — a process that degrades the accuracy of both estimates, constituting what the field calls negative transfer. "The negative transfer is not random. It is driven by underlying physical degeneracies in the model," said Krishnaraj. "Different physical parameters can produce very similar observable effects, making it difficult for the AI to disentangle them correctly. So this is something we need to be aware of and try to mitigate." The failure is summary-statistic dependent. When the standard power spectrum is used, it carries little sensitivity to neutrino mass independently, so the pretrained network is not confused — fine-tuning proceeds normally. When the marked power spectrum is used, which weights galaxies by local density and is specifically designed to be more sensitive to neutrino mass, the degeneracy is exposed and negative transfer appears clearly. Does Transfer Learning Work for Other Beyond-ΛCDM Physics? For modified gravity — specifically the Hu-Sawicki f(R) model, which modifies the Einstein-Hilbert action by adding a function of the Ricci scalar — transfer learning provided significant gains similar to the neutrino mass power-spectrum case. For equilateral-type primordial non-Gaussianities, where parameter degeneracies are mild, transfer learning improved inference consistently across nearly all parameters. For local-type primordial non-Gaussianities with fixed ΛCDM parameters, transfer learning offered no advantage, because the source and target domains shared almost no structure to transfer. The pattern that emerges is precise: the benefit of transfer learning is proportional to the overlap between the pretraining domain and the fine-tuning domain, and the risk of negative transfer is proportional to the strength of physical degeneracies between the new parameters and the standard-model parameters the network was trained to recognize. A researcher deploying this approach on real survey data needs to ask, before committing to a standard-model pretraining strategy: does my target beyond-standard-model parameter produce signatures that the pretrained network has already learned to explain away? What This Means for Foundation-Model Approaches in Physics The parallel the authors draw to foundation models — large pretrained networks such as BERT and GPT that underpin modern language AI — is not merely illustrative. The structure of the problem is identical: a large corpus of "source domain" data trains a general-purpose representation, and a small set of "target domain" data fine-tunes it. The success of GPT-class models rests on the assumption that the source domain (internet text) and target domain (downstream tasks) share sufficient structure that the pretrained representations generalize. In physics inference, the assumption is that ΛCDM and beyond-ΛCDM cosmologies share sufficient structure that a ΛCDM-pretrained network generalizes to new physics. The Princeton paper demonstrates that the assumption holds most of the time — but breaks down precisely at the parameter degeneracies that are most scientifically interesting, because those degeneracies define the cases where new physics looks most like old physics. An AI system that has deeply internalized the standard model may, at those moments, produce what the authors describe as a hallucination: a confident standard-model explanation for a signal that is actually something new. The authors explicitly note that the finding extends beyond cosmology. "Looking beyond cosmology," they write in the paper, "this analysis could inform other areas of fundamental physics, such as learning extensions beyond the Standard Model of particle physics." At the Large Hadron Collider, networks trained on Standard Model backgrounds face the same structural challenge: the most interesting new-physics signals are those that most closely resemble known processes. The paper was first posted to arXiv in October 2025 and presented at the NeurIPS 2025 workshop on Machine Learning and the Physical Sciences before appearing in its peer-reviewed form today. For cosmologists preparing pipelines for Euclid and the Rubin Observatory, the practical guidance is specific: auditing for negative transfer risk requires identifying, before training, which beyond-standard-model parameters are physically degenerate with the standard-model parameters being pretrained on, and choosing or designing summary statistics that are sufficiently sensitive to the new parameter to break those degeneracies before the network encodes them. Failing that, the paper recommends the dummy-node architecture as the best currently available mitigation — it consistently outperformed the alternatives tested, though it does not eliminate the risk in strongly degenerate cases. Frequently Asked Questions What is transfer learning in the context of cosmology? Transfer learning is a machine-learning technique in which a neural network trained on one set of data is fine-tuned on a related but different dataset, rather than being trained from scratch. In cosmology, researchers have used it to pretrain networks on cheap standard-model simulations and then adapt them to test beyond-standard-model physics with far fewer expensive simulations — in this study, reducing the required simulation count by roughly a factor of ten. What causes negative transfer in cosmological AI simulations? Negative transfer occurs when physical parameter degeneracies in the standard cosmological model cause the pretrained network to misattribute new-physics signatures to familiar parameters it already knows. In this study, neutrino mass suppresses small-scale cosmic structure in ways nearly identical to a lower value of σ8, the matter clustering amplitude. Because the pretrained network has learned to explain small-scale power deficits as low σ8, it resists recognizing the same pattern as neutrino mass during fine-tuning. How does neutrino mass detection work in cosmological surveys? Massive neutrinos suppress the growth of cosmic structure at small scales by streaming freely through gravitational potentials rather than clumping, leaving a characteristic deficit in the matter power spectrum. Surveys like Euclid and the Rubin Observatory measure galaxy clustering and weak lensing at unprecedented precision, and machine-learning models trained on simulations are used to extract the neutrino mass signal from those observational maps. The challenge is that this signal is nearly degenerate with other cosmological parameters, including σ8. Does this failure mode apply beyond cosmology? Yes. The study explicitly notes that any scientific domain where an AI is pretrained on a "standard model" and then fine-tuned on beyond-standard-model extensions faces the same risk. Particle physics inference at the Large Hadron Collider is a direct analog: networks trained on Standard Model backgrounds must detect new-physics signals that, in the most interesting cases, closely resemble known processes. The Princeton-Flatiron findings define a general diagnostic: wherever parameter degeneracies exist between the pretraining domain and the fine-tuning domain, negative transfer is a structural risk that must be audited before deployment.

Source: Tech Times

Read Original Source →

Cart (0 items)