[ad_1] One way in which I think current AI models are sloppy is that LLMs are trained in a way that messily merges the following "layers":The "dream machine" layer: LLMs are pre-trained on lots of slop from the internet, which creates an excellent "prior". The "truth machine": LLMs are trained to "reduce hallucinations" in a variety of ways, including RLHF and the more recent reasoning RL.The "good machine": The same
[ad_1] TLDR: We made substantial progress in 2024:We published a series of papers that verify key predictions of Singular Learning Theory (SLT) [1, 2, 3, 4, 5, 6].We scaled key SLT-derived techniques to models with billions of parameters, eliminating our main concerns around tractability.We have clarified our theory of change and diversified our research portfolio to pay off across a range of different timelines.In 2025, we will accelerate our research
[ad_1] (Audio version here (read by the author), or search for "Joe Carlsmith Audio" on your podcast app.This is the second essay in a series that I’m calling “How do we solve the alignment problem?”.[1]I’m hoping that the individual essays can be read fairly well on their own, but see this introduction for a summary of the essays that have been released thus far, and for a bit more about the
[ad_1] My goal as an AI safety researcher is to put myself out of a job.I don’t worry too much about how planet sized brains will shape galaxies in 100 years. That’s something for AI systems to figure out.Instead, I worry about safely replacing human researchers with AI agents, at which point human researchers are “obsolete.” The situation is not necessarily fine after human obsolescence; however, the bulk of risks
[ad_1] With many thanks to Sasha Frangulov for comments and editingBefore publishing their o1-preview model system card on Sep 12, 2024, OpenAI tested the model on various safety benchmarks which they had constructed. These included benchmarks which aimed to evaluate whether the model could help with the development of Chemical, Biological, Radiological, and Nuclear (CBRN) weapons. They concluded that the model could help experts develop some of these weapons, but
[ad_1] Shan Chen, Jack Gallifant, Kuleen Sasse, Danielle Bitterman[1]Please read this as a work in progress where we are colleagues sharing this in a lab (https://www.bittermanlab.org) meeting to help/motivate potential parallel research.TL;DR:Recent work has evaluated the generalizability of Sparse Autoencoder (SAE) features; this study examines their effectiveness in multimodal settings.We evaluate feature extraction using a CIFAR-100-inspired explainable classification task, analyzing the impact of pooling strategies, binarization, and layer selection on
[ad_1] Consider concepts such as "a vector", "a game-theoretic agent", or "a market". Intuitively, those are "purely theoretical" abstractions: they don't refer to any specific real-world system. Those abstractions would be useful even in universes very different from ours, and reasoning about them doesn't necessarily involve reasoning about our world.Consider concepts such as "a tree", "my friend Alice", or "human governments". Intuitively, those are "real-world" abstractions. While "a tree" bundles
[ad_1] The AGI Safety & Alignment Team (ASAT) at Google DeepMind (GDM) is hiring! Please apply to the Research Scientist and Research Engineer roles. Strong software engineers with some ML background should also apply (to the Research Engineer role). Our initial batch of hiring will focus more on hiring engineers, but we expect to continue to use the applications we receive for future hiring this year, which we expect will be more
[ad_1] Navin Singh KhadkaEnvironment Correspondent, BBC World ServiceGetty ImagesTrump has said the US's oil and gas will be sold all over the worldThe UN climate summit in the United Arab Emirates in 2023 ended with a call to "transition away from fossil fuels". It was applauded as a historic milestone in global climate action.Barely a year later, however, there are fears that the global commitment may be losing momentum, as
[ad_1] I'm planning to organize a mentorship programme for people who want to become researchers working on the Learning-Theoretic Agenda (LTA). I'm still figuring out the detailed plan, the logistics and the funding, but here's an outline of how it would looks like. To express interest, submit this form.I believe that the risk of a global catastrophe due to unaligned artificial superintelligence is the most pressing problem of our time.