Dream, Truth, & Good — AI Alignment Forum

[ad_1] One way in which I think current AI models are sloppy is that LLMs are trained in a way that messily merges the following "layers":The "dream machine" layer: LLMs are pre-trained on lots of slop from the internet, which creates an excellent "prior". The "truth machine": LLMs are trained to "reduce hallucinations" in a variety of ways, including RLHF and the more recent reasoning RL.The "good machine": The same

Timaeus in 2024 — AI Alignment Forum

[ad_1] TLDR: We made substantial progress in 2024:We published a series of papers that verify key predictions of Singular Learning Theory (SLT) [1, 2, 3, 4, 5, 6].We scaled key SLT-derived techniques to models with billions of parameters, eliminating our main concerns around tractability.We have clarified our theory of change and diversified our research portfolio to pay off across a range of different timelines.In 2025, we will accelerate our research

When should we worry about AI power-seeking? — AI Alignment Forum

[ad_1] (Audio version here (read by the author), or search for "Joe Carlsmith Audio" on your podcast app.This is the second essay in a series that I’m calling “How do we solve the alignment problem?”.[1]I’m hoping that the individual essays can be read fairly well on their own, but see this introduction for a summary of the essays that have been released thus far, and for a bit more about the

How might we safely pass the buck to AI? — AI Alignment Forum

[ad_1] My goal as an AI safety researcher is to put myself out of a job.I don’t worry too much about how planet sized brains will shape galaxies in 100 years. That’s something for AI systems to figure out.Instead, I worry about safely replacing human researchers with AI agents, at which point human researchers are “obsolete.” The situation is not necessarily fine after human obsolescence; however, the bulk of risks

Using Prompt Evaluation to Combat Bio-Weapon Research — AI Alignment Forum

[ad_1] With many thanks to Sasha Frangulov for comments and editingBefore publishing their o1-preview model system card on Sep 12, 2024, OpenAI tested the model on various safety benchmarks which they had constructed. These included benchmarks which aimed to evaluate whether the model could help with the development of Chemical, Biological, Radiological, and Nuclear (CBRN) weapons. They concluded that the model could help experts develop some of these weapons, but

Are SAE features from the Base Model still meaningful to LLaVA? — AI Alignment Forum

[ad_1] Shan Chen, Jack Gallifant, Kuleen Sasse, Danielle Bitterman[1]Please read this as a work in progress where we are colleagues sharing this in a lab (https://www.bittermanlab.org) meeting to help/motivate potential parallel research.TL;DR:Recent work has evaluated the generalizability of Sparse Autoencoder (SAE) features; this study examines their effectiveness in multimodal settings.We evaluate feature extraction using a CIFAR-100-inspired explainable classification task, analyzing the impact of pooling strategies, binarization, and layer selection on

Abstract Mathematical Concepts vs. Abstractions Over Real-World Systems — AI Alignment Forum

[ad_1] Consider concepts such as "a vector", "a game-theoretic agent", or "a market". Intuitively, those are "purely theoretical" abstractions: they don't refer to any specific real-world system. Those abstractions would be useful even in universes very different from ours, and reasoning about them doesn't necessarily involve reasoning about our world.Consider concepts such as "a tree", "my friend Alice", or "human governments". Intuitively, those are "real-world" abstractions. While "a tree" bundles

AGI Safety & Alignment @ Google DeepMind is hiring — AI Alignment Forum

[ad_1] The AGI Safety & Alignment Team (ASAT) at Google DeepMind (GDM) is hiring! Please apply to the Research Scientist and Research Engineer roles. Strong software engineers with some ML background should also apply (to the Research Engineer role). Our initial batch of hiring will focus more on hiring engineers, but we expect to continue to use the applications we receive for future hiring this year, which we expect will be more

How Trump’s ‘drill, baby, drill’ pledge is affecting other countries

[ad_1] Navin Singh KhadkaEnvironment Correspondent, BBC World ServiceGetty ImagesTrump has said the US's oil and gas will be sold all over the worldThe UN climate summit in the United Arab Emirates in 2023 ended with a call to "transition away from fossil fuels". It was applauded as a historic milestone in global climate action.Barely a year later, however, there are fears that the global commitment may be losing momentum, as

Gauging Interest for a Learning-Theoretic Agenda Mentorship Programme — AI Alignment Forum

[ad_1] I'm planning to organize a mentorship programme for people who want to become researchers working on the Learning-Theoretic Agenda (LTA). I'm still figuring out the detailed plan, the logistics and the funding, but here's an outline of how it would looks like. To express interest, submit this form.I believe that the risk of a global catastrophe due to unaligned artificial superintelligence is the most pressing problem of our time.