~www_lesswrong_com | Bookmarks (720)
-
Auditing language models for hidden objectives — LessWrong
Published on March 13, 2025 7:18 PM GMTWe study alignment audits—systematic investigations into whether an AI...
-
Vacuum Decay: Expert Survey Results — LessWrong
Published on March 13, 2025 6:31 PM GMTDiscuss
-
A Frontier AI Risk Management Framework: Bridging the Gap Between Current AI Practices and Established Risk Management — LessWrong
Published on March 13, 2025 6:29 PM GMTWe (SaferAI) propose a risk management framework which we...
-
Creating Complex Goals: A Model to Create Autonomous Agents — LessWrong
Published on March 13, 2025 6:17 PM GMTWhy do adults pursue long-term and complex goals? People...
-
Habermas Machine — LessWrong
Published on March 13, 2025 6:16 PM GMTThis post is a distillation of a recent work...
-
The "Reversal Curse": you still aren't antropomorphising enough. — LessWrong
Published on March 13, 2025 10:24 AM GMTI scrutinise the so-called "reversal curse", wherein LLMs seem...
-
AI #107: The Misplaced Hype Machine — LessWrong
Published on March 13, 2025 2:40 PM GMTThe most hyped event of the week, by far,...
-
Intelsat as a Model for International AGI Governance — LessWrong
Published on March 13, 2025 12:58 PM GMTIf there is an international project to build artificial...
-
Stacity: a Lock-In Risk Benchmark for Large Language Models — LessWrong
Published on March 13, 2025 12:08 PM GMTIntroSo far we have identified lock-in risk, defined lock-in,...
-
The prospect of accelerated AI safety progress, including philosophical progress — LessWrong
Published on March 13, 2025 10:52 AM GMTThis started life as a reaction to a post...
-
Formalizing Space-Faring Civilizations Saturation concepts and metrics — LessWrong
Published on March 13, 2025 9:40 AM GMTCrossposted on the EA Forum.Displacement of other Space-Faring Civilizations...
-
Elon Musk May Be Transitioning to Bipolar Type I — LessWrong
Published on March 11, 2025 5:45 PM GMTEpistemic status: Speculative pattern-matching based on public information. In 2023,...
-
How Language Models Understand Nullability — LessWrong
Published on March 11, 2025 3:57 PM GMTTL;DR Large language models have demonstrated an emergent ability...
-
Forethought: a new AI macrostrategy group — LessWrong
Published on March 11, 2025 3:39 PM GMTForethought[1] is a new AI macrostrategy research group cofounded by Max...
-
Preparing for the Intelligence Explosion — LessWrong
Published on March 11, 2025 3:38 PM GMTThis is a linkpost for a new paper called Preparing...
-
AI Control May Increase Existential Risk — LessWrong
Published on March 11, 2025 2:30 PM GMTEpistemic status: The following isn't an airtight argument, but...
-
When is it Better to Train on the Alignment Proxy? — LessWrong
Published on March 11, 2025 1:35 PM GMTThis is a response to Matt's earlier post. If...
-
Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases — LessWrong
Published on March 11, 2025 11:52 AM GMTTL;DR: We provide some evidence that Claude 3.7 Sonnet...
-
A Hogwarts Guide to Citizenship — LessWrong
Published on March 11, 2025 5:50 AM GMTThose engaged with questions of how to make the...
-
Cognitive Reframing—How to Overcome Negative Thought Patterns and Behaviors — LessWrong
Published on March 11, 2025 4:56 AM GMTCognitive reframing is a powerful psychological technique that encourages...
-
Trojan Sky — LessWrong
Published on March 11, 2025 3:14 AM GMTYou learn the rules as soon as you’re old...
-
Have you actually tried raising the birth rate? — LessWrong
Published on March 10, 2025 6:06 PM GMTI just saw on twitter someone claiming that we...
-
Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens — LessWrong
Published on March 10, 2025 4:07 PM GMTProduced as part of the ML Alignment & Theory...
-
We Have No Plan for Preventing Loss of Control in Open Models — LessWrong
Published on March 10, 2025 3:35 PM GMTNote: This post is intended to be the first...