Research Alignment / Evals
published: 2026-05-09
updated: 2026-05-09
Sequence planning shell for AI alignment research, evals, personas, control, and benchmark interpretation.
Sequence: Research Alignment / Evals
Main Ideas And Sequence Order
Blank for collaborative planning.
References
- METR time horizons — METR post on task time horizons and AI autonomy evaluation.
- HCAST: Human-Calibrated Autonomy Software Tasks — Paper describing human-calibrated autonomy tasks and their benchmark construction.
- METR Mythos announcement thread — METR thread announcing or contextualizing Mythos/time-horizon results.
- Alex Albert Mythos / METR post — Lab-side reaction to the METR/Mythos result.
- Florian Brand caveat thread on METR hard-task scarcity — Caveat thread on benchmark sparsity and hard-task scarcity.
- Eric W. Tramel on benchmark usefulness — Thread about what makes benchmarks useful once frontier models saturate easier regions.
- Investigating the consequences of accidentally grading CoT during RL — OpenAI alignment post on reward processes accidentally optimizing chain-of-thought properties.
- How we monitor internal coding agents for misalignment — OpenAI post on monitoring internal coding agents for misalignment signals.
- Alignment Research Blog — OpenAI alignment blog index for adjacent research posts.
- Removing Sandbagging in LLMs by Training with Weak Supervision — Paper on reducing sandbagging behavior with weak-supervision training.
- Natural Language Autoencoders repository — Repository for natural-language autoencoder work on model internals.
- Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations — Transformer Circuits article explaining NLA results.
- Stephen Casper on Natural Language Autoencoders — Thread raising concerns or implications of NLA-style methods.
- Samuel Marks response on NLA concerns — Response thread discussing interpretation of NLA risks/benefits.
- Ryan Greenblatt training experiments thread — Thread on experiments labs should run to understand training and alignment behavior.
- Palisade self-replication test — Concrete autonomy-risk artifact involving self-replication testing.
- Ryan Greenblatt: My picture of the present in AI — Recent worldview summary on AI capabilities and alignment.
- Ryan Greenblatt: Current AIs seem pretty misaligned to me — Alignment Forum post arguing that current AIs already show meaningful misalignment.
- Ryan Greenblatt: Anthropic repeatedly accidentally trained against the CoT — Post on accidental anti-CoT training as evidence of inadequate lab process.
- Ryan Greenblatt: AIs can now often do massive easy-to-verify SWE tasks — Capabilities update focused on large verifiable software tasks.
- Ryan Greenblatt: How do we more safely defer to AIs? — Post on safer deference to AI systems.
- Redwood Research podcast with Buck Shlegeris and Ryan Greenblatt — Podcast source for Redwood/AI control framing.
- Buck Shlegeris: Announcing ControlConf 2026 — Announcement of a conference centered on AI control.
- ControlConf 2026 — Conference site for AI control work.
- How do LLMs generalize when training is compatible with two off-distribution behaviors? — Post on ambiguous training and off-distribution generalization.
- ASMR-Bench: Auditing for Sabotage in ML Research — Benchmark paper for sabotage auditing in ML research.
- Risk from fitness-seeking AIs: mechanisms and mitigations — Alignment Forum post on fitness-seeking threat models.
- Recursive forecasting: Eliciting long-term forecasts from myopic fitness-seekers — Post on eliciting forecasts from potentially myopic/fitness-seeking systems.
- Owain Evans homepage — Researcher homepage for Owain Evans.
- Truthful AI — Organization page for Truthful AI.
- Truthful AI hiring page / current research orientation — Current research orientation and hiring signal for Truthful AI.
- Language models transmit behavioural traits through hidden signals in data — Nature paper on behavioral traits transmitting through hidden signals in training data.
- Persona Vectors: Monitoring and Controlling Character Traits in Language Models — Paper on monitoring and steering model character traits.
- The Consciousness Cluster — Paper on emergent preferences in models claiming consciousness.
- Activation Oracles — Paper on training/evaluating LLMs as activation explainers.
- Sam Marks: The persona selection model — Alignment Forum post on persona selection as an alignment lens.
- Marius Hobbhahn / Apollo Research — Researcher/org reference for Apollo’s scheming and evaluations work.
- Apollo Research: building a science of scheming — Apollo Research homepage framing scheming as an empirical science.
- Joseph Bloom / Goodfire: Verbalized Eval Awareness Inflates Measured Safety — Goodfire research post on eval awareness and measured safety.
- Nicholas Carlini: Black-Hat LLMs — Security-oriented writing on LLM misuse and offensive capabilities.
- Raluca Ada Popa homepage — Researcher page for AI security, systems security, and cryptography.
- Sella Nevo / RAND-linked TMLR paper — Technical paper linked to AI risk/security work.
- Recursive speakers list — CSV list of Recursive event speakers used to identify relevant thinkers.
- Recursive site — Recursive event/site source.