Research Alignment / Evals

published: 2026-05-09

updated: 2026-05-09

Sequence planning shell for AI alignment research, evals, personas, control, and benchmark interpretation.

Sequence: Research Alignment / Evals

Main Ideas And Sequence Order

Blank for collaborative planning.

References

METR time horizons — METR post on task time horizons and AI autonomy evaluation.
HCAST: Human-Calibrated Autonomy Software Tasks — Paper describing human-calibrated autonomy tasks and their benchmark construction.
METR Mythos announcement thread — METR thread announcing or contextualizing Mythos/time-horizon results.
Alex Albert Mythos / METR post — Lab-side reaction to the METR/Mythos result.
Florian Brand caveat thread on METR hard-task scarcity — Caveat thread on benchmark sparsity and hard-task scarcity.
Eric W. Tramel on benchmark usefulness — Thread about what makes benchmarks useful once frontier models saturate easier regions.
Investigating the consequences of accidentally grading CoT during RL — OpenAI alignment post on reward processes accidentally optimizing chain-of-thought properties.
How we monitor internal coding agents for misalignment — OpenAI post on monitoring internal coding agents for misalignment signals.
Alignment Research Blog — OpenAI alignment blog index for adjacent research posts.
Removing Sandbagging in LLMs by Training with Weak Supervision — Paper on reducing sandbagging behavior with weak-supervision training.
Natural Language Autoencoders repository — Repository for natural-language autoencoder work on model internals.
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations — Transformer Circuits article explaining NLA results.
Stephen Casper on Natural Language Autoencoders — Thread raising concerns or implications of NLA-style methods.
Samuel Marks response on NLA concerns — Response thread discussing interpretation of NLA risks/benefits.
Ryan Greenblatt training experiments thread — Thread on experiments labs should run to understand training and alignment behavior.
Palisade self-replication test — Concrete autonomy-risk artifact involving self-replication testing.
Ryan Greenblatt: My picture of the present in AI — Recent worldview summary on AI capabilities and alignment.
Ryan Greenblatt: Current AIs seem pretty misaligned to me — Alignment Forum post arguing that current AIs already show meaningful misalignment.
Ryan Greenblatt: Anthropic repeatedly accidentally trained against the CoT — Post on accidental anti-CoT training as evidence of inadequate lab process.
Ryan Greenblatt: AIs can now often do massive easy-to-verify SWE tasks — Capabilities update focused on large verifiable software tasks.
Ryan Greenblatt: How do we more safely defer to AIs? — Post on safer deference to AI systems.
Redwood Research podcast with Buck Shlegeris and Ryan Greenblatt — Podcast source for Redwood/AI control framing.
Buck Shlegeris: Announcing ControlConf 2026 — Announcement of a conference centered on AI control.
ControlConf 2026 — Conference site for AI control work.
How do LLMs generalize when training is compatible with two off-distribution behaviors? — Post on ambiguous training and off-distribution generalization.
ASMR-Bench: Auditing for Sabotage in ML Research — Benchmark paper for sabotage auditing in ML research.
Risk from fitness-seeking AIs: mechanisms and mitigations — Alignment Forum post on fitness-seeking threat models.
Recursive forecasting: Eliciting long-term forecasts from myopic fitness-seekers — Post on eliciting forecasts from potentially myopic/fitness-seeking systems.
Owain Evans homepage — Researcher homepage for Owain Evans.
Truthful AI — Organization page for Truthful AI.
Truthful AI hiring page / current research orientation — Current research orientation and hiring signal for Truthful AI.
Language models transmit behavioural traits through hidden signals in data — Nature paper on behavioral traits transmitting through hidden signals in training data.
Persona Vectors: Monitoring and Controlling Character Traits in Language Models — Paper on monitoring and steering model character traits.
The Consciousness Cluster — Paper on emergent preferences in models claiming consciousness.
Activation Oracles — Paper on training/evaluating LLMs as activation explainers.
Sam Marks: The persona selection model — Alignment Forum post on persona selection as an alignment lens.
Marius Hobbhahn / Apollo Research — Researcher/org reference for Apollo’s scheming and evaluations work.
Apollo Research: building a science of scheming — Apollo Research homepage framing scheming as an empirical science.
Joseph Bloom / Goodfire: Verbalized Eval Awareness Inflates Measured Safety — Goodfire research post on eval awareness and measured safety.
Nicholas Carlini: Black-Hat LLMs — Security-oriented writing on LLM misuse and offensive capabilities.
Raluca Ada Popa homepage — Researcher page for AI security, systems security, and cryptography.
Sella Nevo / RAND-linked TMLR paper — Technical paper linked to AI risk/security work.
Recursive speakers list — CSV list of Recursive event speakers used to identify relevant thinkers.
Recursive site — Recursive event/site source.