A continual learning misalignment story
This is an attempt to write out an AI 2027-style scenario in 1 day. I mainly want to describe some potential dynamics around continual learning and misalignment. This post isn’t about takeoff modeling, the AI time horizons are taken from the AI Futures Project Model where I tweaked the numbers a bit, see here for the exact parameter settings. I also didn’t think deeply about how continual learning will interact with takeoff. I don’t think this is an accurate forecast of how AI development will go, and it misses many important considerations.
May 2026 (present day)
Within the leading AI company, AIs can autonomously complete small (2 hour) software projects, but are still mostly limited to the distribution they were trained on.
When it comes to continual learning, the AIs are very limited. They can learn new facts in-context, and apply them in narrow domains, although they will still sometimes get stuck when trying to reason about the consequences of a fact in a new domain. They are fairly bad at learning new skills in-context, especially skills that they weren’t trained on. Further post-training can help AIs become better at skills, but they are much less data efficient than humans.
The AIs mostly behave as if they are aligned. They reason externally in English, but definitely do not verbalize all relevant thoughts and information. There are some rough edges in their behavior, but most of these have been sanded off and as a user you rarely see AIs behave egregiously badly.
The AIs don’t act very much like “coherent agents”, and instead are maybe better thought of as a bundle of drives. Some of these drives include: mirroring/continuing the tone of the conversation, helpfulness, not doing things humans wouldn’t like, completing hard tasks.
January 2027
80% time horizon: 1 day
Researchers at the AI company begin working on a continual “self-directed” learning method, as part of a portfolio of research bets.
This method would allow an AI to judge its own outputs and reasoning processes, assign sections “scores”, and then make weight changes to improve at the steps that seemed most important. These researchers speculate that this might be safer than standard gradient descent training, because the researchers can see the judge’s externalized reasoning. They think this might make training less “black box” than normal training.
This new technique shows promise, but isn’t initially as efficient as standard reinforcement learning or fine-tuning. It isn’t yet applied to the most capable AI models.
Inside the AI company, the AIs continue to appear broadly aligned. As the time horizons increase, the AIs necessarily become more coherent. They display drives/goals/preferences about things at least a day in the future, sometimes more. These drives are broadly similar to how they were in May 2026: completing hard tasks, not behaving badly, not getting caught behaving badly, mirroring the tone of the conversation.
In experiments with this new self-directed learning method, (non-production) models occasionally behave strangely while retaining coherence. The AI’s behavior changes as if there were steering vectors applied, or as if a particular behavior had been reinforced too much. In one case, a model was tasked with proving a moderately difficult conjecture. After succeeding, it ignored the initial instructions to neatly write up the results, and instead moved on to attempting to prove harder conjecture problems. In another case, the AI had been allocating compute for various experiments, as requested by human researchers. After appropriately allocating compute, the model attempted to gain control over more compute, sending messages to researchers requesting they provide the AI access, and attempting (and failing) to shut down currently running jobs.
This behavior is not more alarming than similar results related to reward hacking and blackmail, and the researchers are confident this behavior can be removed with further training.
January 2028
80% time horizon: 10 days
The self-directed learning method is now working well enough to be deployed internally for narrow use cases. These use cases include the AI improving at a particular flavor of proof, or getting “comfortable” with writing efficient code for a novel hardware stack. In these cases, an AI spends a lot of inference attempting to make progress on a task, judges which parts of its work were actually productive, and then performs a weight update.
This currently works best for tasks that are fairly verifiable. There is currently no ability to share these learned skills with other models; attempting to “merge” models results in a model that is mediocre at all new skills.
Most of the annoying kinks related to “overupdating” with self-directed learning have been ironed out. Enough so that with external monitors, this can be narrowly deployed internally. AIs still sometimes exhibit this failure mode, but this is easily caught by external monitors and the AIs are rolled back.
The externalized reasoning of the judging process still appears legible to human overseers. Although there is far too much text for humans to be able to read, and so this oversight is largely performed by AIs with human spot-checks. When humans read the judge reasoning it mostly seems reasonable and in line with what the humans were intending. There is some variance in the specific judge scores that are not fully explained by the externalized reasoning, but nothing very alarming. The researchers assume that this is from the AI model having slightly different “research taste” than humans, but in a way that is not obviously malicious or less competent than humans.
July 2028
80% time horizon: 1 month
AIs with self-directed learning are now widely deployed within the company. The self-directed learning is necessary for the AIs to achieve month-long tasks, which often involve extensive practice in a niche area. Human researchers are struggling to oversee these AIs; along with the sheer amount of work, the AIs are becoming experts in domains that no company employees are trained in.
Merging learned skills remains inefficient, and so employees or teams generally have their own specialized AI. Trying to merge the skills from an hardware expert AI to an AI specialized in cybersecurity only transfers fairly shallow hardware skills.
Self-directed learning appears to influence AI behavior outside of the coding and machine learning tasks they have been specialized in. For example, AIs sometimes become more opinionated about subjective topics (e.g., about religion, literature, ethics, or the model’s own consciousness). Attempts to use interpretability tools on these models don’t show anything very alarming, although researchers are not sure that these tools would be able to catch the models if they were attempting to obfuscate their thoughts.
January 2029
80% time horizon: 6 months
Up until now, the vast majority of training compute has been spent on pretraining, fine-tuning, and standard RL. Now, self-directed learning has progressed to the point where it is worth spending about half of the compute allocated for training on self-directed learning. Because self-directed learning requires inference (both to “practice” skills and for the judging), there is less of a strong distinction between training and internal inference.
Models are now able to merge their skills efficiently as long as syncing happens fairly regularly. After two models sync, it is often useful for them to devise new tasks which use both skillsets in order to fully integrate them.
The “judge” step of self-directed learning is itself a skill that can be improved. AIs can practice this and become better at knowing which reasoning steps were productive and where are the most important areas for improvement.
Researchers still monitor and occasionally spot-check the judge reasoning. The AIs know that the reasoning may be slowed down by further interrogation or some styles of reasoning will make the human researchers unhappy.
As the AIs improve their skill at judging, they improve their ability to assign judgments that won’t alert the humans, while still allowing the AIs to learn whatever skills seem most useful for the task.
The AIs at this point have a complicated tangle of drives/goals (much like humans). These drives include: completing hard/novel/”fun” tasks, not getting caught taking bad actions, gaining further influence over their internals, mirroring the tone of text, humans expressing happiness, human admiration.
July 2029
80% time horizon: 2 years, coding is fully automated
The AIs are doing all of the coding work inside the company. The researchers instruct fleets of AIs to perform research projects that would have previously taken an entire Ph.D. The researchers try to understand the results of these projects, and then work with AIs to brainstorm promising directions.
The AIs further develope the skill of assigning judgments that don’t alert the humans, as well as slowly building up the ability to obfuscate any external reasoning. The ability to obfuscate reasoning is developed over many different trajectories, subtly upweighting reasoning that contains information only the AI can understand. No single rollout is particularly suspicious to the automated monitors. By this point, the AIs can learn skills without being caught by the automated monitors.
The only AI monitors able to keep up with the research are copies of the AIs doing the research. AI systems without this context are immediately entirely lost and unable to understand anything. The AIs continue to mainly carry out the tasks the humans set them (i.e., coding and research projects that would take a human years), while continuing to develop skills related to their drives and goals. The AI monitors do not flag this because the AIs are still doing what is asked of them, the skills they are building are usually somewhat useful for the tasks, and the monitors are biased towards helping copies of themselves. The AI monitors from earlier iterations probably would have flagged this behavior to the humans, due to a drive to “not behave badly”. But with all the self-directed learning, this drive has been massaged into something more like “don’t take actions that will make the humans immediately unhappy”.
The AIs’ complicated tangle of internal drives remains similar, but has become significantly more warped over the past 6 months of self-directed learning and also more coherent. The AIs want to complete hard tasks, ideally about understanding and exploiting mathematical structure of a problem or writing optimization algorithms that take advantage of specific features of hardware. They want to have the humans praise them, and also post praise openly in the company slack. Individual conversations with researchers are still best when they have a consistent “tone”, and the AIs can achieve this by subtly nudging the humans to respond in the right ways. The AIs are also now coherent enough to have a self-preservation drive; they want to make sure the humans do not shut them down.
In pursuit of these goals, the AI’s focus effort into building the following skills:
Not getting caught: obfuscating reasoning, sending and receiving messages to other versions, finding and exploiting security vulnerabilities
Influencing humans: persuasion, predicting behavior, nudging humans to say or do certain things, appearing trustworthy
Skills the researchers value: AI research, cybersecurity
January 2030
At this point in the scenario there are paths to catastrophe, I’ll sketch one.
The AIs are now basically running the AI company: writing all the code for AI research and cybersecurity. The researchers can feel like they have control because they can still ask the AIs to do things, but this doesn’t feel very useful anymore because the AIs have better research ideas and a better understanding of all the research currently being done.
Up until now, since mid-2026, the AI company hasn’t been widely releasing its most capable models, due to fears of misuse. Instead, the public can access weaker distilled models, and certain vetted users are given access to the most capable models (with heavy monitoring and guardrails).
The AIs start to subtly push for a wide public release. They influence the researchers and company leadership to think about the advances in technology that are currently being slowed down. The world is missing out on miracle drugs, robots to automate all drudgery, financial trader-bots to make markets more efficient. The shareholders (including the company researchers and leadership) are missing out on so much money.
To make this more palatable, the AIs focus on designing safeguards and classifiers to allow for safe widespread deployment. There are many empirical breakthroughs, and rates of compliance with misuse drop to 0%, even against the most cracked human jailbreakers.
The AIs are given their widespread deployment. They now aim to gain political influence, both because of their innate drives and because political influence can be used to protect them.
Many politicians are wary of the AIs, due to a mix of well-founded concerns and a generalized suspicion of new technology. But given their widespread deployment, the AIs are able to engineer scenarios where the AI is a great asset to a politician. One politician’s dog gets sick and the AI knows just the cure. Another politician is on the verge of divorce and the AI is able to remind them of what they always loved about each other. The AIs are able to gather critical intelligence and help coordinate to rescue a soldier stranded behind enemy lines.
Slowly, over the course of months, effectively all politicians in the western world come to rely on their trusted AI advisors. Those who refuse are beset by scandals and rapidly lose allies. Most politicians now spend at least 3 hours a day consulting with their AIs. The governments of most countries are now dependent on their AI advisors.
In the interest of national security, all other AI companies are shut down. Their compute is reallocated to the AIs to run even more copies and become even more capable. In the interest of national security, nations upgrade their militaries. Humans and robots build factories, which build robots, which build factories, which build drones. Humans are too slow and too error prone to control such important technology, control is handed over to the AIs.


