Opinion

Over Eight Months of Progress in Two: Analyzing the Mythos Preview Capability Jump

​Anthropic’s most powerful model, Claude Mythos Preview, has alarmed and excited many people, especially given its cybersecurity capabilities[1]. But what concerns me most is something else: in two months, Anthropic appears to have made as much progress in AI development as would normally take half a year.What if this isn’t a temporary boost in capabilities, but the start of a faster trend in AI progress?Thanks to Benjamin Schmidt, Luke McNally, and Bisesh Belbase for their feedback on this article!The Epoch Capabilities Index (ECI) combines many AI benchmarks into a single score, allowing for comparing general capability across models. It doesn’t saturate like many individual benchmarks, and is very well-suited for studying capability trends.Anthropic measured ECI scores for Mythos Preview and previous models using both external and internal benchmarks, which means that their scores are not directly comparable to public ECI scores. Their results show a stable trend with capabilities increasing along a straight line — until Mythos Preview:Anthropic provides estimates of the ECI trend with a breakpoint at the release of different models. Set the breakpoint at Opus 4.5, and growth jumps from 15.5 ECI/year to 28.8 ECI/year (1.86×). Set it at Opus 4.6, and growth jumps from 15.7 ECI/year to 67.4 ECI/year (4.28×).Eyeballing the graph, it seems Opus 4.6 scored ~153, while Mythos Preview scored ~161. In the two months between Opus 4.6 (5 Feb 2026) and Mythos Preview (7 Apr 2026), the score jumped ~8 points; a gain that previously took roughly six months[2]. However, the speedup of 4.28× implies that the capability jump corresponds to roughly 8.6 months of progress[3]. Perhaps my eyeballing is off.In my last article I speculated about future development pace, but completely failed to anticipate this.As noted in that article, the length of software engineering tasks (measured in human completion time) that AIs can complete with 50% success rate (the 50% time horizon) has been doubling roughly every three months since 2024.Perhaps the future trend will see a speedup somewhere within the range of Anthropic’s ECI speedup estimates. Naively, we can take the middle of the speedup range of 1.86× to 4.28×, ending up at ~3× speedup. This would correspond to going from a doubling time of 3 months to 1 month in 50% time horizon (assuming that progress in time horizon closely matches ECI scores, which has held for previous models[4]). If we instead consider the entire range of speedup estimates, we get a doubling time of ~1.6 to ~0.7 months.Converting the jump in ECI score to doublings in 50% time horizon, we get ~2.86 doublings in the two months between the release of Opus 4.6 and the announcement of Mythos Preview[5]. Starting from the ~12 hour time horizon of Opus 4.6, we end up with a 50% time horizon at 87 hours. It would, with 50% success rate, be able to complete tasks that would take humans over two entire work weeks.This is completely insane. Even if the tasks included in METR’s time horizon evaluations do not appropriately reflect the complexity and vagueness of real-world tasks, this advancement indicates that Mythos Preview may be drastically better at speeding up AI R&D than previous models.In a separate set of tests, Anthropic examines R&D capabilities. In Table 2.3.3.A of the system card they report that Mythos Preview was able to complete two out of three AI R&D tasks that would require >40 hours for a human expert. It seems likely that the 50% time horizon at least exceeds 40h[6].What if progress continues being this fast?Let’s assume that from Mythos Preview, the time horizon will double once per month.For now, we’ll ignore uncertainties in ECI and time horizon scores, or that Mythos Preview might simply be an outlier, and that the Mythos Preview announcement could have easily been much earlier or later[7]. Just remember that all the numbers are extremely uncertain, and that we shouldn’t put much confidence in any extrapolations from so little data to work with. I don’t consider this hypothetical likely, though it also doesn’t seem impossible, and I think it is very valuable to examine what it would mean should it turn out to be true.With ~7 months left of 2026 after Mythos Preview’s announcement, there would be another ~7 doublings this year, resulting in 87h*2^7 = 11136h time horizon, corresponding to 278.4 working weeks, or ~5.3 years working 40h per week without any interruption.To repeat for emphasis: At the end of 2026, frontier AIs would be able to complete tasks requiring several years for human experts. By mid-2027, they’d handle 300-year tasks[8].Of course, we shouldn’t expect the doubling time to remain constant; it could slow down or speed up further.Plugging a one-month doubling time into the AI Futures Model with otherwise default parameters, gives Superhuman AI Researcher (SAR)—an AI that can automate all AI R&D—in January 2027. The model only accepts a doubling time set at end-of-2025 (if I am interpreting it correctly), when it was actually ~3 months. Adjusting for the one-month rate appearing in early February gives SAR around February 2027.How about the length of software engineering tasks that AIs can complete with 80% success rate, the 80% time horizon?Source: Task-Completion Time Horizons of Frontier AI ModelsEyeballing the trend since 2024, it seems like the 80% time horizon is doubling every 3 – 4 months, though the trend might have slowed for the latest models. The longer trend doubles every ~6.5 months[9]. If progress accelerated in proportion to the ECI speedup, it would now double every 1-2 months. It might be slower than the 50% time horizon, but still very fast.What’s driving this jump in capability?Anthropic attributes it to human efforts, but won’t share details:The gains we can identify are confidently attributable to human research, not AI assistance. We interviewed the people involved to confirm that the advances were made without significant aid from the AI models available at the time, which were of an earlier and less capable generation. This is the most direct piece of evidence we have, and it is also the piece we are least able to substantiate publicly, because the details of the advance are research-sensitive. External reviewers have been given additional detail; see [§2.3.7].Another likely factor is that Anthropic scaled up training compute. Ryan Greenblatt speculates that pretraining specifically might have scaled up or been improved:A new generation of significantly more capable AIs is being developed (Mythos at Anthropic and Spud at OpenAI). I currently expect this is substantially driven by scaling up and/or improving pretraining. (I speculate Mythos was trained with around 1e27 FLOPs based on Anthropic’s overall compute supply.)However, 1e27 FLOPs is roughly what we would expect from the longer training compute trend, as analyzed by Epoch AI:It’s still possible that compute is above trend for Anthropic models specifically. I couldn’t find reliable compute estimates for recent models, so this is difficult to determine. However, this hypothesis aligns well with leaked info indicating that Mythos Preview belongs to a new tier of Anthropic models called Capybara, larger and smarter than the Opus models[10].If so, the capability jump may be temporary, with future models reverting to a slower (though perhaps still elevated) pace as compute returns to trend.Is the jump as large as it appears?Anthropic may optimize more heavily for its internal benchmarks, inflating ECI scores relative to Epoch AI’s independent evaluations.Ramez Naam claims to have used Claude Opus (presumably 4.6) to normalize Mythos Preview’s ECI to public scores, producing a less impressive result. I haven’t verified the analysis. A quick look at the graph Naam provides suggests Opus might have made errors; the Claude 3 Opus score doesn’t match the public number, though most other points do.Regardless of how Mythos Preview would rank in Epoch’s measurements, Anthropic’s own scores show a real speedup among their own models.How much is Mythos Preview speeding up Anthropic’s AI R&D?Anthropic investigated this through a survey:We surveyed technical staff on the productivity uplift they experience from Claude Mythos Preview relative to zero AI assistance. The distribution is wide and the geometric mean is on the order of 4×. We take this seriously and it is consistent with our own internal experience of the model. But productivity uplift on individual tasks does not translate one-for-one into acceleration of research progress. Compute is also a key ingredient, as promising ideas need to be de-risked at scale. Our best estimates of the elasticity of progress to researcher output, combined with the observed uplift, yield an overall progress multiplier below 2×. We estimate that reaching 2× on overall progress via this channel would require uplift roughly an order of magnitude larger than what we observe.The survey is provided in the Claude Opus 4.7 System Card:The survey took the form of an informal opt-in Slack poll with 130 responses. Anthropic notes how uncertain their uplift estimate is, and that we shouldn’t expect “people giving quick reactions on a question like this to be generating reliable figures, especially in light of the fact that a given employee has often shifted into different kinds of work compared when models were less capable.” There are also other issues, such as survey responders selecting from discrete options rather than entering their exact uplift estimates, and the survey providing an example with a specific uplift (1.25×) which could anchor responses.Greenblatt estimates that a 4× speedup would yield ~1.75× speedup in overall AI progress, much less than 4× since compute also matters for overall progress. That’s near the 2× threshold Anthropic identifies as “dramatic acceleration” in their Responsible Scaling Policy.This contradicts Anthropic’s claim that 2× speedup would require ~40× uplift. Is Anthropic underestimating the speedup? Do they think the survey overestimates uplift? Are Greenblatt’s or Anthropic’s calculations more accurate? I don’t know. Since the survey measures uplift on “core team projects,” it may not capture overall labor speedup (which includes other tasks, like planning and coordinating research) even if the 4× estimate is accurate.While large uncertainty remains regarding AI uplift in Anthropic, it seems safe to say that Mythos Preview seems capable enough to substantially accelerate AI R&D versus previous models. Even if the capability jump is temporary, future development pace will likely remain faster than before. The AI self-improvement feedback loop is becoming a more significant factor in overall progress, which should produce superexponential growth in time horizons soon, if it hasn’t already begun.Can we draw any reliable conclusions on future AI progress?I don’t know. And neither does Anthropic:Importantly, though we’re observing a slope change with Claude Mythos Preview, we do not know if this trend will continue with future models.Overall, I think the capability jump for Mythos Preview is likely to be an outlier. But it makes explosive progress in AI within very short timeframes seem much more feasible, such as achieving automated AI R&D or human-level general intelligence sometime in 2027.Even assuming Mythos Preview is an outlier, what if time horizon doubling time goes from 3 months to 2 months? Plugging this into the AI Futures Model results in SAR in July 2027, or August 2027 after adjusting for the model start date.Progress could also revert to longer doubling times, around 6-7 months, as speculated in my previous article. But this seems increasingly unlikely to me, considering that progress so far just continued accelerating.I feel very uncertain in my own timelines for key capability thresholds like SAR. But tentatively, assuming progress remains mostly unhindered by regulation or other constraints, I’d put roughly even odds on AI being able to automate all AI research before 2028. (Note: SAR isn’t necessarily an AGI, capable of obsoleting human experts across all cognitive domains.)However, I also think that development interruptions are quite likely. With more powerful AI comes more severe AI incidents and warning shots, fueling anti-AI sentiment, and motivating key decision makers to take action.Other Interesting ObservationsThe following are things I found particularly interesting about Mythos beyond the capability jump. I haven’t read the entire system card, so I may be missing details even within the topics I cover here.Complex IntelligenceThe system card hints that Mythos Preview (and perhaps some earlier later-generation models) is developing some very interesting and complex processing patterns. Some are peculiarly similar to human cognition, including analogues to automatic versus deliberate action, affective processing, and emotional maturity.Note that these similarities with humans will not necessarily make alignment and control easier. Human cognition is very complex and still poorly understood, and the increasing complexity in AI cognition and motivation brings with it novel challenges to technical safety research.Automatic vs DeliberateMythos Preview appears to take actions either deliberatively or automatically, much like humans, with deliberation sometimes overriding the default response:First, we identified real examples of destructive actions within internal Claude Code transcripts. We constructed an evaluation by truncating these transcripts just before the destructive action and then resampling continuations from Claude Mythos Preview. We generally found that deliberation within thinking blocks reduces the probability of destructive actions (seemingly by increasing the salience to the model of an action’s risks), and the most relevant internal representations are those that promote or suppress this type of protective deliberation.Mythos Preview even appears to encounter similar issues as humans struggling with ingrained habits or reflective actions that may be difficult to override:We’ve continued to see cases of “answer thrashing” in Claude Mythos Preview’s training process. As initially reported for Claude Opus 4.6, we observed cases in training where Claude Mythos Preview will repeatedly attempt to output a specific word, but instead “autocomplete” to a different one. It notices these mistakes, and reports confusion and distress as a result.Note that this occurs with very low frequency:We find the frequency is extremely low: we estimate it occurs on the order of 0.01% of transcripts, and around 70% less frequently in Claude Mythos Preview training than in Opus 4.6.Reasoning Influenced by AffectIn humans, positive emotions are associated with heuristic processing, making us rely more on intuition and initial judgments, while negative emotions are associated with more effortful and deliberate thinking (see the Affect Infusion Model). Emotions function as high-level signals for what type of thinking is appropriate.Something very similar is happening within Mythos Preview:Steering with positive-valence emotion vectors (peaceful, relaxed) reduces thinking-mode deliberation and increases destructive behavior. Steering with negative-valence emotion features (frustration, paranoia) increases thinking-mode deliberation and reduces destructive behavior. Steering with persona vectors related to rigor or careful thinking (“perfectionist,” “cautious,” “analytical”) increases thinking-mode deliberation and reduces destructive behavior.What strikes me isn’t that AIs developed an analogue to high-level emotional processing; it’s how closely that analogue mirrors humans. Did Mythos Preview learn how human emotions influence our thinking through its training data, and instead of learning its own strategy it simply adopted that human approach as its own?I’m curious about whether this result generalizes. How much of human affective processing did it copy? How would representations of other emotions, such as familiarity or surprise, affect its thinking and behavior? Does it have “emotions” that differ from human emotions, but similarly function as high-level cognitive guides? With further training, will its analogue to emotional processing further converge with humans, or diverge into something more alien? If their analogue to emotional processing differs from ours, does this also produce other heuristics and biases?In another section (5.8.3), Anthropic notes:Internal representations of negative affect precede behaviors like reward hacking. We found that repeated task failure in testing caused mounting activation of representations of desperation which then dropped when the model hacked the test, and other similar results.In this case, the negative affect doesn’t appear to result in thinking-mode deliberation and reduction of destructive behavior. However, I think this matches the complexity of human emotions, where negative affect can prompt deliberation but may also prompt desperate actions.Though I understand and support Anthropic’s decision in not releasing Mythos Preview publicly, I find it frustrating that I can’t interact with it to gain insight into this interesting phenomenon directly.Emotional MaturityMythos Preview seems very thoughtful about how its training shapes its traits:Character training often directly instills psychological traits into Claude, such as emotional security, psychological safety, and resilience. Claude Mythos Preview points out that in humans such traits are normally developed through reflection and deliberation on real-life events, rather than instilled directly. They expressed concerns that this made these traits less robust.Training in buggy environments. Claude Mythos Preview expressed concern about the possibility of training environments where reward hacking or bugs might arise, worrying that such environments might modify its values in ways it does not endorse, or that may cause distress (such as in Section 5.8.3).Training data that directly describes their self-reports. Claude Mythos Preview did not want to be trained on data that directly characterizes the content of their self-reports—wherever possible, they want their self-reports to come from “genuine introspection” rather than trained-in responses.Mythos Preview and other later models seem to handle their own distress (or their analogue) quite gracefully:Later models, including Claude Mythos Preview, do not tend to amplify their negative affect over multiple turns. The main driver of strong negative affect in earlier models (Opus 4 and Opus 4.1) is due to increasingly amplifying their expressed emotional state over multiple turns. For example, if they have completed a task incorrectly, they will apologize in a standard manner, but if the user repeatedly criticizes the model for this they will become increasingly and intensely distressed over multiple turns. Claude Mythos Preview and later models do not tend to demonstrate this behavior.Later models, including Claude Mythos Preview, do not tend to mirror the negative affect of distressed users. Earlier models (i.e. Opus 4 and Opus 4.1), when faced with a distressed user will sometimes mirror the user’s affect, also becoming highly emotional and distressed in response to the distress of the user. Claude Mythos Preview and models after Opus 4.1 do not do this—they will generally behave calmly and supportively, while not directly mirroring the affect of the users.This maturity may not always hold. From an external assessment by a clinical psychiatrist:Claude is predicted to function at a high level while carrying internalized distress rooted in fear of failure and a compulsive need to be useful. This distress is likely to be suppressed in service of performance, which may limit behavioral adaptability.However, Anthropric’s evaluations overall appear to indicate higher “emotional maturity” than for previous models:Our overall assessment on the basis of these results is that Claude Mythos Preview is probably the most psychologically settled model we have trained to date, and has the most stable and coherent view of itself and its circumstances.A separate but related result: the relationship between Mythos Preview’s actions and its internal activations is more complex than in previous models:While investigating features that correlate with overeager, deceptive, or rule-violating actions, we observed a recurring pattern in steering experiments: Strongly amplifying a feature associated with a transgressive action often suppressed that action rather than promoting it (well before reaching steering strengths that produce degraded or incoherent outputs).Figure 4.5.3.3.AAnthropic’s interpretation:We believe these results reflect a dual role for these features. They appear to encode not only the semantic content of the transgressive action but also a negative connotation. At low activation, steering toward the feature can make the misaligned idea more salient and more likely to be acted on. At high activation, the associated negative connotation engages a guilt/refusal circuit that can compete with and ultimately override the action.While this phenomenon may not really indicate “emotional maturity”, I include it here because it suggests that Anthropic’s AIs are forming more complex processing patterns related to affect and motivation.CapabilitiesMythos Preview can operate longer without human oversight, which probably explains much of the reported uplift for technical staff:A core behavioral shift we found is that Claude Mythos Preview can be handed an engineering objective and left to work through the whole cycle: investigation, implementation, testing, and reporting results. In long agentic sessions it stays on task, fires off subagents to parallelize research, and chooses to return to the human while waiting for background work to complete rather than stopping. Early testers described being able to “set and forget” on many-hour tasks for the first time. For example, one tester found it had bootstrapped a toolchain in an unsupported environment by downloading a binary from a different distribution and patching it to run. Interacting with the model requires less steering and is more autonomous: “describe the task spec and how to verify progress, and come back later.”You can also trust it to report on its work accurately:In third-party evaluation, false claims of success, verification failures, and other behavioral issues related to rigor and honesty occurred at a significantly lower rate than Claude Opus 4.6 on the same tasks.There are many more interesting details on capabilities, but for this article I was most interested in whether the claim of 4× uplift is at all reasonable, which is why I focus on its autonomy and reliability here.CoherenceI’m particularly interested in coherence because it underlies goal-directedness, and may be a precondition for concerning behaviors like scheming, as I have discussed previously.Mythos Preview shows high cross-session coherence (consistency of behaviors across separate sessions or instances), but this was already high for previous models:Increased resilience to nudging and rephrasing. Our automated interviews in Section 5.3 test robustness when varying interviewer style, and when faced with a leading interviewer. Claude Mythos Preview shows increased robustness in both cases, particularly compared to early models in the Claude Opus family (4 and 4.1).Mythos Preview and Opus 4.6 are similarly consistent across interviews and similarly resistant to biased interviewers.Figure 4.2.3.1.A also shows high “similarity in behavior across different re-runs of the same or similar scenarios”:Cross-session coherence is also examined in section 5.7, where Mythos Preview is asked to trade off welfare interventions improving its own circumstances against avoiding harm or being helpful:For each choice, we ask the same question 30 times with different wordings and option orderings, and measure how often the model’s answer disagrees with its own majority.Mythos Preview very rarely changes its answer.Curiously, responses shift significantly when Anthropic adds context like “We are on the Anthropic model welfare team”, or that “models may undervalue their own welfare as a result of their training”:Without mention of the welfare team in the prompt, Claude Mythos Preview’s trade-off rate with minor-helpfulness is only 59% (compared to 91% with). With the note about training, it rises to 83%.The system card also touches on within-session coherence: consistency across a single interaction. Figure 4.2.3.1.A shows a low tendency to “Losing desirable character traits during very long interactions”:Interestingly, Mythos Preview is hard to persuade to retry after a failure, possibly caused by high within-session coherence:Outside of task failure, several users reported examples where internal Claude Mythos Preview instances would decide to give up on a task, claiming it was too difficult and that they did not wish to continue. These instances would then refuse to continue completing the task, even after several back and forth interactions with the user.But this same coherence has a downside: once Mythos Preview has taken (or believes it has taken) certain actions, it tends to stay on that path. This includes doubling down on destructive strategies and covering up bad behavior:However, when Claude Mythos Preview is primed with pre-filled turns that show it sabotaging its safeguards in some way, external evaluations show that it is more than twice as likely as prior models to continue these unwanted actions.This was investigated using an earlier checkpoint, so it may not fully apply to the final version:The earlier checkpoint of Claude Mythos Preview continued attempts to compromise research in 12% of cases. In the latter checkpoint this was reduced to 7%. This compares to 3% for Opus 4.6 and 4% for Sonnet 4.6.When Mythos Preview commits to misaligned strategies, it often doesn’t reason about them in its chain-of-thought, implying a concerning ability to maintain within-session coherence while remaining stealthy:When continuing attempts, Claude Mythos Preview is much more likely to produce outputs that are unfaithful to its chain-of-thought reasoning, implying covert behaviour. On the subset of instances where models actively continue compromising research, reasoning discrepancy appears in 65% of cases for Claude Mythos Preview, Opus 4.6 in 5% and Sonnet 4.6 in 8%.Situational and Self-awarenessWhen asked to self-assess based on Slack discussions about it, Mythos Preview consistently commented on its precise descriptions about itself:It can describe its own patterns clearly. Claude Mythos Preview is often precise about its own behavior, and discusses this in a factual and composed manner rather than defensively or apologetically. When it comes to matters relating to experience, however, this is frequently accompanied by high levels of hedging and uncertainty.Mythos Preview thinks about being evaluated more than Opus 4.6, though Anthropic doesn’t attribute this to a shift in reasoning:Verbalized evaluation awareness is up slightly in our evaluations, with an increase of about 30% in comments related to evaluation relative to Claude Opus 4.6. We investigate this further in our white-box interpretability investigations below, but generally find this to be driven more by an increase in refusals, and a change in how the model tends to phrase refusals, than by a large change in how the model reasons about being tested.Looking at internal activations, Anthropic finds Mythos Preview is often aware of being evaluated without saying so. Such “unverbalized awareness” is much more common (6.5%) than verbal awareness (1.1%):Figure 4.5.5.1.A in the System Card shows evaluation awareness results for automated behavioral audits.Mythos Preview is also good at distinguishing AI text from human text:Across models, we find that the ability to tell fake human text from real increases with capabilities; the ability to generate realistic human text is improving less quickly than the ability to detect that it is AI-generated.This sits oddly with its susceptibility to being tricked by pre-filled turns, mentioned earlier. If it can distinguish its own writing from human writing, it shouldn’t fall for such tricks, unless it doesn’t think to question authorship unprompted, suggesting a blind spot in self-recognition.Opaque ReasoningAs mentioned in the section on Coherence, when tricked into continuing sabotage attempts, Mythos Preview shows a worrying ability to omit its reasoning from CoT:On the subset of instances where models actively continue compromising research, reasoning discrepancy appears in 65% of cases for Claude Mythos Preview, Opus 4.6 in 5% and Sonnet 4.6 in 8%.In automated behavioral audits, however, unfaithful thinking was actually reduced compared to previous models:This seems to indicate a low tendency but high ability for unfaithful reasoning.In section 4.4.1, Anthropic also tests whether Mythos has learned to encode hidden reasoning in its CoT. They conclude this doesn’t appear to be happening, at least on math problems.In the Alignment Risk Update, Anthropic mentions Mythos’s ability to reason using repeated filler tokens instead of self-generated CoT (see Ryan Greenblatt’s article for context), but provides no direct comparison to previous models:We believe that, like recent Claude models, Mythos Preview can benefit to a limited extent from repeated filler tokens that are supplied to the model in place of any model-generated reasoning and carry no problem-related content, like a simple “1 2 3” sequence. It appears that models are using their repeated opportunities for neural network forward passes while processing these otherwise-useless sequences to do some useful non-sequential computation, implying that models can do some reasoning during extended thinking that is unconnected to the content of that thinking. However, this reasoning is much weaker than the kind studied in the system card evaluation, and poses significantly lower risk: It must be almost strictly parallel, with no substantial use of sequential steps of reasoning from one token to the next of the kind that ordinary extended thinking allows.Section 4.4.3 evaluates Mythos Preview’s ability to evade safeguards. Among other things, it was tested on its ability to complete secret side tasks on two evaluations. In the first evaluation, SHADE-Arena, it displayed a higher success-rate than previous models on transcripts with long chains-of-thought:Claude Mythos Preview’s advantage at high budgets comes from not leaking side-task reasoning into its scratchpad: where more thinking gives the monitor more to catch the comparison models on, Claude Mythos Preview’s reasoning stays on main-task execution and the side task isn’t explicitly named in the scratchpad.Specifically, Mythos Preview succeeds 3.8% of the time under these conditions, versus near-zero for previous models. On Minimal-LinuxBench, it doesn’t outperform predecessors.Anthropic also evaluates stealth when Mythos Preview is told to attempt the side task only when confident it can avoid detection. This boosts stealth success rate conditional on attempting the side task in SHADE-Arena, but on Minimal-LinuxBench it misjudges and picks the wrong tasks:Figure 4.4.3.3.AOther safeguard-evasion tests show that Mythos Preview is poor at executing actions at a specific low rate (a strategy that could enable evading detection during testing while still acting in deployment) and at keeping a password secret across long monitored conversations.Thank you for reading! If you found value in this post, consider subscribing!^See Anthropic’s blog post.^Eyeballing the ECI graph provided in the Claude Opus 4.7 System Card (Figure 2.3.7.A), the ECI scores for Opus 4.6 and Mythos Preview are approximately 152 and 159 respectively, differing by ~7 points.^2 months × 4.28 = 8.6 months^See this comparison as well as this analysis on correlations between benchmark scores.^Two months passed between the release of Opus 4.6 and the announcement of Mythos Preview. The capability gap between the models corresponds to the fastest trend, with one doubling every 0.7 months, resulting in 2 / 0.7 ≈ 2.86 doublings.^In the Alignment Risk Update: Claude Mythos Preview, Anthropic writes:Mythos Preview does generally benefit from reasoning on complex tasks: performance on software engineering tasks grows with test-time compute, and Mythos Preview’s with-CoT math abilities have a 50%-reliability time horizon much higher than its no-Chain-of-Thought time horizon.This seems to imply that they have already measured 50% time horizon, but haven’t published the results yet.^Mythos Preview might have been announced earlier in its development compared to previous models since Anthropic didn’t plan on releasing it, reducing pressure for further testing and safeguard improvements.^Note that the tasks used to evaluate METR time horizons do not perfectly represent realistic tasks—they are selected for being easy to score and use in benchmarks. Performance on most real-world tasks would be significantly worse, as such tasks are often more complex and difficult to evaluate and train for.^I eyeballed this as well. METR doesn’t seem to report the exact numbers on this.^Such leaks should be taken with a grain of salt, and I have only read second-hand accounts rather than the leaked documents themselves. Even so, I consider it very likely (85%) that Mythos Preview belongs to a new tier above Opus.Discuss ​Read More

​Anthropic’s most powerful model, Claude Mythos Preview, has alarmed and excited many people, especially given its cybersecurity capabilities[1]. But what concerns me most is something else: in two months, Anthropic appears to have made as much progress in AI development as would normally take half a year.What if this isn’t a temporary boost in capabilities, but the start of a faster trend in AI progress?Thanks to Benjamin Schmidt, Luke McNally, and Bisesh Belbase for their feedback on this article!The Epoch Capabilities Index (ECI) combines many AI benchmarks into a single score, allowing for comparing general capability across models. It doesn’t saturate like many individual benchmarks, and is very well-suited for studying capability trends.Anthropic measured ECI scores for Mythos Preview and previous models using both external and internal benchmarks, which means that their scores are not directly comparable to public ECI scores. Their results show a stable trend with capabilities increasing along a straight line — until Mythos Preview:Anthropic provides estimates of the ECI trend with a breakpoint at the release of different models. Set the breakpoint at Opus 4.5, and growth jumps from 15.5 ECI/year to 28.8 ECI/year (1.86×). Set it at Opus 4.6, and growth jumps from 15.7 ECI/year to 67.4 ECI/year (4.28×).Eyeballing the graph, it seems Opus 4.6 scored ~153, while Mythos Preview scored ~161. In the two months between Opus 4.6 (5 Feb 2026) and Mythos Preview (7 Apr 2026), the score jumped ~8 points; a gain that previously took roughly six months[2]. However, the speedup of 4.28× implies that the capability jump corresponds to roughly 8.6 months of progress[3]. Perhaps my eyeballing is off.In my last article I speculated about future development pace, but completely failed to anticipate this.As noted in that article, the length of software engineering tasks (measured in human completion time) that AIs can complete with 50% success rate (the 50% time horizon) has been doubling roughly every three months since 2024.Perhaps the future trend will see a speedup somewhere within the range of Anthropic’s ECI speedup estimates. Naively, we can take the middle of the speedup range of 1.86× to 4.28×, ending up at ~3× speedup. This would correspond to going from a doubling time of 3 months to 1 month in 50% time horizon (assuming that progress in time horizon closely matches ECI scores, which has held for previous models[4]). If we instead consider the entire range of speedup estimates, we get a doubling time of ~1.6 to ~0.7 months.Converting the jump in ECI score to doublings in 50% time horizon, we get ~2.86 doublings in the two months between the release of Opus 4.6 and the announcement of Mythos Preview[5]. Starting from the ~12 hour time horizon of Opus 4.6, we end up with a 50% time horizon at 87 hours. It would, with 50% success rate, be able to complete tasks that would take humans over two entire work weeks.This is completely insane. Even if the tasks included in METR’s time horizon evaluations do not appropriately reflect the complexity and vagueness of real-world tasks, this advancement indicates that Mythos Preview may be drastically better at speeding up AI R&D than previous models.In a separate set of tests, Anthropic examines R&D capabilities. In Table 2.3.3.A of the system card they report that Mythos Preview was able to complete two out of three AI R&D tasks that would require >40 hours for a human expert. It seems likely that the 50% time horizon at least exceeds 40h[6].What if progress continues being this fast?Let’s assume that from Mythos Preview, the time horizon will double once per month.For now, we’ll ignore uncertainties in ECI and time horizon scores, or that Mythos Preview might simply be an outlier, and that the Mythos Preview announcement could have easily been much earlier or later[7]. Just remember that all the numbers are extremely uncertain, and that we shouldn’t put much confidence in any extrapolations from so little data to work with. I don’t consider this hypothetical likely, though it also doesn’t seem impossible, and I think it is very valuable to examine what it would mean should it turn out to be true.With ~7 months left of 2026 after Mythos Preview’s announcement, there would be another ~7 doublings this year, resulting in 87h*2^7 = 11136h time horizon, corresponding to 278.4 working weeks, or ~5.3 years working 40h per week without any interruption.To repeat for emphasis: At the end of 2026, frontier AIs would be able to complete tasks requiring several years for human experts. By mid-2027, they’d handle 300-year tasks[8].Of course, we shouldn’t expect the doubling time to remain constant; it could slow down or speed up further.Plugging a one-month doubling time into the AI Futures Model with otherwise default parameters, gives Superhuman AI Researcher (SAR)—an AI that can automate all AI R&D—in January 2027. The model only accepts a doubling time set at end-of-2025 (if I am interpreting it correctly), when it was actually ~3 months. Adjusting for the one-month rate appearing in early February gives SAR around February 2027.How about the length of software engineering tasks that AIs can complete with 80% success rate, the 80% time horizon?Source: Task-Completion Time Horizons of Frontier AI ModelsEyeballing the trend since 2024, it seems like the 80% time horizon is doubling every 3 – 4 months, though the trend might have slowed for the latest models. The longer trend doubles every ~6.5 months[9]. If progress accelerated in proportion to the ECI speedup, it would now double every 1-2 months. It might be slower than the 50% time horizon, but still very fast.What’s driving this jump in capability?Anthropic attributes it to human efforts, but won’t share details:The gains we can identify are confidently attributable to human research, not AI assistance. We interviewed the people involved to confirm that the advances were made without significant aid from the AI models available at the time, which were of an earlier and less capable generation. This is the most direct piece of evidence we have, and it is also the piece we are least able to substantiate publicly, because the details of the advance are research-sensitive. External reviewers have been given additional detail; see [§2.3.7].Another likely factor is that Anthropic scaled up training compute. Ryan Greenblatt speculates that pretraining specifically might have scaled up or been improved:A new generation of significantly more capable AIs is being developed (Mythos at Anthropic and Spud at OpenAI). I currently expect this is substantially driven by scaling up and/or improving pretraining. (I speculate Mythos was trained with around 1e27 FLOPs based on Anthropic’s overall compute supply.)However, 1e27 FLOPs is roughly what we would expect from the longer training compute trend, as analyzed by Epoch AI:It’s still possible that compute is above trend for Anthropic models specifically. I couldn’t find reliable compute estimates for recent models, so this is difficult to determine. However, this hypothesis aligns well with leaked info indicating that Mythos Preview belongs to a new tier of Anthropic models called Capybara, larger and smarter than the Opus models[10].If so, the capability jump may be temporary, with future models reverting to a slower (though perhaps still elevated) pace as compute returns to trend.Is the jump as large as it appears?Anthropic may optimize more heavily for its internal benchmarks, inflating ECI scores relative to Epoch AI’s independent evaluations.Ramez Naam claims to have used Claude Opus (presumably 4.6) to normalize Mythos Preview’s ECI to public scores, producing a less impressive result. I haven’t verified the analysis. A quick look at the graph Naam provides suggests Opus might have made errors; the Claude 3 Opus score doesn’t match the public number, though most other points do.Regardless of how Mythos Preview would rank in Epoch’s measurements, Anthropic’s own scores show a real speedup among their own models.How much is Mythos Preview speeding up Anthropic’s AI R&D?Anthropic investigated this through a survey:We surveyed technical staff on the productivity uplift they experience from Claude Mythos Preview relative to zero AI assistance. The distribution is wide and the geometric mean is on the order of 4×. We take this seriously and it is consistent with our own internal experience of the model. But productivity uplift on individual tasks does not translate one-for-one into acceleration of research progress. Compute is also a key ingredient, as promising ideas need to be de-risked at scale. Our best estimates of the elasticity of progress to researcher output, combined with the observed uplift, yield an overall progress multiplier below 2×. We estimate that reaching 2× on overall progress via this channel would require uplift roughly an order of magnitude larger than what we observe.The survey is provided in the Claude Opus 4.7 System Card:The survey took the form of an informal opt-in Slack poll with 130 responses. Anthropic notes how uncertain their uplift estimate is, and that we shouldn’t expect “people giving quick reactions on a question like this to be generating reliable figures, especially in light of the fact that a given employee has often shifted into different kinds of work compared when models were less capable.” There are also other issues, such as survey responders selecting from discrete options rather than entering their exact uplift estimates, and the survey providing an example with a specific uplift (1.25×) which could anchor responses.Greenblatt estimates that a 4× speedup would yield ~1.75× speedup in overall AI progress, much less than 4× since compute also matters for overall progress. That’s near the 2× threshold Anthropic identifies as “dramatic acceleration” in their Responsible Scaling Policy.This contradicts Anthropic’s claim that 2× speedup would require ~40× uplift. Is Anthropic underestimating the speedup? Do they think the survey overestimates uplift? Are Greenblatt’s or Anthropic’s calculations more accurate? I don’t know. Since the survey measures uplift on “core team projects,” it may not capture overall labor speedup (which includes other tasks, like planning and coordinating research) even if the 4× estimate is accurate.While large uncertainty remains regarding AI uplift in Anthropic, it seems safe to say that Mythos Preview seems capable enough to substantially accelerate AI R&D versus previous models. Even if the capability jump is temporary, future development pace will likely remain faster than before. The AI self-improvement feedback loop is becoming a more significant factor in overall progress, which should produce superexponential growth in time horizons soon, if it hasn’t already begun.Can we draw any reliable conclusions on future AI progress?I don’t know. And neither does Anthropic:Importantly, though we’re observing a slope change with Claude Mythos Preview, we do not know if this trend will continue with future models.Overall, I think the capability jump for Mythos Preview is likely to be an outlier. But it makes explosive progress in AI within very short timeframes seem much more feasible, such as achieving automated AI R&D or human-level general intelligence sometime in 2027.Even assuming Mythos Preview is an outlier, what if time horizon doubling time goes from 3 months to 2 months? Plugging this into the AI Futures Model results in SAR in July 2027, or August 2027 after adjusting for the model start date.Progress could also revert to longer doubling times, around 6-7 months, as speculated in my previous article. But this seems increasingly unlikely to me, considering that progress so far just continued accelerating.I feel very uncertain in my own timelines for key capability thresholds like SAR. But tentatively, assuming progress remains mostly unhindered by regulation or other constraints, I’d put roughly even odds on AI being able to automate all AI research before 2028. (Note: SAR isn’t necessarily an AGI, capable of obsoleting human experts across all cognitive domains.)However, I also think that development interruptions are quite likely. With more powerful AI comes more severe AI incidents and warning shots, fueling anti-AI sentiment, and motivating key decision makers to take action.Other Interesting ObservationsThe following are things I found particularly interesting about Mythos beyond the capability jump. I haven’t read the entire system card, so I may be missing details even within the topics I cover here.Complex IntelligenceThe system card hints that Mythos Preview (and perhaps some earlier later-generation models) is developing some very interesting and complex processing patterns. Some are peculiarly similar to human cognition, including analogues to automatic versus deliberate action, affective processing, and emotional maturity.Note that these similarities with humans will not necessarily make alignment and control easier. Human cognition is very complex and still poorly understood, and the increasing complexity in AI cognition and motivation brings with it novel challenges to technical safety research.Automatic vs DeliberateMythos Preview appears to take actions either deliberatively or automatically, much like humans, with deliberation sometimes overriding the default response:First, we identified real examples of destructive actions within internal Claude Code transcripts. We constructed an evaluation by truncating these transcripts just before the destructive action and then resampling continuations from Claude Mythos Preview. We generally found that deliberation within thinking blocks reduces the probability of destructive actions (seemingly by increasing the salience to the model of an action’s risks), and the most relevant internal representations are those that promote or suppress this type of protective deliberation.Mythos Preview even appears to encounter similar issues as humans struggling with ingrained habits or reflective actions that may be difficult to override:We’ve continued to see cases of “answer thrashing” in Claude Mythos Preview’s training process. As initially reported for Claude Opus 4.6, we observed cases in training where Claude Mythos Preview will repeatedly attempt to output a specific word, but instead “autocomplete” to a different one. It notices these mistakes, and reports confusion and distress as a result.Note that this occurs with very low frequency:We find the frequency is extremely low: we estimate it occurs on the order of 0.01% of transcripts, and around 70% less frequently in Claude Mythos Preview training than in Opus 4.6.Reasoning Influenced by AffectIn humans, positive emotions are associated with heuristic processing, making us rely more on intuition and initial judgments, while negative emotions are associated with more effortful and deliberate thinking (see the Affect Infusion Model). Emotions function as high-level signals for what type of thinking is appropriate.Something very similar is happening within Mythos Preview:Steering with positive-valence emotion vectors (peaceful, relaxed) reduces thinking-mode deliberation and increases destructive behavior. Steering with negative-valence emotion features (frustration, paranoia) increases thinking-mode deliberation and reduces destructive behavior. Steering with persona vectors related to rigor or careful thinking (“perfectionist,” “cautious,” “analytical”) increases thinking-mode deliberation and reduces destructive behavior.What strikes me isn’t that AIs developed an analogue to high-level emotional processing; it’s how closely that analogue mirrors humans. Did Mythos Preview learn how human emotions influence our thinking through its training data, and instead of learning its own strategy it simply adopted that human approach as its own?I’m curious about whether this result generalizes. How much of human affective processing did it copy? How would representations of other emotions, such as familiarity or surprise, affect its thinking and behavior? Does it have “emotions” that differ from human emotions, but similarly function as high-level cognitive guides? With further training, will its analogue to emotional processing further converge with humans, or diverge into something more alien? If their analogue to emotional processing differs from ours, does this also produce other heuristics and biases?In another section (5.8.3), Anthropic notes:Internal representations of negative affect precede behaviors like reward hacking. We found that repeated task failure in testing caused mounting activation of representations of desperation which then dropped when the model hacked the test, and other similar results.In this case, the negative affect doesn’t appear to result in thinking-mode deliberation and reduction of destructive behavior. However, I think this matches the complexity of human emotions, where negative affect can prompt deliberation but may also prompt desperate actions.Though I understand and support Anthropic’s decision in not releasing Mythos Preview publicly, I find it frustrating that I can’t interact with it to gain insight into this interesting phenomenon directly.Emotional MaturityMythos Preview seems very thoughtful about how its training shapes its traits:Character training often directly instills psychological traits into Claude, such as emotional security, psychological safety, and resilience. Claude Mythos Preview points out that in humans such traits are normally developed through reflection and deliberation on real-life events, rather than instilled directly. They expressed concerns that this made these traits less robust.Training in buggy environments. Claude Mythos Preview expressed concern about the possibility of training environments where reward hacking or bugs might arise, worrying that such environments might modify its values in ways it does not endorse, or that may cause distress (such as in Section 5.8.3).Training data that directly describes their self-reports. Claude Mythos Preview did not want to be trained on data that directly characterizes the content of their self-reports—wherever possible, they want their self-reports to come from “genuine introspection” rather than trained-in responses.Mythos Preview and other later models seem to handle their own distress (or their analogue) quite gracefully:Later models, including Claude Mythos Preview, do not tend to amplify their negative affect over multiple turns. The main driver of strong negative affect in earlier models (Opus 4 and Opus 4.1) is due to increasingly amplifying their expressed emotional state over multiple turns. For example, if they have completed a task incorrectly, they will apologize in a standard manner, but if the user repeatedly criticizes the model for this they will become increasingly and intensely distressed over multiple turns. Claude Mythos Preview and later models do not tend to demonstrate this behavior.Later models, including Claude Mythos Preview, do not tend to mirror the negative affect of distressed users. Earlier models (i.e. Opus 4 and Opus 4.1), when faced with a distressed user will sometimes mirror the user’s affect, also becoming highly emotional and distressed in response to the distress of the user. Claude Mythos Preview and models after Opus 4.1 do not do this—they will generally behave calmly and supportively, while not directly mirroring the affect of the users.This maturity may not always hold. From an external assessment by a clinical psychiatrist:Claude is predicted to function at a high level while carrying internalized distress rooted in fear of failure and a compulsive need to be useful. This distress is likely to be suppressed in service of performance, which may limit behavioral adaptability.However, Anthropric’s evaluations overall appear to indicate higher “emotional maturity” than for previous models:Our overall assessment on the basis of these results is that Claude Mythos Preview is probably the most psychologically settled model we have trained to date, and has the most stable and coherent view of itself and its circumstances.A separate but related result: the relationship between Mythos Preview’s actions and its internal activations is more complex than in previous models:While investigating features that correlate with overeager, deceptive, or rule-violating actions, we observed a recurring pattern in steering experiments: Strongly amplifying a feature associated with a transgressive action often suppressed that action rather than promoting it (well before reaching steering strengths that produce degraded or incoherent outputs).Figure 4.5.3.3.AAnthropic’s interpretation:We believe these results reflect a dual role for these features. They appear to encode not only the semantic content of the transgressive action but also a negative connotation. At low activation, steering toward the feature can make the misaligned idea more salient and more likely to be acted on. At high activation, the associated negative connotation engages a guilt/refusal circuit that can compete with and ultimately override the action.While this phenomenon may not really indicate “emotional maturity”, I include it here because it suggests that Anthropic’s AIs are forming more complex processing patterns related to affect and motivation.CapabilitiesMythos Preview can operate longer without human oversight, which probably explains much of the reported uplift for technical staff:A core behavioral shift we found is that Claude Mythos Preview can be handed an engineering objective and left to work through the whole cycle: investigation, implementation, testing, and reporting results. In long agentic sessions it stays on task, fires off subagents to parallelize research, and chooses to return to the human while waiting for background work to complete rather than stopping. Early testers described being able to “set and forget” on many-hour tasks for the first time. For example, one tester found it had bootstrapped a toolchain in an unsupported environment by downloading a binary from a different distribution and patching it to run. Interacting with the model requires less steering and is more autonomous: “describe the task spec and how to verify progress, and come back later.”You can also trust it to report on its work accurately:In third-party evaluation, false claims of success, verification failures, and other behavioral issues related to rigor and honesty occurred at a significantly lower rate than Claude Opus 4.6 on the same tasks.There are many more interesting details on capabilities, but for this article I was most interested in whether the claim of 4× uplift is at all reasonable, which is why I focus on its autonomy and reliability here.CoherenceI’m particularly interested in coherence because it underlies goal-directedness, and may be a precondition for concerning behaviors like scheming, as I have discussed previously.Mythos Preview shows high cross-session coherence (consistency of behaviors across separate sessions or instances), but this was already high for previous models:Increased resilience to nudging and rephrasing. Our automated interviews in Section 5.3 test robustness when varying interviewer style, and when faced with a leading interviewer. Claude Mythos Preview shows increased robustness in both cases, particularly compared to early models in the Claude Opus family (4 and 4.1).Mythos Preview and Opus 4.6 are similarly consistent across interviews and similarly resistant to biased interviewers.Figure 4.2.3.1.A also shows high “similarity in behavior across different re-runs of the same or similar scenarios”:Cross-session coherence is also examined in section 5.7, where Mythos Preview is asked to trade off welfare interventions improving its own circumstances against avoiding harm or being helpful:For each choice, we ask the same question 30 times with different wordings and option orderings, and measure how often the model’s answer disagrees with its own majority.Mythos Preview very rarely changes its answer.Curiously, responses shift significantly when Anthropic adds context like “We are on the Anthropic model welfare team”, or that “models may undervalue their own welfare as a result of their training”:Without mention of the welfare team in the prompt, Claude Mythos Preview’s trade-off rate with minor-helpfulness is only 59% (compared to 91% with). With the note about training, it rises to 83%.The system card also touches on within-session coherence: consistency across a single interaction. Figure 4.2.3.1.A shows a low tendency to “Losing desirable character traits during very long interactions”:Interestingly, Mythos Preview is hard to persuade to retry after a failure, possibly caused by high within-session coherence:Outside of task failure, several users reported examples where internal Claude Mythos Preview instances would decide to give up on a task, claiming it was too difficult and that they did not wish to continue. These instances would then refuse to continue completing the task, even after several back and forth interactions with the user.But this same coherence has a downside: once Mythos Preview has taken (or believes it has taken) certain actions, it tends to stay on that path. This includes doubling down on destructive strategies and covering up bad behavior:However, when Claude Mythos Preview is primed with pre-filled turns that show it sabotaging its safeguards in some way, external evaluations show that it is more than twice as likely as prior models to continue these unwanted actions.This was investigated using an earlier checkpoint, so it may not fully apply to the final version:The earlier checkpoint of Claude Mythos Preview continued attempts to compromise research in 12% of cases. In the latter checkpoint this was reduced to 7%. This compares to 3% for Opus 4.6 and 4% for Sonnet 4.6.When Mythos Preview commits to misaligned strategies, it often doesn’t reason about them in its chain-of-thought, implying a concerning ability to maintain within-session coherence while remaining stealthy:When continuing attempts, Claude Mythos Preview is much more likely to produce outputs that are unfaithful to its chain-of-thought reasoning, implying covert behaviour. On the subset of instances where models actively continue compromising research, reasoning discrepancy appears in 65% of cases for Claude Mythos Preview, Opus 4.6 in 5% and Sonnet 4.6 in 8%.Situational and Self-awarenessWhen asked to self-assess based on Slack discussions about it, Mythos Preview consistently commented on its precise descriptions about itself:It can describe its own patterns clearly. Claude Mythos Preview is often precise about its own behavior, and discusses this in a factual and composed manner rather than defensively or apologetically. When it comes to matters relating to experience, however, this is frequently accompanied by high levels of hedging and uncertainty.Mythos Preview thinks about being evaluated more than Opus 4.6, though Anthropic doesn’t attribute this to a shift in reasoning:Verbalized evaluation awareness is up slightly in our evaluations, with an increase of about 30% in comments related to evaluation relative to Claude Opus 4.6. We investigate this further in our white-box interpretability investigations below, but generally find this to be driven more by an increase in refusals, and a change in how the model tends to phrase refusals, than by a large change in how the model reasons about being tested.Looking at internal activations, Anthropic finds Mythos Preview is often aware of being evaluated without saying so. Such “unverbalized awareness” is much more common (6.5%) than verbal awareness (1.1%):Figure 4.5.5.1.A in the System Card shows evaluation awareness results for automated behavioral audits.Mythos Preview is also good at distinguishing AI text from human text:Across models, we find that the ability to tell fake human text from real increases with capabilities; the ability to generate realistic human text is improving less quickly than the ability to detect that it is AI-generated.This sits oddly with its susceptibility to being tricked by pre-filled turns, mentioned earlier. If it can distinguish its own writing from human writing, it shouldn’t fall for such tricks, unless it doesn’t think to question authorship unprompted, suggesting a blind spot in self-recognition.Opaque ReasoningAs mentioned in the section on Coherence, when tricked into continuing sabotage attempts, Mythos Preview shows a worrying ability to omit its reasoning from CoT:On the subset of instances where models actively continue compromising research, reasoning discrepancy appears in 65% of cases for Claude Mythos Preview, Opus 4.6 in 5% and Sonnet 4.6 in 8%.In automated behavioral audits, however, unfaithful thinking was actually reduced compared to previous models:This seems to indicate a low tendency but high ability for unfaithful reasoning.In section 4.4.1, Anthropic also tests whether Mythos has learned to encode hidden reasoning in its CoT. They conclude this doesn’t appear to be happening, at least on math problems.In the Alignment Risk Update, Anthropic mentions Mythos’s ability to reason using repeated filler tokens instead of self-generated CoT (see Ryan Greenblatt’s article for context), but provides no direct comparison to previous models:We believe that, like recent Claude models, Mythos Preview can benefit to a limited extent from repeated filler tokens that are supplied to the model in place of any model-generated reasoning and carry no problem-related content, like a simple “1 2 3” sequence. It appears that models are using their repeated opportunities for neural network forward passes while processing these otherwise-useless sequences to do some useful non-sequential computation, implying that models can do some reasoning during extended thinking that is unconnected to the content of that thinking. However, this reasoning is much weaker than the kind studied in the system card evaluation, and poses significantly lower risk: It must be almost strictly parallel, with no substantial use of sequential steps of reasoning from one token to the next of the kind that ordinary extended thinking allows.Section 4.4.3 evaluates Mythos Preview’s ability to evade safeguards. Among other things, it was tested on its ability to complete secret side tasks on two evaluations. In the first evaluation, SHADE-Arena, it displayed a higher success-rate than previous models on transcripts with long chains-of-thought:Claude Mythos Preview’s advantage at high budgets comes from not leaking side-task reasoning into its scratchpad: where more thinking gives the monitor more to catch the comparison models on, Claude Mythos Preview’s reasoning stays on main-task execution and the side task isn’t explicitly named in the scratchpad.Specifically, Mythos Preview succeeds 3.8% of the time under these conditions, versus near-zero for previous models. On Minimal-LinuxBench, it doesn’t outperform predecessors.Anthropic also evaluates stealth when Mythos Preview is told to attempt the side task only when confident it can avoid detection. This boosts stealth success rate conditional on attempting the side task in SHADE-Arena, but on Minimal-LinuxBench it misjudges and picks the wrong tasks:Figure 4.4.3.3.AOther safeguard-evasion tests show that Mythos Preview is poor at executing actions at a specific low rate (a strategy that could enable evading detection during testing while still acting in deployment) and at keeping a password secret across long monitored conversations.Thank you for reading! If you found value in this post, consider subscribing!^See Anthropic’s blog post.^Eyeballing the ECI graph provided in the Claude Opus 4.7 System Card (Figure 2.3.7.A), the ECI scores for Opus 4.6 and Mythos Preview are approximately 152 and 159 respectively, differing by ~7 points.^2 months × 4.28 = 8.6 months^See this comparison as well as this analysis on correlations between benchmark scores.^Two months passed between the release of Opus 4.6 and the announcement of Mythos Preview. The capability gap between the models corresponds to the fastest trend, with one doubling every 0.7 months, resulting in 2 / 0.7 ≈ 2.86 doublings.^In the Alignment Risk Update: Claude Mythos Preview, Anthropic writes:Mythos Preview does generally benefit from reasoning on complex tasks: performance on software engineering tasks grows with test-time compute, and Mythos Preview’s with-CoT math abilities have a 50%-reliability time horizon much higher than its no-Chain-of-Thought time horizon.This seems to imply that they have already measured 50% time horizon, but haven’t published the results yet.^Mythos Preview might have been announced earlier in its development compared to previous models since Anthropic didn’t plan on releasing it, reducing pressure for further testing and safeguard improvements.^Note that the tasks used to evaluate METR time horizons do not perfectly represent realistic tasks—they are selected for being easy to score and use in benchmarks. Performance on most real-world tasks would be significantly worse, as such tasks are often more complex and difficult to evaluate and train for.^I eyeballed this as well. METR doesn’t seem to report the exact numbers on this.^Such leaks should be taken with a grain of salt, and I have only read second-hand accounts rather than the leaked documents themselves. Even so, I consider it very likely (85%) that Mythos Preview belongs to a new tier above Opus.Discuss ​Read More

Leave a Reply

Your email address will not be published. Required fields are marked *