Opinion

Probing CODI’s Latent Reasoning Chain with Logit Lens and Tuned Lens

As latent reasoning models become more capable, understanding what information they encode at each step becomes increasingly important for safety and interpretability. If tools like logit lens and tuned lens can decode latent reasoning chains, they could serve as lightweight monitoring tools — flagging when a model’s internal computation diverges from its stated reasoning, or enabling early exit once the answer has crystallized. This post explores whether those tools work on CODI’s 6 latent steps and what they reveal about its internal computation.TL;DRI applied logit lens and tuned lens to probe CODI’s latent reasoning chain on GSM8K arithmetic problems.Key findings:The direct-trained tuned lens reveals the final answer in odd latent steps (1, 3, 5) while plain logit lens cannot. Tuned lens with only latents 3,5 is not able to decode the the final answerFinal answer detection rate is highest at even steps (2, 4), consistent with even steps acting as storageEntropy peaks at odd steps 3 and 5, consistent with active computation happening thereTraining translators directly on CODI latent hidden states surprisingly underperforms translators trained on text tokens — suggesting latent vectors remain close to text token geometryExperimental setupCoDI modelI use the publicly available CODI Llama 3.2 1B checkpoint from Can we interpret latent reasoning using current mechanistic interpretability tools? Tuned Logit LensI used the code implementation for the training of Tuned logit lens from Eliciting Latent Predictions from Transformers with the Tuned LensExperimentsThe model in Intermediate Calculations contains the final answer.PROMPT = “A team starts with 3 members. They recruit 5 new members. Then each current member recruits 2 additional people. How many people are there now on the team? Give the answer only and nothing else.”Answer = 10I looked at the same prompt as the lesswrong post “Can we interpret latent reasoning using current mechanistic interpretability tools?”In the figure above which is a Tuned Logit lens trained on the gsm8k with 500 samples and 3 epochs.An interesting feature of the Tuned Logit Lens on non CODI tokens is that unlike normal logit lens it demonstrates that in the CODI latents the model stores information about the Answer 10. The answer 10 for the logit lens seemed to be absent from the even latent vector indexes of 2, 4, 6This is different from normal logit lens where it it only shows the intermediate steps of 8 and 16 but, never shows the final answer of 10Tuned Logit Lens on CODI latentsMotivated by the strong performance of the direct-trained tuned lens, I trained a second set of translators directly on CODI’s latent hidden states, hypothesizing that a lens specialized to latent-space geometry would outperform one trained on text tokens. This however, was not the case.Unlike normal logit lens or even the directly-trained tuned lens the CODI tuned lens always outputted end of cot tokens for latents 6 and beyondThere seems to be a noticeable difference in the tuned logit lens depending on the latents used to train it especially between even and odd latentsTuned Logit Lens CODITraining tuned logit lenses on CODI latents seemed to cause the logit lens to mirror the final layer which suggests over-fitting.Tuned Logit Lens CODI Latents EvenTraining tuned logit lenses on CODI even latents of 2,4,6 seemed to cause the logit lens to output a lot more text tokens or n tokens. The final layer can be ignored since tuned logit lens does not train the last layer.Tuned Logit Lens CODI Latents ODDTraining tuned logit lenses on CODI odd latents of 1,3,5 seemed to cause the logit lens to output to do the opposite of the even logit lens as it seems that it only ouputted numbers even before the latent reasoning as seen with latent vector index 0. It was able to find the final answer of 10 however, it was unable to produce valid outputs for non-latent reasoning activations as seen with how it did not fully decode latent vector index 0.Tuned Logit Lens CODI Latents (3,5)If tuned lens is only using 3, 5 the 10s do not show up adding 1,3,5 allow the the logit lens to find the final answer.Tuned Logit Lens on Entropy Over LayersIn order to explore the differences in logit len outputs between the different Tuned Logit Lens I looked at the Entropy.Plain Logit Lens seemed to have the highest Entropy with it only dropping at the very last layersDirect Tuned Logit lens seemed to have the lowest entropy compared to CODI tuned lens and default logit lensAn interesting finding is that latents 3 and 5 have the highest entropy across the 3 logit lens variations excluding latent 1. Tuned Logit Lens on Intermediate answer detectionThe tuned logit lens containing the final answer made me curious for outside the top token in the topk what was the final answer emission for the different tuned logit lens and which latents predicted the final answer at the highest rates.Directly Tuned Logit Lens did the best and had the highest final answer detection rateSurprisingly the latents that had the highest final answer detection rate in the top k tokens were the even tokens 2, 4. DiscussionThe latent Intermediate answer detection having the highest rates for latents 2 and 4 is consistent with this lesswrong post.The high entropy values for latents 3 and 5 along with the fact that for the tuned logit lens the final answer appears only in the odd latents of 1, 3, 5 seems to be consistent with the results from this lesswrong post as the reason for why overriding latents 3 and 5 had a considerable decrease in accuracy could be explained by the findings for the high entropy values for latents 3 and 5.The even latents of 2 and 4 containing the intermediate answers the most when doing the top k possibly can be explained with the Scratchpad Thinking paper finding that even latents steps store and access numerical information.Together these three findings provide converging evidence from logit lens and entropy analysis that CODI follows the compute/store alternation proposed in the Scratchpad Thinking paper Future WorkObserve Final Answer Detection Rate and see if latents 2 and 4 have the highest accuracy for different top ksTry other methods like patchscopes, activation oracles, etc on the latent reasoningAttempt to create a thought anchors that work with CODI latent reasoning modelsDiscuss Read More

Related Posts

On negotiated settlements vs conflict with misaligned AGI

Strategy-Stealing Argument Against AI Dealmaking

Things I learned from reddit fashion

Leave a Reply Cancel reply