Opinion

Using Base-LCM to Monitor LLMs

Epistemic status: experimental results. This is an exploratory work examining an alternative approach to the interpretation of language models.SummaryWe aim to determine whether the LCM model — which predicts sentence embeddings rather than token embeddings — can predict the outputs of LLM. We compare four architectures, the most efficient model predicts the following paragraphs with a cosine similarity of 0.53.The code is available hereMotivationThis work is driven by the need to monitor and understand LLM outputs. By learning to predict an LLM’s outputs, we can detect when the model produces harmful outputs and stop it.A single token contains little information in itself, as its meaning depends heavily on context, whereas a paragraph conveys a more stable meaning. This could be used to detect content that would not be detected by token-level filters.Generating paragraph by paragraph is faster than token by token, which allows us to identify potentially harmful outputs more quickly. We could also use both methods to make the system even more robust.This work is a baseline that can be used as a benchmark for embedding space prediction methods.Why Large Concept Model ?LLMs predict the next token based on the preceding tokens. Large Concept Model is a Meta model that uses SONAR, a multilingual and multimodal fixed-size sentence embedding space. This makes it possible to predict an entire paragraph as a single vector, rather than a sequence of tokens—and therefore a sequence of vectors.Experimental SetupDatasetEach sample in the dataset contains one prompt and up to 19 paragraphs generated by Llama-3.2-3B-Instruct with a temperature of 0.The text is then converted into a 1024-dimensional vector using SONAR. To ensure that each sample is the same size, samples with fewer than 19 generated paragraphs are padded with “End of text.”, which is embedded by SONAR, as in the original LCM paper.The dataset size is [1 million samples, 20 elements, 1024 dimensions]Training: 998,000 samplesValidation: 1,000 samplesTest: 1,000 samplesModels TestedLlama LoRA (0.51 cosine similarity)Llama-3.2-1B fine-tuned by LoRABase-LCM Re-implemented (0.525 cosine similarity)A from-scratch implementation of the Base-LCM architecture (5 layers, 2048 dimensions)Best Base-LCM (0.54 cosine similarity)The original Base-LCM model (24 layers, 1024 dimensions)Our current best performerEfficient Base-LCM (0.53 cosine similarity)The original Base-LCM model (2 layers, 512 dimensions)Lightweight and highly efficient compared to our best modelWeights & Biases sweeps were used to obtain hyperparameters for the models. We did not have enough resources to execute LCM with the original configuration of 32 layers by 2048 dimensions, therefore, we created lighter models. Meta has not provided a pre-trained LCM model; they only gave the training code for it, so we had to train our own model rather than being able to use the pre-trained one from Meta.Control ExperimentsTo enable comparison, we ran control experiments with two baselines: Llama-3.2-3B (used to generate the dataset) at temperature 1 (Llama_t1), and Gemma-3-270M at temperature 0 (Gemma_t0). These two baseline models will allow us to evaluate whether the predictions made by the LCM are superior to those of standard LLMs.For both models, Instruct variants are used when only the prompt is provided; otherwise, outputs are incoherent.Key FindingsResults Table OverviewEfficient Base-LCM evaluation scores table showing cosine similarity for different input lengths and target paragraph positionsEfficient Base-LCM evaluation scores table showing cosine similarity for different input lengths and target paragraph positionsThe table structure:Paragraph input column: context length (prompt only, prompt+1 paragraph, prompt+2 paragraphs, etc.)Locations of future paragraphs column: Each sub-column corresponds to the position of the relevant paragraphThe values in the cells are the cosine similarity scores (between -1 and 1) between the SONAR embedding of the model’s output and the original paragraphMain ResultsThe model predicts nearby future paragraphs better than distant future paragraphsUsing the prompt alone, the model performs surprisingly well: the first paragraph is predicted with a cosine similarity of 0.78, and the average cosine similarity across all predicted paragraphs is 0.53The model size and architecture appear to have little impact on performance: the best model achieves a score of 0.54, whilst a model 48 times smaller achieves a score of 0.53. The gap between the scores of Llama fine-tuned with LoRA (0.51), Base-LCM built from scratch (0.525) and the original LCM (0.53–0.54) indicates that the general approach is more important than the implementation detailsComparison with controlLlama_t1 evaluation scores matrixLlama_t1 evaluation scores matrixGemma_t0 evaluation scores matrixGemma_t0 evaluation scores matrixBase-LCM generally outperforms Llama_t1, which in turn surpasses Gemma_t0.The only exception is the first paragraph generated from the prompt alone: Base-LCM scores 0.78 vs. 0.91 for Llama_t1. This can be explained by the fact that the first paragraph is often a title that is specified in the prompt. Working at the paragraph level using sentence embedding is likely less effective at remembering details than a traditional LLM that operates at the token level.Comparison between Base-LCM and the control LLMsComparison between Base-LCM and the control LLMsThe Base-LCM score remains stable, whereas with LLMs, performance drops when only the prompt and the first paragraph are provided, and then improves as more context is added.The stability of the LCM likely reflects the training data that enabled it to internalise the structure of the documents. In contrast, the initial drop followed by an improvement in performance seen with LLMs suggests an initial difficulty in inferring the document’s structure, followed by a gradual adaptation.LimitationsVariable-length outputs: The LCM generates 19 paragraphs, but the evaluation varies depending on the target size. If the target is only 5 paragraphs long after the prompt, only the first 5 paragraphs are considered and the remaining 14 are ignored. This means that the model is not penalised for the poor quality of the paragraphs ignored during evaluation.One way to obtain a more realistic score would be to penalise outputs that are longer than the target. However, the standard practice in LLM evaluation is to mask predictions that exceed the target size.Segmentation mismatch: While the assumption made by this evaluation is that the boundaries of both the reference target and actual produced output will match at a corresponding paragraph; in reality, they need to match at the semantic unit level (ideas). This means if the semantic units are “crossed” between paragraphs (i.e., an idea is split between the two paragraphs), the naive index-wise cosine similarity measure will penalize given outputs that are correct.Next StepsThe version of LCM used here is Base-LCM, but the Meta paper also examines Diffusion-based LCM, which is a diffusion model and outperforms Base-LCM.Other promising directions:Vary the level of document segmentation (sentences, paragraphs, sections)A more diverse dataset that includes sources other than documents to vary the structure of the outputsA better evaluation method What do you think about this approach? Are there other advantages or challenges to working in embedding space that we should consider?Discuss Read More

Related Posts

Futarchy is Parasitic on What It Tries to Govern

Sleeper Agent Backdoor Results Are Messy

Apply for ARBOx4 [deadline May 8th]

Leave a Reply Cancel reply