Abstraction
Understanding the inner workings of LLMs boils down to math, very hard crazy math, and some foundational ideas in computational linguistics such as context-free grammar and finite state automata. Think of a trained AI model has a abstract mathematical function in algebra, which receives your input prompts (as an input), and outputs its responses (as the output). This function is made up of many variables, can have many term polynomials, and very high-dimensional (e.g., axes are made up of high-dimensional sub axes, and the sub axes are made up of ........you get it). Granted, it is not trivial for the human mind to grasps, but mathematical tools and abstractions can! Although a function can accept any input within its "domain" (i.e., you can give an AI any input prompt), their output can be constrained, thereby limiting its "range" of output. Therefore, our approach is based on understanding how to contrain the outputs or capabilities of the function, the AI model. Its like putting a leash on a dragon, constraining it to a set of desirable capabilities. In the below write up, we withheld some details due to some IP concerns.
Technical Formulation Summary
Euler One is a proactive AI security technology to protect LLM-based AI applications from malicious exploitation aimed to subvert its safety guardrails, security vulnerabilities, and prevent the injection of hidden capabilities (or backdoor). Euler One can also protect against benign unwanted behavior from an AI model, at the user's discretion. Euler One is based on a new mathematical abstraction that provides tools to encode and interrogate the abstract properties of the high-dimensional feature space where generative AI tasks take place (such as next token generation).
Expression: An expression is anything that can be said or written within the framework of a language.
Concept: A concept is an idea (or group of ideas) carried or inferred in an expression. Basically, any thought, idea, or imagination that can be expressed or inferred from language is a concept. For example, the expression "I am hungry" may carry the concept of "in pain" or "discomfort" or "suffering".
Concept Space: A concept space is a high-dimensional space where concepts live or where concepts can be represented. The number of concept spaces in a model is at least the number of parameters in node embedding of tokens. For example, in the Llama3 8B model, there are at least 4096 concept spaces. Concepts can be represented as a linear combination of the concept space they live in.
Synapse (Axes of a Concept Space): A synapse is an axes in a concept space. It constitutes the dimensions or directions within a concept space. Concepts in concept spaces relate to each other via synapses across their concept spaces. Similar to how neurons talk to each other via synapses. The number of concept axes within a concept space depends on the input prompt, and is derived during analysis. We found that this number is at most the number of unique tokens. For example, in the Llama3 8B model, there are at most 128k synapses that connect concepts.
Meaning Hyperplane: The Meaning hyperplane is the over-arching space where concept spaces combine to derive their meaning. In this hyperplane, groups of synapses form a structural group (we call this SG). The number of SGs is equal to the number of unique tokens divided by the cardinality of the SG. SGs form a unique linear combination (e.g., coefficient of their weighted sum) of synapses, and this fine-grained structure succinctly represent a concept in their contextual meaning of the expressed idea.
These abstractions serve as tools that allow us to encode and interrogate provable mathematical properties of a model. This enables us to describe the degree to which a model is protected against potential exploitation, or circumvention of its security and safety guardrails.
Under these abstractions, Euler One formulates four (2) techniques (**IP/patent pending**).
SCAVENGER AI: Novel technique to automatically localize and derive mathematical signatures of arbitrary concepts in LLMs, to aid in the identification of its capabilities as well as to block or nullify them via targeted fine tuning-based patching.
SCAVENGER empowers organizations to detect unwanted capabilities in their deployed AI model, and enable them to block or remove these unwanted capabilities. SCAVENGER can detect if a model is vulnerable to prompt injection, context injection, deception through role playing or pretending. If a model is vulnerable to these things, SCAVENGER can block them per specific concept desired by the customer.
FORTRESS AI: Novel technique to automatically fortify concept spaces in LLMs to prevent the injection of backdoors without breaking the LLM's business functionality in any detectable way. This provides anti-tampering protection for LLMs, which prevents hidden capabilities to be installed without knowledge of the tampering.
We are working on an extension of FORTRESS that can explore and search the concept spaces of an LLM in search of potentially hidden injected concepts or backdoors (which exists such that the intended business functions of the LLM is not affected.
More Technical Details
Euler One leverages mathematical foundations in multivariate calculus and analytic theory to understand the inner workings of LLMs, particularly, how the inference program using the weight parameters to transform input prompts to responses that satisfy the ask of the prompts. However, the success of this transformation depends on how well the previous training the model converged. We analyze the processing using a combination of techniques including multi-step integrated gradient check-pointing. Euler One models the mathematical implications of these processes, especially the hidden structures and relationships in the SGs/synapses involved. During training and before convergence, consider a function which tracks the progression of the gradient descent steps. The points of this function traces a topology of the meaning hyperplane, eventually settling (or converging) on a combination of SG (within the meaning hyperplane) corresponding to the concept carried by each of the training example.
Unlike Mechanistic Interpretability (the leading technique to understanding the decision making of LLMs used Anthropic and Deepmind) which uses a neuron as its unit of analysis, Euler One uses synapses as its unit of analysis and understanding. Synapses (axes of a concept space) are equivalent to "partitions" within each neuron, so the operates at a lower level compared to a neuron. These partitions are connected and consistent across all neurons in the multi-layer perception (MLP) (we do not include this yet in the Attention Layers). Think of it this way, all neurons associated with the MLP are related to each other via the inter-relationship of these their inner partitions. There is a challenge in finding what these partitions are and their relative density to one another. The high density partitions are the identifiers of concepts. Note that input prompts carry a concept (or is a concept), which may or may not be in the same concept space as the response. In our experiments we validated this hypothesis via several observations, For example, we observed consistent increase in the density of the dominant concept axes (i.e., partition) during a fine-tuning process intended to reinforce that particular concept within the model.
Given a model, we parse the weights in all the layers and represent them in a Euler Meaning HyperGraph (EMG). The EMG is a multi-modal graph network based on abstractions we alluded to earlier, namely concepts, concept spaces, synapses and SGs, and meaning hyperplane.
The EMH is designed to encode the functional mathematical structure of the learned weight/bias parameters in a trained model, which drives an LLM's probabilistic prediction of output (e.g., of next token) in production. The output is a function of the linear transformation of the input prompt by the the weights. By combining our understanding of the structure of the math involved (multi-variate calculus and analytic theory), we work EMH to enable the identification and localization of concept spaces (associated with any arbitrary concept of our choice in the human world) deep in the model.
Our team is continuing to work on this and will have more structured information soon.