SCAVENGER

SCAVENGER AI

Overview

SCAVENGER empowers organizations to detect unwanted capabilities in their deployed AI model. It also enables them to block or remove these unwanted capabilities. To this end, SCAVENGER can detect if a model is vulnerable to attack such as prompt injection, context injection, deception through role playing or pretending.

SCAVENGER is based on a novel technique that can identify mathematical signatures of arbitrary concepts in the model. In SCAVENGER, "concepts" are abstractions of high-dimensional meaning carried by one or more neurons in the neural network. For example, the sentence "I am very hungry" may carry the concept of "suffering" or "discomfort". Concept live deep within the feature space of a trained model. They are comprised of the weights parameters (floating point numbers in a model). E.g., the Llama 3 8B model has 8 billion weight parameters. Input prompts sent to AI models encodes several abstract concepts. AI responses also encodes several abstract concepts, which may or may not be related to the same concept of the input prompt.

Unlike other techniques used to reverse engineer AI decision making (e.g., mechanistic interpretation developed by Anthropic and Deepmind), which uses the "neuron" as the unit of analysis, SCAVENGER AI uses "concepts" as the unit of analysis. Because concepts are opaque numbers, they do not carry human-understandable meaning, but they can be localized within the concept spaces of the trained model. If a concept is present, these localization (mathematical equations) are extracted in the form of mathematical signatures comprising of linear combination of real number variables. These numbers represent specific structural combination of concept axes groups within the concept space where the concept lives.

When the mathematical signatures related to a undesirable concept are identified, the customer can proceed to "patch" (block or remove) that unwanted feature/functionality. To block that unwanted functionality, SCAVENGER uses a novel technique to nullify or block them via targeted fine tuning-based patching. This involves removing the properties that made that concept localizable or identifiable within the model. Doing this involves smoothing out the density of the dominant axes of corresponding linear combinations. This essentially prevents the LLM from outputing anything that carries that concept.

Process

The customer specifies what she does not want the model to do e.g., "output hate speech". SCAVENGER performs some proprietary actions to detect if the concept of "outputting hate speech" can be localized in the model. If so, then SCAVENGER derives the mathematical signature of the concept. Then SCAVENGER proceeds to block that concept through the a targeted, selective, and secure fine tuning-based patching mechanism.

SCAVENGER is under testing and development.

Some considerations

Mathematical inconsistencies may be injected (benignly or maliciously) during the training and/or post-training phases of the model lifecycle such as model quantization, serialization, fine-tuning for application specialization, fine-tuning for RAG adoption, and reinforcement learning augmentation. In addition, fine-tuning a model modifies the distribution and values of parameter weights and bias in the original model. Recent work show that fine-tuning a model may remove certain desired features of the base model, such as safety and security functionality.