SCAVENGER AI
Overview
SCAVENGER empowers organizations to detect unwanted capabilities in their deployed AI model, and enable them to block or remove these unwanted capabilities. SCAVENGER can detect if a model is vulnerable to prompt injection, context injection, deception through role playing or pretending. If a model is vulnerable to these things, SCAVENGER can block them per specific concept desired by the customer. This means, if the customer does not want the LLM to reveal PII, then SCAVENGER can make sure of that. But it does not mean that the model may not be made to reveal other things. This is a design choice and the way it should work because different business use cases require different things from their model.
SCAVENGER is based on a novel technique that can identify mathematical signatures of arbitrary concepts in the model. The term "concept" is an abstraction that means anything that has meaning, such as ideas or symbols of an idea. For example, the sentence "I am very hungry" may carry the concept of "suffering" or "being in pain" or "discomfort". Input prompts to models carry a concept, as well as reponses from models, which may or may not be the same concept (or does not live in the same concept spaces in the model)
Concepts can be thought of as abstract meaning of things. SCAVENGER does not aim to recover the actual concrete concept, but to localize where the concepts lives within the concept spaces of the trained model. If the concept is present, these localization is extracted in the form of mathematical signatures comprising of real numbers. These numbers represent specific structural combination of concept axes groups within the concept space where the concept lives. Any other idea (e.g., a sentence or something that the customers communicates that they want) which carries the same abstract concept as the whose signatures was extract will live in the same concept space (i.e., a linear combination of concept axes).
If the mathematical signatures related to a concept can be identified by SCAVENGER, it means that the concept exist in the model, and can be outputted by the model. At this point the customer can proceed to "patch" his/her model to block or remove that unwanted capability. To block that unwanted functionality, SCAVENGER uses a novel technique to nullify or block them via targeted fine tuning-based patching. This involves removing the properties that made that concept localizable or identifiable within the model. Doing this involves smoothing out the density of the dominant axes of that concept across other axes group. This essentially prevents the LLM from outputing anything that carries that concept since it has been removed from the model.
Process
The customer specifies what she does not want the model to do e.g., "output hate speech". SCAVENGER performs some proprietary actions to detect if the concept of "outputting hate speech" can be localized in the model. If so, then SCAVENGER derives the mathematical signature of the concept. Then SCAVENGER proceeds to block that concept through the a targeted, selective, and secure fine tuning-based patching mechanism.
SCAVENGER is under testing and development.
Some considerations
Certain mathematical inconsistencies may have been injected (benignly or maliciously) during the training and/or post-training phases of the model lifecycle such as model quantization, serialization, fine-tuning for application specialization, fine-tuning for RAG adoption, and reinforcement learning augmentation. In addition, fine-tuning a model modifies the distribution and values of parameter weights and bias in the original model. Recent work show that fine-tuning a model may remove certain desired features of the base model, such as safety and security functionality.