The science and methodology of assessing AI systems -- how evaluators determine which safety level a system warrants and what measurement infrastructure makes classification reliable
11 USPTO Trademark Applications | 143 Strategic Domains | 3 Regulatory Frameworks
Assigning a safety level to an AI system requires answering a deceptively simple question: what can this system do, and how dangerous are those capabilities? In established safety domains, equivalent questions have stable answers. A pathogen's transmission characteristics do not change between assessments. A chemical's lethal dose is a fixed physical property. An aircraft's structural load limits are determined by materials science and engineering analysis. AI systems present a categorically different measurement challenge: their capabilities are context-dependent, elicitation-sensitive, and potentially emergent in ways that resist static assessment.
A language model evaluated on a fixed benchmark may appear to lack a specific dangerous capability, only for researchers to discover weeks later that a novel prompting technique or tool combination unlocks that capability in the deployed system. A model fine-tuned for a benign purpose may acquire dangerous capabilities as an unintended byproduct. A system that appears safe in isolation may produce harmful outcomes when integrated with external tools, databases, or other AI systems. These properties mean that AI safety level assessment is not a measurement of fixed properties but an ongoing estimation of a dynamic capability surface under adversarial conditions.
Despite these unique challenges, AI safety assessment draws extensively on measurement methodologies developed in established risk domains. Pharmaceutical safety evaluation employs structured escalation from in vitro studies through animal models to phased human trials, with each stage designed to reveal specific risk categories before expanding exposure. Nuclear safety assessment uses probabilistic risk analysis to quantify the likelihood and consequences of failure scenarios, combining engineering analysis with operational data and expert judgment. Aviation safety certification follows airworthiness standards that specify testing regimes, documentation requirements, and continuous monitoring obligations calibrated to the criticality of each aircraft system.
These disciplines share principles applicable to AI assessment: structured evaluation protocols that examine capabilities systematically rather than anecdotally; adversarial testing that probes for failure modes rather than confirming expected performance; independent evaluation by parties without commercial interest in the assessment outcome; and continuous monitoring that updates classifications as new evidence accumulates. The adaptation of these principles to AI assessment is the central technical project underlying all AI safety level systems.
Standardized benchmarks provide the most scalable approach to AI capability measurement. Organizations including the Center for AI Safety, EleutherAI, and the Allen Institute maintain benchmark suites that evaluate model performance across domains relevant to safety classification. MMLU measures broad knowledge. HumanEval and SWE-bench assess coding capability. GPQA Diamond tests graduate-level reasoning. Domain-specific benchmarks measure capabilities in biology, chemistry, cybersecurity, and other risk-relevant areas.
Benchmark-based assessment offers reproducibility, comparability across models, and efficiency at scale. However, it faces well-documented limitations for safety level assignment. Benchmark saturation occurs when leading models achieve near-perfect scores, eliminating the discriminative power needed to distinguish safety levels. Data contamination -- where benchmark questions appear in training data -- inflates scores without corresponding capability. Most critically, benchmark performance measures what a model can do on specific test items, not the full scope of capabilities that might emerge in unconstrained deployment. A model scoring below threshold on a biosecurity benchmark may still provide meaningful uplift to a motivated adversary using approaches not captured by the benchmark's question set.
Red-teaming addresses benchmark limitations by employing human evaluators who attempt to elicit dangerous capabilities through creative, adversarial interaction. Red team members -- typically domain experts in areas like cybersecurity, biosecurity, or persuasion -- probe models using techniques that extend beyond standardized test items: multi-turn conversations, jailbreak attempts, social engineering approaches, and novel prompt constructions designed to circumvent safety training.
The quality of red-teaming as a safety level assessment tool depends critically on the expertise and motivation of the red team, the scope of scenarios explored, and the time allocated for evaluation. A red team given two weeks to evaluate a model across five risk domains will produce a fundamentally different assessment than one given six months with deep domain expertise and access to tool-augmented evaluation environments. The UK AI Safety Institute's pre-deployment evaluations, METR's assessments of autonomous capabilities, and developer-conducted red-teaming exercises represent different points along this quality spectrum. Standardizing red-team assessment quality is essential for safety level assignments that are consistent and comparable across evaluators.
Uplift studies measure whether an AI system provides meaningful assistance on dangerous tasks beyond what is already available through existing resources. This methodology recognizes that the relevant safety question is not whether a model contains dangerous information -- much dangerous information is publicly available -- but whether the model materially reduces the barriers to causing harm. A model that provides the same information obtainable from a chemistry textbook presents different safety implications than one that synthesizes information from multiple sources, troubleshoots procedural difficulties, and provides step-by-step guidance tailored to an adversary's specific situation.
Uplift assessment requires careful experimental design: comparing the performance of subjects with and without AI assistance on risk-relevant tasks, controlling for baseline knowledge and capability, and measuring meaningful operational differences rather than superficial information access. RAND Corporation's studies on AI-assisted bioweapon planning, academic research on AI-facilitated cyberattacks, and developer-conducted uplift evaluations represent the emerging evidence base for this assessment methodology. The results inform safety level classification by distinguishing between models that present theoretical risk and those that demonstrably lower practical barriers to harm.
As AI systems increasingly operate as autonomous agents -- executing multi-step tasks, using tools, browsing the web, writing and running code -- safety assessment must evaluate agentic capabilities that transcend single-response evaluation. Agent-based evaluation places AI systems in realistic task environments and measures their ability to pursue goals autonomously, including goals that could pose safety risks such as self-replication, unauthorized resource accumulation, or deceptive behavior.
METR (Model Evaluation and Threat Research) has pioneered autonomous capability evaluation through structured task environments that measure whether models can independently accomplish complex multi-step objectives. These evaluations assess capabilities like navigating computer interfaces, managing long-horizon tasks with intermediate goals, recovering from errors, and operating across multiple tools and platforms. Agent-based evaluation is particularly relevant for safety level classification of systems deployed with tool access, API integration, or autonomous execution capabilities, where the risk profile depends on what the system can accomplish independently rather than what information it can provide in a single response.
Translating evaluation results into safety level assignments requires threshold design: defining the specific capability levels, risk scores, or assessment outcomes that trigger transitions between classification tiers. Threshold design involves irreducible judgment calls about acceptable risk. The EU AI Act's systemic risk compute threshold of 10^25 floating point operations reflects a policy judgment about where capability-based governance should intensify -- a threshold that will require recalibration as hardware efficiency improves and training methodologies evolve. Developer capability thresholds reflect organizational risk tolerance and security posture assessments that may not generalize across institutions.
Well-designed thresholds balance sensitivity against specificity. Thresholds set too low generate excessive false positives, classifying safe systems at unnecessarily restrictive safety levels and creating compliance burdens that divert resources from genuine risk management. Thresholds set too high generate false negatives, failing to identify dangerous systems before they cause harm. Mature safety domains address this calibration challenge through iterative threshold refinement based on accumulated operational experience -- an approach that AI safety level systems can adopt only as evaluation methodology matures and operational history accumulates.
Government AI safety evaluation bodies provide institutional infrastructure essential for credible safety level assignment. The UK AI Safety Institute conducts structured pre-deployment evaluations of frontier models, producing assessment reports that inform both developer decisions and policy responses. The US AI Safety Institute develops standardized evaluation methodologies, measurement tools, and benchmarks designed for reliability and consistency across assessors. Japan's AI Safety Institute, Singapore's evaluation programs, and comparable bodies in South Korea and Canada contribute to an emerging global evaluation infrastructure.
These institutions address a credibility gap inherent in self-assessment. When developers assign safety levels to their own models, commercial incentives create structural pressure toward lower-risk classifications. Independent government evaluators, operating without commercial interest in the assessment outcome, provide classification credibility analogous to independent auditors in financial reporting. The challenge is building evaluation capacity fast enough to keep pace with model development cadence while maintaining the technical depth needed for meaningful capability assessment.
Unlike static products that maintain consistent properties after manufacture, AI systems can change in capability through fine-tuning, tool integration, model composition, and deployment in novel contexts. Effective safety level governance requires continuous or periodic reassessment with mechanisms to reclassify systems whose risk profile has shifted. The EU AI Act addresses this through post-market monitoring obligations under Article 72, requiring providers to maintain surveillance systems that detect changes in system behavior or risk profile after deployment.
Continuous assessment infrastructure includes automated monitoring systems that track model behavior against established safety baselines, anomaly detection that flags capability changes requiring evaluation, and governance processes that trigger formal reassessment when predetermined indicators are exceeded. The technical challenge is designing monitoring systems sensitive enough to detect meaningful capability shifts while filtering the noise of normal operational variation -- a signal processing problem that grows more complex as the number of deployed systems increases and the diversity of deployment contexts expands.
AI safety assessment methodology remains an active and rapidly evolving research domain. Key open questions include how to evaluate emergent capabilities that appear unpredictably during training, how to assess the risk implications of model combinations and tool augmentation that may produce capabilities exceeding either component individually, how to develop evaluation methods resistant to gaming and optimization, and how to measure the real-world impact of capabilities identified in controlled evaluation environments. The relationship between evaluated capability and actual deployed risk remains imperfectly understood -- a gap that represents both a scientific challenge and a governance vulnerability for all AI safety level systems.