ISO/IEC DIS 24029-3
ISO/IEC JTC 1/SC 42
Secretariat: ANSI
Date: 2025-12-08
Artificial intelligence (AI) — Assessment of the robustness of neural networks —
Part 3:
Methodology for the use of statistical methods
DIS stage
Warning for WD’s and CD’s
This document is not an ISO International Standard. It is distributed for review and comment. It is subject to change without notice and may not be referred to as an International Standard.
Recipients of this draft are invited to submit, with their comments, notification of any relevant patent rights of which they are aware and to provide supporting documentation.
© ISO 2025
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting on the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address below or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: + 41 22 749 01 11
E-mail: copyright@iso.org
Website: www.iso.org
Published in Switzerland
Contents
Robustness assessment during the life cycle
Robustness properties measurable using statistical methods
In-domain robustness assessment
Specification of expected behaviour
Other considerations that contribute to robustness assessment
Statistical measures of robust behaviour
Robustness assessment based on performance drops
Targeted assessment of the system's response to a perturbation
Assessment through separability metrics
Assessment through statistical reliability engineering
Robustness assessment through curvature analysis
Selection of datasets representative of a robustness dimension
Datasets needed for robustness assessment
Acquisition of data for robustness assessment
Measuring adversarial robustness
ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies (ISO member bodies). The work of preparing International Standards is normally carried out through ISO technical committees. Each member body interested in a subject for which a technical committee has been established has the right to be represented on that committee. International organizations, governmental and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are described in the ISO/IEC Directives, Part 1. In particular the different approval criteria needed for the different types of ISO documents should be noted. This document was drafted in accordance with the editorial rules of the ISO/IEC Directives, Part 2 (see www.iso.org/directives).
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights. ISO shall not be held responsible for identifying any or all such patent rights. Details of any patent rights identified during the development of the document will be in the Introduction and/or on the ISO list of patent declarations received (see www.iso.org/patents).
Any trade name used in this document is information given for the convenience of users and does not constitute an endorsement.
For an explanation on the meaning of ISO specific terms and expressions related to conformity assessment, as well as information about ISO's adherence to the World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT) see the following URL: www.iso.org/iso/foreword.html.
This document was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology, Subcommittee SC 42, Artificial intelligence.
A list of all parts in the ISO/IEC 24029 series can be found on the ISO website.
Any feedback or questions on this document should be directed to the user’s national standards body. A complete listing of these bodies can be found at www.iso.org/members.html.
Introduction
Neural networks are a widely used AI systems capable of tackling a variety of tasks, such as detection, classification or interpolation, on diverse data types, such as text, speech, image, time series or tabular data. AI quality models of neural networks comprise certain characteristics, including robustness. For example, ISO/IEC 25059 considers in its quality model that robustness is a sub-characteristic of reliability. Assessing robustness is one key step to build trust into AI systems, and in particular into neural networks. This document focuses on AI systems that comprise neural networks.
Demonstrating the ability of a system to maintain its level of performance under locally varying conditions can be done using formal methods, as illustrated in ISO/IEC 24029-2. Statistical methods can be complementary to other methods in their ability to explore more widely these varying conditions.
As neural networks can be used to tackle high dimensionality data (e.g. real-world perception data, Internet of Things data), the input space they have to handle can be both variable and hard to explore exhaustively. In that regard statistical methods allow a wide exploration of their input space, which can be combined with any robustness assessment scheme to reinforce it.
Artificial intelligence (AI) — Assessment of the robustness of neural networks —
Part 3:
Methodology for the use of statistical methods
1.0 Scope
This document provides methodology for the use of statistical methods to assess robustness properties of neural networks.
2.0 Normative references
The following documents are referred to in the text in such a way that some or all of their content constitutes requirements of this document. For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies.
ISO/IEC 22989:2022, Information technology — Artificial intelligence — Artificial intelligence concepts and terminology
ISO/IEC 23053:2022, Framework for Artificial Intelligence (AI) Systems Using Machine Learning (ML)
ISO/IEC 24029-2:2023, Artificial intelligence (AI) — Assessment of the robustness of neural networks — Part 2: Methodology for the use of formal methods
3.0 Terms and definitions
For the purposes of this document, the following terms and definitions given in ISO/IEC 22989:2022, as well as the following ISO/IEC 23053:2022 apply.
ISO and IEC maintain terminological databases for use in standardization at the following addresses:
- IEC Electropedia: available at http://www.electropedia.org/
- ISO Online browsing platform: available at http://www.iso.org/obp
Stability
ability of an AI system to produce invariant output under specified perturbations on input
4.0 Abbreviated terms
artificial intelligence | |
out-of-distribution |
5.0 Robustness assessment during the life cycle
5.1 Summary
Figure 1 — Example of AI system life cycle model stages and high-level processes
Robustness assessment spans across the entire life of an AI system. The life cycle of an AI system, drawn from ISO/IEC 22989, is described in Figure 1 and is composed of seven stages: inception, design and development, validation and verification, deployment, operation and monitoring, re-evaluation and retirement. Each stage requires a different outlook on the robustness assessment of a neural network. Evaluation of robustness requires a description of the domain of use of the system, and the set of perturbations the AI system is likely to encounter. This is essential to build a statistical robustness assessment scheme (as well as a formal one as described in ISO/IEC 24029-2, or an empirical one). More on the domain discussion can be found in subclause 6.2.
Evaluation of robustness uses two distinct approaches: The top-down process (Clauses 7.2, 7.3) starts with the intended domain of the system and systematically derives perturbations through decomposition (e.g., analyzing an autonomous vehicle's domain by breaking down environmental conditions into weather, lighting, and road conditions).
The bottom-up process (Clauses 7.4, 7.5, 7.6) begins with observed perturbations from real-world data and builds up to domain understanding (e.g., collecting field data on system failures to identify previously unknown perturbation categories)
Both approaches have merits and drawbacks, and their use depends on the stage of the life cycle the system is at. More on the suitability of these at each stage of the life cycle is available in the following subclauses.
To assess robustness, risks to robustness that can occur during the design, development and use of an AI system shall be identified, documented and evaluated. An organzation can use ISO/IEC 23894 in conjunction with this document.
5.1.1 Inception
At inception stage, an AI system’s characteristics, including robustness objectives and the associated metrics for their measurement shall be specified and documented.
The first step to assess robustness, is to specify the perturbations against which the system is intended to be robust, by accounting for its intended domain. At this stage a top-down approach should be used to identify the challenges that are going to impact the input of the neural network. In particular, a formal description of the type of challenges, their variability, their intensity should be provided, along with a justification.
Thresholds shall be defined in accordance with the thresholding process described in subclauses of Clause 7 (e.g. subclauses 7.2.3, 7.3.3). Their choices shall be justified in the documentation, with respect to the properties described in Clause 6. If no threshold is being set at this stage, then during design and development stage one shall be defined considering a baseline performance evaluation.
5.1.2 Design and development
The design and development stage allows the preparation of the robustness assessment by considering the risks identified and the domain of use of the neural network defined during the inception stage. Special care is needed regarding the data preparation and the scenario preparation that is being used to assess robustness later on. Since neural networks can be used in a variety of context and on various data type, the description available for the data can vary. This does not prevent this descriptive work during the design and development stage, using attributes such as possible nominal ranges, expected distribution, available categories, data format, data missingness and noise. This description is used to produce the robustness evaluation scheme as well as the expected thresholds to be verified during the verification and validation stage. The design at the inception stage translates the requirement at a technical level combined with the means to evaluate them.
For a proper statistical assessment of robustness, the design and development process shall provide a description of the data using its appropriate characteristics to define a robustness assessment scheme and the appropriate thresholds.
At this stage, some risks or domain definitions can be adjusted to better reflect the development of the system. This process reflects a typical bottom-up approach.
In case new risks to robustness are identified during design and development, new thresholds shall be defined in accordance with subclauses of Clause 7 (e.g. subclauses 7.2.3, 7.3.3) and their choices shall be justified in the documentation, with respect to the properties described in Clause 6.
5.1.3 Verification and validation
The validation and verification stage aims at evaluating whether the neural networks’ performance meets the thresholds defined in the previous stages. A statistical evaluation of the robustness relies on measuring across the domain of use and against the different perturbations within this domain.
Robustness assessments utilised on high variety of data points may be broken down into regions, and such a selection may be chosen with respect to the robustness objective to be verified. However, such breakdowns shall be justified to not skew the evaluation process.
EXAMPLE 1 For measuring robustness of an image-input neural network to random noise, one can augment the dataset with noise from different distributions like Gaussian and Uniform. Each distribution will correspond to a split of the augmented dataset. The evaluation process is not skewed in this case as the underlying unperturbed data distribution is the same in all the splits.
The domain of use shall be covered as much as possible by the verification and validation process with data sets that are complete and representative of the domain they represent. As far as is reasonably practical, care should be taken to ensure the content of verification and validation test sets is semanticaly independent of the training set(s).
The robustness in each part of the domain shall be evaluated against the respective defined threshold
Inside the domain of use of a neural network, several perturbations can occur which are identified during inception stage and adjusted during design and training. Each can occur individually in the domain, but also in combination. Specific data sets can be made to be representative of these perturbations whether they occur independently or in combination.
EXAMPLE 2 Perturbation can be variations in lighting conditions for a computer vision system operating in daylight hours.
The validation and verification stage shall consider each separately, whenever possible, to statistically characterize the performance drop of the neural network with or without it.
A statistical evaluation shall also be conducted to evaluate the robustness of the neural network on meaningful combinations of perturbations.
5.1.4 Deployment
During deployment, a neural network can be tailored to a specific hardware architecture requiring a change of its underlying numerical accuracy or different technology stack. Doing so can impact the robustness of the neural network.
Statistical methods should be used to assess the impact of a change of arithmetic or the change of the technology stack to the neural network and verify it does not violate the thresholds measured during verification and validation stage (e.g., threshold about stability, sensitivity). Such an assessment shall take into account, whenever applicable, the impact of change of arithmetic of each operator, as well as the impact of the reordering of numerical expressions (more details can be found on those transformations in ISO/IEC 24029-2 in subclause 7.4).
5.1.5 Operation and monitoring
Once deployed and in operation, a neural network’s robustness can be monitored to detect defects or altered conditions of use that can lead to decreased robustness. In production a neural network can remain unchanged (i.e. its weights remain unchanged) or it can continue to learn (i.e. continuous learning causing weights to be adjusted continuously). This clause applies to neural networks that do not perform continuous learning, whereas subclause 5.7 addresses neural networks that do. Statistical robustness assessment during operation uses data points from the domain of use to verify that the behaviour of the system is not violating the threshold verified during design and training stages.
When evaluating a neural network, the goal is to verify that the domain on which the system is used is still compatible with the one upon which it has been validated.
A data set representative of the use of the neural network shall be regularly collected and used to verify the robustness of the system using the same thresholds as in the validation and verification stage.
The frequency of the robustness assessment at this stage shall be chosen in accordance with the risk assessment.
EXAMPLE High risk applications would need more frequent assessments.
5.1.6 Continuous validation
In the scenario of a neural network performing continuous learning, the robustness of the updated neural network shall be assessed on both the updated data distribution as well as the previous data distribution, and record these assessments for future use
The frequency of the robustness assessment at this stage shall be chosen in accordance with the risk assessment.
5.1.7 Re-evaluation
If one or more re-evaluations are parts of the life cycle of the system, for example for a midlife evaluation, then robustness is also statistically evaluated. At this point several statistical evaluations already occurred and can be used to perform a statistical assessment of the robustness over time.
The re-evaluation process shall reproduce the initial verification and validation, as well as the statistical assessment done during operation and monitoring stage and continuous validation stage (if applicable).
The re-evaluation process shall document the evolution of both the neural network and the domain. Similarly to the validation and verification stage, the re-evaluation stage should assess the robustness on different data sets that are representative of the domain of use by the time of the evaluation.
5.1.8 Retirement
Once a retirement has been decided for a neural network, a final statistical assessment of robustness should be done to document the characteristics of the system from beginning to the end. This can be especially valuable for the inception and design of future systems that replace the retired one.
6.0 Robustness properties measurable using statistical methods
6.1 General
Robustness is the ability of an AI system to maintain its level of performance under any circumstances. Behind this general goal are in fact multiple related properties.
To further understand this multi-faceted concept, it is important to first consider the associated concept of domain. The domain corresponds to the definition of a certain region, based on meaningful criteria, among all the possible data points that can theoretically be inputs to the AI system. This concept can be applied in different ways to datasets and to AI systems, and at various stages of the AI system life cycle. In particular, when designing and developing a neural network, it is designed for proper functioning within a certain domain: a certain delineation of the inputs that are meant to be processed. However, it can be relevant for stakeholders to also know how the neural network behaves when facing other domains, for instance to inform risk mitigation.
Robustness therefore encompasses two sub-concepts:
- In-domain robustness relates to how the neural network behaves when facing special conditions that can occur within the boundaries of its intended domain of use. This notably includes the presence of outliers, or any other rare event that is part of the domain;
EXAMPLE 1 For a speech recognition system designed for German (as spoken in Germany), in-domain robustness can pertain to the system’s behaviour in case of noise, such as distant speech in the background, recording noise due to a low-quality microphone, or corruption of the audio file.
EXAMPLE 2 For a speech recognition system, analyzing the neural network’s response to the presence of borrowed words, i.e. German words that are drawn from another language but have become fully part of the German language. For instance, the presence of technical terms originating from English, while being part of the German domain, such words are atypical compared to other German words. More generally, rare words can be of interest for in-domain robustness.
- OOD robustness relates to how the AI system behaves when facing input data, or any other condition, that typically occur outside the intended domain. OOD robustness is not limited to data that does not exist in the intended domain, but also to conditions where the data is distributed differently.
EXAMPLE 3 In the example of a (Germany’s) German speech recognition system, OOD robustness can pertain to the behaviour of the system in case of Austrian German inputs, as Austrian German has particular characteristics (e.g., words, pronunciation) which differ from Germany’s German. Assessing the robustness of the system to Austrian German does not mean that it is designed for Austrian German, but it can provide useful information to users, so that they understand the consequences of using the system on this different form of German. Depending on the stakeholder’s objectives, OOD robustness can focus on one specific alternate domain (Austrian German) or several of them (e.g. Austrian German and Swiss German) or it can target the entirety of the possible inputs that are out of the intended domain (all forms of German, including Belgium, Luxembourg, Liechtenstein, as well as any local German-speaking community in other countries).
EXAMPLE 4 If the intended domain includes German speakers that are equally distributed between male and female speakers, with a certain associated performance over that domain (which is the result of how it behaves on male and on female inputs), then using the system in particular conditions where 80% speakers are male and 20% female can result in practice in a different observed performance, which is therefore a robustness concern.
See subclause 6.2 for more information on domains, how to delineate them and how to use them for consideration of in-domain or OOD robustness assessment.
As the ability to maintain performance, robustness properties build on an existing property of performance. Performance corresponds to a quantitative measure of how well the AI system processes its inputs to produce outputs, which is termed ‘functional correctness’ in ISO/IEC 25059’s AI system quality model (see ISO/IEC 25059, Annex C, on the use of the term ‘performance’). While performance focuses on proper functioning of the AI system in typical conditions, robustness pertains to how this characteristic evolves when facing alternate, extreme, or any other unexpected or unwanted conditions.
While it can rely in practice on similar tools or methods, robustness assessment is different from the analysis of performance, which encompasses any further detailed measurement or test that can be conducted to better understand the performance of the AI system and mechanisms it results from. For instance, it can be useful to analyze performance on subparts of the domain, such as measuring differentiated performance on inputs with and without a given feature, to understand how that feature affects performance and possibly to improve the design of the AI system for better performance. This is different from robustness and can pertain to examining the AI system’s response to the presence of a certain undesirable feature or condition. While both cases can include contrastive measurements of performance, the objective pursued by the assessment is not the same.
Maintaining performance can in practice correspond to a broad range of practical expectations on the AI system’s behaviour in response to the perturbation. For instance, is the AI system expected to keep its output unchanged in case of perturbation? Or is it expected to account for the change by producing a different output, but only within a given limit? Depending on how precise the robustness assessment is expected to be, and depending on the approach adopted for that assessment, it can be useful to further specify, ahead of the robustness assessment, the exact nature of responses to perturbations that are expected from the AI system to be considered as robust. See subclause 6.3 for possible criteria that can be used to formulate those expectations.
NOTE By contrast with formal methods for robustness assessment, for which this detailed specification of expected response is a prerequisite (see ISO/IEC 24029-2, Clause 5 for more details), in statistical assessment there are methods that require this further detailed information to be applied and there are methods that don’t, given that those expectations can also be captured implicitly through the data used.
Robustness can be achieved either through technical or organizational measures. As a result, the technical assessment of robustness can consist in a direct assessment of whether the AI system’s behaviour is robust itself, or it can focus on assessing whether the conditions to trigger organizational measures are appropriately completed (e.g. the accurate detection of a perturbation, to hand over control to an operator).
Finally, the statistical assessment of robustness relies on information about the perturbations the system is intended to be robust. In some cases it can be expressed as particular, well-delineated conditions (e.g. the occurrence of a given event), or it can be expressed as a dimension to vary along (which can be one of the attributes defining the domain, or another dimension). Perturbations can consist of noise applied to the input data overall or on parts of it, or on other artefacts (e.g., a certain semantic change of an image). In case of OOD robustness, information on the alternate domains of interest can convey that information already, or it can warrant further specification. It is not possible to draw an exhaustive list of possible ways to specify those circumstances, as it largely depends on the use case, however it is crucial information for guiding the robustness assessment process through statistical means (e.g. the selection of data to be used, or the preferred assessment method).
When conducting a robustness assessment using statistical methods, the organization shall define and document:
- the intended domain of use of the AI system, including its relevant attributes, as described in subclause 6.2;
- whether the robustness property to assess pertains to in-domain robustness, including the definition of the criteria for atypical circumstances that are considered; or to OOD robustness, including either the identification of one or more alternate domains that are considered for robustness, or the definition of criteria for atypical circumstances;
- if applicable and needed for the application of the chosen assessment methods, a further detailed specification of the expected behaviour of the AI system in case of perturbations, as described in subclause 6.3.
6.1.1 In-domain robustness assessment
AI systems and neural networks are designed with a specific task and a particular environment of use in mind. To that effect the relevant stakeholders can evaluate their performance using data points that are drawn from it. This corresponds to the notion of domain, which can be for instance described as a set of bounded attributes, that the system is intended to be used on. The choice of the attributes is not often straightforward as certain domains can be harder to describe than others. Also, these attributes can be numerical or non-numerical (e.g. categories, texts, graphs). Describing a domain allows defining specific variations of the input that the AI system can face. These perturbations can be also bounded and their definition can be used to verify robustness properties of the system.
NOTE This concept is also used in ISO/IEC 24029-2 and more details can be found in it too.
In-domain robustness includes adversarial robustness. This refers to the system's ability to maintain correct outputs when faced with adversarial examples, which are inputs with slight perturbations designed to cause incorrect response of the neural network.
Distribution-shift robustness is the statistical measure of the ability of the model to generalize on data that comes from a statistically different distribution than the training data, especially when the independent and identically distributed assumption between the model’s training data and model’s production data is violated. Such a distribution shift can occur due to synthetic distribution shifts (e.g. image clarity reduction), natural distribution shifts (e.g. small changes in pixels intensity due to natural effects), or other real-world changes in the deployment environment.
6.1.2 Specification of expected behaviour
6.1.3 General
There can be different expectations on the behaviour of the AI system, in response to perturbation, depending on how and how much its outputs are expected to change. This clause describes possible ways to specify these expectations.
NOTE The concepts described in this clause are partly aligned with concepts described in ISO/IEC 24029-2:2023.Clause 5, for robustness assessment through formal methods, but they are framed or formulated differently as a result of the different ways by which they are used in the context of statistical assessment methods.
6.1.4 Stability
Stability properties correspond to the ability of the system to have invariant output when facing a certain perturbation, for instance in the neighborhood of a certain size considering the domain on which to assess the robustness.
EXAMPLE 1 A possible method for evaluating stability is to assess if the neural network can stay stable facing increasingly changing inputs (e.g., injection of random noise with increasing magnitude).
EXAMPLE 2 A possible method is to assess if the neural network is stable against group actions on the inputs (e.g. invariance against rotations or symmetries for an image input).
6.1.5 Sensitivity
Sensitivity is the ability of the system to produce outputs within a certain neighborhood or the variation of the model's confidence, when facing variation of the inputs. The variation of the input can be related to certain perturbation, for instance in the neighborhood of a certain size considering the domain on which to assess the robustness.
6.1.6 Relevance
Relevance relates to the acceptability of an output through an analysis of the influence of the inputs on the outputs of the neural network. This is done to measure how much a system stays robust against variations of its input. It also helps human operators to understand subjectively the output of the system and assess its robustness.
Several techniques exist to assess on an input of a neural network, how much each dimension of the inputs influence the output (see ISO/IEC TS 6254 for more details). In a statistical evaluation of robustness, the analysis assesses each input of the data set using interpretability techniques. Therefore, this analysis is valid for one input at a time. Generalizing from these results across the domain, or on part of it, can be difficult in cases of high domain dimensionality, since one interpretation can vary widely from one input to another. In particular, since the interpretation of each of the relevance results are done through a (subjective) human intervention, their generalization amplifies the bias introduced by the concerned stakeholder.
6.1.7 Reachability
Verifying a reachability property corresponds to checking whether an AI agent (in the sense of ISO/IEC 22989) can reach a set of states when using the neural network to control itself in a given environment. The robustness of a neural network helps enforce that it does not drift with the risk of placing the AI agent in a predefined set of states. Reachability can be seen as a variation of the stability or sensitivity properties, where the AI agent is used in a loop. Stability or sensitivity is measured considering a data set where each input is considered as independent. In a reachability property, the inputs tested construct a coherent sequence representing the action of an agent over time.
Assessing reachability through a statistical approach has some limitation given the number of sequences of output the neural network produces. Due to their non-linear behaviour, it is possible that some sequences leading to the predefined set of states can be missed. This is especially the case if the environment reacting to the output of the neural network is also non-deterministic and has a non-linear response. Assessing the probability of those missed sequences is also difficult as their non-linearity is not always known in advance.
Statistical methods to assess the sequence space of a neural network should only be used when the dimensionality of output space is small enough and should rely on a sufficient number of sequences to ensure proper coverage of the possible sequence space.
6.2 Other considerations that contribute to robustness assessment
The assessment of robustness can involve the assessment of other goals that are not directly robustness, for instance when robustness is achieved through organizational means and not directly by the AI system itself. One of the examples of this is robustness achieved by handing over control to an operator when certain circumstances are detected. In this case, the statistical assessment can pertain specifically to the capability of OOD detection. OOD detection performance characterizes the ability of the model to flag inputs that are completely out of the system’s input data distribution domain.
7.0 Statistical measures of robust behaviour
7.1 General
This subclause introduces the main two types of statistical measures for robustness assessment: performance-based assessment methods and direct assessment methods. In both cases, those measures are applied on datasets, which shall be selected according to Clause 8 to be representative of the robustness property of interest. Different assessment methods utilize that data in different ways, hence induce different constraints on the type of datasets to use:
- Some statistical assessment methods require data sampled from an in-domain or out-of-domain distribution. For instance, performance-based assessment involves both;
- Some statistical assessment methods require only raw data, while others require annotated data. For instance, metamorphic testing methods rely on comparison of outputs (analysis of the response to a change introduced in the input), so they can be applied without data annotations being available.
7.1.1 Robustness assessment based on performance drops
7.1.2 General
As robustness consists in maintaining performance, one approach to assess robustness is to explicitly measure that performance and quantify its drop. The same performance assessment procedure is applied to expected conditions and to unexpected conditions, and the difference in results yields a robustness assessment. Performance assessment shall be conducted using evaluation metrics that are appropriate for the task of the system, as described in ISO/IEC TS 4213 for classification, regression, clustering and recommendation tasks or ISO/IEC AWI 23282 for NLP tasks. This approach can be especially useful when assessing OOD robustness.
7.1.3 Examples
This subclause presents some examples on how to construct the different data sets that can be used to evaluate the drop of performance of a neural network.
The construction of the data set to measure the performance drop can be done using specific perturbation. For example, perturbation analysis introduces a Gaussian noise to the input data and evaluates the model’s performance on this modified data. A model can be considered robust if its performance is relatively stable despite these changes whereas a significant drop in performance indicates that the model is sensitive to small input changes. Generally, the noise applied to the input data can vary and represents perturbations that can realistically occur in the system environment.
EXAMPLE Perturbations can be power line noise for ECG signals, lighting changes or weather conditions for images, accent variations and background noise for speech, sensor degradation in IoT data, to cover real-world data variability comprehensively
Another example is adversarial testing, which involves evaluating the model's performance on adversarial examples, which are inputs specifically designed to deceive the neural network, e.g. slight perturbations that cause misclassification. A significant drop in performance under adversarial conditions can signal a lack of robustness.
7.1.4 Thresholding process
Decision on whether the system is robust along the property of interest is made based on the difference between performance on in-domain data and on out-of-domain data. The lower the drop, the more robust the AI system is along the robustness property being assessed. A threshold shall be determined ahead of the assessment, based on the task, domain, specific use case and context of the AI system, and existing body of knowledge and applicable good practices for those expected (in-domain) and unexpected (out-of-domain) conditions. The AI system shall be considered robust along that property if the measured performance drop is lower than that threshold.
7.2 Targeted assessment of the system's response to a perturbation
7.2.1 General
When the phenomena to assess robustness against are sufficiently understood to be explicitly modelled, it is possible to conduct direct assessment of robustness along the property of interest. In that approach, a robustness test is designed to assess whether the AI system’s behaviour is robust to those phenomena, and the percentage of test instances that pass or fail for a given test yields a robustness assessment.
This approach can be especially useful when assessing in-domain robustness of stability, sensitivity or reachability properties.
7.2.2 Examples
In the CheckList methodology [7], robustness tests are generated by applying automated perturbations (possibly involving in turn the use of other data, templates, or another AI system) to a broad set of input sentences.
EXAMPLE Examples of stability property can be:
- The stability property in presence of typos can be assessed by randomly swapping characters with their neighbour and observing whether the output changes;
- Stability in presence of URLs in social media texts can be assessed by adding randomly generated URLs at the end of the existing texts and checking that it does not change the original output;
- For a sentiment analysis system, sensitivity to vocabulary can be assessed by adding a positive phrase at the end of a text (regardless of whether it is positive or negative) and checking that the confidence in the positive sentiment class increases (or at least does not decrease more than a certain threshold).
In [8] a statistical approach is used to evaluate the probability of a program to reach a particular state. This allows to verify a reachability property. To do so different kind of estimators can be used and the one proposed by the authors are part of a vast literature on the topic.
7.2.3 Thresholding process
A decision on whether the system is robust along the property of interest is made based on the percentage of test instances that fail for a given robustness test. The lower that percentage, the more robust the AI system is along the robustness property being assessed.
A threshold shall be determined ahead of the assessment, based on the task, domain, specific use case and context of the AI system, and existing body of knowledge and applicable good practices for that robustness test. The AI system shall be considered robust along that property if the measured failure rate is lower than that threshold.
7.3 Assessment through separability metrics
7.3.1 General
Recent work demonstrates the link between data separability and robustness of a model, in particular in case of the assessment of adversarial robustness[9]. This relates to the notion of stability presented in subclause 6.3. The goal of this type of assessment is to estimate a lower bound of the minimal perturbation needed to alter the decision of a neural network.
This approach based on direct assessment methods can be especially useful when assessing in-domain robustness of stability and sensitivity properties.
7.3.2 Example
In [10] the authors propose a method that transforms the robustness evaluation process into a local Lipschitz constant estimation problem and applies the extreme value theory to solve it. This allows to have an estimation of the lower bound of the minimal perturbation needed to alter the decision of a given neural network.
7.3.3 Thresholding process
The method proposed by [10] is comparable in terms of results of other statistical metrics. Nonetheless, such methods should be used to compare robustness of different models and not in isolation as an absolute assessment of the robustness.
7.4 Assessment through statistical reliability engineering
7.4.1 General
Evaluating local robustness can be achieved through the computation of uncertainties with a statistical approach. Contrary to direct assessment where the perturbation is known in advance, this kind of assessment estimates the probability of a robustness property violation. The estimation can be done in several ways depending on the task performed by the neural network and the size its input space. To make it scalable several strategies exist to split and explore the input space and assess efficiently the robustness of the neural network.
Unlike formal verification methods, which can be sound (providing guarantees) but may be incomplete (unable to check all properties like abstract interpretation), these statistical approaches are not sound but can explore the entire input space and provide counter-examples.
7.4.2 Example
Assessment through statistical reliability engineering can rely on a stochastic simulation inspired by the said field [11] or by estimating the probability of a violation of robustness to occur [12]. In the first case the assessment is conducted as the resolution of a statistical hypothesis where a neural network is locally robust if the estimated probability of failure is lower than a critical level. As a statistical approach based on sampling it relies on an exploration of the input space using a simulation procedure that can generate samples of rare events. In the second case, a multi-level splitting is done to the input space to reduce the span needed to be covered to find rare event that can violate the robustness property.
Another approach can be used to estimate uncertainty of a neural network, for example using Shannon entropy [13]. Shannon entropy measures the uncertainty or randomness of a model’s outputs via its predicted probability distribution. With lower entropy values across all predictions suggesting a higher level of robustness. Low entropy indicates high confidence whereas high entropy is associated with low confidence and potential model instability.
Monte Carlo dropout [14] can also be used to assess the uncertainty of a neural network decision. This method can be used during the design and training stage or in operation stage of the neural network life cycle. It involves using dropout layers to approximate the uncertainty in the model's predictions by performing multiple forward passes with different dropout masks. The variance of predictions across multiple forward passes gives an estimate of uncertainty of the model decision. A model is considered robust if the variance estimate is low and the predictions are relatively stable for most inputs, i.e. if the output distribution varies widely for a given input, the model is less stable. It is noted that Monte Carlo dropout entails significant computational overhead due to multiple forward passes and advise on trade-offs and optimization strategies such as reducing samples, hardware acceleration, or selective application to layers.
7.4.3 Thresholding process
A decision on whether the system is robust along the property of interest is made based on a critical probability set beforehand, and by assessing if the probability of the neural network violating the robustness property is lower or greater that this critical probability. This critical probability threshold shall be determined ahead of the assessment, based on the task, domain, specific use case and context of the AI system and existing body of knowledge and applicable good practices for that robustness test. The AI system shall be considered robust along that property if the violation probability is lower than that threshold.
7.5 Robustness assessment through curvature analysis
7.5.1 General
When the manifold assumption hypothesis is valid (see [15] for more details) it can be argued that a robust model outputs a data manifold that is as simple as possible. Greater simplicity in the output manifold leads to higher robustness. In other words, as the output manifold becomes simpler (e.g. lower curvature), the impact of input manifold on the network becomes lower, resulting in a more robust network. Therefore, measuring the output manifold curvature is a direct method to assess robustness of the network.
7.5.2 Examples
One way to assess the curvature of the output space is presented in [15] where the method uses both the training manifold and the output manifold.
7.5.3 Thresholding process
The method proposed by [15] provides comparative means to estimate robustness of different neural networks. Nonetheless, such methods should be used to compare robustness of different models and not in isolation as an absolute assessment of the robustness.
8.0 Selection of datasets representative of a robustness dimension
8.1 General
In statistical assessment, the robustness dimension being assessed is not formally defined, but captured implicitly through the use of data exhibiting that dimension. For that reason, appropriate selection of datasets that are representative of the targeted dimension is an important aspect of the robustness assessment process.
Relevant datasets shall be identified, built or acquired for statistical assessment of robustness, in accordance with the characteristics of the use case, with accordance to the targeted robustness dimension and with potential requirements of the chosen assessment method.
NOTE NOTE For detailed guidance on data analytics and statistical assessment methodologies, refer to ISO/IEC 5259 (all parts) and EN ISO/IEC 8183.
8.1.1 Datasets needed for robustness assessment
Datasets involved in the statistical assessment of robustness can be of three types:
- Common datasets: In that case, the dataset is a collection of data instances, representative of a given domain. It should be similar to production data that is expected according to the intended use of the AI system. Datasets of that type are commonly used when automated transforms are applied as part of the assessment method (e.g. in the case of metamorphic testing), which can be either performance-based or direct assessment of robustness;
- Instance-level contrastive data: In that case, the dataset is a collection of data point pairs. For each data instance, it includes both a data point considered as normal condition and a data point considered as abnormal condition, according to the targeted robustness dimension. The two data points should constitute a minimal pair: the only difference between both data points is characterized by the targeted dimension. The set of normal data points shall be representative of the intended domain of use of the AI system. Datasets of that type are more commonly used when performing direct assessment of robustness;
- Set-level contrastive data: In that case, the data consists of two separate datasets, one comprising data points considered as normal conditions and one comprising data points considered as abnormal conditions. The main difference between both datasets shall be the one expressed by the targeted dimension. The two datasets can have a different size and come from different sources. The dataset in normal conditions shall be representative of the intended domain of use of the AI system. Datasets of that type are more commonly used when applying performance-based assessment of robustness.
The appropriate type of dataset depends on the assessment method. The design choices of datasets used for that assessment shall be documented and justified.
8.1.2 Acquisition of data for robustness assessment
The data needed for robustness assessment can be acquired using one or more of the following processes:
- Selection of instances. This process is similar to selection of data for performance assessment. Data points are sourced and possibly filtered to fit a given domain. It can be involved in acquiring either normal or abnormal data points;
- Generation of new data. This process operates manual creation or automated generation, to produce the desired data. It can be involved in acquiring either normal or abnormal data points;
- Selection of minimal pairs. This process aims at identifying minimal pairs (normal and abnormal data points differentiated only by the targeted dimension) within an existing repository of data. It can be useful for acquiring instance-level contrastive data, in which case it is applied after the selection of instances or the generation of new data;
- Transformation of existing data. This process applies manual or automated transforms to a given set of data points. The transforms are designed to express the phenomena associated to the targeted robustness dimension. This process is applied after the selection of instances or the generation of new data points. It can be useful for acquiring either common datasets (when the desired dataset is not found within naturally occurring data), instance-level contrastive data (to build the abnormal counterpart of each normal data point) or set-level contrastive data (to create an abnormal dataset).
Depending on the chosen assessment method, additional data preparation steps can be necessary, such as data annotation or using statistical methods to design the dataset.
EXAMPLE 1 Bootstrap resampling involves generating multiple new datasets by sampling with replacement from the original dataset and then retraining the model on each of these bootstrapped datasets. The variability in performance due to different training sets can be assessed via the standard deviation or confidence intervals of performance metrics across the different resampled datasets and provides an insight into how robust the model is to changes in the data distribution.
EXAMPLE 2 Cross validation techniques involve splitting the dataset into several folds and training the neural network multiple times on different subsets of the data while evaluating it on the remaining folds. If the model shows low variability in performance across different folds, it suggests that the model is robust and generalizes well whereas high variance in performance indicates the model is sensitive to specific data splits.
Cross validation gives only a rough estimation [18] in how performance can be maintained (for more details see ISO/IEC TS 4213). It can be used to compare robustness of different algorithm by comparing how they perform in average.
Several dataset preparation strategies are available to perform a cross validation assessment of robustness. In this subclause some of them are presented.
- Holdout validation is the simplest approach, that splits the dataset in two and use the first part for training, and the other for testing purposes. However, the testing set can contain features and information that the training set does not cover, preventing the neural network to learn them, causing a lower robustness of the model.
- Leave one out cross validation involves training the model on all but one data point at a time with performance evaluated on the left-out point and this is repeated for all data points. This method is computationally expensive but helps to assess the stability of the model across individual data points. If the model's performance fluctuates widely depending on which point is excluded, it suggests the model is not robust and can be overfitting to specific data points.
- Stratified cross-validation allow the splitting process of the dataset in k folds, while keeping in each fold the same class distribution as the entire dataset. During each iteration, one-fold is used for testing, and the remaining folds are used for training. The process is repeated k times, with each fold serving as the test set exactly once. This method is useful to validate the robustness of classification neural network trained on data set with imbalanced classes.
- For K-fold cross-validation, the dataset is split into k folds, and the training is done on all but one fold. The method is repeated k times with each time a different fold reserved to perform the testing.
The respective processes applied to acquire the data for robustness assessment shall be documented and justified with respect to the design choices of the datasets and to the constraints resulting from the chosen assessment method. Documentation includes describing selection criteria as well as any tool, algorithm or other automated process involved in the selection, generation or transformation of the data.
8.1.3 Measuring adversarial robustness
Adversarial robustness can be measured by computing the misclassification rate of the input samples when subjected to a fixed maximum input perturbation. Methods such as Fast Gradient Sign Method, Projected Gradient Descent, Mechanistic Interpretability techniques of trace variation analysis, etc can be used to introduce the adversarial perturbation.
[1] ISO/IEC 25059:2023, Software engineering — Systems and software Quality Requirements and Evaluation (SQuaRE) — Quality model for AI systems
[2] ISO/IEC 24029-2:2023, Artificial intelligence (AI) — Assessment of the robustness of neural networks — Part 2: Methodology for the use of formal methods
[3] ISO/IEC 22989:2022, Information technology — Artificial intelligence — Artificial intelligence concepts and terminology
[4] ISO/IEC 23894:2023, Information technology — Artificial intelligence — Guidance on risk management
[5] ISO/IEC TS 6254:2025, Information technology — Artificial intelligence — Objectives and approaches for explainability and interpretability of machine learning (ML) models and artificial intelligence (AI) systems
[6] ISO/IEC TS 4213:2022, Information technology — Artificial intelligence — Assessment of machine learning classification performance
[7] Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2020). Beyond accuracy: Behavioral testing of NLP models with CheckList. arXiv preprint arXiv:2005.04118.
[8] Statistical Reachability Analysis, Seongmin Lee, Marcel Böhme, ESEC/FSE, 2023
[9] A. Pal, J. Sulam, R. Vidal. Adversarial examples might be avoidable: The role of data concentration in adversarial robustness. Advances in Neural Information Processing Systems, 36, 46989-47015.
[10] SMART: A Robustness Evaluation Framework for Neural Networks, Yuanchun Xiong and Baowen Zhang, 2022.
[11] Efficient Statistical Assessment of Neural Network Corruption Robustness, Karim Tit, Teddy Furon, Mathias Rousset, Neurips, 2021
[12] A statistical approach to assessing neural network robustness, Stefan Webb, Tom Rainforth, Yee Whye Teh, Pawan Kumar, ICLR, 2019
[13] A Theory on AI Uncertainty Based on Rademacher Complexity and Shannon Entropy, Mingyong Zhou, IICSPI, 2020
[14] Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning, Yarin Gal, Zoubin Ghahramani, PMLR 48, 2016
[15] Manifold-based approach for neural network robustness analysis, Ali Sekmen, Bahadir Bilgin, 2024
[16] ISO/IEC 5259, Data quality for analytics and machine learning (ML)
[17] EN ISO/IEC 8183:2024, Information technology - Artificial intelligence - Data life cycle framework (ISO/IEC 8183:2023)
[18] A Modern Theory of Cross-Validation through the Lens of Stability, J. Lei, 2025, arXiv:2505.23592v2
