Data Preparation
Plain English Translation
ISO 42001 Annex A.7.6 requires organizations to define and document the criteria used to select AI data preparation methods. Because machine learning algorithms are often sensitive to missing entries, varying scales, or non-normal distributions, explicit steps like data cleaning, normalization, scaling, and encoding must be carefully planned. By standardizing and documenting these preprocessing requirements, organizations can establish audit-ready data pipeline controls and significantly reduce the risk of AI system errors.
Technical Implementation
Use the tabs below to select your organization size.
Required Actions (startup)
- Document basic data cleaning and imputation steps applied to datasets.
- Establish baseline criteria for handling missing or malformed entries in training data.
Required Actions (scaleup)
- Implement automated, version-controlled data preprocessing pipelines.
- Maintain a registry of approved normalization, scaling, and encoding transforms tailored for specific AI tasks.
Required Actions (enterprise)
- Integrate data preparation criteria natively into robust ML pipeline orchestration tools.
- Enforce statistical exploration and rigorous labeling standards, documenting all steps as automated audit evidence per AI system.
ISO/IEC 42001 Annex A.7.6 requires organizations to define and document the criteria for selecting data preparation methods, as well as the specific methods actually used. This ensures that data preprocessing steps are deliberate, repeatable, and appropriately aligned with the requirements of the specific AI task.
Criteria should be defined based on the specific AI task, the algorithms being utilized, and the inherent characteristics of the raw data. Organizations must consider how different models tolerate missing entries, non-normal distributions, and varying data scales when establishing these criteria.
According to ISO 42001 implementation guidance, relevant activities include statistical exploration of the data, data cleaning (handling missing or incorrect entries), imputation, normalization, scaling, target variable labeling, and encoding categorical variables.
Auditors expect to see documented criteria for selecting methods, alongside concrete records of the specific transforms and preparations applied to a given AI system's data. This proves that AI data pipeline controls and AI training data preprocessing requirements are actively managed. Tools like WatchDog Security's Compliance Center can help link Annex A.7.6 to the specific SOPs, validation rules, and evidence packages used for each AI system.
Organizations should utilize detailed metadata, data provenance logs, and version-controlled transformation scripts within their ML pipelines. This documentation links the specific data preparation steps to the assigned personnel, ensuring accountability and traceability.
Data preparation (A.7.6) directly improves data quality (A.7.4) by resolving statistical errors and formatting issues. Concurrently, data provenance (A.7.5) tracks the origin and lineage of the data before and after these preparation methods are applied, forming a complete AI data governance lifecycle.
Data preparation criteria apply comprehensively to all data utilized by the AI system, encompassing training, validation, testing, and live production data. Consistent application of preparation methods is critical to ensure the AI system performs reliably across its entire lifecycle.
Data preparation criteria should be reviewed whenever there are significant changes to the AI system, when new algorithms are deployed, or during routine management reviews. Continual improvement ensures the preprocessing methods remain effective against evolving data sources. Tools like WatchDog Security's Policy Management can support review cadences by tracking owners, approval workflows, and prior versions of the criteria and procedures.
Roles such as data scientists, data engineers, and domain experts should be explicitly assigned responsibility for defining, selecting, and applying data preparation methods. Clear role-based accountability is an essential element of responsible AI development.
Criteria should dictate precisely how missing data is imputed without introducing statistical bias, and how sensitive attributes are encoded or anonymized. Formally documented data cleaning normalization encoding strategies help ensure the prepared data does not reinforce unwanted historical social biases.
Standardizing data preparation requires consistent criteria, defined owners, and repeatable evidence across AI systems. Tools like WatchDog Security's Compliance Center can map Annex A.7.6 requirements to specific artifacts (e.g., SOPs and validation rules), track implementation status, and centralize audit evidence showing which preparation criteria were approved and when.
Data preparation criteria often change as models, features, or data sources evolve, so approvals and version history matter for auditability. Tools like WatchDog Security's Policy Management can maintain controlled versions of data preparation standards and SOPs, record stakeholder approvals, and track attestations so teams can demonstrate that the latest criteria were formally reviewed and adopted.
| Version | Date | Author | Description |
|---|---|---|---|
| 1.0.0 | 2026-02-23 | WatchDog Security GRC Team | Initial publication |