rapids/docs/developers/validation-schema-config.md

8.5 KiB

Validation schema of config.yaml

!!! hint "Why do we need to validate the config.yaml?" Most of the key/values in the config.yaml are constrained to a set of possible values or types. For example [TIME_SEGMENTS][TYPE] can only be one of ["FREQUENCY", "PERIODIC", "EVENT"], and [TIMEZONE] has to be a string.

We should show the user an error if that's not the case. We could validate this in Python or R but since we reuse scripts and keys in multiple places, tracking these validations can be time consuming and get out of control. Thus, we do these validations through a schema and check that schema before RAPIDS starts processing any data so the user can see the error right away.

Keep in mind these validations can only cover certain base cases. Some validations that require more complex logic should still be done in the respective script. For example, we can check that a CSV file path actually ends in `.csv` but we can only check that the file actually exists in a Python script.

The structure and values of the config.yaml file are validated using a YAML schema stored in tools/config.schema.yaml. Each key in config.yaml, for example PIDS, has a corresponding entry in the schema where we can validate its type, possible values, required properties, min and max values, among other things.

The config.yaml is validated against the schema every time RAPIDS runs (see the top of the Snakefile):

validate(config, "tools/config.schema.yaml")

Structure of the schema

The schema has three main sections required, definitions, and properties. All of them are just nested key/value YAML pairs, where the value can be a primitive type (integer, string, boolean, number) or can be another key/value pair (object).

required

required lists properties that should be present in the config.yaml. We will almost always add every config.yaml key to this list (meaning that the user cannot delete any of those keys like TIMEZONE or PIDS).

definitions

definitions lists key/values that are common to different properties so we can reuse them. You can define a key/value under definitions and use $ref to refer to it in any property.

For example, every sensor like [PHONE_ACCELEROMETER] has one or more providers like RAPIDS and PANDA, these providers have some common properties like the COMPUTE flag or the SRC_SCRIPT string. Therefore we define a shared provider "template" that is used by every provider and extended with properties exclusive to each one of them. For example:

=== "provider definition (template)" The PROVIDER definition will be used later on different properties.

```yaml
PROVIDER:
    type: object
    required: [COMPUTE, SRC_SCRIPT, FEATURES]
    properties:
    COMPUTE:
        type: boolean
    FEATURES:
        type: [array, object]
    SRC_SCRIPT:
        type: string
        pattern: "^.*\\.(py|R)$"
```

=== "provider reusing and extending the template" Notice that RAPIDS (a provider) uses and extends the PROVIDER template in this example. The FEATURES key is overriding the FEATURES key from the #/definitions/PROVIDER template but is keeping the validation for COMPUTE, and SRC_SCRIPT. For more details about reusing properties, go to this link

```yaml hl_lines="9 10"
PHONE_ACCELEROMETER:
    type: object
     # .. other properties
    PROVIDERS:
        type: ["null", object]
        properties:
        RAPIDS:
            allOf:
            - $ref: "#/definitions/PROVIDER"
            - properties:
                FEATURES: 
                    type: array
                    uniqueItems: True
                    items:
                    type: string
                    enum: ["maxmagnitude", "minmagnitude", "avgmagnitude", "medianmagnitude", "stdmagnitude"]
```

properties

properties are nested key/values that describe the different components of our config.yaml file. Values can be of one or more primitive types like string, number, array, boolean and null. Values can also be another key/value pair (of type object) that are similar to a dictionary in Python.

For example, the following property validates the PIDS of our config.yaml. It checks that PIDS is an array with unique items of type string.

PIDS:
    type: array
    uniqueItems: True
    items:
      type: string

Modifying the schema

!!! hint "Validating the config.yaml during development" If you updated the schema and want to check the config.yaml is compliant, you can run the command snakemake --list-params-changes. You will see Building DAG of jobs... if there are no problems or an error message otherwise (try setting any COMPUTE flag to a string like test instead of False/True).

You can use this command without having to configure RAPIDS to process any participants or sensors.

You can validate different aspects of each key/value in our config.yaml file:

=== "number/integer" Including min and max values ```yaml MINUTE_RATIO_THRESHOLD_FOR_VALID_YIELDED_HOURS: type: number minimum: 0 maximum: 1

FUSED_RESAMPLED_CONSECUTIVE_THRESHOLD:
    type: integer
    exclusiveMinimum: 0
```

=== "string" Including valid values (enum) yaml items: type: string enum: ["count", "maxlux", "minlux", "avglux", "medianlux", "stdlux"] === "boolean" yaml MINUTES_DATA_USED: type: boolean === "array" Including whether or not it should have unique values, the type of the array's elements (strings, numbers) and valid values (enum). yaml MESSAGES_TYPES: type: array uniqueItems: True items: type: string enum: ["received", "sent"] === "object" PARENT is an object that has two properties. KID1 is one of those properties that are, in turn, another object that will reuse the "#/definitions/PROVIDER" definition AND also include (extend) two extra properties GRAND_KID1 of type array and GRAND_KID2 of type number. KID2 is another property of PARENT of type boolean.

The schema validation looks like this
```yaml
PARENT:
    type: object
    properties:
      KID1:
        allOf:
          - $ref: "#/definitions/PROVIDER"
          - properties:
              GRAND_KID1:
                type: array
                uniqueItems: True
              GRAND_KID2:
                type: number
      KID2:
        type: boolean
```

The `config.yaml` key that the previous schema validates looks like this:
```yaml
PARENT:
    KID1:
        # These four come from the `PROVIDER` definition (template)
        COMPUTE: False
        FEATURES: [x, y] # an array
        SRC_SCRIPT: "a path to a py or R script"

        # This two come from the extension
        GRAND_KID1: [a, b] # an array
        GRAND_KID2: 5.1 # an number
     KID2: True # a boolean
```

Verifying the schema is correct

We recommend that before you start modifying the schema you modify the config.yaml key that you want to validate with an invalid value. For example, if you want to validate that COMPUTE is boolean, you set COMPUTE: 123. Then create your validation, run snakemake --list-params-changes and make sure your validation fails (123 is not boolean), and then set the key to the correct value. In other words, make sure it's broken first so that you know that your validation works.

!!! warning Be careful. You can check that the schema config.schema.yaml has a valid format by running python tools/check_schema.py. You will see this message if its structure is correct: Schema is OK. However, we don't have a way to detect typos, for example allOf will work but allOF won't (capital F) and it won't show any error. That's why we recommend to start with an invalid key/value in your config.yaml so that you can be sure the schema validation finds the problem.

Useful resources

Read the following links to learn more about what we can validate with schemas. They are based on JSON instead of YAML schemas but the same concepts apply.