Add `config.yaml` validation documentation.

feature/plugin_sentimental
JulioV 2021-02-23 16:22:14 -05:00
parent 1a9bcf8e37
commit d73bfdde1f
4 changed files with 180 additions and 37 deletions

View File

@ -4,6 +4,7 @@
- Add support for Empatica devices (all sensors)
- Add logo
- Move Citation page to the Setup section
- Add `config.yaml` validation schema and documentation.
## v0.4.3
- Fix bug when any of the rows from any sensor do not belong a time segment
## v0.4.2

View File

@ -1,36 +0,0 @@
# Configuration Schema
The configuration schema is a JSON schema, which is for validating the strcuture of sensor data. By running the command `snakemake --list-params-changes`, rapids will check if all settings match the standard without computing the sensor data. The configuration can be divided into three parts: required, definitions and properties.
### Required
This part specifies the required features that need to be defined under the properties. If there is any missing feature, it will throw an error message.
For example, there is a required feature called `Testing_feature` but it is not defined in the properties. When running `snakemake --list-params-changes`, here is the error message you will see, `ValidationError: 'Testing_feature' is a required property`.
### Definitions
Features might share some common properties, to reuse the structure of the common properties in other places, you can define the property under the definitions, then using `$ref` to refer to the defined property.
For example, there is a defined property called `PROVIDERS` in `definitions` and a new feature called `phone_messages` has such property.
![picture](/img/provider_example.png)
To reuse the definiation of `PROVIDERS`, all you need to do is adding `$ref: "#/definitions/PROVIDER"` to the additional property of `phone_messages`.
![picture](/img/phone_messages_example.png)
You can also overwrite the existing property in `PROVIDERS` if the definiation of a certain property is different from the current `PROVIDER`, or extend the properties of `PROVIDERS`.
For more examples and details, please refer to http://json-schema.org/understanding-json-schema/structuring.html#reuse
### Properties
This is where features are defined. The definition of a feature usually contains three things. The type of the feature, a list of required properties, and the definition of each property. A feature is usually an object, which is like the dictionary in Python. Everthing inside the object is a pair, meaning that it has a key and a value. In JSON schema, the `keys` must always be string. Besides object, other common types are string, numeric, array, boolean and null.
For example,
![picture](/img/feature_example.png)
The sensor feature PHONE_AWARE_LOG is an object with two required properties, `TABLE` and `PROVIDERS`. The `TABLE` contains the name of the data table used for computation, which is a string. The `PROVIDERS` has two types, `null` and `object` meaning that PHONE_AWARE_LOG may or may not have a provider.
### Refrences
- Understanding JSON Schema: http://json-schema.org/understanding-json-schema/index.html
- A Media Type for Describing JSON Documents: https://tools.ietf.org/html/draft-handrews-json-schema-01

View File

@ -0,0 +1,178 @@
# Validation schema of `config.yaml`
!!! hint "Why do we need to validate the `config.yaml`?"
Most of the key/values in the `config.yaml` are constrained to a set of possible values or types. For example `[TIME_SEGMENTS][TYPE]` can only be one of `["FREQUENCY", "PERIODIC", "EVENT"]`, and `[TIMEZONE]` has to be a string.
We should show the user an error if that's not the case. We could validate this in Python or R but since we reuse scripts and keys in multiple places, tracking these validations can be time consuming and get out of control. Thus, we do these validations through a schema and check that schema before RAPIDS starts processing any data so the user can see the error right away.
Keep in mind these validations can only cover certain base cases. Some validations that require more complex logic should still be done in the respective script. For example, we can check that a CSV file path actually ends in `.csv` but we can only check that the file actually exists in a Python script.
The structure and values of the `config.yaml` file are validated using a YAML schema stored in `tools/config.schema.yaml`. Each key in `config.yaml`, for example `PIDS`, has a corresponding entry in the schema where we can validate its type, possible values, required properties, min and max values, among other things.
The `config.yaml` is validated against the schema every time RAPIDS runs (see the top of the `Snakefile`):
```python
validate(config, "tools/config.schema.yaml")
```
## Structure of the schema
The schema has three main sections `required`, `definitions`, and `properties`. All of them are just nested key/value YAML pairs, where the value can be a primitive type (`integer`, `string`, `boolean`, `number`) or can be another key/value pair (`object`).
### required
`required` lists `properties` that should be present in the `config.yaml`. We will almost always add every `config.yaml` key to this list (meaning that the user cannot delete any of those keys like `TIMEZONE` or `PIDS`).
### definitions
`definitions` lists key/values that are common to different `properties` so we can reuse them. You can define a key/value under `definitions` and use `$ref` to refer to it in any `property`.
For example, every sensor like `[PHONE_ACCELEROMETER]` has one or more providers like `RAPIDS` and `PANDA`, these providers have some common properties like the `COMPUTE` flag or the `SRC_FOLDER` string, therefore we define a common provider "template" that is used by every provider and extended with properties exclusive to each one of them. For example:
=== "provider definition (template)"
The `PROVIDER` definition will be used later on different `properties`.
```yaml
PROVIDER:
type: object
required: [COMPUTE, SRC_FOLDER, SRC_LANGUAGE, FEATURES]
properties:
COMPUTE:
type: boolean
FEATURES:
type: [array, object]
SRC_FOLDER:
type: string
SRC_LANGUAGE:
type: string
enum: [python, r]
```
=== "provider reusing and extending the template"
Notice that in this example `RAPIDS` (a provider) is using and extending the `PROVIDER` template. The `FEATURES` key is overriding the `FEATURES` key from the `#/definitions/PROVIDER` template but is keeping the validation for `COMPUTE`, `SRC_FOLDER`, and `SRC_LANGUAGE`. For more details about reusing properties go to this [link](http://json-schema.org/understanding-json-schema/structuring.html#reuse)
```yaml hl_lines="9 10"
PHONE_ACCELEROMETER:
type: object
# .. other properties
PROVIDERS:
type: ["null", object]
properties:
RAPIDS:
allOf:
- $ref: "#/definitions/PROVIDER"
- properties:
FEATURES:
type: array
uniqueItems: True
items:
type: string
enum: ["maxmagnitude", "minmagnitude", "avgmagnitude", "medianmagnitude", "stdmagnitude"]
```
### properties
`properties` are nested key/values that describe the different components of our `config.yaml` file. Values can be of one or more primitive types like `string`, `number`, `array`, `boolean` and `null`. Values can also be another key/value pair (of type `object`) that are similar to a dictionary in Python.
For example, the following property validates the `PIDS` of our `config.yaml`. It checks that `PIDS` is an `array` with unique items of type `string`.
```yaml
PIDS:
type: array
uniqueItems: True
items:
type: string
```
## Modifying the schema
!!! hint "Validating the `config.yaml` during development"
If you updated the schema and want to check the `config.yaml` is compliant, you can run the command `snakemake --list-params-changes`. You will see `Building DAG of jobs...` if there are no problems or an error message otherwise (try setting any `COMPUTE` flag to a string like `test` instead of `False/True`).
You can use this command without having to configure RAPIDS to process any participants or sensors.
You can validate different aspects of each key/value in our `config.yaml` file:
=== "number/integer"
Including min and max values
```yaml
MINUTE_RATIO_THRESHOLD_FOR_VALID_YIELDED_HOURS:
type: number
minimum: 0
maximum: 1
FUSED_RESAMPLED_CONSECUTIVE_THRESHOLD:
type: integer
exclusiveMinimum: 0
```
=== "string"
Including valid values (`enum`)
```yaml
items:
type: string
enum: ["count", "maxlux", "minlux", "avglux", "medianlux", "stdlux"]
```
=== "boolean"
```yaml
MINUTES_DATA_USED:
type: boolean
```
=== "array"
Including whether or not it should have unique values, the type of the array's elements (`strings`, `numbers`) and valid values (`enum`).
```yaml
MESSAGES_TYPES:
type: array
uniqueItems: True
items:
type: string
enum: ["received", "sent"]
```
=== "object"
`PARENT` is an object that has two properties. `KID1` is one of those properties that is in turn another object that will reuse the `"#/definitions/PROVIDER"` `definition` **AND** also include (extend) two extra properties `GRAND_KID1` of type `array` and `GRAND_KID2` of type `number`. `KID2` is another property of `PARENT` of type `boolean`.
The schema validation looks like this
```yaml
PARENT:
type: object
properties:
KID1:
allOf:
- $ref: "#/definitions/PROVIDER"
- properties:
GRAND_KID1:
type: array
uniqueItems: True
GRAND_KID2:
type: number
KID2:
type: boolean
```
The `config.yaml` key that the previous schema validates looks like this:
```yaml
PARENT:
KID1:
# These four come from the `PROVIDER` definition (template)
COMPUTE: False
FEATURES: [x, y] # an array
SRC_FOLDER: "any string"
SRC_LANGUAGE: "any string"
# This two come from the extension
GRAND_KID1: [a, b] # an array
GRAND_KID2: 5.1 # an number
KID2: True # a boolean
```
## Verifying the schema is correct
We recommend that before you start modifying the schema you modify the `config.yaml` key that you want to validate with an invalid value. For example, if you want to validate that `COMPUTE` is boolean, you set `COMPUTE: 123`. Then create your validation, run `snakemake --list-params-changes` and make sure your validation fails (123 is not `boolean`), and then set the key to the correct value. In other words, make sure it's broken first so that you know that your validation works.
!!! warning
**Be careful**. You can check that the schema `config.schema.yaml` has a valid format by running `python tools/check_schema.py`. You will see this message if its structure is correct: `Schema is OK`. However, we don't have a way to detect typos, for example `allOf` will work but `allOF` won't (capital `F`) and it won't show any error. That's why we recommend to start with an invalid key/value in your `config.yaml` so that you can be sure the schema validation finds the problem.
## Useful resources
Read the following links to learn more about what we can validate with schemas. They are based on `JSON` instead of `YAML` schemas but the same concepts apply.
- [Understanding JSON Schemas](http://json-schema.org/understanding-json-schema/index.html)
- [Specification of the JSON schema we use](https://tools.ietf.org/html/draft-handrews-json-schema-01)

View File

@ -126,7 +126,7 @@ nav:
- Documentation: developers/documentation.md
- Testing: developers/testing.md
- Test cases: developers/test-cases.md
- Configuration Schema: developers/config_schema.md
- Validation schema of config.yaml: developers/validation-schema-config.md
- Others:
- Migrating from beta: migrating-from-old-versions.md
- Code of Conduct: code_of_conduct.md