Add support for smartphone sources and schemas.

Initial support for accelerometer Update docs for automatically create participants Update docs for initial multiple time zones
2021-03-02 17:57:22 -05:00 · 2021-03-02 17:57:22 -05:00 · dc11cb593d
parent e417aa3a6a
commit dc11cb593d
18 changed files with 971 additions and 252 deletions
--- a/config.yaml
+++ b/config.yaml
@ -42,9 +42,8 @@ TIME_SEGMENTS: &time_segments
 # See https://www.rapids.science/latest/setup/configuration/#device-data-source-configuration
 PHONE_DATA_CONFIGURATION:
  SOURCE: 
-    TYPE: DATABASE
+    TYPE: aware_mysql
    DATABASE_GROUP: *database_group
-    DEVICE_ID_COLUMN: device_id # column name
  TIMEZONE: 
    TYPE: SINGLE
    VALUE: *timezone
--- a/docs/datastreams/add-new-data-streams.md
+++ b/docs/datastreams/add-new-data-streams.md
@ -0,0 +1,296 @@
+# Add New Data Streams
+
+A data stream is a set of sensor data collected using a specific type of **device** with a specific **format** and stored in a specific **container**. RAPIDS is agnostic to data streams' formats and containers (see the [Data Streams Introduction](../data-streams-introduction) for a list of supported data streams).
+
+In short, a format describes how raw data logged by a device maps to the data expected by RAPIDS, and a container is a script that connects to the database or file where that data is stored. 
+
+The most common cases when you would want to implement a new data stream are:
+
+- You collected data with a mobile sensing app RAPIDS does not support yet. For example, [Beiwe](https://www.beiwe.org/) data stored in MySQL. You will need to define a new format and a new container.
+- You collected data with a mobile sensing app RAPIDS supports but this data is stored in a container that RAPIDS can't connect to yet. For example, AWARE data stored in PostgreSQL. In this case, you can reuse the format of the`aware_mysql` stream but you will need to implement a new container script.
+
+!!! hint
+    RAPIDS supports smartphones, Fitbit, and Empatica devices, you can add a new data stream for the first two.
+
+## Formats and Containers in RAPIDS
+
+**CONTAINER**. The container of a data stream is queried using a `container.R` script. This script implements functions that will pull data from a database, file, etc.
+
+**FORMAT**. The format of a data stream is described using a `format.yaml` file. A format file describes the mapping between your stream's raw data and the data that RAPIDS needs.
+
+Both the `container.R` and the `format.yaml` are saved under `src/data/streams/[stream_name]` where `[stream_name]` can be 
+`aware_mysql` for example.
+
+## Implement a Container
+
+The `container.R` script of a data stream should be implemented in R. This script must have two functions if you are implementing a stream for phone data, or one function otherwise.
+
+=== "download_data"
+
+    This function returns the data columns for a specific sensor and participant. It has the following parameters:
+
+    | Param              | Description                                                                                           |   
+    |--------------------|-------------------------------------------------------------------------------------------------------|
+    | data_configuration | Any parameters (keys/values) set by the user in any `[DEVICE_DATA_STREAMS][stream_name]` key of `config.yaml`. For example, `[DATABASE_GROUP]` inside `[FITBIT_DATA_STREAMS][fitbitjson_mysql]` | 
+    | sensor_container   | The value set by the user in any `[DEVICE_SENSOR][CONTAINER]` key of `config.yaml`. It can be a table, file path, or whatever data source you want to support that contains the **data from a single sensor for all participants**. For example, `[PHONE_ACCELEROMETER][CONTAINER]`|
+    | device             | The device id that you need to get the data for (this is set by the user in the [participant files](../../setup/configuration/#participant-files)). For example, in AWARE this device is a uuid|
+    | columns            | A list of the columns that you need to get from `sensor_container`. You specify these columns in your stream's `format.yaml`|
+
+
+    !!! example
+        This is the `download_data` function we implemented for `aware_mysql`. Note that we can `message`, `warn` or `stop` the user during execution.
+
+        ```r
+        download_data <- function(data_configuration, device, sensor_container, columns){
+            group <- data_configuration$SOURCE$DATABASE_GROUP
+            dbEngine <- dbConnect(MariaDB(), default.file = "./.env", group = group)
+            
+            
+            query <- paste0("SELECT ", paste(columns, collapse = ",")," FROM ", sensor_container, " WHERE device_id = '", device,"'")
+            # Letting the user know what we are doing
+            message(paste0("Executing the following query to download data: ", query)) 
+            sensor_data <- dbGetQuery(dbEngine, query)
+            
+            dbDisconnect(dbEngine)
+            
+            if(nrow(sensor_data) == 0)
+                warning(paste("The device '", device,"' did not have data in ", sensor_container))
+
+            return(sensor_data)
+        }
+        ```
+
+=== "infer_device_os"
+
+    !!! warning
+        This function is only necessary for phone data streams. 
+    
+    RAPIDS allows users to use the keyword `infer` (previously `multiple`) to [automatically infer](../../setup/configuration/#structure-of-participants-files) the mobile Operative System a device (phone) was running. 
+    
+    If you have a way to infer the OS of a device id, implement this function. For example, for AWARE data we use the `aware_device` table.
+ 
+    If you don't have a way to infer the OS, call `stop("Error Message")` so other users know they can't use `infer` or the inference failed, and they have to assign the OS manually in the participant file.
+    
+    This function returns the operative system (`android` or `ios`) for a specific device. It has the following parameters:
+
+    | Param              | Description                                                                                           |   
+    |--------------------|-------------------------------------------------------------------------------------------------------|
+    | data_configuration | Any parameters (keys/values) set by the user in any `[DEVICE_DATA_STREAMS][stream_name]` key of `config.yaml`. For example, `[DATABASE_GROUP]` inside `[FITBIT_DATA_STREAMS][fitbitjson_mysql]` | 
+    | device             | The device id that you need to infer the OS for (this is set by the user in the [participant files](../../setup/configuration/#participant-files)). For example, in AWARE this device is a uuid|
+
+
+    !!! example
+        This is the `infer_device_os` function we implemented for `aware_mysql`. Note that we can `message`, `warn` or `stop` the user during execution.
+
+        ```r
+        infer_device_os <- function(data_configuration, device){
+            group <- data_configuration$SOURCE$DATABASE_GROUP # specified DB credentials group in config.yaml
+            
+            dbEngine <- dbConnect(MariaDB(), default.file = "./.env", group = group)
+            query <- paste0("SELECT device_id,brand FROM aware_device WHERE device_id = '", device, "'")
+            message(paste0("Executing the following query to infer phone OS: ", query)) 
+            os <- dbGetQuery(dbEngine, query)
+            dbDisconnect(dbEngine)
+            
+            if(nrow(os) > 0)
+                return(os %>% mutate(os = ifelse(brand == "iPhone", "ios", "android")) %>% pull(os))
+            else
+                stop(paste("We cannot infer the OS of the following device id because it does not exist in the aware_device table:", device))
+            
+            return(os)
+        }
+        ```
+
+## Implement a Format
+
+A format describes the mapping between your stream's raw data and the data that RAPIDS needs. This file has a section per sensor (e.g. `PHONE_ACCELEROMETER`), and each section has two keys (attributes):
+
+1. `COLUMN_MAPPINGS` is a mapping between the columns RAPIDS needs and the columns your raw data has. 
+2. `MUTATION_SCRIPTS` are a collection of R or Python scripts that transform your raw data into the format RAPIDS needs. 
+
+Let's explain these keys with examples.
+
+### Name mapping
+
+The mapping for some sensors is straightforward. For example, accelerometer data most of the time has a timestamp, three axis (x,y,z) and a device id that produced it. It is likely that AWARE and a different sensing app like Beiwe logged accelerometer data in the same way but with different columns names. In this case we only need to match Beiwe data columns to RAPIDS columns one-to-one:
+
+```yaml hl_lines="4 5 6 7 8"
+PHONE_ACCELEROMETER:
+  ANDROID:
+    COLUMN_MAPPINGS:
+      TIMESTAMP: beiwe_timestamp
+      DEVICE_ID: beiwe_deviceID
+      DOUBLE_VALUES_0: beiwe_x
+      DOUBLE_VALUES_1: beiwe_y
+      DOUBLE_VALUES_2: beiwe_z
+    MUTATION_SCRIPTS: # it's ok if this is empty
+```
+
+### Value mapping
+For some sensors we need to map column names and values. For example, screen data has ON and OFF events, let's suppose Beiwe represents an ON event with the number `1` but RAPIDS identifies ON events with the number `2`. In this case we need to mutate the raw data coming from Beiwe and replace all `1`s with `2`s.
+
+We do this by listing one or more R or Python scripts in `MUTATION_SCRIPTS` that will be executed in order:
+
+```yaml hl_lines="8"
+PHONE_SCREEN:
+  ANDROID:
+    COLUMN_MAPPINGS:
+      TIMESTAMP: beiwe_timestamp
+      DEVICE_ID: beiwe_deviceID
+      EVENT: beiwe_event
+    MUTATION_SCRIPTS:
+        - src/data/streams/mutations/phone/beiwe/beiwe_screen_map.py
+```
+
+Every `MUTATION_SCRIPT` has a `main` function that receives a data frame with your raw sensor data and should return the mutated data. We usually store all mutation scripts under `src/data/streams/mutations/[device]/[platform]/` and they can be reused across data streams.
+
+!!! hint
+    This `MUTATION_SCRIPT` can also be used to clean/preprocess your data before extracting behavioral features.
+
+=== "python"
+    Example of a python mutation script
+    ```python
+    import pandas as pd
+
+    def main(data):
+        # mutate data
+        return(data)
+    ```
+=== "R"
+    Example of a R mutation script
+    ```r
+    source("renv/activate.R") # needed to use RAPIDS renv environment
+    library(dplyr)
+
+    main <- function(data){
+        # mutate data
+        return(data)
+    }
+    ```
+
+### Complex mapping
+Sometimes, your raw data doesn't even have the same columns RAPIDS expects for a sensor. For example, let's pretend Beiwe stores `PHONE_ACCELEROMETER` axis data in a single column called `acc_col` instead of three: `x-y-z`. You need to create a `MUTATION_SCRIPT` to split `acc_col` into three columns `x`, `y`, and `z`. 
+
+For this, you mark the missing `COLUMN_MAPPINGS` with the word `FLAG_TO_MUTATE`, map `acc_col` to `FLAG_AS_EXTRA`, and list a Python script under `MUTATION_SCRIPT` with the code to split `acc_col`.
+
+Every column mapped with `FLAG_AS_EXTRA` will be included in the data frame you receive in your mutation script and we recommend deleting them from the returned data frame after they are not needed anymore.
+
+!!! hint
+    Note that although `COLUMN_MAPPINGS` keys are in capital letters for readability (e.g. `DOUBLE_VALUES_0`), the names of the final columns you mutate in your scripts should be lower case.
+
+```yaml hl_lines="6 7 8 9 11"
+PHONE_ACCELEROMETER:
+  ANDROID:
+    COLUMN_MAPPINGS:
+      TIMESTAMP: beiwe_timestamp
+      DEVICE_ID: beiwe_deviceID
+      DOUBLE_VALUES_0: FLAG_TO_MUTATE
+      DOUBLE_VALUES_1: FLAG_TO_MUTATE
+      DOUBLE_VALUES_2: FLAG_TO_MUTATE
+      FLAG_AS_EXTRA: acc_col
+    MUTATION_SCRIPTS:
+        - src/data/streams/mutations/phone/beiwe/beiwe_split_acc.py
+```
+
+This is a draft of `beiwe_split_acc.py` `MUTATION_SCRIPT`:
+```python
+import pandas as pd
+
+def main(data):
+    # data has the acc_col
+    # split acc_col into three columns: double_values_0, double_values_1, double_values_2 to match RAPIDS format
+    # remove acc_col since we don't need it anymore
+    return(data)
+```
+
+### OS complex mapping
+There is a special case for a complex mapping scenario for smartphone data streams. The Android and iOS sensor APIs return data in different formats for certain sensors (like screen, activity recognition, battery, among others). 
+
+In case you didn't notice, the examples we have used so far are grouped under an `ANDROID` key, which means they will be applied to data collected by Android phones. Additionally, each sensor has an `IOS` key for a similar purpose. We use the complex mapping described above to transform iOS data into an Android format (it's always iOS to Android and any new phone data stream must do the same).
+
+For example, this is the `format.yaml` key for `PHONE_ACTVITY_RECOGNITION`. Note that the `ANDROID` mapping is simple (one-to-one) but the `IOS` mapping is complex with two `FLAG_TO_MUTATE` columns, one `FLAG_AS_EXTRA` column, and one `MUTATION_SCRIPT`.
+
+```yaml hl_lines="14 15 17 19"
+PHONE_ACTIVITY_RECOGNITION:
+  ANDROID:
+    COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      ACTIVITY_TYPE: activity_type
+      ACTIVITY_NAME: activity_name
+      CONFIDENCE: confidence
+    MUTATION_SCRIPTS: 
+  IOS:
+    COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      ACTIVITY_TYPE: FLAG_TO_MUTATE
+      ACTIVITY_NAME: FLAG_TO_MUTATE
+      CONFIDENCE: confidence
+      FLAG_AS_EXTRA: activities
+    MUTATION_SCRIPTS:
+      - "src/data/streams/mutations/phone/aware/activity_recogniton_ios_unification.R"
+```
+
+??? "Example activity_recogniton_ios_unification.R"
+    In this `MUTATION_SCRIPT` we create `ACTIVITY_NAME` and `ACTIVITY_TYPE` based on `activities`, and map `confidence` iOS values to Android values.
+    ```R
+    source("renv/activate.R")
+    library("dplyr", warn.conflicts = F)
+    library(stringr)
+
+    clean_ios_activity_column <- function(ios_gar){
+        ios_gar <- ios_gar %>%
+            mutate(activities = str_replace_all(activities, pattern = '("|\\[|\\])', replacement = ""))
+
+        existent_multiple_activities <- ios_gar %>%
+            filter(str_detect(activities, ",")) %>% 
+            group_by(activities) %>%
+            summarise(mutiple_activities = unique(activities), .groups = "drop_last") %>% 
+            pull(mutiple_activities)
+
+        known_multiple_activities <- c("stationary,automotive")
+        unkown_multiple_actvities <- setdiff(existent_multiple_activities, known_multiple_activities)
+        if(length(unkown_multiple_actvities) > 0){
+            stop(paste0("There are unkwown combinations of ios activities, you need to implement the decision of the ones to keep: ", unkown_multiple_actvities))
+        }
+
+        ios_gar <- ios_gar %>%
+            mutate(activities = str_replace_all(activities, pattern = "stationary,automotive", replacement = "automotive"))
+        
+        return(ios_gar)
+    }
+
+    unify_ios_activity_recognition <- function(ios_gar){
+        # We only need to unify Google Activity Recognition data for iOS
+        # discard rows where activities column is blank
+        ios_gar <- ios_gar[-which(ios_gar$activities == ""), ]
+        # clean "activities" column of ios_gar
+        ios_gar <- clean_ios_activity_column(ios_gar)
+
+        # make it compatible with android version: generate "activity_name" and "activity_type" columns
+        ios_gar  <-  ios_gar %>% 
+            mutate(activity_name = case_when(activities == "automotive" ~ "in_vehicle",
+                                            activities == "cycling" ~ "on_bicycle",
+                                            activities == "walking" ~ "walking",
+                                            activities == "running" ~ "running",
+                                            activities == "stationary" ~ "still"),
+                    activity_type = case_when(activities == "automotive" ~ 0,
+                                            activities == "cycling" ~ 1,
+                                            activities == "walking" ~ 7,
+                                            activities == "running" ~ 8,
+                                            activities == "stationary" ~ 3,
+                                            activities == "unknown" ~ 4),
+                    confidence = case_when(confidence == 0 ~ 0,
+                                        confidence == 1 ~ 50,
+                                        confidence == 2 ~ 100)
+                                        ) %>% 
+            select(-activities)
+        
+        return(ios_gar)
+    }
+
+    main <- function(data){
+        return(unify_ios_activity_recognition(data))
+    }
+    ```
--- a/docs/datastreams/aware-mysql.md
+++ b/docs/datastreams/aware-mysql.md
@ -0,0 +1,46 @@
+# `aware-mysql`
+
+This [data stream](../../datastreams/data-streams-introduction) handles iOS and Android sensor data collected with the [AWARE Framework](https://awareframework.com/) and stored in a MySQL database.
+
+## Container
+A MySQL database with a table per sensor, each containing the data for all participants. This is the default database created by the old PHP AWARE server (as opposed to the new JavaScript Micro server).
+
+The script to connect and download data from this container is at:
+```bash
+src/data/streams/aware_mysql/container.R
+```
+
+## Format
+If you collected sensor data with the vanilla (original) AWARE mobile clients you shouldn't need to modify this format (described below). 
+
+Remember that a format maps and transforms columns in your raw data stream to the [mandatory columns RAPIDS needs](../mandatory-phone-format).
+
+The yaml file that describes the format of this data stream is at:
+```bash
+src/data/streams/aware_mysql/format.yaml
+```
+
+!!! hint
+    The mappings in this stream (RAPIDS/Stream) are the same names because AWARE data was the first stream RAPIDS supported, meaning that it considers AWARE column names the default.
+
+??? info "PHONE_ACCELEROMETER"
+
+    === "ANDROID"
+    
+        **COLUMN_MAPPINGS**
+
+        | RAPIDS column   | Stream column   |
+        |-----------------|-----------------|
+        | TIMESTAMP       | timestamp       |
+        | DEVICE_ID       | device_id       |
+        | DOUBLE_VALUES_0 | double_values_0 |
+        | DOUBLE_VALUES_1 | double_values_1 |
+        | DOUBLE_VALUES_2 | double_values_2 |
+
+        **MUTATION_SCRIPTS**
+
+        None
+
+    === "IOS"
+    
+        Same as ANDROID
--- a/docs/datastreams/data-streams-introduction.md
+++ b/docs/datastreams/data-streams-introduction.md
@ -0,0 +1,28 @@
+# Data Streams Introduction
+
+A data stream is a set of sensor data collected using a specific type of **device** with a specific **format** and stored in a specific **container**.
+
+For example, the `aware_mysql` data stream handles smartphone data (**device**) collected with the [AWARE Framework](https://awareframework.com/) (**format**) stored in a MySQL database (**container**). Similarly, smartphone data collected with [Beiwe](https://www.beiwe.org/) will have a different format and could be stored in a container like a PostgreSQL database or a CSV file.
+
+If you want to process a data stream using RAPIDS, make sure that your data is stored in a supported **format** and **container** (see table below). 
+
+If RAPIDS doesn't support your data stream yet (e.g. Beiwe data stored in PostgreSQL, or AWARE data stored in InfluxDB), you can always [implement a new data stream](../add-new-data-streams). If it's something you think other people might be interested on, we will be happy to include your new data stream in RAPIDS, so get in touch!.
+
+!!! hint
+    You can only add new data streams for Smartphone or Fitbit data. If you need RAPIDS to process data from **different devices**, like Oura Rings or Actigraph wearables, get in touch. It is a more complex process that could take a few days to implement for someone familiar with R or Python but that we would be happy to work on together.
+
+For reference, these are the data streams we currently support: 
+
+| Data Stream | Device | Format | Container | Docs
+|--|--|--|--|--|
+| `aware_mysql`| Phone | AWARE app | MySQL | [link]()
+| `aware_csv`| Phone | AWARE app | CSV files | [link]()
+| `fitbitjson_mysql`| Fitbit | JSON (per Fitbit's API) | MySQL | [link]()
+| `fitbitjson_csv`| Fitbit | JSON (per Fitbit's API) | CSV files | [link]()
+| `fitbitparsed_mysql`| Fitbit | Parsed (parsed API data) | MySQL | [link]()
+| `fitbitparsed_csv`| Fitbit | Parsed (parsed API data)  | CSV files | [link]()
+| `empatica_zip`| Empatica | E4 Connect | ZIP files | [link]()
+
+!!! hint
+    - Fitbit data can be processed from the JSON object produced by Fitbit's API (recommended) or from parsed tabular data (if you only have access to parsed data).
+    - Empatica data can only be accessed through the [E4 Connect website](https://support.empatica.com/hc/en-us/articles/201608896-Data-export-and-formatting-from-E4-connect-) that produces zip files with a CSV file per sensor which can be processed directly in RAPIDS. 
--- a/docs/datastreams/fitbitjson-mysql.md
+++ b/docs/datastreams/fitbitjson-mysql.md
@ -0,0 +1,55 @@
+# `fitbitjson_mysql`
+This [data stream](../../datastreams/data-streams-introduction) handles Fitbit sensor data downloaded using the [Fitbit Web API](https://dev.fitbit.com/build/reference/web-api/) and stored in a MySQL database. Please note that RAPIDS cannot query the API directly, you need to use other available tools or implement your own. Once you have your sensor data in a MySQL database, RAPIDS can process it.
+
+## Container
+A MySQL database with a table per sensor, each containing the data for all participants.
+
+The script to connect and download data from this container is at:
+```bash
+src/data/streams/fitbitjson_mysql/container.R
+```
+
+## Format
+
+The `format.yaml` maps and transforms columns in your raw data stream to the [mandatory columns RAPIDS needs for Fitbit sensors](../mandatory-fitbit-format). This file is at:
+
+```bash
+src/data/streams/fitbitjson_mysql/format.yaml
+```
+
+If you want RAPIDS to process Fitbit sensor data using this stream, you will need to replace the following `COLUMN_MAPPINGS` inside **each sensor** section in `format.yaml` to match your raw data column names:
+
+| Column   | Description   |
+|-----------------|-----------------|
+| device_id       | A string that uniquely identifies a device |
+| fitbit_data       | A string column that contains the JSON objects downloaded from Fitbit's API |
+
+
+
+??? info "FITBIT_HEARTRATE_SUMMARY section"
+
+    **COLUMN_MAPPINGS**
+
+    | RAPIDS column   | Stream column   |
+    |-----------------|-----------------|
+    | LOCAL_DATE_TIME       | FLAG_TO_MUTATE |
+    | DEVICE_ID       | device_id |
+    | HEARTRATE_DAILY_RESTINGHR | FLAG_TO_MUTATE |
+    | HEARTRATE_DAILY_CALORIESOUTOFRANGE | FLAG_TO_MUTATE |
+    | HEARTRATE_DAILY_CALORIESFATBURN | FLAG_TO_MUTATE |
+    | HEARTRATE_DAILY_CALORIESCARDIO | FLAG_TO_MUTATE |
+    | HEARTRATE_DAILY_CALORIESPEAK | FLAG_TO_MUTATE |
+    | FLAG_AS_EXTRA: | fitbit_data |
+
+
+    **MUTATION_SCRIPTS**
+
+    TODO list our parsing script
+
+    ??? "Example of the raw data RAPIDS expects for this data stream"
+
+        |device_id                                |fitbit_data                                               |
+        |---------------------------------------- |--------------------------------------------------------- |
+        |a748ee1a-1d0b-4ae9-9074-279a2b6ba524     |{"activities-heart":[{"dateTime":"2020-10-07","value":{"customHeartRateZones":[],"heartRateZones":[{"caloriesOut":1200.6102,"max":88,"min":31,"minutes":1058,"name":"Out of Range"},{"caloriesOut":760.3020,"max":120,"min":86,"minutes":366,"name":"Fat Burn"},{"caloriesOut":15.2048,"max":146,"min":120,"minutes":2,"name":"Cardio"},{"caloriesOut":0,"max":221,"min":148,"minutes":0,"name":"Peak"}],"restingHeartRate":72}}],"activities-heart-intraday":{"dataset":[{"time":"00:00:00","value":68},{"time":"00:01:00","value":67},{"time":"00:02:00","value":67},...],"datasetInterval":1,"datasetType":"minute"}}
+        |a748ee1a-1d0b-4ae9-9074-279a2b6ba524     |{"activities-heart":[{"dateTime":"2020-10-08","value":{"customHeartRateZones":[],"heartRateZones":[{"caloriesOut":1100.1120,"max":89,"min":30,"minutes":921,"name":"Out of Range"},{"caloriesOut":660.0012,"max":118,"min":82,"minutes":361,"name":"Fat Burn"},{"caloriesOut":23.7088,"max":142,"min":108,"minutes":3,"name":"Cardio"},{"caloriesOut":0,"max":221,"min":148,"minutes":0,"name":"Peak"}],"restingHeartRate":70}}],"activities-heart-intraday":{"dataset":[{"time":"00:00:00","value":77},{"time":"00:01:00","value":75},{"time":"00:02:00","value":73},...],"datasetInterval":1,"datasetType":"minute"}}
+        |a748ee1a-1d0b-4ae9-9074-279a2b6ba524     |{"activities-heart":[{"dateTime":"2020-10-09","value":{"customHeartRateZones":[],"heartRateZones":[{"caloriesOut":750.3615,"max":77,"min":30,"minutes":851,"name":"Out of Range"},{"caloriesOut":734.1516,"max":107,"min":77,"minutes":550,"name":"Fat Burn"},{"caloriesOut":131.8579,"max":130,"min":107,"minutes":29,"name":"Cardio"},{"caloriesOut":0,"max":220,"min":130,"minutes":0,"name":"Peak"}],"restingHeartRate":69}}],"activities-heart-intraday":{"dataset":[{"time":"00:00:00","value":90},{"time":"00:01:00","value":89},{"time":"00:02:00","value":88},...],"datasetInterval":1,"datasetType":"minute"}}
--- a/docs/datastreams/fitbitparsed-mysql.md
+++ b/docs/datastreams/fitbitparsed-mysql.md
@ -0,0 +1,52 @@
+# `fitbitparsed_mysql`
+This [data stream](../../datastreams/data-streams-introduction) handles Fitbit sensor data downloaded using the [Fitbit Web API](https://dev.fitbit.com/build/reference/web-api/), **parsed**, and stored in a MySQL database. Please note that RAPIDS cannot query the API directly, you need to use other available tools or implement your own. Once you have your sensor data in a MySQL database, RAPIDS can process it.
+
+!!! info "What is the difference between JSON and plain data streams"
+    Most people will only need `fitbitjson_mysql` because they downloaded and stored their data directly from Fitbit's API. However, if for some reason you don't have access to that JSON data and instead only have the parsed data (columns and rows) you can use this data stream.
+
+## Container
+A MySQL database with a table per sensor, each containing the data for all participants.
+
+The script to connect and download data from this container is at:
+```bash
+src/data/streams/fitbitjson_mysql/container.R
+```
+
+## Format
+
+The `format.yaml` maps and transforms columns in your raw data stream to the [mandatory columns RAPIDS needs for Fitbit sensors](../mandatory-fitbit-format). This file is at:
+
+```bash
+src/data/streams/fitbitparsed_mysql/format.yaml
+```
+
+If you want RAPIDS to process Fitbit sensor data using this stream, you will need to replace any `COLUMN_MAPPINGS` inside **each sensor** section in  `format.yaml` to match your raw data column names. 
+
+All columns are mandatory, however, all except `device_id` and `local_date_time` can be empty if you don't have that data. Just have in mind that some features will be empty if some of these columns are empty.
+
+??? info "FITBIT_HEARTRATE_SUMMARY section"
+
+    
+    **COLUMN_MAPPINGS**
+
+    | RAPIDS column   | Stream column   |
+    |-----------------|-----------------|
+    | LOCAL_DATE_TIME       | local_date_time |
+    | DEVICE_ID       | device_id |
+    | HEARTRATE_DAILY_RESTINGHR | heartrate_daily_restinghr |
+    | HEARTRATE_DAILY_CALORIESOUTOFRANGE | heartrate_daily_caloriesoutofrange |
+    | HEARTRATE_DAILY_CALORIESFATBURN | heartrate_daily_caloriesfatburn |
+    | HEARTRATE_DAILY_CALORIESCARDIO | heartrate_daily_caloriescardio |
+    | HEARTRATE_DAILY_CALORIESPEAK | heartrate_daily_caloriespeak |
+
+    **MUTATION_SCRIPTS**
+
+    TODO list our parsing script
+
+    ??? "Example of the raw data RAPIDS expects for this data stream"
+
+        |device_id                              |local_date_time   |heartrate_daily_restinghr |heartrate_daily_caloriesoutofrange  |heartrate_daily_caloriesfatburn  |heartrate_daily_caloriescardio  |heartrate_daily_caloriespeak   |
+        |-------------------------------------- |----------------- |------- |-------------- |------------- |------------ |-------|
+        |a748ee1a-1d0b-4ae9-9074-279a2b6ba524   |2020-10-07        |72      |1200.6102      |760.3020      |15.2048      |0      |
+        |a748ee1a-1d0b-4ae9-9074-279a2b6ba524   |2020-10-08        |70      |1100.1120      |660.0012      |23.7088      |0      |
+        |a748ee1a-1d0b-4ae9-9074-279a2b6ba524   |2020-10-09        |69      |750.3615       |734.1516      |131.8579     |0      |
--- a/docs/datastreams/mandatory-fitbit-format.md
+++ b/docs/datastreams/mandatory-fitbit-format.md
@ -0,0 +1,15 @@
+# Mandatory Fitbit Format
+
+This is a description of the format RAPIDS needs to process data for the following PHONE sensors.
+
+??? info "FITBIT_HEARTRATE_SUMMARY"
+
+    | RAPIDS column   | Description   |
+    |-----------------|-----------------|
+    | LOCAL_DATE_TIME       |  TODO |
+    | DEVICE_ID       |  TODO |
+    | HEARTRATE_DAILY_RESTINGHR |  TODO |
+    | HEARTRATE_DAILY_CALORIESOUTOFRANGE |  TODO |
+    | HEARTRATE_DAILY_CALORIESFATBURN |  TODO |
+    | HEARTRATE_DAILY_CALORIESCARDIO |  TODO |
+    | HEARTRATE_DAILY_CALORIESPEAK |  TODO |
--- a/docs/datastreams/mandatory-phone-format.md
+++ b/docs/datastreams/mandatory-phone-format.md
@ -0,0 +1,19 @@
+# Mandatory Phone Format
+
+This is a description of the format RAPIDS needs to process data for the following PHONE sensors.
+
+??? info "PHONE_ACCELEROMETER"
+
+    === "ANDROID"
+
+        | RAPIDS column   | Description   |
+        |-----------------|-----------------|
+        | TIMESTAMP       | A UNIX timestamp (13 digits) when a row of data was logged   |
+        | DEVICE_ID       | A string that uniquely identifies a device       |
+        | DOUBLE_VALUES_0 | x axis of acceleration |
+        | DOUBLE_VALUES_1 | y axis of acceleration |
+        | DOUBLE_VALUES_2 | z axis of acceleration |
+
+    === "IOS"
+        Same as ANDROID
+    
--- a/docs/index.md
+++ b/docs/index.md
@ -2,7 +2,9 @@

 Reproducible Analysis Pipeline for Data Streams (RAPIDS) allows you to process smartphone and wearable data to [extract](features/feature-introduction.md) and [create](features/add-new-features.md) **behavioral features** (a.k.a. digital biomarkers), [visualize](visualizations/data-quality-visualizations.md) mobile sensor data and [structure](workflow-examples/analysis.md) your analysis into reproducible workflows.

-RAPIDS is open source, documented, modular, tested, and reproducible. At the moment we support smartphone data collected with [AWARE](https://awareframework.com/), wearable data from Fitbit devices, and wearable data from Empatica devices (in collaboration with the [DBDP](https://dbdp.org/)).
+RAPIDS is open source, documented, modular, tested, and reproducible. At the moment we support smartphone data, and wearable data from Fitbit devices, and Empatica devices (these in collaboration with the [DBDP](https://dbdp.org/)).
+
+Read the [introduction to data streams](../../datastreams/data-streams-introduction) for more information on what data streams we support, and this tutorial to [add support for new data streams](../../datastreams/add-new-data-streams) for smartphones or Fitbits (formats/containers).

 !!! tip
    :material-slack: Questions or feedback can be posted on the \#rapids channel in AWARE Framework\'s [slack](http://awareframework.com:3000/). 
--- a/docs/setup/configuration.md
+++ b/docs/setup/configuration.md
@ -3,11 +3,11 @@

 You need to follow these steps to configure your RAPIDS deployment before you can extract behavioral features

-1. Add your [database credentials](#database-credentials)
+0. Verify RAPIDS can process your [data streams](#supported-data-streams)
 2. Choose the [timezone of your study](#timezone-of-your-study)
 3. Create your [participants files](#participant-files)
 4. Select what [time segments](#time-segments) you want to extract features on
-5. Modify your [device data source configuration](#device-data-source-configuration)
+5. Configure your [data streams](#data-stream-configuration)
 6. Select what [sensors and features](#sensor-and-features-to-process) you want to process

 When you are done with this configuration, go to [executing RAPIDS](../execution).
@ -16,59 +16,37 @@ When you are done with this configuration, go to [executing RAPIDS](../execution
    Every time you see `config["KEY"]` or `[KEY]` in these docs we are referring to the corresponding key in the `config.yaml` file.

 ---
-## Database credentials

-Only follow this step if you are processing smartphone or Fitbit data stored in a database. For reference, we list below the data sources RAPIDS support for each type of device.
+## Supported data streams

-1. Create an empty file called `#!bash .env` in your RAPIDS root directory
-2. Add the following lines and replace your database-specific  credentials (user, password, host, and database):
+A data stream refers to sensor data collected using a specific type of **device** with a specific **format** and stored in a specific **container**. For example, the `aware_mysql` data stream handles smartphone data (**device**) collected with the [AWARE Framework](https://awareframework.com/) (**format**) stored in a MySQL database (**container**).

-  ``` yaml
-  [MY_GROUP]
-  user=MY_USER
-  password=MY_PASSWORD
-  host=MY_HOST
-  port=3306
-  database=MY_DATABASE
-  ```
-
-??? warning "What is `[MY_GROUP]`?"
-    The label `[MY_GROUP]` is arbitrary but it has to match the following `config.yaml` key:
-    ```yaml
-    DATABASE_GROUP: &database_group
-      MY_GROUP
-    ```
-
-??? hint "Connecting to localhost (host machine) from inside our docker container"
-    If you are using RAPIDS' docker container and Docker-for-mac or Docker-for-Windows 18.03+, you can connect to a MySQL database in your host machine using `host.docker.internal` instead of `127.0.0.1` or `localhost`. In a Linux host you need to run our docker container using `docker run --network="host" -d moshiresearch/rapids:latest` and then `127.0.0.1` will point to your host machine.
-
-??? hint "Data sources supported for each device type"
-    | Device | Database | CSV Files | Zip files
-    |--|--|--|--|
-    | Smartphone| Yes (MySQL) | No | No |
-    | Fitbit| Yes (MySQL) | Yes | No |
-    | Empatica| No | No | Yes |
-
-    - RAPIDS only supports MySQL/MariaDB databases. If you would like to add support for a different database engine get in touch and we can discuss how to implement it.
-    - Fitbit data can be processed as the JSON object produced by Fitbit's API (recommended) or in a parsed tabular fashion.
-    - Empatica devices produce a zip file with a CSV file per sensor which can be processed directly in RAPIDS.
-    
---
+Check the table in [introduction to data streams](../../datastreams/data-streams-introduction) to know what data streams we support. If your data stream is supported, continue with to the next configuration section. If you want to implement a new data stream, follow this tutorial to [add support for new data streams](../../datastreams/add-new-data-streams). If you have read the tutorial but have questions, get in touch by email or in Slack.

 ## Timezone of your study

 ### Single timezone

-If your study only happened in a single time zone, select the appropriate code form this [list](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones) and change the following config key. Double check your timezone code pick, for example US Eastern Time is `America/New_York` not `EST`
+If your study only happened in a single time zone or you want to ignore short trips of your participants to different time zones, select the appropriate code form this [list](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones) and change the following config key. Double check your timezone code pick, for example US Eastern Time is `America/New_York` not `EST`

 ``` yaml
-TIMEZONE: &timezone
-  America/New_York
+TIMEZONE: 
+    TYPE: SINGLE
+    TZCODE: America/New_York
 ```

 ### Multiple timezones

-Support coming soon.
+If you have the timestamps when participants' devices changed to a new time zone, follow these instructions
+
+TODO more info
+
+``` yaml
+TIMEZONE: 
+    TYPE: MULTIPLE
+    TZCODE: America/New_York
+    MULTIPLE_TZCODES_FILE: path_to/csv.file
+```

 ---

@ -120,7 +98,7 @@ Participant files link together multiple devices (smartphones and wearables) to
    | Key&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;            | Description                                                                                                                                                                                                                                                                                                                                |
    |-------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
    | `[DEVICE_IDS]` | An array of the strings that uniquely identify each smartphone, you can have more than one for when participants changed phones in the middle of the study, in this case, data from all their devices will be joined and relabeled with the last 1 on this list.                                                                           |
-    | `[PLATFORMS]`  | An array that specifies the OS of each smartphone in  `[DEVICE_IDS]` , use a combination of  `android`  or  `ios`  (we support participants that changed platforms in the middle of your study!). If you have an  `aware_device`  table in your database you can set  `[PLATFORMS]: [multiple]`  and RAPIDS will infer them automatically. |
+    | `[PLATFORMS]`  | An array that specifies the OS of each smartphone in  `[DEVICE_IDS]` , use a combination of  `android`  or  `ios`  (we support participants that changed platforms in the middle of your study!). You can set `[PLATFORMS]: [infer]`  and RAPIDS will infer them automatically (each phone data stream infer this differently, e.g. `aware_mysql` uses the `aware_device` table). |
    | `[LABEL]`      | A string that is used in reports and visualizations.                                                                                                                                                                                                                                                                                       |
    | `[START_DATE]` | A string with format  `YYY-MM-DD` . Only data collected  *after*  this date will be included in the analysis                                                                                                                                                                                                                               |
    | `[END_DATE]`   | A string with format  `YYY-MM-DD` . Only data collected  *before*  this date will be included in the analysis                                                                                                                                                                                                                              |
@ -143,48 +121,24 @@ Participant files link together multiple devices (smartphones and wearables) to
    | `[END_DATE]`     | A string with format  `YYY-MM-DD` . Only data collected  *before*  this date will be included in the analysis    
 ### Automatic creation of participant files

-You have two options a) use the `aware_device` table in your database or b) use a CSV file. In either case, in your `config.yaml`, set the devices (`PHONE`, `FITBIT`, `EMPATICA`) `[ADD]` flag to `TRUE` depending on what devices you used in your study. Set `[DEVICE_ID_COLUMN]` to the name of the column that uniquely identifies each device  (only for `PHONE` and `FITBIT`).
+You can use a CSV file with a row per participant to automatically create participant files. 

-=== "aware_device table"
+??? "`AWARE_DEVICE_TABLE` was deprecated"
+    In previous versions of RAPIDS, you could create participant files automatically using the `aware_device` table. We deprecated this option but you can still achieve the same results if you export the output of the following SQL query as a CSV file and follow the instructions below:
    
-    Set the following keys in your `config.yaml`
+    ```sql
+    SELECT device_id, device_id as fitbit_id, CONCAT("p", _id) as pid, if(brand = "iPhone", "ios", "android") as platform, CONCAT("p", _id)  as label, DATE_FORMAT(FROM_UNIXTIME((timestamp/1000)- 86400), "%Y-%m-%d") as start_date, CURRENT_DATE as end_date from aware_device order by _id;
+    ```
+
+In your `config.yaml`:
+
+1. Set `CSV_FILE_PATH` to a CSV file path that complies with the specs described below
+2. Set the devices (`PHONE`, `FITBIT`, `EMPATICA`) `[ADD]` flag to `TRUE` depending on what devices you used in your study.
+3. Set `[DEVICE_ID_COLUMN]` to the name of the column in your CSV file that uniquely identifies each device  (only for `PHONE` and `FITBIT`).

 ```yaml
 CREATE_PARTICIPANT_FILES:
-      SOURCE:
-        TYPE: AWARE_DEVICE_TABLE
-        DATABASE_GROUP: *database_group
-        CSV_FILE_PATH: ""
-        TIMEZONE: *timezone
-      PHONE_SECTION:
-        ADD: TRUE # or FALSE
-        DEVICE_ID_COLUMN: device_id # column name
-        IGNORED_DEVICE_IDS: []
-      FITBIT_SECTION:
-        ADD: FALSE # or FALSE
-        DEVICE_ID_COLUMN: fitbit_id # column name
-        IGNORED_DEVICE_IDS: []
-      EMPATICA_SECTION: # Empatica doesn't have a device_id column because the devices produce zip files per participant
-        ADD: FALSE # or FALSE
-    ```
-
-    Then run 
-
-    ```bash
-    snakemake -j1 create_participants_files
-    ```
-
-=== "CSV file"
-
-    Set the following keys in your `config.yaml`. 
-
-    ```yaml
-    CREATE_PARTICIPANT_FILES:
-      SOURCE:
-        TYPE: CSV_FILE
-        DATABASE_GROUP: ""
  CSV_FILE_PATH: "your_path/to_your.csv"
-        TIMEZONE: *timezone
  PHONE_SECTION:
    ADD: TRUE # or FALSE
    DEVICE_ID_COLUMN: device_id # column name
@ -196,24 +150,26 @@ You have two options a) use the `aware_device` table in your database or b) use
  EMPATICA_SECTION: # Empatica doesn't have a device_id column because the devices produce zip files per participant
    ADD: FALSE # or FALSE
 ```
-    Your CSV file (`[SOURCE][CSV_FILE_PATH]`) should have the following columns but you can omit any values you don't have on each column:
+
+Your CSV file (`[CSV_FILE_PATH]`) should have the following columns (headers) but the values within each column can be empty:

 | Column           | Description                                                                                               |
 |------------------|-----------------------------------------------------------------------------------------------------------|
 | phone device id  | The name of this column has to match `[PHONE_SECTION][DEVICE_ID_COLUMN]`. Separate multiple ids with `;`  |
 | fitbit device id | The name of this column has to match `[FITBIT_SECTION][DEVICE_ID_COLUMN]`. Separate multiple ids with `;`  |
-    | pid              | Unique identifiers with the format pXXX (your participant files will be named with this string            |
-    | platform         | Use `android`, `ios` or `multiple` as explained above, separate values with `;`            |
+| pid              | Unique identifiers with the format pXXX (your participant files will be named with this string)            |
+| platform         | Use `android`, `ios` or `infer` as explained above, separate values with `;`            |
 | label            | A human readable string that is used in reports and visualizations.                                       |
 | start_date       | A string with format `YYY-MM-DD`. |
 | end_date         | A string with format `YYY-MM-DD`. |

 !!! example
+    We added white spaces to this example to make it easy to read but you don't have to.

    ```csv
-        device_id,pid,label,platform,start_date,end_date,fitbit_id
-        a748ee1a-1d0b-4ae9-9074-279a2b6ba524;dsadas-2324-fgsf-sdwr-gdfgs4rfsdf43,p01,julio,android;ios,2020-01-01,2021-01-01,fitbit1
-        4c4cf7a1-0340-44bc-be0f-d5053bf7390c,p02,meng,ios,2021-01-01,2022-01-01,fitbit2
+    device_id                                                                ,fitbit_id ,pid ,label ,platform    ,start_date ,end_date
+    a748ee1a-1d0b-4ae9-9074-279a2b6ba524;dsadas-2324-fgsf-sdwr-gdfgs4rfsdf43 ,fitbit1   ,p01 ,julio ,android;ios ,2020-01-01 ,2021-01-01
+    4c4cf7a1-0340-44bc-be0f-d5053bf7390c                                     ,fitbit2   ,p02 ,meng  ,ios         ,2021-01-01 ,2022-01-01
    ```

 Then run 
@ -394,63 +350,163 @@ Time segments (or epochs) are the time windows on which you want to extract beha
    survey2,1584291600000,2H,1H,-1,klj34oi2-8frk-2343-21kk-324ljklewlr3
    ```
 --- 
-## Device Data Source Configuration
+## Data Stream Configuration

-You might need to modify the following config keys in your `config.yaml` depending on what devices your participants used and where you are storing your data (ignore the sections of devices you did not use).
+Modify the following keys in your `config.yaml` depending on the [data stream](../../datastreams/data-streams-introduction) you want to process.

 === "Phone"

-    The relevant `config.yaml` section looks like this by default:
+    Set `[PHONE_DATA_STREAMS][TYPE]` to the smartphone data stream you want to process (e.g. `aware_mysql`) and configure its parameters (e.g. `[DATABASE_GROUP]`). Ignore the parameters of streams you are not using (e.g. `[FOLDER]` of `aware_csv`).

    ```yaml
-    PHONE_DATA_CONFIGURATION:
-      SOURCE: 
-        TYPE: DATABASE
-        DATABASE_GROUP: *database_group
-        DEVICE_ID_COLUMN: device_id # column name
-      TIMEZONE: 
-        TYPE: SINGLE # SINGLE (MULTIPLE support coming soon)
-        VALUE: *timezone
-
+    PHONE_DATA_STREAMS:
+      TYPE: aware_mysql
+      aware_mysql:
+        DATABASE_GROUP: MY_GROUP
+      aware_csv:
+        FOLDER: data/external/aware_csv
    ```

-    **Parameters for `[PHONE_DATA_CONFIGURATION]`**
+    === "aware_mysql"

        | Key                  | Description                                                                                                                |
        |---------------------|----------------------------------------------------------------------------------------------------------------------------|
-    | `[SOURCE] [TYPE]`             | Only `DATABASE` is supported (phone data will be pulled from a database)                                                   |
-    | `[SOURCE] [DATABASE_GROUP]`   | `*database_group`  points to the value defined before in  [Database credentials](#database-credentials)    |
-    | `[SOURCE] [DEVICE_ID_COLUMN]` | A column that contains strings that uniquely identify smartphones. For data collected with AWARE this is usually  `device_id` |
-    | `[TIMEZONE] [TYPE]`             | Only `SINGLE` is supported for now                                                                                                |
-    | `[TIMEZONE] [VALUE]`            | `*timezone`  points to the value defined before in  [Timezone of your study](#timezone-of-your-study)          |
+        | `[DATABASE_GROUP]`   | A database credentials group. Read the instructions below to set it up    |
+
+        ??? info "Setting up a DATABASE_GROUP and its connection credentials"
+
+            1. If you haven't done so, create an empty file called `#!bash .env` in your RAPIDS root directory: `./.env`
+            2. Add the following lines to `./.env` and replace your database-specific credentials (user, password, host, and database):
+                1. Note that the label `[MY_GROUP]` is arbitrary but it has to match `[PHONE_DATA_STREAMS][aware_mysql] [DATABASE_GROUP]`
+
+              ``` yaml
+              [MY_GROUP]
+              user=MY_USER
+              password=MY_PASSWORD
+              host=MY_HOST
+              port=3306
+              database=MY_DATABASE
+              ```
+
+            ??? hint "Connecting to localhost (host machine) from inside our docker container"
+                If you are using RAPIDS' docker container and Docker-for-mac or Docker-for-Windows 18.03+, you can connect to a MySQL database in your host machine using `host.docker.internal` instead of `127.0.0.1` or `localhost`. In a Linux host you need to run our docker container using `docker run --network="host" -d moshiresearch/rapids:latest` and then `127.0.0.1` will point to your host machine.
+            ---
+
+
+    === "aware_csv"
+
+        | Key                  | Description                                                                                                                |
+        |---------------------|----------------------------------------------------------------------------------------------------------------------------|
+        | `[FOLDER]`   | Folder where you have to place a CSV file **per** phone sensor. Each file has to contain all the data from every participant you want to process.     |
+
+
+

 === "Fitbit"

-    The relevant `config.yaml` section looks like this by default:
+    Set `[FITBIT_DATA_STREAMS][TYPE]` to the Fitbit data stream you want to process (e.g. `fitbitjson_mysql`) and configure its parameters (e.g. `[DATABASE_GROUP]`). 
+    
+    Ignore the parameters of streams you are not using (e.g. `[FOLDER]` of `aware_csv`).
+

    ```yaml
-      FITBIT_DATA_CONFIGURATION:
-        SOURCE: 
-          TYPE: DATABASE # DATABASE or FILES (set each [FITBIT_SENSOR][TABLE] attribute with a table name or a file path accordingly)
-          COLUMN_FORMAT: JSON # JSON or PLAIN_TEXT
-          DATABASE_GROUP: *database_group
-          DEVICE_ID_COLUMN: device_id # column name
-        TIMEZONE: 
-          TYPE: SINGLE # Fitbit devices don't support time zones so we read this data in the timezone indicated by VALUE 
-          VALUE: *timezone
+    FITBIT_DATA_STREAMS:
+      TYPE: fitbitjson_mysql
+
+      fitbitjson_mysql:
+        DATABASE_GROUP: MY_GROUP
+        COLUMN_MAPPINGS_READY: False
+
+      fitbitjson_csv:
+        FOLDER: data/external/fitbit_csv
+        COLUMN_MAPPINGS_READY: False
+
+      fitbitparsed_mysql:
+        DATABASE_GROUP: MY_GROUP
+        COLUMN_MAPPINGS_READY: False
+        
+      fitbitparsed_csv:
+        FOLDER: data/external/fitbit_csv
+        COLUMN_MAPPINGS_READY: False

    ```

-      **Parameters for For `[FITBIT_DATA_CONFIGURATION]`**
+    === "fitbitjson_mysql"
+
+        This data stream process Fitbit data inside a JSON column as obtained from the Fitbit API and stored in a MySQL database.
+

        | Key                  | Description                                                                                                                |
-      |------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-      | `[SOURCE]` `[TYPE]`             | `DATABASE` or `FILES`  (set each `[FITBIT_SENSOR]` `[TABLE]` attribute accordingly with a table name or a file path)                                                                                         |
-      | `[SOURCE]` `[COLUMN_FORMAT]`    | `JSON` or `PLAIN_TEXT`. Column format of the source data. If you pulled your data directly from the Fitbit API the column containing the sensor data will be in `JSON` format                                   |
-      | `[SOURCE]` `[DATABASE_GROUP]`   | `*database_group`  points to the value defined before in  [Database credentials](#database-credentials). Only used if  `[TYPE]`  is  `DATABASE` . |
-      | `[SOURCE]` `[DEVICE_ID_COLUMN]` | A column that contains strings that uniquely identify Fitbit devices.                                                                                                |
-      | `[TIMEZONE]` `[TYPE]`             | Only `SINGLE` is supported  (Fitbit devices always store data in local time).                                                                                 |
-      | `[TIMEZONE]` `[VALUE]`            | `*timezone`  points to the value defined before in  [Timezone of your study](#timezone-of-your-study)                                             |
+        |---------------------|----------------------------------------------------------------------------------------------------------------------------|
+        | `[DATABASE_GROUP]`   | A database credentials group. Read the instructions below to set it up    |
+        | `[COLUMN_MAPPINGS_READY]`   | Set this to `True` after you have modified this stream's `format.yaml` column mappings to match your raw data column names: [`fitbitjson_mysql`](../../datastreams/fitbitjson-mysql#format)   |
+
+        ??? info "Setting up a DATABASE_GROUP and its connection credentials"
+
+            1. If you haven't done so, create an empty file called `#!bash .env` in your RAPIDS root directory: `./.env`
+            2. Add the following lines to `./.env` and replace your database-specific credentials (user, password, host, and database):
+                1. Note that the label `[MY_GROUP]` is arbitrary but it has to match `[FITBIT_DATA_STREAMS][fitbitjson_mysql] [DATABASE_GROUP]`
+
+              ``` yaml
+              [MY_GROUP]
+              user=MY_USER
+              password=MY_PASSWORD
+              host=MY_HOST
+              port=3306
+              database=MY_DATABASE
+              ```
+
+            ??? hint "Connecting to localhost (host machine) from inside our docker container"
+                If you are using RAPIDS' docker container and Docker-for-mac or Docker-for-Windows 18.03+, you can connect to a MySQL database in your host machine using `host.docker.internal` instead of `127.0.0.1` or `localhost`. In a Linux host you need to run our docker container using `docker run --network="host" -d moshiresearch/rapids:latest` and then `127.0.0.1` will point to your host machine.
+            ---
+
+    === "fitbitjson_csv"
+
+        This data stream process Fitbit data inside a JSON column as obtained from the Fitbit API and stored in a CSV file.
+
+        | Key                  | Description                                                                                                                |
+        |---------------------|----------------------------------------------------------------------------------------------------------------------------|
+        | `[FOLDER]`   | Folder where you have to place a CSV file **per** Fitbit sensor. Each file has to contain all the data from every participant you want to process.     |
+        | `[COLUMN_MAPPINGS_READY]`   | Set this to `True` after you have modified this stream's `format.yaml` column mappings to match your raw data column names: [`fitbitjson_csv`](../../datastreams/fitbitjson-csv#format)   |
+
+
+    === "fitbitparsed_mysql"
+
+        This data stream process Fitbit data stored in multiple columns after being parsed from the JSON column returned by Fitbit API and stored in a MySQL database.
+        
+
+        | Key                  | Description                                                                                                                |
+        |---------------------|----------------------------------------------------------------------------------------------------------------------------|
+        | `[DATABASE_GROUP]`   | A database credentials group. Read the instructions below to set it up    |
+        | `[COLUMN_MAPPINGS_READY]`   | Set this to `True` after you have modified this stream's `format.yaml` column mappings to match your raw data column names: [`fitbitparsed_mysql`](../../datastreams/fitbitparsed-mysql#format)   |
+
+        ??? info "Setting up a DATABASE_GROUP and its connection credentials"
+
+            1. If you haven't done so, create an empty file called `#!bash .env` in your RAPIDS root directory: `./.env`
+            2. Add the following lines to `./.env` and replace your database-specific credentials (user, password, host, and database):
+                1. Note that the label `[MY_GROUP]` is arbitrary but it has to match `[FITBIT_DATA_STREAMS][fitbitparsed_mysql] [DATABASE_GROUP]`
+
+              ``` yaml
+              [MY_GROUP]
+              user=MY_USER
+              password=MY_PASSWORD
+              host=MY_HOST
+              port=3306
+              database=MY_DATABASE
+              ```
+
+            ??? hint "Connecting to localhost (host machine) from inside our docker container"
+                If you are using RAPIDS' docker container and Docker-for-mac or Docker-for-Windows 18.03+, you can connect to a MySQL database in your host machine using `host.docker.internal` instead of `127.0.0.1` or `localhost`. In a Linux host you need to run our docker container using `docker run --network="host" -d moshiresearch/rapids:latest` and then `127.0.0.1` will point to your host machine.
+            ---
+        
+    === "fitbitparsed_csv"
+
+        This data stream process Fitbit data stored in multiple columns (plain text) after being parsed from the JSON column returned by Fitbit API and stored in a CSV file.
+
+        | Key                  | Description                                                                                                                |
+        |---------------------|----------------------------------------------------------------------------------------------------------------------------|
+        | `[FOLDER]`   | Folder where you have to place a CSV file **per** Fitbit sensor. Each file has to contain all the data from every participant you want to process.     |
+        | `[COLUMN_MAPPINGS_READY]`   | Set this to `True` after you have modified this stream's `format.yaml` column mappings to match your raw data column names: [`fitbitparsed_csv`](../../datastreams/fitbitparsed-csv#format)   |

 === "Empatica"

--- a/mkdocs.yml
+++ b/mkdocs.yml
@ -79,6 +79,18 @@ nav:
  - Example Workflows:
    - Minimal: workflow-examples/minimal.md
    - Analysis: workflow-examples/analysis.md
+  - Data Streams:
+      - Introduction: datastreams/data-streams-introduction.md
+      - Phone:
+        - aware_mysql: datastreams/aware-mysql.md
+        - Mandatory Phone Format: datastreams/mandatory-phone-format.md
+      - Fitbit:
+        - fitbitjson_mysql: datastreams/fitbitjson-mysql.md
+        - fitbitparsed_mysql: datastreams/fitbitparsed-mysql.md
+        - fitbitjson_csv: datastreams/fitbitjson-csv.md
+        - fitbitparsed_csv: datastreams/fitbitparsed-csv.md
+        - Mandatory Fitbit Format: datastreams/mandatory-phone-format.md
+      - Add New Data Streams: datastreams/add-new-data-streams.md
  - Behavioral Features:
      - Introduction: features/feature-introduction.md
      - Phone:
--- a/rules/common.smk
+++ b/rules/common.smk
@ -30,8 +30,9 @@ def get_phone_sensor_names():
                    phone_sensor_names.append(config_key)
    return phone_sensor_names

-from pathlib import Path
 def get_zip_suffixes(pid):
+    from pathlib import Path
+
    zipfiles = list((Path("data/external/empatica/") / Path(pid)).rglob("*.zip"))
    suffixes = []
    for zipfile in zipfiles:
@ -42,3 +43,27 @@ def get_all_raw_empatica_sensor_files(wildcards):
    suffixes = get_zip_suffixes(wildcards.pid)
    files = ["data/raw/{}/empatica_{}_raw_{}.csv".format(wildcards.pid, wildcards.sensor, suffix) for suffix in suffixes]
    return(files)
+
+
+def download_phone_data_input_with_mutation_scripts(wilcards):
+    import yaml
+    input = dict()
+    phone_source_type = config["PHONE_DATA_CONFIGURATION"]["SOURCE"]["TYPE"]
+
+    input["participant_file"] = "data/external/participant_files/{pid}.yaml"
+    input["rapids_schema_file"] = "src/data/streams/rapids_columns.yaml"
+    input["source_schema_file"] = "src/data/streams/" + phone_source_type + "/format.yaml"
+    input["source_download_file"] = "src/data/streams/"+ phone_source_type + "/container.R"
+
+    schema = yaml.load(open(input.get("source_schema_file"), 'r'), Loader=yaml.FullLoader)
+    sensor = ("phone_" + wilcards.sensor).upper()
+    if sensor not in schema:
+        raise ValueError("{sensor} is not defined in the schema {schema}".format(sensor=sensor, schema=input.get("source_schema_file")))
+    for device_os in ["ANDROID", "IOS"]:
+        scripts = schema[sensor][device_os]["MUTATION_SCRIPTS"]
+        if isinstance(scripts, list):
+            for idx, script in enumerate(scripts):
+                if not script.lower().endswith((".py", ".r")):
+                    raise ValueError("Mutate scripts can only be Python or R scripts (.py, .R).\n   Instead we got {script} in \n   [{sensor}][{device_os}] of {schema}".format(script=script, sensor=sensor, device_os=device_os, schema=input.get("source_schema_file")))
+                input["mutationscript"+str(idx)] = script
+    return input
--- a/rules/preprocessing.smk
+++ b/rules/preprocessing.smk
@ -24,14 +24,11 @@ rule create_participants_files:
        "../src/data/create_participants_files.R"

 rule download_phone_data:
-    input:
-        "data/external/participant_files/{pid}.yaml"
+    input: unpack(download_phone_data_input_with_mutation_scripts)
    params:
-        source = config["PHONE_DATA_CONFIGURATION"]["SOURCE"],
+        data_configuration = config["PHONE_DATA_CONFIGURATION"],
        sensor = "phone_" + "{sensor}",
-        table = lambda wildcards: config["PHONE_" + str(wildcards.sensor).upper()]["TABLE"],
-        timezone = config["PHONE_DATA_CONFIGURATION"]["TIMEZONE"]["VALUE"],
-        aware_multiplatform_tables = config["PHONE_ACTIVITY_RECOGNITION"]["TABLE"]["ANDROID"] + "," + config["PHONE_ACTIVITY_RECOGNITION"]["TABLE"]["IOS"] + "," + config["PHONE_CONVERSATION"]["TABLE"]["ANDROID"] + "," + config["PHONE_CONVERSATION"]["TABLE"]["IOS"],
+        tables = lambda wildcards: config["PHONE_" + str(wildcards.sensor).upper()]["TABLE"],
    output:
        "data/raw/{pid}/phone_{sensor}_raw.csv"
    script:
--- a/src/data/download_phone_data.R
+++ b/src/data/download_phone_data.R
@ -1,20 +1,14 @@
 source("renv/activate.R")
-source("src/data/unify_utils.R")
-library(RMariaDB)
-library(stringr)
-library("dplyr", warn.conflicts = F)
-library(readr)
-library(yaml)
-library(lubridate)
-options(scipen=999)

-validate_deviceid_platforms <- function(device_ids, platforms){
-  if(length(device_ids) == 1){
-    if(length(platforms) > 1 || (platforms != "android" && platforms != "ios"))
-      stop(paste0("If you have 1 device_id, its platform should be 'android' or 'ios' but you typed: '", paste0(platforms, collapse = ","), "'. Participant file: ", participant))
-  } else if(length(device_ids) > 1 && length(platforms) == 1){
-    if(platforms != "android" && platforms != "ios" && platforms != "multiple")
-      stop(paste0("If you have more than 1 device_id, platform should be 'android', 'ios' OR 'multiple' but you typed: '", paste0(platforms, collapse = "s,"), "'. Participant file: ", participant))
+library(yaml)
+library(dplyr)
+library(readr)
+# we use reticulate but only load it if we are going to use it to minimize the case when old RAPIDS deployments need to update ther renv
+
+validate_deviceid_platforms <- function(device_ids, platforms, participant){
+  if(length(device_ids) > 1 && length(platforms) == 1){
+    if(platforms != "android" && platforms != "ios" && platforms != "infer")
+      stop(paste0("If you have more than 1 device_id, platform should be 'android', 'ios' OR 'infer' but you typed: '", paste0(platforms, collapse = "s,"), "'. Participant file: ", participant))
  } else if(length(device_ids) > 1 && length(platforms) > 1){
    if(length(device_ids) != length(platforms))
      stop(paste0("The number of device_ids should match the number of platforms. Participant file:", participant))
@ -23,85 +17,123 @@ validate_deviceid_platforms <- function(device_ids, platforms){
  }
 }

-is_multiplaform_participant <- function(dbEngine, device_ids, platforms){
-  # Multiple android and ios platforms or the same platform (android, ios) for multiple devices
-  if((length(device_ids) > 1 && length(platforms) > 1) || (length(device_ids) > 1 && length(platforms) == 1 && (platforms == "android" || platforms == "ios"))){
-    return(TRUE)
-  }
-  # Multiple platforms for multiple devices, we search the platform for every device in the aware_device table
-  if(length(device_ids) > 1 && length(platforms) == 1 && platforms == "multiple"){
-    devices_platforms <- dbGetQuery(dbEngine, paste0("SELECT device_id,brand FROM aware_device WHERE device_id IN ('", paste0(device_ids, collapse = "','"), "')"))
-    platforms <- devices_platforms %>% distinct(brand) %>% pull(brand)
-    # Android phones have different brands so we check that we got at least two different platforms and one of them is iPhone
-    if(length(platforms) > 1 && "iPhone" %in% platforms){
-      return(TRUE)
-    }
-  }
-  return(FALSE)
+validate_inferred_os <- function(source_download_file, participant_file, device, device_os){
+  if(!is.na(device_os) && device_os != "android" && device_os != "ios")
+    stop(paste0("We tried to infer the OS for ", device, " but 'infer_device_os' function inside '",source_download_file,"' returned '",device_os,"' instead of 'android' or 'ios'. You can assign the OS manually in the participant file or report this bug on GitHub.\nParticipant file ", participant_file))
 }

-get_timestamp_filter <- function(device_ids, participant, timezone){
-    # Read start and end date from the participant file to filter data within that range
-    start_date <- ymd_hms(paste(participant$PHONE$START_DATE,"00:00:00"), tz=timezone, quiet=TRUE)
-    end_date <- ymd_hms(paste(participant$PHONE$END_DATE, "23:59:59"), tz=timezone, quiet=TRUE)
-    start_timestamp = as.numeric(start_date) * 1000
-    end_timestamp = as.numeric(end_date) * 1000
-    if(is.na(start_timestamp)){
-      message(paste("PHONE[START_DATE] was not provided or failed to parse (", participant$PHONE$START_DATE,"), all data for", paste0(device_ids, collapse=","),"is returned"))
-      return("")
-    }else if(is.na(end_timestamp)){
-      message(paste("PHONE[END_DATE] was not provided or failed to parse (", participant$PHONE$END_DATE,"), all data for", paste0(device_ids, collapse=","),"is returned"))
-      return("")
-    } else if(start_timestamp > end_timestamp){
-      stop(paste("Start date has to be before end date in PHONE[TIME_SPAN] (",start_date,",", date(end_date),"), all data for", paste0(device_ids, collapse=","),"is returned"))
-      return("")
+mutate_data <- function(scripts, data){
+  for(script in scripts){
+    if(grepl("\\.(R)$", script)){
+      myEnv <- new.env()    
+      source(script, local=myEnv)
+      attach(myEnv, name="sourced_scripts_rapids")
+      if(exists("main", myEnv)){
+        message(paste("Applying mutation script", script))
+        data <- main(data)
      } else{
-      message(paste("Filtering data between", start_date, "and", end_date, "in", timezone, "for",paste0(device_ids, collapse=",")))
-      return(paste0("AND timestamp BETWEEN ", start_timestamp, " AND ", end_timestamp))
+        stop(paste0("The following mutation script does not have main function: ", script))
      }
-}
-
-participant_file <- snakemake@input[[1]]
-source <- snakemake@params[["source"]]
-group <- source$DATABASE_GROUP
-table <- snakemake@params[["table"]]
-sensor <- snakemake@params[["sensor"]]
-timezone <- snakemake@params[["timezone"]]
-aware_multiplatform_tables <- str_split(snakemake@params[["aware_multiplatform_tables"]], ",")[[1]]
-sensor_file <- snakemake@output[[1]]
-
-participant <- read_yaml(participant_file)
-if(! "PHONE" %in% names(participant)){
-  stop(paste("The following participant file does not have a PHONE section, create one manually or automatically (see the docs):", participant_file))
-}
-device_ids <- participant$PHONE$DEVICE_IDS
-unified_device_id <- tail(device_ids, 1)
-platforms <- participant$PHONE$PLATFORMS
-validate_deviceid_platforms(device_ids, platforms)
-timestamp_filter <- get_timestamp_filter(device_ids, participant, timezone)
-
-dbEngine <- dbConnect(MariaDB(), default.file = "./.env", group = group)
-
-if(is_multiplaform_participant(dbEngine, device_ids, platforms)){
-  sensor_data <- unify_raw_data(dbEngine, table, sensor, timestamp_filter, aware_multiplatform_tables, device_ids, platforms)
+      # rm(list = ls(envir = myEnv), envir = myEnv, inherits = FALSE)
+      detach("sourced_scripts_rapids")
+    } else{ # python
+      library(reticulate)
+      module <- gsub(pattern = "\\.py$", "", basename(script))
+      script_functions <- import_from_path(module, path = dirname(script))
+      if(py_has_attr(script_functions, "main")){
+        message(paste("Applying mutation script", script))
+        data <- script_functions$main(data)
      } else{
-  # table has two elements for conversation and activity recognition (they store data on a different table for ios and android)
-  if(length(table) > 1)
-    table <- table[[toupper(platforms[1])]]
-  query <- paste0("SELECT * FROM ", table, " WHERE ",source$DEVICE_ID_COLUMN," IN ('", paste0(device_ids, collapse = "','"), "')", timestamp_filter)
-  sensor_data <- dbGetQuery(dbEngine, query) %>%
-    rename(device_id = source$DEVICE_ID_COLUMN)
+        stop(paste0("The following mutation script does not have a main function: ", script))
+      }
+    }
  }

-sensor_data <- sensor_data %>% arrange(timestamp)
+  return(data)
+}

-# Unify device_id
-sensor_data <- sensor_data %>% mutate(device_id = unified_device_id)
+rename_columns <- function(name_maps, data){
+  for(name in names(name_maps))
+    data <- data %>% rename(!!tolower(name) := name_maps[[name]])
+  return(data)
+}

-# Removing blob_feature conversation column (it's loaded as a list column that crashes write_csv)
-sensor_data <- sensor_data %>% select(-any_of("blob_feature"))
-# Droping duplicates on all columns except for _id or id
-sensor_data <- sensor_data %>% distinct(!!!syms(setdiff(names(sensor_data), c("_id", "id"))))
+validate_expected_columns_mapping <- function(schema, rapids_schema, sensor, rapids_schema_file){
+  android_columns <- names(schema[[sensor]][["ANDROID"]][["COLUMN_MAPPINGS"]])
+  android_columns <- android_columns[(android_columns != "FLAG_AS_EXTRA")]

-write_csv(sensor_data, sensor_file)
-dbDisconnect(dbEngine)
+  ios_columns <- names(schema[[sensor]][["IOS"]][["COLUMN_MAPPINGS"]])
+  ios_columns <- ios_columns[(ios_columns != "FLAG_AS_EXTRA")]
+  rapids_columns <- rapids_schema[[sensor]]
+
+  if(is.null(rapids_columns))
+    stop(paste(sensor, " columns are not listed in RAPIDS' column specification. If you are adding support for a new phone sensor, add any mandatory columns in ", rapids_schema_file))
+  if(length(setdiff(rapids_columns, android_columns)) > 0)
+    stop(paste(sensor," mappings are missing one or more mandatory columns for ANDROID. The missing column mappings are for ", paste(setdiff(rapids_columns, android_columns), collapse=","),"in", rapids_schema_file))
+  if(length(setdiff(rapids_columns, ios_columns)) > 0)
+    stop(paste(sensor," mappings are missing one or more mandatory columns for IOS. The missing column mappings are for ", paste(setdiff(rapids_columns, ios_columns), collapse=","),"in", rapids_schema_file))
+  if(length(setdiff(android_columns, rapids_columns)) > 0)
+    stop(paste(sensor," mappings have one or more columns than required for ANDROID, add them as FLAG_AS_EXTRA instead. The extra column mappings are for ", paste(setdiff(android_columns, rapids_columns), collapse=","),"in", rapids_schema_file))
+  if(length(setdiff(ios_columns, rapids_columns)) > 0)
+    stop(paste(sensor," mappings have one or more columns than required for IOS, add them as FLAG_AS_EXTRA instead. The extra column mappings are for ", paste(setdiff(ios_columns, rapids_columns), collapse=","),"in", rapids_schema_file))
+}
+
+download_phone_data <- function(){
+  participant_file <- snakemake@input[["participant_file"]]
+  source_schema_file <- snakemake@input[["source_schema_file"]]
+  rapids_schema_file <- snakemake@input[["rapids_schema_file"]]
+  source_download_file <- snakemake@input[["source_download_file"]]
+  data_configuration <- snakemake@params[["data_configuration"]]
+  tables <- snakemake@params[["tables"]]
+  sensor <- toupper(snakemake@params[["sensor"]])
+  output_data_file <- snakemake@output[[1]]
+
+  source(source_download_file)
+
+  participant_data <- read_yaml(participant_file)
+  schema <- read_yaml(source_schema_file)
+  rapids_schema <- read_yaml(rapids_schema_file)
+  devices <- participant_data$PHONE$DEVICE_IDS
+  device_oss <- participant_data$PHONE$PLATFORMS
+  device_oss <- replace(device_oss, device_oss == "multiple", "infer") # support multiple for retro compatibility
+  validate_deviceid_platforms(devices, device_oss, participant_file)
+
+  if(length(device_oss) == 1)
+    device_oss <- rep(device_oss, length(devices))
+
+  validate_expected_columns_mapping(schema, rapids_schema, sensor, rapids_schema_file)
+  # ANDROID or IOS COLUMN_MAPPINGS are guaranteed to be the same at this point (see validate_expected_columns_mapping function)
+  expected_columns <- tolower(names(schema[[sensor]][["ANDROID"]][["COLUMN_MAPPINGS"]]))
+  expected_columns <- expected_columns[(expected_columns != "flag_extra")]
+  participant_data <- setNames(data.frame(matrix(ncol = length(expected_columns), nrow = 0)), expected_columns)
+
+  for(idx in seq_along(devices)){ #TODO remove length
+    
+    device <- devices[idx]
+    message(paste0("\nProcessing ", sensor, " for ", device))
+    device_os <- ifelse(device_oss[idx] == "infer", infer_device_os(data_configuration, device), device_oss[idx])
+    validate_inferred_os(basename(source_download_file), participant_file, device, device_os)
+    os_table <- ifelse(length(tables) > 1, tables[[toupper(device_os)]], tables) # some sensor tables have a different name for android and ios    
+
+    columns_to_download <- schema[[sensor]][[toupper(device_os)]][["COLUMN_MAPPINGS"]]
+    columns_to_download <- columns_to_download[(columns_to_download != "FLAG_TO_MUTATE")]
+    data <- download_data(data_configuration, device, os_table, columns_to_download)
+    
+    # Rename all COLUMN_MAPPINGS except those mapped as FLAG_AS_EXTRA or FLAG_TO_MUTATE
+    columns_to_rename <- schema[[sensor]][[toupper(device_os)]][["COLUMN_MAPPINGS"]]
+    columns_to_rename <- (columns_to_rename[(columns_to_rename != "FLAG_TO_MUTATE" & names(columns_to_rename) != "FLAG_AS_EXTRA")])
+    renamed_data <- rename_columns(columns_to_rename, data)
+    
+    mutation_scripts <- schema[[sensor]][[toupper(device_os)]][["MUTATION_SCRIPTS"]]
+    mutated_data <- mutate_data(mutation_scripts, renamed_data)
+
+    if(length(setdiff(expected_columns, colnames(mutated_data))) > 0)
+      stop(paste("The mutated data for", device, "is missing these columns expected by RAPIDS: [", paste(setdiff(expected_columns, colnames(mutated_data)), collapse=","),"]. One ore more mutation scripts in [", sensor,"][",toupper(device_os), "]","[MUTATION_SCRIPTS] are removing or not adding these columns"))
+    participant_data <- rbind(participant_data, mutated_data)
+      
+  }
+
+  write_csv(participant_data, output_data_file)
+}
+
+download_phone_data()
--- a/src/data/streams/aware_mysql/container.R
+++ b/src/data/streams/aware_mysql/container.R
@ -0,0 +1,63 @@
+# if you need a new package, you should add it with renv::install(package) so your renv venv is updated
+library(RMariaDB)
+
+# This file gets executed for each PHONE_SENSOR of each participant
+# If you are connecting to a database the env file containing its credentials is available at "./.env"
+# If you are reading a CSV file instead of a DB table, the @param sensor_container wil contain the file path as set in config.yaml
+# You are not bound to databases or files, you can query a web API or whatever data source you need.
+
+#' @description
+#' RAPIDS allows users to use the keyword "infer" (previously "multiple") to automatically infer the mobile Operative System a device was running.
+#' If you have a way to infer the OS of a device ID, implement this function. For example, for AWARE data we use the "aware_device" table.
+#'  
+#' If you don't have a way to infer the OS, call stop("Error Message") so other users know they can't use "infer" or the inference failed, 
+#' and they have to assign the OS manually in the participant file
+#' 
+#' @param data_configuration The PHONE_DATA_CONFIGURATION key in config.yaml. If you need specific parameters add them there.
+#' @param device A device ID string
+#' @return The OS the device ran, "android" or "ios"
+
+infer_device_os <- function(data_configuration, device){
+  group <- data_configuration$SOURCE$DATABASE_GROUP # specified DB credentials group in config.yaml
+  
+  dbEngine <- dbConnect(MariaDB(), default.file = "./.env", group = group)
+  query <- paste0("SELECT device_id,brand FROM aware_device WHERE device_id = '", device, "'")
+  message(paste0("Executing the following query to infer phone OS: ", query)) 
+  os <- dbGetQuery(dbEngine, query)
+  dbDisconnect(dbEngine)
+  
+  if(nrow(os) > 0)
+    return(os %>% mutate(os = ifelse(brand == "iPhone", "ios", "android")) %>% pull(os))
+  else
+    stop(paste("We cannot infer the OS of the following device id because it does not exist in the aware_device table:", device))
+  
+  return(os)
+}
+
+#' @description
+#' Gets the sensor data for a specific device id from a database table, file or whatever source you want to query
+#' 
+#' @param data_configuration The PHONE_DATA_CONFIGURATION key in config.yaml. If you need specific parameters add them there.
+#' @param device A device ID string
+#' @param sensor_container database table or file containing the sensor data for all participants. This is the PHONE_SENSOR[TABLE] key in config.yaml
+#' @param columns the columns needed from this sensor (we recommend to only return these columns instead of every column in sensor_container)
+#' @return A dataframe with the sensor data for device
+
+download_data <- function(data_configuration, device, sensor_container, columns){
+  group <- data_configuration$SOURCE$DATABASE_GROUP
+  dbEngine <- dbConnect(MariaDB(), default.file = "./.env", group = group)
+  
+  
+  query <- paste0("SELECT ", paste(columns, collapse = ",")," FROM ", sensor_container, " WHERE device_id = '", device,"'")
+  # Letting the user know what we are doing
+  message(paste0("Executing the following query to download data: ", query)) 
+  sensor_data <- dbGetQuery(dbEngine, query)
+  
+  dbDisconnect(dbEngine)
+  
+  if(nrow(sensor_data) == 0)
+    warning(paste("The device '", device,"' did not have data in ", sensor_container))
+
+  return(sensor_data)
+}
+
--- a/src/data/streams/aware_mysql/format.yaml
+++ b/src/data/streams/aware_mysql/format.yaml
@ -0,0 +1,18 @@
+PHONE_ACCELEROMETER:
+  ANDROID:
+    COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      DOUBLE_VALUES_0: double_values_0
+      DOUBLE_VALUES_1: double_values_1
+      DOUBLE_VALUES_2: double_values_2
+    MUTATION_SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      DOUBLE_VALUES_0: double_values_0
+      DOUBLE_VALUES_1: double_values_1
+      DOUBLE_VALUES_2: double_values_2
+    MUTATION_SCRIPTS: # List any python or r scripts that mutate your raw data
+
--- a/src/data/streams/rapids_columns.yaml
+++ b/src/data/streams/rapids_columns.yaml
@ -0,0 +1,6 @@
+PHONE_ACCELEROMETER:
+  - TIMESTAMP
+  - DEVICE_ID
+  - DOUBLE_VALUES_0
+  - DOUBLE_VALUES_1
+  - DOUBLE_VALUES_2
--- a/tools/config.schema.yaml
+++ b/tools/config.schema.yaml
@ -97,14 +97,12 @@ definitions:
    properties:
      SOURCE: 
        type: object
-        required: [TYPE, DATABASE_GROUP, DEVICE_ID_COLUMN]
+        required: [TYPE, DATABASE_GROUP]
        properties:
          TYPE:
            type: string
          DATABASE_GROUP:
            type: string
-          DEVICE_ID_COLUMN:
-            type: string
      TIMEZONE:
        type: object
        required: [TYPE, VALUE]
@ -189,7 +187,7 @@ properties:
            properties:
              TYPE:
                type: string
-                enum: [DATABASE]
+                enum: [aware_mysql]

  PHONE_ACCELEROMETER:
    type: object