Bring back requested fields in config.yaml.

Update coding files based on 7e565c34db98265afcda922a337493781fdd8ed5 in supermodule.
Completely remove PACKAGE_NAMES_HASHED and instead provide a differently structured file.
2023-04-19 11:07:58 +02:00 · 2023-04-18 22:58:42 +02:00 · 2023-04-18 22:45:12 +02:00 · 2023-04-18 22:40:11 +02:00 · 2023-04-18 21:34:59 +02:00 · 2023-04-18 21:23:26 +02:00
728 changed files with 33066 additions and 1967 deletions
--- a/.gitattributes
+++ b/.gitattributes
@ -0,0 +1,7 @@
+# We'll let Git's auto-detection algorithm infer if a file is text. If it is,
+# enforce LF line endings regardless of OS or git configurations.
+* text=auto eol=lf
+
+# Isolate binary files in case the auto-detection algorithm fails and
+# marks them as text files (which could brick them).
+*.{png,jpg,jpeg,gif,webp,woff,woff2} binary
--- a/.gitignore
+++ b/.gitignore
@ -93,10 +93,17 @@ packrat/*

 # exclude data from source control by default
 data/external/*
+!/data/external/empatica/empatica1/E4 Data.zip
 !/data/external/.gitkeep
 !/data/external/stachl_application_genre_catalogue.csv
 !/data/external/timesegments*.csv
 !/data/external/wiki_tz.csv
+!/data/external/main_study_usernames.csv
+!/data/external/timezone.csv
+!/data/external/play_store_application_genre_catalogue.csv
+!/data/external/play_store_categories_count.csv
+
+
 data/raw/*
 !/data/raw/.gitkeep
 data/interim/*
@ -114,3 +121,12 @@ settings.dcf
 tests/fakedata_generation/
 site/
 credentials.yaml
+
+# Docker container and other files
+.devcontainer
+
+# Calculating features module
+calculatingfeatures/
+
+# Temp folder for rapids data/external
+rapids_temp_data/
--- a/README.md
+++ b/README.md
@ -11,3 +11,191 @@
 For more information refer to our [documentation](http://www.rapids.science)

 By [MoSHI](https://www.moshi.pitt.edu/), [University of Pittsburgh](https://www.pitt.edu/)
+
+## Installation 
+
+For RAPIDS installation refer to to the [documentation](https://www.rapids.science/1.8/setup/installation/)
+
+### For the installation of the Docker version
+
+1. Follow the [instructions](https://www.rapids.science/1.8/setup/installation/) to setup RAPIDS via Docker (from scratch).
+
+2. Delete current contents in /rapids/ folder when in a container session.
+    ```
+    cd ..
+    rm -rf rapids/{*,.*}
+    cd rapids
+    ```
+
+3. Clone RAPIDS workspace from Git and checkout a specific branch.
+    ```
+    git clone "https://repo.ijs.si/junoslukan/rapids.git" .
+    git checkout <branch_name>
+    ```
+
+4. Install missing “libpq-dev” dependency with bash.
+    ```
+    apt-get update -y
+    apt-get install -y libpq-dev
+    ```
+
+5. Restore R venv.
+Type R to go to the interactive R session and then:
+    ```
+    renv::restore()
+    ```
+
+6. Install cr-features module 
+From: https://repo.ijs.si/matjazbostic/calculatingfeatures.git -> branch master. 
+Then follow the "cr-features module" section below.  
+
+7. Install all required packages from environment.yml, prune also deletes conda packages not present in environment file.
+    ```
+    conda env update --file environment.yml –prune
+    ```
+
+8. If you wish to update your R or Python venvs.
+    ```
+    R in interactive session:
+    renv::snapshot()
+    Python: 
+    conda env export --no-builds | sed 's/^.*libgfortran.*$/  - libgfortran/' | sed 's/^.*mkl=.*$/  - mkl/' >  environment.yml
+    ```
+
+### cr-features module 
+
+This RAPIDS extension uses cr-features library accessible [here](https://repo.ijs.si/matjazbostic/calculatingfeatures).
+
+To use cr-features library:
+
+- Follow the installation instructions in the [README.md](https://repo.ijs.si/matjazbostic/calculatingfeatures/-/blob/master/README.md).
+
+- Copy built calculatingfeatures folder into the RAPIDS workspace.
+
+- Install the cr-features package by:
+    ```
+    pip install path/to/the/calculatingfeatures/folder
+    e.g. pip install ./calculatingfeatures if the folder is copied to main parent directory
+    cr-features package has to be built and installed everytime to get the newest version. 
+    Or an the newest version of the docker image must be used.   
+    ```
+
+## Updating RAPIDS
+
+To update RAPIDS, first pull and merge [origin]( https://github.com/carissalow/rapids), such as with:
+
+```commandline
+git fetch --progress "origin" refs/heads/master
+git merge --no-ff origin/master
+```
+
+Next, update the conda and R virtual environment.
+
+```bash
+R -e 'renv::restore(repos = c(CRAN = "https://packagemanager.rstudio.com/all/__linux__/focal/latest"))'
+```
+
+## Custom configuration
+### Credentials
+
+As mentioned under [Database in RAPIDS documentation](https://www.rapids.science/1.6/snippets/database/), a `credentials.yaml` file is needed to connect to a database.
+It should contain:
+
+```yaml
+PSQL_STRAW:
+  database: staw
+  host: 212.235.208.113
+  password: password
+  port: 5432
+  user: staw_db
+```
+
+where`password` needs to be specified as well.
+
+## Possible installation issues
+### Missing dependencies for RPostgres
+
+To install `RPostgres` R package (used to connect to the PostgreSQL database), an error might occur:
+
+```text
+------------------------- ANTICONF ERROR ---------------------------
+Configuration failed because libpq was not found. Try installing:
+   * deb: libpq-dev (Debian, Ubuntu, etc)
+   * rpm: postgresql-devel (Fedora, EPEL)
+   * rpm: postgreql8-devel, psstgresql92-devel, postgresql93-devel, or postgresql94-devel (Amazon Linux)
+   * csw: postgresql_dev (Solaris)
+   * brew: libpq (OSX)
+If libpq is already installed, check that either:
+  (i)  'pkg-config' is in your PATH AND PKG_CONFIG_PATH contains a libpq.pc file; or
+  (ii) 'pg_config' is in your PATH.
+If neither can detect , you can set INCLUDE_DIR
+and LIB_DIR manually via:
+  R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
+--------------------------[ ERROR MESSAGE ]----------------------------
+  <stdin>:1:10: fatal error: libpq-fe.h: No such file or directory
+compilation terminated.
+```
+
+The library requires `libpq` for compiling from source, so install accordingly.
+
+### Timezone environment variable for tidyverse (relevant for WSL2)
+
+One of the R packages, `tidyverse` might need access to the `TZ` environment variable during the installation.
+On Ubuntu 20.04 on WSL2 this triggers the following error:
+
+```text
+> install.packages('tidyverse')
+
+ERROR: configuration failed for package ‘xml2’
+System has not been booted with systemd as init system (PID 1). Can't operate.
+Failed to create bus connection: Host is down
+Warning in system("timedatectl", intern = TRUE) :
+  running command 'timedatectl' had status 1
+Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) :
+  namespace ‘xml2’ 1.3.1 is already loaded, but >= 1.3.2 is required
+Calls: <Anonymous> ... namespaceImportFrom -> asNamespace -> loadNamespace
+Execution halted
+ERROR: lazy loading failed for package ‘tidyverse’
+```
+
+This happens because WSL2 does not use the `timedatectl` service, which provides this variable.
+
+```bash
+~$ timedatectl
+System has not been booted with systemd as init system (PID 1). Can't operate.
+Failed to create bus connection: Host is down
+```
+
+and later 
+
+```bash 
+Warning message:
+In system("timedatectl", intern = TRUE) :
+  running command 'timedatectl' had status 1
+Execution halted
+```
+
+This can be amended by setting the environment variable manually before attempting to install `tidyverse`:
+
+```bash
+export TZ='Europe/Ljubljana'
+```
+
+Note: if this is needed to avoid runtime issues, you need to either define this environment variable in each new terminal window or (better) define it in your `~/.bashrc` or `~/.bash_profile`.
+
+## Possible runtime issues
+### Unix end of line characters
+
+Upon running rapids, an error might occur:
+
+```bash
+/usr/bin/env: ‘python3\r’: No such file or directory
+```
+
+This is due to Windows style end of line characters. 
+To amend this, I added a `.gitattributes` files to force `git` to checkout `rapids` using Unix EOL characters.
+If this still fails, `dos2unix` can be used to change them.
+
+### System has not been booted with systemd as init system (PID 1)
+
+See [the installation issue above](#Timezone-environment-variable-for-tidyverse-(relevant-for-WSL2)).
--- a/68
+++ b/68
@ -5,6 +5,7 @@ include: "rules/common.smk"
 include: "rules/renv.smk"
 include: "rules/preprocessing.smk"
 include: "rules/features.smk"
+include: "rules/models.smk"
 include: "rules/reports.smk"

 import itertools
@ -45,7 +46,12 @@ for provider in config["PHONE_MESSAGES"]["PROVIDERS"].keys():
 for provider in config["PHONE_CALLS"]["PROVIDERS"].keys():
    if config["PHONE_CALLS"]["PROVIDERS"][provider]["COMPUTE"]:
        files_to_compute.extend(expand("data/raw/{pid}/phone_calls_raw.csv", pid=config["PIDS"]))
-        files_to_compute.extend(expand("data/raw/{pid}/phone_calls_with_datetime.csv", pid=config["PIDS"]))
+        if (provider == "RAPIDS") and (config["PHONE_CALLS"]["PROVIDERS"][provider]["FEATURES_TYPE"] == "EPISODES"):
+            files_to_compute.extend(expand("data/interim/{pid}/phone_calls_episodes.csv", pid=config["PIDS"]))
+            files_to_compute.extend(expand("data/interim/{pid}/phone_calls_episodes_resampled.csv", pid=config["PIDS"]))
+            files_to_compute.extend(expand("data/interim/{pid}/phone_calls_episodes_resampled_with_datetime.csv", pid=config["PIDS"]))
+        else:
+            files_to_compute.extend(expand("data/raw/{pid}/phone_calls_with_datetime.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/interim/{pid}/phone_calls_features/phone_calls_{language}_{provider_key}.csv", pid=config["PIDS"], language=get_script_language(config["PHONE_CALLS"]["PROVIDERS"][provider]["SRC_SCRIPT"]), provider_key=provider.lower()))
        files_to_compute.extend(expand("data/processed/features/{pid}/phone_calls.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"]))
@ -122,6 +128,10 @@ for provider in config["PHONE_APPLICATIONS_FOREGROUND"]["PROVIDERS"].keys():
        files_to_compute.extend(expand("data/raw/{pid}/phone_applications_foreground_raw.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/raw/{pid}/phone_applications_foreground_with_datetime.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/raw/{pid}/phone_applications_foreground_with_datetime_with_categories.csv", pid=config["PIDS"]))
+        if config["PHONE_APPLICATIONS_FOREGROUND"]["PROVIDERS"][provider]["INCLUDE_EPISODE_FEATURES"]:
+            files_to_compute.extend(expand("data/interim/{pid}/phone_app_episodes.csv", pid=config["PIDS"]))
+            files_to_compute.extend(expand("data/interim/{pid}/phone_app_episodes_resampled.csv", pid=config["PIDS"]))
+            files_to_compute.extend(expand("data/interim/{pid}/phone_app_episodes_resampled_with_datetime.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/interim/{pid}/phone_applications_foreground_features/phone_applications_foreground_{language}_{provider_key}.csv", pid=config["PIDS"], language=get_script_language(config["PHONE_APPLICATIONS_FOREGROUND"]["PROVIDERS"][provider]["SRC_SCRIPT"]), provider_key=provider.lower()))
        files_to_compute.extend(expand("data/processed/features/{pid}/phone_applications_foreground.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"]))
@ -154,6 +164,25 @@ for provider in config["PHONE_CONVERSATION"]["PROVIDERS"].keys():
        files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"]))
        files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv")

+for provider in config["PHONE_ESM"]["PROVIDERS"].keys():
+    if config["PHONE_ESM"]["PROVIDERS"][provider]["COMPUTE"]:
+        files_to_compute.extend(expand("data/raw/{pid}/phone_esm_raw.csv",pid=config["PIDS"]))
+        files_to_compute.extend(expand("data/raw/{pid}/phone_esm_with_datetime.csv",pid=config["PIDS"]))
+        files_to_compute.extend(expand("data/interim/{pid}/phone_esm_clean.csv",pid=config["PIDS"]))
+        files_to_compute.extend(expand("data/interim/{pid}/phone_esm_features/phone_esm_{language}_{provider_key}.csv",pid=config["PIDS"],language=get_script_language(config["PHONE_ESM"]["PROVIDERS"][provider]["SRC_SCRIPT"]),provider_key=provider.lower()))
+        files_to_compute.extend(expand("data/processed/features/{pid}/phone_esm.csv", pid=config["PIDS"]))
+        # files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv",pid=config["PIDS"]))
+        # files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv")
+
+for provider in config["PHONE_SPEECH"]["PROVIDERS"].keys():
+    if config["PHONE_SPEECH"]["PROVIDERS"][provider]["COMPUTE"]:
+        files_to_compute.extend(expand("data/raw/{pid}/phone_speech_raw.csv",pid=config["PIDS"]))
+        files_to_compute.extend(expand("data/raw/{pid}/phone_speech_with_datetime.csv",pid=config["PIDS"]))
+        files_to_compute.extend(expand("data/interim/{pid}/phone_speech_features/phone_speech_{language}_{provider_key}.csv",pid=config["PIDS"],language=get_script_language(config["PHONE_SPEECH"]["PROVIDERS"][provider]["SRC_SCRIPT"]),provider_key=provider.lower()))
+        files_to_compute.extend(expand("data/processed/features/{pid}/phone_speech.csv", pid=config["PIDS"]))
+        files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"]))
+        files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv")
+
 # We can delete these if's as soon as we add feature PROVIDERS to any of these sensors
 if isinstance(config["PHONE_APPLICATIONS_CRASHES"]["PROVIDERS"], dict):
    for provider in config["PHONE_APPLICATIONS_CRASHES"]["PROVIDERS"].keys():
@ -208,7 +237,8 @@ for provider in config["PHONE_LOCATIONS"]["PROVIDERS"].keys():
        if provider == "BARNETT":
            files_to_compute.extend(expand("data/interim/{pid}/phone_locations_barnett_daily.csv", pid=config["PIDS"]))
        if provider == "DORYAB":
-            files_to_compute.extend(expand("data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns.csv", pid=config["PIDS"]))
+            files_to_compute.extend(expand("data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns_episodes.csv", pid=config["PIDS"]))
+            files_to_compute.extend(expand("data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns_episodes_resampled_with_datetime.csv", pid=config["PIDS"]))

        files_to_compute.extend(expand("data/raw/{pid}/phone_locations_raw.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/interim/{pid}/phone_locations_processed.csv", pid=config["PIDS"]))
@ -307,7 +337,7 @@ for provider in config["EMPATICA_ACCELEROMETER"]["PROVIDERS"].keys():
        files_to_compute.extend(expand("data/processed/features/{pid}/empatica_accelerometer.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"]))
        files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv")
-
+     
 for provider in config["EMPATICA_HEARTRATE"]["PROVIDERS"].keys():
    if config["EMPATICA_HEARTRATE"]["PROVIDERS"][provider]["COMPUTE"]:
        files_to_compute.extend(expand("data/raw/{pid}/empatica_heartrate_raw.csv", pid=config["PIDS"]))
@ -353,7 +383,7 @@ for provider in config["EMPATICA_INTER_BEAT_INTERVAL"]["PROVIDERS"].keys():
        files_to_compute.extend(expand("data/processed/features/{pid}/empatica_inter_beat_interval.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"]))
        files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv")
-
+     
 if isinstance(config["EMPATICA_TAGS"]["PROVIDERS"], dict):
    for provider in config["EMPATICA_TAGS"]["PROVIDERS"].keys():
        if config["EMPATICA_TAGS"]["PROVIDERS"][provider]["COMPUTE"]:
@ -377,11 +407,41 @@ if config["HEATMAP_SENSOR_ROW_COUNT_PER_TIME_SEGMENT"]["PLOT"]:
    files_to_compute.append("reports/data_exploration/heatmap_sensor_row_count_per_time_segment.html")

 if config["HEATMAP_PHONE_DATA_YIELD_PER_PARTICIPANT_PER_TIME_SEGMENT"]["PLOT"]:
+    if not config["PHONE_DATA_YIELD"]["PROVIDERS"]["RAPIDS"]["COMPUTE"]:
+        raise ValueError("Error: [PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] must be True in config.yaml to get heatmaps of overall data yield.")
    files_to_compute.append("reports/data_exploration/heatmap_phone_data_yield_per_participant_per_time_segment.html")

 if config["HEATMAP_FEATURE_CORRELATION_MATRIX"]["PLOT"]:
    files_to_compute.append("reports/data_exploration/heatmap_feature_correlation_matrix.html")

+# Data Cleaning
+for provider in config["ALL_CLEANING_INDIVIDUAL"]["PROVIDERS"].keys():
+    if config["ALL_CLEANING_INDIVIDUAL"]["PROVIDERS"][provider]["COMPUTE"]:
+        if provider == "STRAW":
+            files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features_cleaned_" + provider.lower() + "_py.csv", pid=config["PIDS"]))
+        else:
+            files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features_cleaned_" + provider.lower() + "_R.csv", pid=config["PIDS"]))
+
+for provider in config["ALL_CLEANING_OVERALL"]["PROVIDERS"].keys():
+    if config["ALL_CLEANING_OVERALL"]["PROVIDERS"][provider]["COMPUTE"]:
+        if provider == "STRAW":
+            for target in config["PARAMS_FOR_ANALYSIS"]["TARGET"]["ALL_LABELS"]:
+                files_to_compute.extend(expand("data/processed/features/all_participants/all_sensor_features_cleaned_" + provider.lower() +"_py_(" + target + ").csv"))
+        else:
+            files_to_compute.extend(expand("data/processed/features/all_participants/all_sensor_features_cleaned_" + provider.lower() +"_R.csv"))     
+
+# Baseline features
+if config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["COMPUTE"]:
+    files_to_compute.extend(expand("data/raw/baseline_merged.csv"))
+    files_to_compute.extend(expand("data/raw/{pid}/participant_baseline_raw.csv", pid=config["PIDS"]))
+    files_to_compute.extend(expand("data/interim/{pid}/baseline_questionnaires.csv", pid=config["PIDS"]))
+    files_to_compute.extend(expand("data/processed/features/{pid}/baseline_features.csv", pid=config["PIDS"]))
+
+# Targets (labels)
+if config["PARAMS_FOR_ANALYSIS"]["TARGET"]["COMPUTE"]:
+    files_to_compute.extend(expand("data/processed/models/individual_model/{pid}/input.csv", pid=config["PIDS"]))
+    for target in config["PARAMS_FOR_ANALYSIS"]["TARGET"]["ALL_LABELS"]:
+        files_to_compute.extend(expand("data/processed/models/population_model/input_" + target + ".csv"))

 rule all:
    input:
--- a/init.py
+++ b/init.py
--- a/automl_test.py
+++ b/automl_test.py
@ -0,0 +1,57 @@
+from pprint import pprint
+import sklearn.metrics
+import autosklearn.regression
+
+import datetime
+import importlib
+import os
+import sys
+
+import numpy as np
+import matplotlib.pyplot as plt
+import pandas as pd
+import seaborn as sns
+import yaml
+
+from sklearn import linear_model, svm, kernel_ridge, gaussian_process
+from sklearn.model_selection import LeaveOneGroupOut, cross_val_score, train_test_split
+from sklearn.metrics import mean_squared_error, r2_score
+from sklearn.impute import SimpleImputer
+
+model_input = pd.read_csv("data/processed/models/population_model/input_PANAS_negative_affect_mean.csv") # Standardizirani podatki
+
+model_input.dropna(axis=1, how="all", inplace=True)
+model_input.dropna(axis=0, how="any", subset=["target"], inplace=True)
+
+categorical_feature_colnames = ["gender", "startlanguage"]
+categorical_feature_colnames += [col for col in model_input.columns if "mostcommonactivity" in col or "homelabel" in col]
+categorical_features = model_input[categorical_feature_colnames].copy()
+mode_categorical_features = categorical_features.mode().iloc[0]
+categorical_features = categorical_features.fillna(mode_categorical_features)
+categorical_features = categorical_features.apply(lambda col: col.astype("category"))
+if not categorical_features.empty:
+    categorical_features = pd.get_dummies(categorical_features)
+numerical_features = model_input.drop(categorical_feature_colnames, axis=1)
+model_in = pd.concat([numerical_features, categorical_features], axis=1)
+
+index_columns = ["local_segment", "local_segment_label", "local_segment_start_datetime", "local_segment_end_datetime"]
+model_in.set_index(index_columns, inplace=True)
+
+X_train, X_test, y_train, y_test = train_test_split(model_in.drop(["target", "pid"], axis=1), model_in["target"], test_size=0.30)
+
+automl = autosklearn.regression.AutoSklearnRegressor(
+    time_left_for_this_task=7200,
+    per_run_time_limit=120
+)
+automl.fit(X_train, y_train, dataset_name='straw')
+
+print(automl.leaderboard())
+pprint(automl.show_models(), indent=4)
+
+train_predictions = automl.predict(X_train)
+print("Train R2 score:", sklearn.metrics.r2_score(y_train, train_predictions))
+test_predictions = automl.predict(X_test)
+print("Test R2 score:", sklearn.metrics.r2_score(y_test, test_predictions))
+
+import sys
+sys.exit()
--- a/config.yaml
+++ b/config.yaml
@ -3,16 +3,17 @@
 ########################################################################################################################

 # See https://www.rapids.science/latest/setup/configuration/#participant-files
-PIDS: [test01]
+PIDS: ['p031', 'p032', 'p033', 'p034', 'p035', 'p036', 'p037', 'p038', 'p039', 'p040', 'p042', 'p043', 'p044', 'p045', 'p046', 'p049', 'p050', 'p052', 'p053', 'p054', 'p055', 'p057', 'p058', 'p059', 'p060', 'p061', 'p062', 'p064', 'p067', 'p068', 'p069', 'p070', 'p071', 'p072', 'p073', 'p074', 'p075', 'p076', 'p077', 'p078', 'p079', 'p080', 'p081', 'p082', 'p083', 'p084', 'p085', 'p086', 'p088', 'p089', 'p090', 'p091', 'p092', 'p093', 'p106', 'p107']

 # See https://www.rapids.science/latest/setup/configuration/#automatic-creation-of-participant-files
 CREATE_PARTICIPANT_FILES:
-  CSV_FILE_PATH: "data/external/example_participants.csv" # see docs for required format
+  USERNAMES_CSV: "data/external/main_study_usernames.csv"
+  CSV_FILE_PATH: "data/external/main_study_participants.csv" # see docs for required format
  PHONE_SECTION:
    ADD: True
    IGNORED_DEVICE_IDS: []
  FITBIT_SECTION:
-    ADD: True
+    ADD: False
    IGNORED_DEVICE_IDS: []
  EMPATICA_SECTION:
    ADD: True
@ -20,19 +21,25 @@ CREATE_PARTICIPANT_FILES:

 # See https://www.rapids.science/latest/setup/configuration/#time-segments
 TIME_SEGMENTS: &time_segments
-  TYPE: PERIODIC # FREQUENCY, PERIODIC, EVENT
-  FILE: "data/external/timesegments_periodic.csv"
-  INCLUDE_PAST_PERIODIC_SEGMENTS: FALSE # Only relevant if TYPE=PERIODIC, see docs
+  TYPE: EVENT # FREQUENCY, PERIODIC, EVENT
+  FILE: "data/external/straw_events.csv"
+  INCLUDE_PAST_PERIODIC_SEGMENTS: TRUE # Only relevant if TYPE=PERIODIC, see docs
+  TAILORED_EVENTS: # Only relevant if TYPE=EVENT
+    COMPUTE: True
+    SEGMENTING_METHOD: "30_before" # 30_before, 90_before, stress_event
+    INTERVAL_OF_INTEREST: 10 # duration of event of interest [minutes]
+    IOI_ERROR_TOLERANCE: 5 # interval of interest erorr tolerance (before and after IOI) [minutes]

 # See https://www.rapids.science/latest/setup/configuration/#timezone-of-your-study
 TIMEZONE: 
-    TYPE: SINGLE
+    TYPE: MULTIPLE
    SINGLE:
-      TZCODE: America/New_York
+      TZCODE: Europe/Ljubljana
    MULTIPLE:
-      TZCODES_FILE: data/external/multiple_timezones_example.csv
-      IF_MISSING_TZCODE: STOP
-      DEFAULT_TZCODE: America/New_York
+      TZ_FILE: data/external/timezone.csv
+      TZCODES_FILE: data/external/multiple_timezones.csv
+      IF_MISSING_TZCODE: USE_DEFAULT
+      DEFAULT_TZCODE: Europe/Ljubljana
      FITBIT: 
        ALLOW_MULTIPLE_TZ_PER_DEVICE: False
        INFER_FROM_SMARTPHONE_TZ: False
@ -43,12 +50,15 @@ TIMEZONE:

 # See https://www.rapids.science/latest/setup/configuration/#data-stream-configuration
 PHONE_DATA_STREAMS:
-  USE: aware_mysql
+  USE: aware_postgresql
  
  # AVAILABLE:
  aware_mysql: 
    DATABASE_GROUP: MY_GROUP

+  aware_postgresql:
+    DATABASE_GROUP: PSQL_STRAW
+  
  aware_csv:
    FOLDER: data/external/aware_csv
  
@ -65,7 +75,6 @@ PHONE_ACCELEROMETER:
      COMPUTE: False
      FEATURES: ["maxmagnitude", "minmagnitude", "avgmagnitude", "medianmagnitude", "stdmagnitude"]
      SRC_SCRIPT: src/features/phone_accelerometer/rapids/main.py
-    
    PANDA:
      COMPUTE: False
      VALID_SENSED_MINUTES: False
@ -77,12 +86,12 @@ PHONE_ACCELEROMETER:
 # See https://www.rapids.science/latest/features/phone-activity-recognition/
 PHONE_ACTIVITY_RECOGNITION:
  CONTAINER: 
-    ANDROID: plugin_google_activity_recognition
+    ANDROID: google_ar
    IOS: plugin_ios_activity_recognition
  EPISODE_THRESHOLD_BETWEEN_ROWS: 5 # minutes. Max time difference for two consecutive rows to be considered within the same AR episode.
  PROVIDERS:
    RAPIDS:
-      COMPUTE: False
+      COMPUTE: True
      FEATURES: ["count", "mostcommonactivity", "countuniqueactivities", "durationstationary", "durationmobile", "durationvehicle"]
      ACTIVITY_CLASSES:
        STATIONARY: ["still", "tilting"]
@ -95,35 +104,52 @@ PHONE_APPLICATIONS_CRASHES:
  CONTAINER: applications_crashes
  APPLICATION_CATEGORIES:
    CATALOGUE_SOURCE: FILE # FILE (genres are read from CATALOGUE_FILE) or GOOGLE (genres are scrapped from the Play Store)
-    CATALOGUE_FILE: "data/external/stachl_application_genre_catalogue.csv"
-    UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether or not to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
-    SCRAPE_MISSING_CATEGORIES: False # whether or not to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
+    CATALOGUE_FILE: "data/external/play_store_application_genre_catalogue.csv"
+    UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
+    SCRAPE_MISSING_CATEGORIES: False # whether to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
  PROVIDERS: # None implemented yet but this sensor can be used in PHONE_DATA_YIELD

 # See https://www.rapids.science/latest/features/phone-applications-foreground/
 PHONE_APPLICATIONS_FOREGROUND:
-  CONTAINER: applications_foreground
+  CONTAINER: applications
  APPLICATION_CATEGORIES:
    CATALOGUE_SOURCE: FILE # FILE (genres are read from CATALOGUE_FILE) or GOOGLE (genres are scrapped from the Play Store)
-    CATALOGUE_FILE: "data/external/stachl_application_genre_catalogue.csv"
-    UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether or not to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
-    SCRAPE_MISSING_CATEGORIES: False # whether or not to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
+    CATALOGUE_FILE: "data/external/play_store_application_genre_catalogue.csv"
+    # Refer to data/external/play_store_categories_count.csv for a list of categories (genres) and their frequency.
+    UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
+    SCRAPE_MISSING_CATEGORIES: False # whether to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
  PROVIDERS:
    RAPIDS:
-      COMPUTE: False
-      SINGLE_CATEGORIES: ["all", "email"]
+      COMPUTE: True
+      INCLUDE_EPISODE_FEATURES: True
+      SINGLE_CATEGORIES: ["Productivity", "Tools", "Communication", "Education", "Social"]
      MULTIPLE_CATEGORIES:
-        social: ["socialnetworks", "socialmediatools"]
-        entertainment: ["entertainment", "gamingknowledge", "gamingcasual", "gamingadventure", "gamingstrategy", "gamingtoolscommunity", "gamingroleplaying", "gamingaction", "gaminglogic", "gamingsports", "gamingsimulation"]
-      SINGLE_APPS: ["top1global", "com.facebook.moments", "com.google.android.youtube", "com.twitter.android"] # There's no entropy for single apps
-      EXCLUDED_CATEGORIES: []
-      EXCLUDED_APPS: ["com.fitbit.FitbitMobile", "com.aware.plugin.upmc.cancer"]
-      FEATURES: ["count", "timeoffirstuse", "timeoflastuse", "frequencyentropy"]
+        games: ["Puzzle", "Card", "Casual", "Board", "Strategy", "Trivia", "Word", "Adventure", "Role Playing", "Simulation", "Board, Brain Games", "Racing"]
+        social: ["Communication", "Social", "Dating"]
+        productivity: ["Tools", "Productivity", "Finance", "Education", "News & Magazines", "Business", "Books & Reference"]
+        health: ["Health & Fitness", "Lifestyle", "Food & Drink", "Sports", "Medical", "Parenting"]
+        entertainment: ["Shopping", "Music & Audio", "Entertainment", "Travel & Local", "Photography", "Video Players & Editors", "Personalization", "House & Home", "Art & Design", "Auto & Vehicles", "Entertainment,Music & Video",
+                        "Puzzle", "Card", "Casual", "Board", "Strategy", "Trivia", "Word", "Adventure", "Role Playing", "Simulation", "Board, Brain Games", "Racing" # Add all games.
+        ]
+        maps_weather: ["Maps & Navigation", "Weather"]
+      CUSTOM_CATEGORIES:
+      SINGLE_APPS: []
+      EXCLUDED_CATEGORIES: ["System", "STRAW"]
+      # Note: A special option here is "is_system_app".
+      # This excludes applications that have is_system_app = TRUE, which is a separate column in the table.
+      # However, all of these applications have been assigned System category.
+      # I will therefore filter by that category, which is a superset and is more complete. JL
+      EXCLUDED_APPS: []
+      FEATURES: 
+        APP_EVENTS: ["countevent", "timeoffirstuse", "timeoflastuse", "frequencyentropy"]
+        APP_EPISODES: ["countepisode", "minduration", "maxduration", "meanduration", "sumduration"]
+      IGNORE_EPISODES_SHORTER_THAN: 0 # in minutes, set to 0 to disable
+      IGNORE_EPISODES_LONGER_THAN: 300 # in minutes, set to 0 to disable
      SRC_SCRIPT: src/features/phone_applications_foreground/rapids/main.py

 # See https://www.rapids.science/latest/features/phone-applications-notifications/
 PHONE_APPLICATIONS_NOTIFICATIONS:
-  CONTAINER: applications_notifications
+  CONTAINER: notifications
  APPLICATION_CATEGORIES:
    CATALOGUE_SOURCE: FILE # FILE (genres are read from CATALOGUE_FILE) or GOOGLE (genres are scrapped from the Play Store)
    CATALOGUE_FILE: "data/external/stachl_application_genre_catalogue.csv"
@ -137,7 +163,7 @@ PHONE_BATTERY:
  EPISODE_THRESHOLD_BETWEEN_ROWS: 30 # minutes. Max time difference for two consecutive rows to be considered within the same battery episode.
  PROVIDERS:
    RAPIDS:
-      COMPUTE: False
+      COMPUTE: True
      FEATURES: ["countdischarge", "sumdurationdischarge", "countcharge", "sumdurationcharge", "avgconsumptionrate", "maxconsumptionrate"]
      SRC_SCRIPT: src/features/phone_battery/rapids/main.py

@ -151,7 +177,7 @@ PHONE_BLUETOOTH:
      SRC_SCRIPT: src/features/phone_bluetooth/rapids/main.R

    DORYAB:
-      COMPUTE: False
+      COMPUTE: True
      FEATURES: 
        ALL: 
            DEVICES: ["countscans", "uniquedevices", "meanscans", "stdscans"]
@ -169,10 +195,11 @@ PHONE_BLUETOOTH:

 # See https://www.rapids.science/latest/features/phone-calls/
 PHONE_CALLS:
-  CONTAINER: calls
+  CONTAINER: call
  PROVIDERS:
    RAPIDS:
-      COMPUTE: False
+      COMPUTE: True
+      FEATURES_TYPE: EPISODES # EVENTS or EPISODES
      CALL_TYPES: [missed, incoming, outgoing]
      FEATURES:
        missed:  [count, distinctcontacts, timefirstcall, timelastcall, countmostfrequentcontact]
@ -181,7 +208,7 @@ PHONE_CALLS:
      SRC_SCRIPT: src/features/phone_calls/rapids/main.R

 # See https://www.rapids.science/latest/features/phone-conversation/
-PHONE_CONVERSATION:
+PHONE_CONVERSATION: # TODO Adapt for speech
  CONTAINER: 
    ANDROID: plugin_studentlife_audio_android
    IOS: plugin_studentlife_audio
@ -200,14 +227,35 @@ PHONE_CONVERSATION:

 # See https://www.rapids.science/latest/features/phone-data-yield/
 PHONE_DATA_YIELD:
-  SENSORS: []
+  SENSORS: [#PHONE_ACCELEROMETER,
+            PHONE_ACTIVITY_RECOGNITION,
+            PHONE_APPLICATIONS_FOREGROUND,
+            PHONE_APPLICATIONS_NOTIFICATIONS,
+            PHONE_BATTERY,
+            PHONE_BLUETOOTH,
+            PHONE_CALLS,
+            PHONE_LIGHT,
+            PHONE_LOCATIONS,
+            PHONE_MESSAGES,
+            PHONE_SCREEN,
+            PHONE_WIFI_VISIBLE]
  PROVIDERS:
    RAPIDS:
-      COMPUTE: False
+      COMPUTE: True
      FEATURES: [ratiovalidyieldedminutes, ratiovalidyieldedhours]
      MINUTE_RATIO_THRESHOLD_FOR_VALID_YIELDED_HOURS: 0.5 # 0 to 1, minimum percentage of valid minutes in an hour to be considered valid.
      SRC_SCRIPT: src/features/phone_data_yield/rapids/main.R

+PHONE_ESM:
+  CONTAINER: esm
+  PROVIDERS:
+    STRAW:
+      COMPUTE: True
+      SCALES: ["PANAS_positive_affect", "PANAS_negative_affect", "JCQ_job_demand", "JCQ_job_control", "JCQ_supervisor_support", "JCQ_coworker_support", 
+              "appraisal_stressfulness_period", "appraisal_stressfulness_event", "appraisal_threat", "appraisal_challenge"]
+      FEATURES: [mean]
+      SRC_SCRIPT: src/features/phone_esm/straw/main.py
+
 # See https://www.rapids.science/latest/features/phone-keyboard/
 PHONE_KEYBOARD:
  CONTAINER: keyboard
@ -219,10 +267,10 @@ PHONE_KEYBOARD:

 # See https://www.rapids.science/latest/features/phone-light/
 PHONE_LIGHT:
-  CONTAINER: light
+  CONTAINER: light_sensor
  PROVIDERS:
    RAPIDS:
-      COMPUTE: False
+      COMPUTE: True
      FEATURES: ["count", "maxlux", "minlux", "avglux", "medianlux", "stdlux"]
      SRC_SCRIPT: src/features/phone_light/rapids/main.py

@ -232,12 +280,12 @@ PHONE_LOCATIONS:
  LOCATIONS_TO_USE: ALL_RESAMPLED # ALL, GPS, ALL_RESAMPLED, OR FUSED_RESAMPLED
  FUSED_RESAMPLED_CONSECUTIVE_THRESHOLD: 30 # minutes, only replicate location samples to the next sensed bin if the phone did not stop collecting data for more than this threshold
  FUSED_RESAMPLED_TIME_SINCE_VALID_LOCATION: 720 # minutes, only replicate location samples to consecutive sensed bins if they were logged within this threshold after a valid location row
-  
+  ACCURACY_LIMIT: 100 # meters, drops location coordinates with an accuracy equal or higher than this. This number means there's a 68% probability the true location is within this radius
+
  PROVIDERS:
    DORYAB:
-      COMPUTE: False
+      COMPUTE: True
      FEATURES: ["locationvariance","loglocationvariance","totaldistance","avgspeed","varspeed", "numberofsignificantplaces","numberlocationtransitions","radiusgyration","timeattop1location","timeattop2location","timeattop3location","movingtostaticratio","outlierstimepercent","maxlengthstayatclusters","minlengthstayatclusters","avglengthstayatclusters","stdlengthstayatclusters","locationentropy","normalizedlocationentropy","timeathome", "homelabel"]
-      ACCURACY_LIMIT: 100 # meters, drops location coordinates with an accuracy higher than this. This number means there's a 68% probability the true location is within this radius
      DBSCAN_EPS: 100 # meters
      DBSCAN_MINSAMPLES: 5
      THRESHOLD_STATIC : 1 # km/h
@ -251,9 +299,8 @@ PHONE_LOCATIONS:
      SRC_SCRIPT: src/features/phone_locations/doryab/main.py

    BARNETT:
-      COMPUTE: False
+      COMPUTE: True
      FEATURES: ["hometime","disttravelled","rog","maxdiam","maxhomedist","siglocsvisited","avgflightlen","stdflightlen","avgflightdur","stdflightdur","probpause","siglocentropy","circdnrtn","wkenddayrtn"]
-      ACCURACY_LIMIT: 100 # meters, drops location coordinates with an accuracy higher than this. This number means there's a 68% probability the true location is within this radius
      IF_MULTIPLE_TIMEZONES: USE_MOST_COMMON
      MINUTES_DATA_USED: False # Use this for quality control purposes, how many minutes of data (location coordinates gruped by minute) were used to compute features
      SRC_SCRIPT: src/features/phone_locations/barnett/main.R
@ -267,10 +314,10 @@ PHONE_LOG:

 # See https://www.rapids.science/latest/features/phone-messages/
 PHONE_MESSAGES:
-  CONTAINER: messages
+  CONTAINER: sms
  PROVIDERS:
    RAPIDS:
-      COMPUTE: False
+      COMPUTE: True
      MESSAGES_TYPES : [received, sent]
      FEATURES: 
        received: [count, distinctcontacts, timefirstmessage, timelastmessage, countmostfrequentcontact]
@ -282,14 +329,23 @@ PHONE_SCREEN:
  CONTAINER: screen
  PROVIDERS:
    RAPIDS:
-      COMPUTE: False
+      COMPUTE: True
      REFERENCE_HOUR_FIRST_USE: 0
      IGNORE_EPISODES_SHORTER_THAN: 0 # in minutes, set to 0 to disable
-      IGNORE_EPISODES_LONGER_THAN: 0 # in minutes, set to 0 to disable
+      IGNORE_EPISODES_LONGER_THAN: 360 # in minutes, set to 0 to disable
      FEATURES: ["countepisode", "sumduration", "maxduration", "minduration", "avgduration", "stdduration", "firstuseafter"] # "episodepersensedminutes" needs to be added later
      EPISODE_TYPES: ["unlock"]
      SRC_SCRIPT: src/features/phone_screen/rapids/main.py

+# Custom added sensor
+PHONE_SPEECH:
+  CONTAINER: speech
+  PROVIDERS:
+    STRAW:
+      COMPUTE: True
+      FEATURES: ["meanspeech", "stdspeech", "nlargest", "nsmallest", "medianspeech"]
+      SRC_SCRIPT: src/features/phone_speech/straw/main.py
+
 # See https://www.rapids.science/latest/features/phone-wifi-connected/
 PHONE_WIFI_CONNECTED:
  CONTAINER: sensor_wifi
@ -304,7 +360,7 @@ PHONE_WIFI_VISIBLE:
  CONTAINER: wifi
  PROVIDERS:
    RAPIDS:
-      COMPUTE: False
+      COMPUTE: True
      FEATURES: ["countscans", "uniquedevices", "countscansmostuniquedevice"]
      SRC_SCRIPT: src/features/phone_wifi_visible/rapids/main.R

@ -407,7 +463,6 @@ FITBIT_SLEEP_INTRADAY:
        UNIFIED: [awake, asleep]
      SLEEP_TYPES: [main, nap, all]
      SRC_SCRIPT: src/features/fitbit_sleep_intraday/rapids/main.py
-  
    PRICE:
      COMPUTE: False
      FEATURES: [avgduration, avgratioduration, avgstarttimeofepisodemain, avgendtimeofepisodemain, avgmidpointofepisodemain, stdstarttimeofepisodemain, stdendtimeofepisodemain, stdmidpointofepisodemain, socialjetlag, rmssdmeanstarttimeofepisodemain, rmssdmeanendtimeofepisodemain, rmssdmeanmidpointofepisodemain, rmssdmedianstarttimeofepisodemain, rmssdmedianendtimeofepisodemain, rmssdmedianmidpointofepisodemain]
@ -443,13 +498,15 @@ FITBIT_STEPS_INTRADAY:
    RAPIDS:
      COMPUTE: False
      FEATURES:
-        STEPS: ["sum", "max", "min", "avg", "std"]
+        STEPS: ["sum", "max", "min", "avg", "std", "firststeptime", "laststeptime"]
        SEDENTARY_BOUT: ["countepisode", "sumduration", "maxduration", "minduration", "avgduration", "stdduration"]
        ACTIVE_BOUT: ["countepisode", "sumduration", "maxduration", "minduration", "avgduration", "stdduration"]
+      REFERENCE_HOUR: 0
      THRESHOLD_ACTIVE_BOUT: 10 # steps
      INCLUDE_ZERO_STEP_ROWS: False
      SRC_SCRIPT: src/features/fitbit_steps_intraday/rapids/main.py

+
 ########################################################################################################################
 #                                                 EMPATICA                                                             #
 ########################################################################################################################
@ -471,6 +528,15 @@ EMPATICA_ACCELEROMETER:
      COMPUTE: False
      FEATURES: ["maxmagnitude", "minmagnitude", "avgmagnitude", "medianmagnitude", "stdmagnitude"]
      SRC_SCRIPT: src/features/empatica_accelerometer/dbdp/main.py
+    CR:
+      COMPUTE: True
+      FEATURES: ["totalMagnitudeBand", "absoluteMeanBand", "varianceBand"] # Acc features
+      WINDOWS:
+        COMPUTE: True
+        WINDOW_LENGTH: 15 # specify window length in seconds
+        SECOND_ORDER_FEATURES: ['mean', 'median', 'sd', 'nlargest', 'nsmallest', 'count_windows']
+      SRC_SCRIPT: src/features/empatica_accelerometer/cr/main.py
+

 # See https://www.rapids.science/latest/features/empatica-heartrate/
 EMPATICA_HEARTRATE:
@ -489,6 +555,15 @@ EMPATICA_TEMPERATURE:
      COMPUTE: False
      FEATURES: ["maxtemp", "mintemp", "avgtemp", "mediantemp", "modetemp", "stdtemp", "diffmaxmodetemp", "diffminmodetemp", "entropytemp"]
      SRC_SCRIPT: src/features/empatica_temperature/dbdp/main.py
+    CR:
+      COMPUTE: True
+      FEATURES: ["maximum", "minimum", "meanAbsChange", "longestStrikeAboveMean", "longestStrikeBelowMean", 
+                  "stdDev", "median", "meanChange", "sumSquared", "squareSumOfComponent", "sumOfSquareComponents"]
+      WINDOWS:
+        COMPUTE: True
+        WINDOW_LENGTH: 300 # specify window length in seconds
+        SECOND_ORDER_FEATURES: ['mean', 'median', 'sd', 'nlargest', 'nsmallest', 'count_windows']
+      SRC_SCRIPT: src/features/empatica_temperature/cr/main.py

 # See https://www.rapids.science/latest/features/empatica-electrodermal-activity/
 EMPATICA_ELECTRODERMAL_ACTIVITY:
@ -498,6 +573,19 @@ EMPATICA_ELECTRODERMAL_ACTIVITY:
      COMPUTE: False
      FEATURES: ["maxeda", "mineda", "avgeda", "medianeda", "modeeda", "stdeda", "diffmaxmodeeda", "diffminmodeeda", "entropyeda"]
      SRC_SCRIPT: src/features/empatica_electrodermal_activity/dbdp/main.py
+    CR:
+      COMPUTE: True
+      FEATURES: ['mean', 'std', 'q25', 'q75', 'qd', 'deriv', 'power', 'numPeaks', 'ratePeaks', 'powerPeaks', 'sumPosDeriv', 'propPosDeriv', 'derivTonic', 
+                  'sigTonicDifference', 'freqFeats','maxPeakAmplitudeChangeBefore', 'maxPeakAmplitudeChangeAfter', 'avgPeakAmplitudeChangeBefore', 
+                  'avgPeakAmplitudeChangeAfter', 'avgPeakChangeRatio', 'maxPeakIncreaseTime', 'maxPeakDecreaseTime', 'maxPeakDuration', 'maxPeakChangeRatio',
+                  'avgPeakIncreaseTime', 'avgPeakDecreaseTime', 'avgPeakDuration', 'signalOverallChange', 'changeDuration', 'changeRate', 'significantIncrease', 
+                  'significantDecrease']
+      WINDOWS:
+        COMPUTE: True
+        WINDOW_LENGTH: 60 # specify window length in seconds
+        SECOND_ORDER_FEATURES: ['mean', 'median', 'sd', 'nlargest', 'nsmallest', count_windows, eda_num_peaks_non_zero]
+        IMPUTE_NANS: True
+      SRC_SCRIPT: src/features/empatica_electrodermal_activity/cr/main.py

 # See https://www.rapids.science/latest/features/empatica-blood-volume-pulse/
 EMPATICA_BLOOD_VOLUME_PULSE:
@ -507,6 +595,15 @@ EMPATICA_BLOOD_VOLUME_PULSE:
      COMPUTE: False
      FEATURES: ["maxbvp", "minbvp", "avgbvp", "medianbvp", "modebvp", "stdbvp", "diffmaxmodebvp", "diffminmodebvp", "entropybvp"]
      SRC_SCRIPT: src/features/empatica_blood_volume_pulse/dbdp/main.py
+    CR:
+      COMPUTE: False
+      FEATURES: ['meanHr', 'ibi', 'sdnn', 'sdsd', 'rmssd', 'pnn20', 'pnn50', 'sd', 'sd2', 'sd1/sd2', 'numRR', # Time features
+                  'VLF', 'LF', 'LFnorm', 'HF', 'HFnorm', 'LF/HF', 'fullIntegral'] # Freq features
+      WINDOWS:
+        COMPUTE: True
+        WINDOW_LENGTH: 300 # specify window length in seconds
+        SECOND_ORDER_FEATURES: ['mean', 'median', 'sd', 'nlargest', 'nsmallest', 'count_windows', 'hrv_num_windows_non_nan']
+      SRC_SCRIPT: src/features/empatica_blood_volume_pulse/cr/main.py

 # See https://www.rapids.science/latest/features/empatica-inter-beat-interval/
 EMPATICA_INTER_BEAT_INTERVAL:
@ -516,6 +613,16 @@ EMPATICA_INTER_BEAT_INTERVAL:
      COMPUTE: False
      FEATURES: ["maxibi", "minibi", "avgibi", "medianibi", "modeibi", "stdibi", "diffmaxmodeibi", "diffminmodeibi", "entropyibi"]
      SRC_SCRIPT: src/features/empatica_inter_beat_interval/dbdp/main.py
+    CR:
+      COMPUTE: True
+      FEATURES: ['meanHr', 'ibi', 'sdnn', 'sdsd', 'rmssd', 'pnn20', 'pnn50', 'sd', 'sd2', 'sd1/sd2', 'numRR', # Time features
+                  'VLF', 'LF', 'LFnorm', 'HF', 'HFnorm', 'LF/HF', 'fullIntegral'] # Freq features            
+      PATCH_WITH_BVP: True
+      WINDOWS:
+        COMPUTE: True
+        WINDOW_LENGTH: 300 # specify window length in seconds
+        SECOND_ORDER_FEATURES: ['mean', 'median', 'sd', 'nlargest', 'nsmallest', 'count_windows', 'hrv_num_windows_non_nan']
+      SRC_SCRIPT: src/features/empatica_inter_beat_interval/cr/main.py

 # See https://www.rapids.science/latest/features/empatica-tags/
 EMPATICA_TAGS:
@ -556,3 +663,96 @@ HEATMAP_FEATURE_CORRELATION_MATRIX:
  CORR_THRESHOLD: 0.1
  CORR_METHOD: "pearson" # choose from {"pearson", "kendall", "spearman"}

+
+########################################################################################################################
+#                                                    Data Cleaning                                                     #
+########################################################################################################################
+
+ALL_CLEANING_INDIVIDUAL:
+  PROVIDERS:
+    RAPIDS:
+      COMPUTE: False
+      IMPUTE_SELECTED_EVENT_FEATURES:
+        COMPUTE: False
+        MIN_DATA_YIELDED_MINUTES_TO_IMPUTE: 0.33
+      COLS_NAN_THRESHOLD: 1 # set to 1 to disable
+      COLS_VAR_THRESHOLD: True
+      ROWS_NAN_THRESHOLD: 1 # set to 1 to disable
+      DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_HOURS # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
+      DATA_YIELD_RATIO_THRESHOLD: 0 # set to 0 to disable
+      DROP_HIGHLY_CORRELATED_FEATURES:
+        COMPUTE: True
+        MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
+        CORR_THRESHOLD: 0.95
+      SRC_SCRIPT: src/features/all_cleaning_individual/rapids/main.R
+    STRAW:
+      COMPUTE: True
+      PHONE_DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_MINUTES # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
+      PHONE_DATA_YIELD_RATIO_THRESHOLD: 0.5 # set to 0 to disable
+      EMPATICA_DATA_YIELD_RATIO_THRESHOLD: 0.5 # set to 0 to disable
+      ROWS_NAN_THRESHOLD: 0.33 # set to 1 to disable
+      COLS_NAN_THRESHOLD: 0.9 # set to 1 to remove only columns that contains all (100% of) NaN
+      COLS_VAR_THRESHOLD: True
+      DROP_HIGHLY_CORRELATED_FEATURES:
+        COMPUTE: True
+        MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
+        CORR_THRESHOLD: 0.95
+      STANDARDIZATION: True
+      SRC_SCRIPT: src/features/all_cleaning_individual/straw/main.py
+
+ALL_CLEANING_OVERALL:
+  PROVIDERS:
+    RAPIDS:
+      COMPUTE: False
+      IMPUTE_SELECTED_EVENT_FEATURES:
+        COMPUTE: False
+        MIN_DATA_YIELDED_MINUTES_TO_IMPUTE: 0.33
+      COLS_NAN_THRESHOLD: 1 # set to 1 to disable
+      COLS_VAR_THRESHOLD: True
+      ROWS_NAN_THRESHOLD: 1 # set to 1 to disable
+      DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_HOURS # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
+      DATA_YIELD_RATIO_THRESHOLD: 0 # set to 0 to disable
+      DROP_HIGHLY_CORRELATED_FEATURES:
+        COMPUTE: True
+        MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
+        CORR_THRESHOLD: 0.95
+      SRC_SCRIPT: src/features/all_cleaning_overall/rapids/main.R
+    STRAW:
+      COMPUTE: True
+      PHONE_DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_MINUTES # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
+      PHONE_DATA_YIELD_RATIO_THRESHOLD: 0.5 # set to 0 to disable
+      EMPATICA_DATA_YIELD_RATIO_THRESHOLD: 0.5 # set to 0 to disable
+      ROWS_NAN_THRESHOLD: 0.33 # set to 1 to disable
+      COLS_NAN_THRESHOLD: 0.8 # set to 1 to remove only columns that contains all (100% of) NaN
+      COLS_VAR_THRESHOLD: True
+      DROP_HIGHLY_CORRELATED_FEATURES:
+        COMPUTE: True
+        MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
+        CORR_THRESHOLD: 0.95
+      STANDARDIZATION: True
+      TARGET_STANDARDIZATION: False
+      SRC_SCRIPT: src/features/all_cleaning_overall/straw/main.py
+
+
+########################################################################################################################
+#                                                      Baseline                                                        #
+########################################################################################################################
+
+PARAMS_FOR_ANALYSIS:
+  BASELINE:
+    COMPUTE: True
+    FOLDER: data/external/baseline
+    CONTAINER: [results-survey637813_final.csv,  # Slovenia
+                results-survey358134_final.csv,  # Belgium 1
+                results-survey413767_final.csv  # Belgium 2
+    ]
+    QUESTION_LIST: survey637813+question_text.csv
+    FEATURES: [age, gender, startlanguage, limesurvey_demand, limesurvey_control, limesurvey_demand_control_ratio, limesurvey_demand_control_ratio_quartile]
+    CATEGORICAL_FEATURES: [gender]
+
+  TARGET:
+    COMPUTE: True
+    LABEL: appraisal_stressfulness_event_mean
+    ALL_LABELS: [PANAS_positive_affect_mean, PANAS_negative_affect_mean, JCQ_job_demand_mean, JCQ_job_control_mean, JCQ_supervisor_support_mean, JCQ_coworker_support_mean, appraisal_stressfulness_period_mean]
+                # PANAS_positive_affect_mean, PANAS_negative_affect_mean, JCQ_job_demand_mean, JCQ_job_control_mean, JCQ_supervisor_support_mean, 
+                # JCQ_coworker_support_mean, appraisal_stressfulness_period_mean, appraisal_stressfulness_event_mean, appraisal_threat_mean, appraisal_challenge_mean
--- a/data/external/aware_csv/calls.csv
+++ b/data/external/aware_csv/calls.csv
@ -0,0 +1,9 @@
+"_id","timestamp","device_id","call_type","call_duration","trace"
+1,1587663260695,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",2,14,"d5e84f8af01b2728021d4f43f53a163c0c90000c"
+2,1587739118007,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",3,0,"47c125dc7bd163b8612cdea13724a814917b6e93"
+5,1587746544891,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",2,95,"9cc793ffd6e88b1d850ce540b5d7e000ef5650d4"
+6,1587911379859,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",2,63,"51fb9344e988049a3fec774c7ca622358bf80264"
+7,1587992647361,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",3,0,"2a862a7730cfdfaf103a9487afe3e02935fd6e02"
+8,1588020039448,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",1,11,"a2c53f6a086d98622c06107780980cf1bb4e37bd"
+11,1588176189024,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",2,65,"56589df8c830c70e330b644921ed38e08d8fd1f3"
+12,1588197745079,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",3,0,"cab458018a8ed3b626515e794c70b6f415318adc"
--- a/data/external/empatica/empatica1/E4
+++ b/data/external/empatica/empatica1/E4
--- a/data/external/main_study_usernames.csv
+++ b/data/external/main_study_usernames.csv
@ -0,0 +1,57 @@
+label,empatica_id
+uploader_79170,A0245B
+uploader_89788,A02731
+uploader_68294,A02705
+uploader_92856,A024AF
+uploader_23726,A0231C
+uploader_66620,A02305
+uploader_58435,A026B5
+uploader_87801,A022A8
+uploader_96055,A027BA
+uploader_69549,A0226C
+uploader_26363,A0263D
+uploader_72010,A023FA
+uploader_13997,A024AF
+uploader_31156,A02305
+uploader_63187,A027BA
+uploader_94821,A022A8
+uploader_65413,A023F1;A023FA
+uploader_36488,A02713
+uploader_91087,A0231C
+uploader_35174,A025D1
+uploader_73880,A02705
+uploader_78650,A02731
+uploader_70578,A0245B
+uploader_88313,A02736
+uploader_58482,A0261A
+uploader_80601,A027BA
+uploader_93729,A0226C
+uploader_61663,A0245B
+uploader_80848,A025D1
+uploader_57312,A023F9;A02361;A027A0
+uploader_52087,A02666
+uploader_98770,A02953
+uploader_51327,A0245F
+uploader_11737,A02732
+uploader_77440,A0264E
+uploader_57277,A02422
+uploader_13098,A026E5
+uploader_80719,A023C8
+uploader_54698,A02953
+uploader_95571,A02853
+uploader_21880,A024DC
+uploader_92905,A02920
+uploader_12108,A023F4
+uploader_17436,A026E5
+uploader_58440,A0273F
+uploader_22172,A0245F
+uploader_39250,A02422
+uploader_15311,A023F9
+uploader_45766,A02920
+uploader_23096,A02361
+uploader_78243,A02422
+uploader_58777,A0245F
+uploader_82941,A02666
+uploader_89606,A023F4
+uploader_82969,A023C8
+uploader_53573,A024DC;A02361
--- a/data/external/participant_files/p01.yaml
+++ b/data/external/participant_files/p01.yaml
@ -0,0 +1,11 @@
+PHONE:
+  DEVICE_IDS: [4b62a655-cbf0-4ac0-a448-06726f45b56a]
+  PLATFORMS: [android]
+  LABEL: uploader_53573
+  START_DATE: 2021-05-21 09:21:24
+  END_DATE: 2021-07-12 17:32:07
+EMPATICA:
+  DEVICE_IDS: [uploader_53573]
+  LABEL: uploader_53573
+  START_DATE: 2021-05-21 09:21:24
+  END_DATE: 2021-07-12 17:32:07
--- a/data/external/play_store_application_genre_catalogue.csv
+++ b/data/external/play_store_application_genre_catalogue.csv
--- a/data/external/play_store_categories_count.csv
+++ b/data/external/play_store_categories_count.csv
@ -0,0 +1,45 @@
+genre,n
+System,261
+Tools,96
+Productivity,71
+Health & Fitness,60
+Finance,54
+Communication,39
+Music & Audio,39
+Shopping,38
+Lifestyle,33
+Education,28
+News & Magazines,24
+Maps & Navigation,23
+Entertainment,21
+Business,18
+Travel & Local,18
+Books & Reference,16
+Social,16
+Weather,16
+Food & Drink,14
+Sports,14
+Other,13
+Photography,13
+Puzzle,13
+Video Players & Editors,12
+Card,9
+Casual,9
+Personalization,8
+Medical,7
+Board,5
+Strategy,4
+House & Home,3
+Trivia,3
+Word,3
+Adventure,2
+Art & Design,2
+Auto & Vehicles,2
+Dating,2
+Role Playing,2
+STRAW,2
+Simulation,2
+"Board,Brain Games",1
+"Entertainment,Music & Video",1
+Parenting,1
+Racing,1
--- a/data/external/timesegments_daily.csv
+++ b/data/external/timesegments_daily.csv
@ -0,0 +1,3 @@
+label,start_time,length,repeats_on,repeats_value
+daily,04:00:00,23H 59M 59S,every_day,0
+working_day,04:00:00,18H 00M 00S,every_day,0
--- a/data/external/timesegments_frequency.csv
+++ b/data/external/timesegments_frequency.csv
@ -1,2 +1,2 @@
 label,length
-thirtyminutes,30
+fiveminutes,5
--- a/data/external/timesegments_periodic.csv
+++ b/data/external/timesegments_periodic.csv
@ -1,9 +1,2 @@
 label,start_time,length,repeats_on,repeats_value
-threeday,00:00:00,2D 23H 59M 59S,every_day,0
-daily, 00:00:00,23H 59M 59S, every_day, 0
-morning,06:00:00,5H 59M 59S,every_day,0
-afternoon,12:00:00,5H 59M 59S,every_day,0
-evening,18:00:00,5H 59M 59S,every_day,0
-night,00:00:00,5H 59M 59S,every_day,0
-two_weeks_overlapping,00:00:00,13D 23H 59M 59S,every_day,0
-weekends,00:00:00,2D 23H 59M 59S,wday,5
+daily,00:00:00,23H 59M 59S,every_day,0
--- a/data/external/timezone.csv
+++ b/data/external/timezone.csv
--- a/docs/analysis/complete-workflow-example.md
+++ b/docs/analysis/complete-workflow-example.md
@ -1,8 +1,8 @@
 # Analysis Workflow Example

 !!! info "TL;DR"
-    - In addition to using RAPIDS to extract behavioral features and create plots, you can structure your data analysis within RAPIDS (i.e. cleaning your features and creating ML/statistical models)
-    - We include an analysis example in RAPIDS that covers raw data processing, cleaning, feature extraction, machine learning modeling, and evaluation
+    - In addition to using RAPIDS to extract behavioral features, create plots, and clean sensor features, you can structure your data analysis within RAPIDS (i.e. creating ML/statistical models and evaluating your models)
+    - We include an analysis example in RAPIDS that covers raw data processing, feature extraction, cleaning, machine learning modeling, and evaluation
    - Use this example as a guide to structure your own analysis within RAPIDS
    - RAPIDS analysis workflows are compatible with your favorite data science tools and libraries
    - RAPIDS analysis workflows are reproducible and we encourage you to publish them along with your research papers
@ -52,7 +52,7 @@ Note you will see a lot of warning messages, you can ignore them since they happ
 ## Modules of our analysis workflow example

 ??? info "1. Feature extraction"
-    We extract daily behavioral features for data yield, received and sent messages, missed, incoming and outgoing calls, resample fused location data using Doryab provider, activity recognition, battery, Bluetooth, screen, light, applications foreground, conversations, Wi-Fi connected, Wi-Fi visible, Fitbit heart rate summary and intraday data, Fitbit sleep summary data, and Fitbit step summary and intraday data without excluding sleep periods with an active bout threshold of 10 steps. In total, we obtained 237 daily sensor features over 12 days per participant. 
+    We extract daily behavioral features for data yield, received and sent messages, missed, incoming and outgoing calls, resample fused location data using Doryab provider, activity recognition, battery, Bluetooth, screen, light, applications foreground, conversations, Wi-Fi connected, Wi-Fi visible, Fitbit heart rate summary and intraday data, Fitbit sleep summary data, and Fitbit step summary and intraday data without excluding sleep periods with an active bout threshold of 10 steps. In total, we obtained 245 daily sensor features over 12 days per participant. 

 ??? info "2. Extract demographic data."
    It is common to have demographic data in addition to mobile and target (ground truth) data. In this example we include participants’ age, gender and the number of days they spent in hospital after their surgery as features in our model. We extract these three columns from the `data/external/example_workflow/participant_info.csv` file. As these three features remain the same within participants, they are used only on the population model. Refer to the `demographic_features` rule in `rules/models.smk`.
@ -69,12 +69,12 @@ Note you will see a lot of warning messages, you can ignore them since they happ
 ??? info "6. Feature cleaning."
    In this stage we perform four steps to clean our sensor feature file. First, we discard days with a data yield hour ratio less than or equal to 0.75, i.e. we include days with at least 18 hours of data. Second, we drop columns (features) with more than 30% of missing rows. Third, we drop columns with zero variance. Fourth, we drop rows (days) with more than 30% of missing columns (features). In this cleaning stage several parameters are created and exposed in `example_profile/example_config.yaml`. 

-    After this step, we kept 158 features over 11 days for the individual model of p01, 101 features over 12 days for the individual model of p02 and 106 features over 20 days for the population model. Note that the difference in the number of features between p01 and p02 is mostly due to iOS restrictions that stops researchers from collecting the same number of sensors than in Android phones. 
+    After this step, we kept 173 features over 11 days for the individual model of p01, 101 features over 12 days for the individual model of p02 and 117 features over 22 days for the population model. Note that the difference in the number of features between p01 and p02 is mostly due to iOS restrictions that stops researchers from collecting the same number of sensors than in Android phones. 
    
    Feature cleaning for the individual models is done in the `clean_sensor_features_for_individual_participants` rule and for the population model in the `clean_sensor_features_for_all_participants` rule in `rules/models.smk`.

 ??? info "7. Merge features and targets."
-    In this step we merge the cleaned features and target labels for our individual models in the `merge_features_and_targets_for_individual_model` rule in `rules/models.smk`. Additionally, we merge the cleaned features, target labels, and demographic features of our two participants for the population model in the `merge_features_and_targets_for_population_model` rule in `rules/models.smk`. These two merged files are the input for our individual and population models. 
+    In this step we merge the cleaned features and target labels for our individual models in the `merge_features_and_targets_for_individual_model` rule in `rules/features.smk`. Additionally, we merge the cleaned features, target labels, and demographic features of our two participants for the population model in the `merge_features_and_targets_for_population_model` rule in `rules/features.smk`. These two merged files are the input for our individual and population models. 

 ??? info "8. Modelling."
    This stage has three phases: model building, training and evaluation. 
--- a/docs/analysis/data-cleaning.md
+++ b/docs/analysis/data-cleaning.md
@ -0,0 +1,92 @@
+Data Cleaning
+=============
+
+The goal of this module is to perform basic clean tasks on the behavioral features that RAPIDS computes. You might need to do further processing depending on your analysis objectives. This module can clean features at the individual level and at the study level. If you are interested in creating individual models (using each participant's features independently of the others) use [`ALL_CLEANING_INDIVIDUAL`]. If you are interested in creating population models (using everyone's data in the same model) use [`ALL_CLEANING_OVERALL`]
+    
+## Clean sensor features for individual participants
+
+!!! info "File Sequence"
+    ```bash
+    - data/processed/features/{pid}/all_sensor_features.csv
+    - data/processed/features/{pid}/all_sensor_features_cleaned_{provider_key}.csv
+    ```
+
+### RAPIDS provider
+
+Parameters description for `[ALL_CLEANING_INDIVIDUAL][PROVIDERS][RAPIDS]`:
+
+|Key&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;            | Description |
+|----------------|-----------------------------------------------------------------------------------------------------------------------------------
+|`[COMPUTE]` | Set to `True` to execute the cleaning tasks described below. You can use the parameters of each task to tweak them or deactivate them|
+|`[IMPUTE_SELECTED_EVENT_FEATURES]`     | Fill NAs with 0 only for event-based features, see table below
+|`[COLS_NAN_THRESHOLD]`                 | Discard columns with missing value ratios higher than `[COLS_NAN_THRESHOLD]`. Set to 1 to disable
+|`[COLS_VAR_THRESHOLD]`                 | Set to `True` to discard columns with zero variance
+|`[ROWS_NAN_THRESHOLD]`                 | Discard rows with missing value ratios higher than `[ROWS_NAN_THRESHOLD]`. Set to 1 to disable
+|`[DATA_YIELD_FEATURE]`                 | `RATIO_VALID_YIELDED_HOURS` or `RATIO_VALID_YIELDED_MINUTES`
+|`[DATA_YIELD_RATIO_THRESHOLD]`         | Discard rows with `ratiovalidyieldedhours` or `ratiovalidyieldedminutes` feature less than `[DATA_YIELD_RATIO_THRESHOLD]`. The feature name is determined by `[DATA_YIELD_FEATURE]` parameter. Set to 0 to disable
+|`DROP_HIGHLY_CORRELATED_FEATURES`      | Discard highly correlated features, see table below
+
+Parameters description for `[ALL_CLEANING_INDIVIDUAL][PROVIDERS][RAPIDS][IMPUTE_SELECTED_EVENT_FEATURES]`:
+
+|Parameters                             | Description                                                    |
+|-------------------------------------- |----------------------------------------------------------------|
+|`[COMPUTE]`                            | Set to `True` to fill NAs with 0 for phone event-based features
+|`[MIN_DATA_YIELDED_MINUTES_TO_IMPUTE]` | Any feature value in a time segment instance with phone data yield > `[MIN_DATA_YIELDED_MINUTES_TO_IMPUTE]` will be replaced with a zero. See below for an explanation. |
+
+Parameters description for `[ALL_CLEANING_INDIVIDUAL][PROVIDERS][RAPIDS][DROP_HIGHLY_CORRELATED_FEATURES]`:
+
+|Parameters                             | Description                                                    |
+|-------------------------------------- |----------------------------------------------------------------|
+|`[COMPUTE]`                            | Set to `True` to drop highly correlated features
+|`[MIN_OVERLAP_FOR_CORR_THRESHOLD]`     | Minimum ratio of observations required per pair of columns (features) to be considered as a valid correlation. 
+|`[CORR_THRESHOLD]` | The absolute values of pair-wise correlations are calculated. If two variables have a valid correlation higher than `[CORR_THRESHOLD]`, we looks at the mean absolute correlation of each variable and removes the variable with the largest mean absolute correlation.
+
+Steps to clean sensor features for individual participants. It only considers the **phone sensors** currently.
+
+??? info "1. Fill NA with 0 for the selected event features."
+    Some event features should be zero instead of NA. In this step, we fill those missing features with 0 when the `phone_data_yield_rapids_ratiovalidyieldedminutes` column is higher than the `[IMPUTE_SELECTED_EVENT_FEATURES][MIN_DATA_YIELDED_MINUTES_TO_IMPUTE]` parameter. Plugins such as Activity Recognition sensor are not considered. You can skip this step by setting `[IMPUTE_SELECTED_EVENT_FEATURES][COMPUTE]` to `False`.
+    
+    Take phone calls sensor as an example. If there are no calls records during a time segment for a participant, then (1) the calls sensor was not working during that time segment; or (2) the calls sensor was working and the participant did not have any calls during that time segment. To differentiate these two situations, we assume the selected sensors are working when `phone_data_yield_rapids_ratiovalidyieldedminutes > [MIN_DATA_YIELDED_MINUTES_TO_IMPUTE]`.
+
+    The following phone event-based features are considered currently:
+
+      - Application foreground: countevent, countepisode, minduration, maxduration, meanduration, sumduration.
+      - Battery: all features.
+      - Calls: count, distinctcontacts, sumduration, minduration, maxduration, meanduration, modeduration.
+      - Keyboard: sessioncount, averagesessionlength, changeintextlengthlessthanminusone, changeintextlengthequaltominusone, changeintextlengthequaltoone, changeintextlengthmorethanone, maxtextlength, totalkeyboardtouches.
+      - Messages: count, distinctcontacts.
+      - Screen: sumduration, maxduration, minduration, avgduration, countepisode.
+      - WiFi: all connected and visible features.
+
+??? info "2. Discard unreliable rows."
+    Extracted features might be not reliable if the sensor only works for a short period during a time segment. In this step, we discard rows when the `phone_data_yield_rapids_ratiovalidyieldedminutes` column or the `phone_data_yield_rapids_ratiovalidyieldedhours` column is less than the `[DATA_YIELD_RATIO_THRESHOLD]` parameter. We recommend using `phone_data_yield_rapids_ratiovalidyieldedminutes` column (set `[DATA_YIELD_FEATURE]` to `RATIO_VALID_YIELDED_MINUTES`) on time segments that are shorter than two or three hours and `phone_data_yield_rapids_ratiovalidyieldedhours` (set `[DATA_YIELD_FEATURE]` to `RATIO_VALID_YIELDED_HOURS`) for longer segments. We do not recommend you to skip this step, but you can do it by setting `[DATA_YIELD_RATIO_THRESHOLD]` to 0.
+
+??? info "3. Discard columns (features) with too many missing values."
+    In this step, we discard columns with missing value ratios higher than `[COLS_NAN_THRESHOLD]`. We do not recommend you to skip this step, but you can do it by setting `[COLS_NAN_THRESHOLD]` to 1.
+
+??? info "4. Discard columns (features) with zero variance."
+    In this step, we discard columns with zero variance. We do not recommend you to skip this step, but you can do it by setting `[COLS_VAR_THRESHOLD]` to `False`.
+
+??? info "5. Drop highly correlated features."
+    As highly correlated features might not bring additional information and will increase the complexity of a model, we drop them in this step. The absolute values of pair-wise correlations are calculated. Each correlation vector between two variables is regarded as valid only if the ratio of valid value pairs (i.e. non NA pairs) is greater than or equal to `[DROP_HIGHLY_CORRELATED_FEATURES][MIN_OVERLAP_FOR_CORR_THRESHOLD]`. If two variables have a correlation coefficient higher than `[DROP_HIGHLY_CORRELATED_FEATURES][CORR_THRESHOLD]`, we look at the mean absolute correlation of each variable and remove the variable with the largest mean absolute correlation. This step can be skipped by setting `[DROP_HIGHLY_CORRELATED_FEATURES][COMPUTE]` to False.
+
+??? info "6. Discard rows with too many missing values."
+    In this step, we discard rows with missing value ratios higher than `[ROWS_NAN_THRESHOLD]`. We do not recommend you to skip this step, but you can do it by setting `[ROWS_NAN_THRESHOLD]` to 1. In other words, we are discarding time segments (e.g. days) that did not have enough data to be considered reliable. This step is similar to step 2 except the ratio is computed based on NA values instead of a phone data yield threshold.
+
+
+
+
+## Clean sensor features for all participants
+
+!!! info "File Sequence"
+    ```bash
+    - data/processed/features/all_participants/all_sensor_features.csv
+    - data/processed/features/all_participants/all_sensor_features_cleaned_{provider_key}.csv
+    ```
+
+
+### RAPIDS provider
+
+Parameters description and the steps are the same as the above [RAPIDS provider](#rapids-provider) section for individual participants.
+
+
--- a/docs/workflow-examples/minimal.md
+++ b/docs/workflow-examples/minimal.md
--- a/docs/change-log.md
+++ b/docs/change-log.md
@ -1,4 +1,42 @@
 # Change Log
+## v1.8.0
+- Add data stream for AWARE Micro server
+- Fix the NA bug in PHONE_LOCATIONS BARNETT provider
+- Fix the bug of data type for call_duration field
+- Fix the index bug of heatmap_sensors_per_minute_per_time_segment
+## v1.7.1
+- Update docs for Git Flow section
+- Update RAPIDS paper information
+## v1.7.0
+- Add firststeptime and laststeptime features to FITBIT_STEPS_INTRADAY RAPIDS provider
+- Update tests for Fitbit steps intraday features
+- Add tests for phone battery features
+- Add a data cleaning module to replace NAs with 0 in selected event-based features, discard unreliable rows and columns, discard columns with zero variance, and discard highly correlated columns
+## v1.6.0
+- Refactor PHONE_CALLS RAPIDS provider to compute features based on call episodes or events
+- Refactor PHONE_LOCATIONS DORYAB provider to compute features based on location episodes
+- Temporary revert PHONE_LOCATIONS BARNETT provider to use R script
+- Update the default IGNORE_EPISODES_LONGER_THAN to be 6 hours for screen RAPIDS provider
+- Fix the bug of step intraday features when INCLUDE_ZERO_STEP_ROWS is False
+## v1.5.0
+- Update Barnett location features with faster Python implementation
+- Fix rounding bug in data yield features
+- Add tests for data yield, Fitbit and accelerometer features
+- Small fixes of documentation
+## v1.4.1
+- Update home page
+- Add PHONE_MESSAGES tests
+## v1.4.0
+- Add new Application Foreground episode features and tests
+- Update VSCode setup instructions for our Docker container
+- Add tests for phone calls features
+- Add tests for WiFI features and fix a bug that incorrectly counted the most scanned device within the current time segment instances instead of globally
+- Add tests for phone conversation features
+- Add tests for Bluetooth features and choose the most scanned device alphabetically when ties exist
+- Add tests for Activity Recognition features and fix iOS unknown activity parsing
+- Fix Fitbit bug that parsed date-times with the current time zone in rare cases
+- Update the visualizations to be more precise and robust with different time segments.
+- Fix regression crash of the example analysis workflow
 ## v1.3.0
 - Refactor PHONE_LOCATIONS DORYAB provider. Fix bugs and faster execution up to 30x
 - New PHONE_KEYBOARD features
@ -124,4 +162,4 @@
 - Update [virtual environment](../developers/virtual-environments) guide
 - Update analysis workflow [example](../workflow-examples/analysis)
 - Add a [Code of Conduct](../code_of_conduct)
- Update [Team](../team) page
+- Update [Team](../team) page
--- a/docs/citation.md
+++ b/docs/citation.md
@ -5,14 +5,10 @@

 ## RAPIDS

-If you used RAPIDS, please cite [this paper](https://preprints.jmir.org/preprint/23246).
+If you used RAPIDS, please cite [this paper](https://www.frontiersin.org/article/10.3389/fdgth.2021.769823).

 !!! cite "RAPIDS et al. citation"
-    Vega J, Li M, Aguillera K, Goel N, Joshi E, Durica KC, Kunta AR, Low CA
-    RAPIDS: Reproducible Analysis Pipeline for Data Streams Collected with Mobile Devices
-    JMIR Preprints. 18/08/2020:23246
-    DOI: 10.2196/preprints.23246
-    URL: https://preprints.jmir.org/preprint/23246
+    Vega, J., Li, M., Aguillera, K., Goel, N., Joshi, E., Khandekar, K., ... & Low, C. A. (2021). Reproducible Analysis Pipeline for Data Streams (RAPIDS): Open-Source Software to Process Data Collected with Mobile Devices. Frontiers in Digital Health, 168.

 ## DBDP (all Empatica sensors)

--- a/docs/datastreams/aware-micro-mysql.md
+++ b/docs/datastreams/aware-micro-mysql.md
@ -0,0 +1,15 @@
+# `aware_micro_mysql`
+
+This [data stream](../../datastreams/data-streams-introduction) handles iOS and Android sensor data collected with the [AWARE Framework's](https://awareframework.com/) [AWARE Micro](https://github.com/denzilferreira/aware-micro) server and stored in a MySQL database.
+
+## Container
+A MySQL database with a table per sensor, each containing the data for all participants. Sensor data is stored in a JSON field within each table called `data`
+
+The script to connect and download data from this container is at:
+```bash
+src/data/streams/aware_micro_mysql/container.R
+```
+
+## Format
+
+--8<---- "docs/snippets/aware_format.md"
--- a/docs/datastreams/data-streams-introduction.md
+++ b/docs/datastreams/data-streams-introduction.md
@ -16,6 +16,7 @@ For reference, these are the data streams we currently support:
 | Data Stream | Device | Format | Container | Docs
 |--|--|--|--|--|
 | `aware_mysql`| Phone | AWARE app | MySQL | [link](../aware-mysql)
+| `aware_micro_mysql`| Phone | AWARE Micro server | MySQL | [link](../aware-micro-mysql)
 | `aware_csv`| Phone | AWARE app | CSV files | [link](../aware-csv)
 | `aware_influxdb` (beta)| Phone | AWARE app | InfluxDB | [link](../aware-influxdb)
 | `fitbitjson_mysql`| Fitbit | JSON (per [Fitbit's API](https://dev.fitbit.com/build/reference/web-api/)) | MySQL | [link](../fitbitjson-mysql)
--- a/docs/developers/git-flow.md
+++ b/docs/developers/git-flow.md
@ -127,9 +127,9 @@ git branch -d release/v[NEW_RELEASE]
 ```
 git checkout master
 git merge --ff-only develop
-git push
+git push # Unlock the master branch before merging
 ```
-1. Go to [GitHub](https://github.com/carissalow/rapids/tags) and create a new release based on the newest tag `v[NEW_RELEASE]` (remember to add the change log)
+1. Release happens automatically after passing the tests

 ## Release a Hotfix
 1. Pull the latest master
@ -156,6 +156,6 @@ git branch -d hotfix/v[NEW_HOTFIX]
 ```
 git checkout master
 git merge --ff-only v[NEW_HOTFIX]
-git push
+git push # Unlock the master branch before merging
 ```
-1. Go to [GitHub](https://github.com/carissalow/rapids/tags) and create a new release based on the newest tag `v[NEW_HOTFIX]` (remember to add the change log)
+1. Release happens automatically after passing the tests
--- a/docs/developers/test-cases.md
+++ b/docs/developers/test-cases.md
@ -7,158 +7,260 @@ The following is a list of the sensors that testing is currently available.

 | Sensor                        | Provider | Periodic | Frequency | Event |
 |-------------------------------|----------|----------|-----------|-------|
-| Phone Accelerometer           | Panda    | N        | N         | N     |
-| Phone Accelerometer           | RAPIDS   | N        | N         | N     |
-| Phone Activity Recognition    | RAPIDS   | N        | N         | N     |
-| Phone Applications Foreground | RAPIDS   | N        | N         | N     |
-| Phone Battery                 | RAPIDS   | Y        | Y         | N     |
-| Phone Bluetooth               | Doryab   | N        | N         | N     |
+| Phone Accelerometer           | Panda    | Y        | Y         | Y     |
+| Phone Accelerometer           | RAPIDS   | Y        | Y         | Y     |
+| Phone Activity Recognition    | RAPIDS   | Y        | Y         | Y     |
+| Phone Applications Foreground | RAPIDS   | Y        | Y         | Y     |
+| Phone Battery                 | RAPIDS   | Y        | Y         | Y     |
+| Phone Bluetooth               | Doryab   | Y        | Y         | Y     |
 | Phone Bluetooth               | RAPIDS   | Y        | Y         | Y     |
-| Phone Calls                   | RAPIDS   | Y        | Y         | N     |
-| Phone Conversation            | RAPIDS   | Y        | Y         | N     |
-| Phone Data Yield              | RAPIDS   | N        | N         | N     |
-| Phone Light                   | RAPIDS   | Y        | Y         | N     |
-| Phone Locations               | Doryab   | N        | N         | N     |
+| Phone Calls                   | RAPIDS   | Y        | Y         | Y     |
+| Phone Conversation            | RAPIDS   | Y        | Y         | Y     |
+| Phone Data Yield              | RAPIDS   | Y        | Y         | Y     |
+| Phone Light                   | RAPIDS   | Y        | Y         | Y     |
+| Phone Locations               | Doryab   | Y        | Y         | Y     |
 | Phone Locations               | Barnett  | N        | N         | N     |
-| Phone Messages                | RAPIDS   | Y        | Y         | N     |
-| Phone Screen                  | RAPIDS   | Y        | N         | N     |
-| Phone WiFi Connected          | RAPIDS   | Y        | Y         | N     |
-| Phone WiFi Visible            | RAPIDS   | Y        | Y         | N     |
+| Phone Messages                | RAPIDS   | Y        | Y         | Y     |
+| Phone Screen                  | RAPIDS   | Y        | Y         | Y     |
+| Phone WiFi Connected          | RAPIDS   | Y        | Y         | Y     |
+| Phone WiFi Visible            | RAPIDS   | Y        | Y         | Y     |
 | Fitbit Calories Intraday      | RAPIDS   | Y        | Y         | Y     |
-| Fitbit Data Yield             | RAPIDS   | N        | N         | N     |
-| Fitbit Heart Rate Summary     | RAPIDS   | N        | N         | N     |
-| Fitbit Heart Rate Intraday    | RAPIDS   | N        | N         | N     |
-| Fitbit Sleep Summary          | RAPIDS   | N        | N         | N     |
+| Fitbit Data Yield             | RAPIDS   | Y        | Y         | Y     |
+| Fitbit Heart Rate Summary     | RAPIDS   | Y        | Y         | Y     |
+| Fitbit Heart Rate Intraday    | RAPIDS   | Y        | Y         | Y     |
+| Fitbit Sleep Summary          | RAPIDS   | Y        | Y         | Y     |
 | Fitbit Sleep Intraday         | RAPIDS   | Y        | Y         | Y     |
 | Fitbit Sleep Intraday         | PRICE    | Y        | Y         | Y     |
-| Fitbit Steps Summary          | RAPIDS   | N        | N         | N     |
-| Fitbit Steps Intraday         | RAPIDS   | N        | N         | N     |
+| Fitbit Steps Summary          | RAPIDS   | Y        | Y         | Y     |
+| Fitbit Steps Intraday         | RAPIDS   | Y        | Y         | Y     |


+## Accelerometer
+
+Description
+
+- The raw accelerometer data file, `phone_accelerometer_raw.csv`, contains data for 4 separate days
+- One episode for each daily segment (night, morning, afternoon and evening)
+- Two episodes locate in the same 30-min segment (`Fri 00:15:00` and `Fri 00:21:21`)
+- Two episodes locate in the same daily segment (`Fri 00:15:00` and `Fri 18:12:00`)
+- One episode before the time switch (`Sun 00:02:00`) and one episode after the time switch (`Sun 04:18:00`)
+- Multiple episodes within one min which cause variance in magnitude (`Fri 00:10:25`, `Fri 00:10:27` and `Fri 00:10:46`)
+
+Checklist
+
+|time segment| single tz | multi tz|platform|
+|-|-|-|-|
+|30min|OK|OK|android, ios|
+|morning|OK|OK|android, ios|
+|daily|OK|OK|android, ios|
+|threeday|OK|OK|android, ios|
+|weekend|OK|OK|android, ios|
+|beforeMarchEvent|OK|OK|android, ios|
+|beforeNovemberEvent|OK|OK|android, ios|
+
 ## Messages (SMS)

-   The raw message data file contains data for 2 separate days.
-   The data for the first day contains records 5 records for every
-    `epoch`.
-   The second day\'s data contains 6 records for each of only 2
-    `epoch` (currently `morning` and `evening`)
-   The raw message data contains records for both `message_types`
-    (i.e. `recieved` and `sent`) in both days in all epochs. The
-    number records with each `message_types` per epoch is randomly
-    distributed There is at least one records with each
-    `message_types` per epoch.
-   There is one raw message data file each, as described above, for
-    testing both iOS and Android data.
-   There is also an additional empty data file for both android and
-    iOS for testing empty data files
+Description
+
+- The raw message data file, `phone_messages_raw.csv`, contains data for 4 separate days
+- One episode for each daily segment (night, morning, afternoon and evening)
+- Two `sent` episodes locate in the same 30-min segment (`Fri 16:08:03.000` and `Fri 16:19:35.000`)
+- Two `received` episodes locate in the same 30-min segment (`Sat 06:45:05.000` and `Fri 06:45:05.000`)
+- Two episodes locate in the same daily segment (`Fri 11:57:56.385` and `Sat 10:54:10.000`)
+- One episode before the time switch (`Sun 00:48:01.000`) and one episode after the time switch (`Sun 06:21:01.000`)
+
+Checklist
+
+|time segment| single tz | multi tz|platform|
+|-|-|-|-|
+|30min|OK|OK|android|
+|morning|OK|OK|android|
+|daily|OK|OK|android|
+|threeday|OK|OK|android|
+|weekend|OK|OK|android|
+|beforeMarchEvent|OK|OK|android|
+|beforeNovemberEvent|OK|OK|android|

 ## Calls

-Due to the difference in the format of the raw call data for iOS and Android the following is the expected results the `calls_with_datetime_unified.csv`. This would give a better idea of the use cases being tested since the `calls_with_datetime_unified.csv` would make both the iOS and Android data comparable.
+Due to the difference in the format of the raw data for iOS and Android the following is the expected results 
+the `phone_calls.csv`. 

-   The call data would contain data for 2 days.
-   The data for the first day contains 6 records for every `epoch`.
-   The second day\'s data contains 6 records for each of only 2
-    `epoch` (currently `morning` and `evening`)
-   The call data contains records for all `call_types` (i.e.
-    `incoming`, `outgoing` and `missed`) in both days in all epochs.
-    The number records with each of the `call_types` per epoch is
-    randomly distributed. There is at least one records with each
-    `call_types` per epoch.
-   There is one call data file each, as described above, for testing
-    both iOS and Android data.
-   There is also an additional empty data file for both android and
-    iOS for testing empty data files
+Description
+
+- One missed episode, one outgoing episode and one incoming episode on Friday night, morning, afternoon and evening
+- There is at least one episode of each type of phone calls on each day
+- One incoming episode crossing two 30-mins segments
+- One outgoing episode crossing two 30-mins segments
+- One missed episode before, during and after the `event`
+- There is one incoming episode before, during or after the `event`
+- There is one outcoming episode before, during or after the `event`
+- There is one missed episode before, during or after the `event`
+
+Data format
+
+| Device | Missed | Outgoing | Incoming |
+|-|-|-|-|
+|android| 3 | 2 | 1 |
+|ios| 1,4 or 3,4 | 3,2,4 | 1,2,4 |
+
+Note
+When generating test data, all traces for iOS device need to be unique otherwise the episode with duplicate trace will be dropped 
+
+Checklist
+
+|time segment| single tz | multi tz|platform|
+|-|-|-|-|
+|30min|OK|OK|android, iOS|
+|morning|OK|OK|android, iOS|
+|daily|OK|OK|android, iOS|
+|threeday|OK|OK|android, iOS|
+|weekend|OK|OK|android, iOS|
+|beforeMarchEvent|OK|OK|android, iOS|
+|beforeNovemberEvent|OK|OK|android, iOS|

 ## Screen

-Due to the difference in the format of the raw screen data for iOS and Android the following is the expected results the `screen_deltas.csv`. This would give a better idea of the use cases being tested since the `screen_eltas.csv` would make both the iOS and Android data comparable These files are used to calculate the features for the screen sensor
+Due to the difference in the format of the raw screen data for iOS and Android the following is the expected results the `phone_screen.csv`. 

-   The screen delta data file contains data for 1 day.
-   The screen delta data contains 1 record to represent an `unlock`
+Description
+
+- The screen data file contains data for 4 days.
+- The screen data contains 1 record to represent an `unlock`
    episode that falls within an `epoch` for every `epoch`.
-   The screen delta data contains 1 record to represent an `unlock`
+- The screen data contains 1 record to represent an `unlock`
    episode that falls across the boundary of 2 epochs. Namely the
    `unlock` episode starts in one epoch and ends in the next, thus
    there is a record for `unlock` episodes that fall across `night`
    to `morning`, `morning` to `afternoon` and finally `afternoon` to
    `night`
-   The testing is done for `unlock` episode\_type.
-   There is one screen data file each for testing both iOS and
-    Android data formats.
-   There is also an additional empty data file for both android and
-    iOS for testing empty data files
+- One episode that crossing two `30-min` segments
+
+Data format
+
+| Device | unlock |
+|-|-|
+| Android | 3, 0|
+| iOS | 3, 2|
+
+
+Checklist
+
+|time segment| single tz | multi tz|platform|
+|-|-|-|-|
+|30min|OK|OK|android, iOS|
+|morning|OK|OK|android, iOS|
+|daily|OK|OK|android, iOS|
+|threeday|OK|OK|android, iOS|
+|weekend|OK|OK|android, iOS|
+|beforeMarchEvent|OK|OK|android, iOS|
+|beforeNovemberEvent|OK|OK|android, iOS|

 ## Battery

-Due to the difference in the format of the raw battery data for iOS and Android as well as versions of iOS the following is the expected results the `battery_deltas.csv`. This would give a better idea of the use cases being tested since the `battery_deltas.csv` would make both the iOS and Android data comparable. These files are used to calculate the features for the battery sensor.
+Description

-   The battery delta data file contains data for 1 day.
-   The battery delta data contains 1 record each for a `charging` and
-    `discharging` episode that falls within an `epoch` for every
-    `epoch`. Thus, for the `daily` epoch there would be multiple
-    `charging` and `discharging` episodes
-   Since either a `charging` episode or a `discharging` episode and
-    not both can occur across epochs, in order to test episodes that
-    occur across epochs alternating episodes of `charging` and
-    `discharging` episodes that fall across `night` to `morning`,
-    `morning` to `afternoon` and finally `afternoon` to `night` are
-    present in the battery delta data. This starts with a
-    `discharging` episode that begins in `night` and end in `morning`.
-   There is one battery data file each, for testing both iOS and
-    Android data formats.
-   There is also an additional empty data file for both android and
-    iOS for testing empty data files
+- The 4-day raw data is contained in `phone_battery_raw.csv`
+- One discharge episode acrossing two 30-min time segements (`Fri 05:57:30.123` to `Fri 06:04:32.456`)
+- One charging episode acrossing two 30-min time segments (`Fri 11:55:58.416` to `Fri 12:08:07.876`)
+- One discharge episode and one charging episode locate within the same 30-min time segement (`Fri 21:30:00` to `Fri 22:00:00`)
+- One episode before the time switch (`Sun 00:24:00.000`) and one episode after the time switch (`Sun 21:58:00`)
+- Two episodes locate in the same daily segment
+  
+Checklist
+
+|time segment| single tz | multi tz|platform|
+|-|-|-|-|
+|30min|OK|OK|android|
+|morning|OK|OK|android|
+|daily|OK|OK|android|
+|threeday|OK|OK|android|
+|weekend|OK|OK|android|
+|beforeMarchEvent|OK|OK|android|
+|beforeNovemberEvent|OK|OK|android|

 ## Bluetooth

-   The raw Bluetooth data file contains data for 1 day.
-   The raw Bluetooth data contains at least 2 records for each
-    `epoch`. Each `epoch` has a record with a `timestamp` for the
-    beginning boundary for that `epoch` and a record with a
-    `timestamp` for the ending boundary for that `epoch`. (e.g. For
-    the `morning` epoch there is a record with a `timestamp` for
-    `6:00AM` and another record with a `timestamp` for `11:59:59AM`.
-    These are to test edge cases)
-   An option of 5 Bluetooth devices are randomly distributed
-    throughout the data records.
-   There is one raw Bluetooth data file each, for testing both iOS
-    and Android data formats.
-   There is also an additional empty data file for both android and
-    iOS for testing empty data files.
+Description 
+
+- The 4-day raw data is contained in `phone_bluetooth_raw.csv`
+- One episode for each daily segment (`night`, `morning`, `afternoon` and `evening`)
+- Two episodes locate in the same 30-min segment (`Fri 23:38:45.789` and `Fri 23:59:59.465`)
+- Two episodes locate in the same daily segment (`Fri 00:00:00.798` and `Fri 00:49:04.132`)
+- One episode before the time switch (`Sun 00:24:00.000`) and one episode after the time switch (`Sun 17:32:00.000`)
+
+Checklist
+
+|time segment| single tz | multi tz|platform|
+|-|-|-|-|
+|30min|OK|OK|android|
+|morning|OK|OK|android|
+|daily|OK|OK|android|
+|threeday|OK|OK|android|
+|weekend|OK|OK|android|
+|beforeMarchEvent|OK|OK|android|
+|beforeNovemberEvent|OK|OK|android|

 ## WIFI

-   There are 2 data files (`wifi_raw.csv` and `sensor_wifi_raw.csv`)
-    for each fake participant for each phone platform. 
-   The raw WIFI data files contain data for 1 day.
-   The `sensor_wifi_raw.csv` data contains at least 2 records for
-    each `epoch`. Each `epoch` has a record with a `timestamp` for the
-    beginning boundary for that `epoch` and a record with a
-    `timestamp` for the ending boundary for that `epoch`. (e.g. For
-    the `morning` epoch there is a record with a `timestamp` for
-    `6:00AM` and another record with a `timestamp` for `11:59:59AM`.
-    These are to test edge cases)
-   The `wifi_raw.csv` data contains 3 records with random timestamps
-    for each `epoch` to represent visible broadcasting WIFI network.
-    This file is empty for the iOS phone testing data.
-   An option of 10 access point devices is randomly distributed
-    throughout the data records. 5 each for `sensor_wifi_raw.csv` and
-    `wifi_raw.csv`.
-   There data files for testing both iOS and Android data formats.
-   There are also additional empty data files for both android and
-    iOS for testing empty data files.
+There are two wifi features (`phone wifi connected` and `phone wifi visible`). The raw test data are seperatly stored in the `phone_wifi_connected_raw.csv` and `phone_wifi_visible_raw.csv`.
+
+Description 
+
+- One episode for each `epoch` (`night`, `morining`, `afternoon` and `evening`)
+- Two two episodes in the same time segment (`daily` and `30-min`)
+- Two episodes around the transition of `epochs` (e.g. one at the end of `night` and one at the beginning of `morning`) 
+- One episode before and after the time switch on Sunday
+
+phone wifi connected
+
+Checklist
+
+|time segment| single tz | multi tz|platform|
+|-|-|-|-|
+|30min|OK|OK|android, iOS|
+|morning|OK|OK|android, iOS|
+|daily|OK|OK|android, iOS|
+|threeday|OK|OK|android, iOS|
+|weekend|OK|OK|android, iOS|
+|beforeMarchEvent|OK|OK|android, iOS|
+|beforeNovemberEvent|OK|OK|android, iOS|
+
+phone wifi visible
+
+Checklist
+
+|time segment| single tz | multi tz|platform|
+|-|-|-|-|
+|30min|OK|OK|android|
+|morning|OK|OK|android|
+|daily|OK|OK|android|
+|threeday|OK|OK|android|
+|weekend|OK|OK|android|
+|beforeMarchEvent|OK|OK|android|
+|beforeNovemberEvent|OK|OK|android|

 ## Light

-   The raw light data file contains data for 1 day.
-   The raw light data contains 3 or 4 rows of data for each `epoch`
-    except `night`. The single row of data for `night` is for testing
-    features for single values inputs. (Example testing the standard
-    deviation of one input value)
-   Since light is only available for Android there is only one file
-    that contains data for Android. All other files (i.e. for iPhone)
-    are empty data files.
+Description
+
+- The 4-day raw light data is contained in `phone_light_raw.csv`
+- One episode for each daily segment (`night`, `morning`, `afternoon` and `evening`)
+- Two episodes locate in the same 30-min segment (`Fri 00:07:27.000` and `Fri 00:12:00.000`)
+- Two episodes locate in the same daily segment (`Fri 01:00:00` and `Fri 03:59:59.654`)
+- One episode before the time switch (`Sun 00:08:00.000`) and one episode after the time switch (`Sun 05:36:00.000`)
+
+Checklist
+
+|time segment| single tz | multi tz|platform|
+|-|-|-|-|
+|30min|OK|OK|android|
+|morning|OK|OK|android|
+|daily|OK|OK|android|
+|threeday|OK|OK|android|
+|weekend|OK|OK|android|
+|beforeMarchEvent|OK|OK|android|
+|beforeNovemberEvent|OK|OK|android|

 ## Locations

@ -171,58 +273,81 @@ Description

 ## Application Foreground

-   The raw application foreground data file contains data for 1 day.
-   The raw application foreground data contains 7 - 9 rows of data
-    for each `epoch`. The records for each `epoch` contains apps that
-    are randomly selected from a list of apps that are from the
-    `MULTIPLE_CATEGORIES` and `SINGLE_CATEGORIES` (See
-    [testing\_config.yaml]()). There are also records in each epoch
-    that have apps randomly selected from a list of apps that are from
-    the `EXCLUDED_CATEGORIES` and `EXCLUDED_APPS`. This is to test
-    that these apps are actually being excluded from the calculations
-    of features. There are also records to test `SINGLE_APPS`
-    calculations.
-   Since application foreground is only available for Android there
-    is only one file that contains data for Android. All other files
-    (i.e. for iPhone) are empty data files.
+- The 4-day raw application data is contained in `phone_applications_foreground_raw.csv`
+- One episode for each daily segment (night, morning, afternoon and evening)
+- Two episodes locate in the same 30-min segment (`Fri 10:12:56.385` and `Fri 10:18:48.895`)
+- Two episodes locate in the same daily segment (`Fri 11:57:56.385` and `Fri 12:02:56.385`)
+- One episode before the time switch (`Sun 00:07:48.001`) and one episode after the time switch (`Sun 05:10:30.001`)
+- Two custom category (`Dating`) episode, one at `Fri 06:05:10.385`, another one at ` Fri 11:53:00.385`
+
+Checklist:
+
+|time segment| single tz | multi tz|platform|
+|-|-|-|-|
+|30min|OK|OK|android|
+|morning|OK|OK|android|
+|daily|OK|OK|android|
+|threeday|OK|OK|android|
+|weekend|OK|OK|android|
+|beforeMarchEvent|OK|OK|android|
+|beforeNovemberEvent|OK|OK|android|

 ## Activity Recognition

-   The raw Activity Recognition data file contains data for 1 day.
-   The raw Activity Recognition data each `epoch` period contains
-    rows that records 2 - 5 different `activity_types`. The is such
-    that durations of activities can be tested. Additionally, there
-    are records that mimic the duration of an activity over the time
-    boundary of neighboring epochs. (For example, there a set of
-    records that mimic the participant `in_vehicle` from `afternoon`
-    into `evening`)
-   There is one file each with raw Activity Recognition data for
-    testing both iOS and Android data formats.
-    (plugin\_google\_activity\_recognition\_raw.csv for android and
-    plugin\_ios\_activity\_recognition\_raw.csv for iOS)
-   There is also an additional empty data file for both android and
-    iOS for testing empty data files.
+Description
+
+- The 4-day raw activity data is contained in `plugin_google_activity_recognition_raw.csv` and `plugin_ios_activity_recognition_raw.csv`.
+- Two episodes locate in the same 30-min segment (`Fri 04:01:54` and `Fri 04:13:52`)
+- One episode for each daily segment (`night`, `morning`, `afternoon` and `evening`)
+- Two episodes locate in the same daily segment (`Fri  05:03:09` and `Fri 05:50:36`)
+- Two episodes with the time difference less than `5 mins` threshold (`Fri 07:14:21` and `Fri 07:18:50`)
+- One episode before the time switch (`Sun 00:46:00`) and one episode after the time switch (`Sun 03:42:00`)
+
+Checklist
+
+|time segment| single tz | multi tz|platform|
+|-|-|-|-|
+|30min|OK|OK|android, iOS|
+|morning|OK|OK|android, iOS|
+|daily|OK|OK|android, iOS|
+|threeday|OK|OK|android, iOS|
+|weekend|OK|OK|android, iOS|
+|beforeMarchEvent|OK|OK|android, iOS|
+|beforeNovemberEvent|OK|OK|android, iOS|

 ## Conversation

-   The raw conversation data file contains data for 2 day.
-   The raw conversation data contains records with a sample of both
-    `datatypes` (i.e. `voice/noise` = `0`, and `conversation` = `2` )
-    as well as rows with for samples of each of the `inference` values
-    (i.e. `silence` = `0`, `noise` = `1`, `voice` = `2`, and `unknown`
-    = `3`) for each `epoch`. The different `datatype` and `inference`
-    records are randomly distributed throughout the `epoch`.
-   Additionally there are 2 - 5 records for conversations (`datatype`
-    = 2, and `inference` = -1) in each `epoch` and for each `epoch`
-    except night, there is a conversation record that has a
-    `double_convo_start` `timestamp` that is from the previous
-    `epoch`. This is to test the calculations of features across
-    `epochs`.
-   There is a raw conversation data file for both android and iOS
-    platforms (`plugin_studentlife_audio_android_raw.csv` and
-    `plugin_studentlife_audio_raw.csv` respectively).
-   Finally, there are also additional empty data files for both
-    android and iOS for testing empty data files
+The 4-day raw conversation data is contained in `phone_conversation_raw.csv`. The different `inference` records are 
+randomly distributed throughout the `epoch`. 
+
+Description
+
+- One episode for each daily segment (`night`, `morning`, `afternoon` and `evening`) on each day
+- Two episodes near the transition of the daily segment, one starts at the end of the afternoon, `Fri 17:10:00` and another one starts at the beginning of the evening, `Fri 18:01:00`
+- One episode across two segments, `daily` and `30-mins`, (from `Fri 05:55:00` to `Fri 06:00:41`)
+- Two episodes locate in the same daily segment (`Sat 12:45:36` and `Sat 16:48:22`)
+- One episode before the time switch, `Sun 00:15:06`, and one episode after the time switch, `Sun 06:01:00`
+
+Data format
+
+| inference | type |
+| - | - |
+| 0 | silence |
+| 1 | noise | 
+| 2 | voice |
+| 3 | unknown | 
+
+Checklist
+
+|time segment| single tz | multi tz|platform|
+|-|-|-|-|
+|30min|OK|OK|android|
+|morning|OK|OK|android|
+|daily|OK|OK|android|
+|threeday|OK|OK|android|
+|weekend|OK|OK|android|
+|beforeMarchEvent|OK|OK|android|
+|beforeNovemberEvent|OK|OK|android|

 ## Keyboard

@ -243,6 +368,40 @@ Description

 - One three-minute episode with a 1-minute row on Sun 08:59:54.65 and 09:00:00,another on Sun 12:01:02 that are considering a single episode in multi-timezone event segments to showcase how
 inferring time zone data for Keyboard from phone data can produce inaccurate results around the tz change. This happens because the device was on LA time until 11:59 and switched to NY time at 12pm, in terms of actual time 09 am LA and 12 pm NY represent the same moment in time so 09:00 LA and 12:01 NY are consecutive minutes.
+## Application Episodes
+
+-   The feature requires raw application foreground data file and raw phone screen data file
+-   The raw data files contains data for 4 day.
+-   The raw conversation data contains records with difference in `timestamp` ranging from milliseconds to minutes.
+-   An app episode starts when an app is launched and ends when another app is launched, marking the episode end of the first one,
+or when the screen locks. Thus, we are taking into account the screen unlock episodes.
+-   There are multiple apps usage within each screen unlock episode to verify creation of different app episodes in each 
+screen unlock session. In the screen unlock episode starting from Fri 05:56:51, Fri 10:00:24, Sat 17:48:01, Sun 22:02:00, and Mon 21:05:00 we have multiple apps, both system and non-system apps, to check this.
+-   The 22 minute chunk starting from Fri 10:03:56 checks app episodes for system apps only.
+-   The screen unlock episode starting from Mon 21:05:00 and Sat 17:48:01 checks if the screen lock marks the end of episode for that particular app which was launched a few milliseconds to 8 mins before the screen lock.
+-   Finally, since application foreground is only for Android devices, this feature is also for Android devices only. All other files are empty data files
+
+
+## Data Yield
+
+Description
+
+- Two sensors were picked for testing, `phone_screen` and `phone_light`. `phone_screen` is event based and `phone_light` is sampling at regular frequency
+- A 31-min episode (from `Fri 01:00:00` to `Fri 01:30:00`) in phone_light data, which is considered as a `validyieldedhours`
+
+
+Checklist
+
+|time segment| single tz | multi tz|platform|
+|-|-|-|-|
+|30min|OK|OK|android, ios|
+|morning|OK|OK|android, ios|
+|daily|OK|OK|android, ios|
+|threeday|OK|OK|android, ios|
+|weekend|OK|OK|android, ios|
+|beforeMarchEvent|OK|OK|android, ios|
+|beforeNovemberEvent|OK|OK|android, ios|
+

 ## Fitbit Calories Intraday

@ -263,6 +422,31 @@ Description
 - A four-minute sedentary episode on Sun 10:01 that will be ignored for Novembers's multi-timezone event segments since the test segment ends at 10am on that weekend.
 - A three-minute very active episode on Sat 16:03. This episode and the one at 16:00 are counted as one for lowmet episodes

+Checklist
+
+|time segment| single tz | multi tz|platform|
+|-|-|-|-|
+|30min|OK|OK|fitbit|
+|morning|OK|OK|fitbit|
+|daily|OK|OK|fitbit|
+|threeday|OK|OK|fitbit|
+|weekend|OK|OK|fitbit|
+|beforeMarchEvent|OK|OK|fitbit|
+|beforeNovemberEvent|OK|OK|fitbit|
+
+
+## Fitbit Heartrate intraday 
+
+Description:
+
+- The 4-day raw heartrate data is contained in `fitbit_heartrate_intraday_raw.csv`
+- One episode for each daily segment (`night`, `morning`, `afternoon` and `evening`)
+- Two episodes locate in the same 30-min segment (`Fri 00:49:00` and `Fri 00:52:00`)
+- Two different types of heartrate zone episodes locate in the same 30-min segment (`Fri 05:49:00 outofrange` and `Fri 05:57:00 fatburn`)
+- Two episodes locate in the same daily segment (`Fri 12:02:00` and `Fri 19:38:00`)
+- One episode before the time switch, `Sun 00:08:00`, and one episode after the time switch, `Sun 07:28:00`
+
+
 Checklist

 |time segment| single tz | multi tz|platform|
@ -322,3 +506,82 @@ Checklist
 |weekend|OK|OK|fitbit|
 |beforeMarchEvent|OK|OK|fitbit|
 |beforeNovemberEvent|OK|OK|fitbit|
+
+
+## Fitbit Heartrate Summary
+
+Description
+
+- The 4-day raw heartrate summary data is contained in `fitbit_heartrate_summary_raw.csv`.
+- As heartrate summary is periodic, it only generates results in periodic feature, there will be no result in frequency and event. 
+
+
+Checklist
+
+|time segment| single tz | multi tz|platform|
+|-|-|-|-|
+|30min|OK|OK|fitbit|
+|morning|OK|OK|fitbit|
+|daily|OK|OK|fitbit|
+|threeday|OK|OK|fitbit|
+|weekend|OK|OK|fitbit|
+|beforeMarchEvent|OK|OK|fitbit|
+|beforeNovemberEvent|OK|OK|fitbit|
+
+## Fitbit Step Intraday
+
+Description
+
+- The 4-day raw heartrate summary data is contained in `fitbit_steps_intraday_raw.csv`
+- One episode for each daily segment (`night`, `morning`, `afternoon` and `evening`) on each day
+- Two episodes within the same 30-min segment (`Fri 05:58:00` and `Fri 05:59:00`)
+- A one-min episode at `2020-03-07 09:00:00` that will be converted to New York time `2020-03-07 12:00:00`
+- One episode before the time switch, `Sun 00:19:00`, and one episode after the time switch, `Sun 09:01:00`
+- Episodes cross two 30-min segments (`Fri 11:59:00` and `Fri 12:00:00`)
+
+Checklist
+
+|time segment| single tz | multi tz|platform|
+|-|-|-|-|
+|30min|OK|OK|fitbit|
+|morning|OK|OK|fitbit|
+|daily|OK|OK|fitbit|
+|threeday|OK|OK|fitbit|
+|weekend|OK|OK|fitbit|
+|beforeMarchEvent|OK|OK|fitbit|
+|beforeNovemberEvent|OK|OK|fitbit|
+
+
+## Fitbit Step Summary
+
+Description
+
+- The 4-day raw heartrate summary data is contained in `fitbit_steps_summary_raw.csv`.
+- As heartrate summary is periodic, it only generates results in periodic feature, there will be no result in frequency and event. 
+
+
+Checklist
+
+|time segment| single tz | multi tz|platform|
+|-|-|-|-|
+|30min|OK|OK|fitbit|
+|morning|OK|OK|fitbit|
+|daily|OK|OK|fitbit|
+|threeday|OK|OK|fitbit|
+|weekend|OK|OK|fitbit|
+|beforeMarchEvent|OK|OK|fitbit|
+|beforeNovemberEvent|OK|OK|fitbit|
+
+## Fitbit Data Yield
+
+Checklist
+
+|time segment| single tz | multi tz|platform|
+|-|-|-|-|
+|30min|OK|OK|fitbit|
+|morning|OK|OK|fitbit|
+|daily|OK|OK|fitbit|
+|threeday|OK|OK|fitbit|
+|weekend|OK|OK|fitbit|
+|beforeMarchEvent|OK|OK|fitbit|
+|beforeNovemberEvent|OK|OK|fitbit|
--- a/docs/features/fitbit-steps-intraday.md
+++ b/docs/features/fitbit-steps-intraday.md
@ -29,6 +29,7 @@ Parameters description for `[FITBIT_STEPS_INTRADAY][PROVIDERS][RAPIDS]`:
 |----------------|-----------------------------------------------------------------------------------------------------------------------------------
 |`[COMPUTE]`                | Set to `True` to extract `FITBIT_STEPS_INTRADAY` features from the `RAPIDS` provider|
 |`[FEATURES]`               |         Features to be computed from steps intraday data, see table below           |
+|`[REFERENCE_HOUR]`         | The reference point from which `firststeptime` or `laststeptime` is to be computed, default is midnight |
 |`[THRESHOLD_ACTIVE_BOUT]`  | Every minute with Fitbit steps data wil be labelled as `sedentary` if its step count is below this threshold, otherwise, `active`.    |
 |`[INCLUDE_ZERO_STEP_ROWS]` | Whether or not to include time segments with a 0 step count during the whole day.                          |

@ -42,6 +43,8 @@ Features description for `[FITBIT_STEPS_INTRADAY][PROVIDERS][RAPIDS]`:
 |minsteps                   |steps          |The minimum step count during a time segment.
 |avgsteps                   |steps          |The average step count during a time segment.
 |stdsteps                   |steps          |The standard deviation of step count during a time segment.
+|firststeptime              |minutes        |Minutes until the first non-zero step count.
+|laststeptime               |minutes        |Minutes until the last non-zero step count.
 |countepisodesedentarybout  |bouts          |Number of sedentary bouts during a time segment.
 |sumdurationsedentarybout   |minutes        |Total duration of all sedentary bouts during a time segment.
 |maxdurationsedentarybout   |minutes        |The maximum duration of any sedentary bout during a time segment.
--- a/docs/features/phone-activity-recognition.md
+++ b/docs/features/phone-activity-recognition.md
@ -44,7 +44,7 @@ Features description for `[PHONE_ACTIVITY_RECOGNITION][PROVIDERS][RAPIDS]`:
 |count                   |rows             | Number of episodes.
 |mostcommonactivity      |activity type   | The most common activity type (e.g. `still`, `on_foot`, etc.). If there is a tie, the first one is chosen.
 |countuniqueactivities   |activity type   | Number of unique activities.
-|durationstationary      |minutes          | The total duration of `[ACTIVITY_CLASSES][STATIONARY]` episodes
+|durationstationary      |minutes          | The total duration of `[ACTIVITY_CLASSES][STATIONARY]` episodes of still and tilting activities
 |durationmobile          |minutes          | The total duration of `[ACTIVITY_CLASSES][MOBILE]` episodes of on foot, running, and on bicycle activities
 |durationvehicle         |minutes          | The total duration of `[ACTIVITY_CLASSES][VEHICLE]` episodes of on vehicle activity

--- a/docs/features/phone-applications-foreground.md
+++ b/docs/features/phone-applications-foreground.md
@ -33,25 +33,36 @@ Parameters description for `[PHONE_APPLICATIONS_FOREGROUND][PROVIDERS][RAPIDS]`:
 |Key&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;            | Description |
 |----------------|-----------------------------------------------------------------------------------------------------------------------------------
 |`[COMPUTE]`| Set to `True` to extract `PHONE_APPLICATIONS_FOREGROUND` features from the `RAPIDS` provider|
+|`[INCLUDE_EPISODE_FEATURES]`| Set to `True` to extract features from application usage episodes using Screen data |
 |`[FEATURES]` |         Features to be computed, see table below
-|`[SINGLE_CATEGORIES]`     | An array of app categories to be *included* in the feature extraction computation. The special keyword `all` represents a category with all the apps from each participant. By default we use the category catalogue pointed by `[APPLICATION_CATEGORIES][CATALOGUE_FILE]` (see the Sensor parameters description table above)
-|`[MULTIPLE_CATEGORIES]`   | An array of collections representing meta-categories (a group of categories). They key of each element is the name of the `meta-category` and the value is an array of member app categories. By default we use the category catalogue pointed by `[APPLICATION_CATEGORIES][CATALOGUE_FILE]` (see the Sensor parameters description table above)
+|`[SINGLE_CATEGORIES]`     | An array of app categories to be *included* in the feature extraction computation. The special keyword `all` represents a category with all the apps from each participant. By default, we use the category catalog pointed by `[APPLICATION_CATEGORIES][CATALOGUE_FILE]` (see the Sensor parameters description table above)
+|`[CUSTOM_CATEGORIES]`   | An array of collections representing your own app categories. The key of each element is the name of the custom category, and the value is an array of the package names (apps) included in that category.
+|`[MULTIPLE_CATEGORIES]`   | An array of collections representing meta-categories (a group of categories). The key of each element is the name of the `meta-category` and the value is an array of member app categories. By default, we use the category catalog pointed by `[APPLICATION_CATEGORIES][CATALOGUE_FILE]` (see the Sensor parameters description table above)
 |`[SINGLE_APPS]`           | An array of apps to be *included* in the feature extraction computation. Use their package name (e.g. `com.google.android.youtube`) or the reserved keyword `top1global` (the most used app by a participant over the whole monitoring study)
-|`[EXCLUDED_CATEGORIES]`   | An array of app categories to be *excluded* from the feature extraction computation. By default we use the category catalogue pointed by `[APPLICATION_CATEGORIES][CATALOGUE_FILE]` (see the Sensor parameters description table above)
+|`[EXCLUDED_CATEGORIES]`   | An array of app categories to be *excluded* from the feature extraction computation. By default, we use the category catalog pointed by `[APPLICATION_CATEGORIES][CATALOGUE_FILE]` (see the Sensor parameters description table above)
 |`[EXCLUDED_APPS]`         | An array of apps to be excluded from the feature extraction computation. Use their package name, for example: `com.google.android.youtube`

 Features description for `[PHONE_APPLICATIONS_FOREGROUND][PROVIDERS][RAPIDS]`:

 |Feature                    |Units      |Description|
 |-------------------------- |---------- |---------------------------|
-|count              |apps      | Number of times a single app or apps within a category were used (i.e. they were brought to the foreground either by tapping their icon or switching to it from another app)
+|countevent              |apps      | Number of times a single app or apps within a category were used (i.e. they were brought to the foreground either by tapping their icon or switching to it from another app)
 |timeoffirstuse     |minutes   | The time in minutes between 12:00am (midnight) and the first use of a single app or apps within a category during a `time_segment`
 |timeoflastuse      |minutes   | The time in minutes between 12:00am (midnight) and the last use of a single app or apps within a category during a `time_segment`
 |frequencyentropy   |nats      | The entropy of the used apps within a category during a `time_segment` (each app is seen as a unique event, the more apps were used, the higher the entropy). This is especially relevant when computed over all apps. Entropy cannot be obtained for a single app
+|countepisode              |apps      | Number of times a usage episode of a single app or apps within a category were logged. In contrast to `countevent`, if an app was used across more than one time segment (for example, across more than one 30-minute segment), the `countepisode` will be one on each time segment instance. 
+|minduration        |minutes   | For a `time_segment`, the minimum duration an application was used in minutes
+|maxduration        |minutes   | For a `time_segment`, the maximum duration an application was used in minutes
+|meanduration       |minutes   | For a `time_segment`, the mean duration of all the applications used in minutes
+|sumduration        |minutes   | For a `time_segment`, the sum duration of all the applications used in minutes

 !!! note "Assumptions/Observations"
-    Features can be computed by app, by apps grouped under a single category (genre) and by multiple categories grouped together (meta-categories). For example, we can get features for `Facebook` (single app), for `Social Network` apps (a category including Facebook and other social media apps) or for `Social` (a meta-category formed by `Social Network` and `Social Media Tools` categories).
+    1. Features can be computed by app, by apps grouped under a single category (genre), by your own categories, or by multiple categories grouped together (meta-categories). For example, we can get features for `Facebook` (single app), for `Social Network` apps (a category including Facebook and other social media apps), for `Traditional Social Media` (a custom category that includes Twitter and Facebook), or for `Social` (a meta-category formed by `Social Network` and `Social Media Tools` categories).

-    Apps installed by default like YouTube are considered systems apps on some phones. We do an exact match to exclude apps where "genre" == `EXCLUDED_CATEGORIES` or "package_name" == `EXCLUDED_APPS`.
+    2. Apps installed by default like YouTube are considered systems apps on some phones. We do an exact match to exclude apps where "genre" == `EXCLUDED_CATEGORIES` or "package_name" == `EXCLUDED_APPS`.

-    We provide three ways of classifying and app within a category (genre): a) by automatically scraping its official category from the Google Play Store, b) by using the catalogue created by Stachl et al. which we provide in RAPIDS (`data/external/stachl_application_genre_catalogue.csv`), or c) by manually creating a personalized catalogue. You can choose a, b or c by modifying `[APPLICATION_GENRES]` keys and values (see the Sensor parameters description table above).
+    3. We provide four ways of classifying an app within a category (genre): a) by automatically scraping its official category from the Google Play Store, b) by using the catalog created by Stachl et al., which we provide in RAPIDS (`data/external/stachl_application_genre_catalogue.csv`), c) by manually creating a personalized catalog, or d) by defining a custom category in `config.yaml`. You can choose a, b, or c by modifying `[APPLICATION_GENRES]` keys and values (see the first table of this page).
+
+    4. We count `episodes` and `events` separately. Events are single app logs (when an app was opened), but episodes span from the time an app was opened until a new app is in the foreground or the screen is locked. Episodes will be chunked across any overlapping time segments. The `top1global` of `episodes` might not be the same as the `top1global` of `events`.
+
+    5. The application episodes are calculated using the application foreground and screen unlock episode data. An application episode starts when the application is launched and ends when new application is launched, or the screen is locked.
--- a/docs/features/phone-bluetooth.md
+++ b/docs/features/phone-bluetooth.md
@ -86,6 +86,7 @@ Features description for `[PHONE_BLUETOOTH][PROVIDERS][DORYAB]`:
 !!! note "Assumptions/Observations"
    - Devices are classified as belonging to the participant (`own`) or to other people (`others`) using k-means based on the number of times and the number of days each device was detected across each participant's dataset. See [Doryab et al](../../citation#doryab-bluetooth) for more details.
    - If ownership cannot be computed because all devices were detected on only one day, they are all considered as `other`. Thus `all` and `other` features will be equal. The likelihood of this scenario decreases the more days of data you have.
+    - When searching for the most frequent device across 30-minute segments, the search range is equivalent to the sum of all segments of the same time period. For instance, the `countscansmostfrequentdeviceacrosssegments` for the time segment (`Fri 00:00:00, Fri 00:29:59`) will get the count in that segment of the most frequent device found within all (`00:00:00, 00:29:59`) time segments. To find `countscansmostfrequentdeviceacrosssegments` for `other` devices, the search range needs to filter out all `own` devices. But no need to do so for `countscansmostfrequentdeviceacrosssedataset`. The most frequent device across the dataset stays the same for `countscansmostfrequentdeviceacrossdatasetall`, `countscansmostfrequentdeviceacrossdatasetown` and `countscansmostfrequentdeviceacrossdatasetother`. Same rule applies to the least frequent device across the dataset. 
    - The most and least frequent devices will be the same across time segment instances and across the entire dataset when every time segment instance covers every hour of a dataset. For example, daily segments (00:00 to 23:59) fall in this category but morning segments (06:00am to 11:59am) or periodic 30-minute segments don't.

    ??? info "Example"
--- a/docs/features/phone-calls.md
+++ b/docs/features/phone-calls.md
@ -26,6 +26,7 @@ Parameters description for `[PHONE_CALLS][PROVIDERS][RAPIDS]`:
 | Key&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;        | Description |
 |-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 |`[COMPUTE]`| Set to `True` to extract `PHONE_CALLS` features from the `RAPIDS` provider|
+|`[FEATURES_TYPE]`| Set to `EPISODES` to extract features based on call episodes or `EVENTS` to extract features based on events.|
 | `[CALL_TYPES]`   | The particular call_type that will be analyzed. The options for this parameter are incoming, outgoing or missed.                                                                                                                                                 |
 | `[FEATURES]`    | Features to be computed for `outgoing`, `incoming`, and `missed` calls. Note that the same features are available for both incoming and outgoing calls, while missed calls has its own set of features. See the tables below. |

@ -60,4 +61,4 @@ Features description for `[PHONE_CALLS][PROVIDERS][RAPIDS]` missed calls:
 !!! note "Assumptions/Observations"
    1. Traces for iOS calls are unique even for the same contact calling a participant more than once which renders `countmostfrequentcontact` meaningless and `distinctcontacts` equal to the total number of traces. 
    2. `[CALL_TYPES]` and `[FEATURES]` keys in `config.yaml` need to match. For example, `[CALL_TYPES]` `outgoing` matches the `[FEATURES]` key `outgoing`
-    3. iOS calls data is transformed to match Android calls data format. See our [algorithm](algorithms/phone-algorithms.md#phone-calls)
+    3. iOS calls data is transformed to match Android calls data format.
--- a/docs/features/phone-keyboard.md
+++ b/docs/features/phone-keyboard.md
@ -6,6 +6,12 @@ Sensor parameters description for `[PHONE_KEYBOARD]`:
 |----------------|-----------------------------------------------------------------------------------------------------------------------------------
 |`[CONTAINER]`| Data stream [container](../../datastreams/data-streams-introduction/) (database table, CSV file, etc.) where the keyboard data is stored

+## RAPIDS provider
+
+!!! info "Available time segments and platforms"
+    - Available for all time segments
+    - Available for Android only
+
 !!! info "File Sequence"
    ```bash
    - data/raw/{pid}/phone_keyboard_raw.csv
--- a/docs/features/phone-locations.md
+++ b/docs/features/phone-locations.md
@ -6,8 +6,9 @@ Sensor parameters description for `[PHONE_LOCATIONS]`:
 |----------------|-----------------------------------------------------------------------------------------------------------------------------------
 |`[CONTAINER]`| Data stream [container](../../datastreams/data-streams-introduction/) (database table, CSV file, etc.) where the location data is stored
 |`[LOCATIONS_TO_USE]`| Type of location data to use, one of `ALL`, `GPS`, `ALL_RESAMPLED` or `FUSED_RESAMPLED`. This filter is based on the `provider` column of the locations table, `ALL` includes every row, `GPS` only includes rows where the provider is gps, `ALL_RESAMPLED` includes all rows after being resampled, and `FUSED_RESAMPLED` only includes rows where the provider is fused after being resampled.
-|`[FUSED_RESAMPLED_CONSECUTIVE_THRESHOLD]`| if `ALL_RESAMPLED` or `FUSED_RESAMPLED` is used, the original fused data has to be resampled, a location row is resampled to the next valid timestamp (see the Assumptions/Observations below) only if the time difference between them is less or equal than this threshold (in minutes).
-|`[FUSED_RESAMPLED_TIME_SINCE_VALID_LOCATION]`| if `ALL_RESAMPLED` or `FUSED_RESAMPLED` is used, the original fused data has to be resampled, a location row is resampled at most for this long (in minutes)
+|`[FUSED_RESAMPLED_CONSECUTIVE_THRESHOLD]`| If `ALL_RESAMPLED` or `FUSED_RESAMPLED` is used, the original fused data has to be resampled. A location row is resampled to the next valid timestamp (see the Assumptions/Observations below) only if the time difference between them is less or equal than this threshold (in minutes).
+|`[FUSED_RESAMPLED_TIME_SINCE_VALID_LOCATION]`| If `ALL_RESAMPLED` or `FUSED_RESAMPLED` is used, the original fused data has to be resampled. A location row is resampled at most for this long (in minutes).
+|`[ACCURACY_LIMIT]` | An integer in meters, any location rows with an accuracy higher or equal than this is dropped. This number means there's a 68% probability the actual location is within this radius.

 !!! note "Assumptions/Observations"
    **Types of location data to use**
@ -16,9 +17,9 @@ Sensor parameters description for `[PHONE_LOCATIONS]`:
    - If you want to use only the GPS provider, set `[LOCATIONS_TO_USE]` to `GPS`
    - If you want to use all providers, set `[LOCATIONS_TO_USE]` to `ALL`
    - If you collected location data from different providers, including the fused API, use `ALL_RESAMPLED`
-    - If your mobile client was configured to use fused location only or want to focus only on this provider, set `[LOCATIONS_TO_USE]` to `RESAMPLE_FUSED`.
+    - If your mobile client was configured to use fused location only or want to focus only on this provider, set `[LOCATIONS_TO_USE]` to `FUSED_RESAMPLED`.
    
-    `ALL_RESAMPLED` and `RESAMPLE_FUSED` take the original location coordinates and replicate each pair forward in time as long as the phone was sensing data as indicated by the joined timestamps of [`[PHONE_DATA_YIELD][SENSORS]`](../phone-data-yield/). This is done because Google's API only logs a new location coordinate pair when it is sufficiently different in time or space from the previous one and because GPS and network providers can log data at variable rates.
+    `ALL_RESAMPLED` and `FUSED_RESAMPLED` take the original location coordinates and replicate each pair forward in time as long as the phone was sensing data as indicated by the joined timestamps of [`[PHONE_DATA_YIELD][SENSORS]`](../phone-data-yield/). This is done because Google's API only logs a new location coordinate pair when it is sufficiently different in time or space from the previous one and because GPS and network providers can log data at variable rates.

    There are two parameters associated with resampling fused location.
    
@ -41,6 +42,7 @@ These features are based on the original open-source implementation by [Barnett
    - data/raw/{pid}/phone_locations_raw.csv
    - data/interim/{pid}/phone_locations_processed.csv
    - data/interim/{pid}/phone_locations_processed_with_datetime.csv
+    - data/interim/{pid}/phone_locations_barnett_daily.csv
    - data/interim/{pid}/phone_locations_features/phone_locations_{language}_{provider_key}.csv
    - data/processed/features/{pid}/phone_locations.csv
    ```
@ -52,7 +54,6 @@ Parameters description for `[PHONE_LOCATIONS][PROVIDERS][BARNETT]`:
 |----------------|-----------------------------------------------------------------------------------------------------------------------------------
 |`[COMPUTE]`| Set to `True` to extract `PHONE_LOCATIONS` features from the `BARNETT` provider|
 |`[FEATURES]` |         Features to be computed, see table below
-|`[ACCURACY_LIMIT]` |   An integer in meters, any location rows with an accuracy higher than this is dropped. This number means there's a 68% probability the actual location is within this radius
 |`[IF_MULTIPLE_TIMEZONES]` |    Currently, `USE_MOST_COMMON` is the only value supported. If the location data for a participant belongs to multiple time zones, we select the most common because Barnett's algorithm can only handle one time zone 
 |`[MINUTES_DATA_USED]` |    Set to `True` to include an extra column in the final location feature file containing the number of minutes used to compute the features on each time segment. Use this for quality control purposes; the more data minutes exist for a period, the more reliable its features should be. For fused location, a single minute can contain more than one coordinate pair if the participant is moving fast enough.

@ -111,7 +112,9 @@ These features are based on the original implementation by [Doryab et al.](../..
    - data/raw/{pid}/phone_locations_raw.csv
    - data/interim/{pid}/phone_locations_processed.csv
    - data/interim/{pid}/phone_locations_processed_with_datetime.csv
-    - data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns.csv
+    - data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns_episodes.csv
+    - data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns_episodes_resampled.csv
+    - data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns_episodes_resampled_with_datetime.csv
    - data/interim/{pid}/phone_locations_features/phone_locations_{language}_{provider_key}.csv
    - data/processed/features/{pid}/phone_locations.csv
    ```
@ -121,9 +124,8 @@ Parameters description for `[PHONE_LOCATIONS][PROVIDERS][DORYAB]`:

 |Key&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;            | Description |
 |----------------|-----------------------------------------------------------------------------------------------------------------------------------
-|`[COMPUTE]`| Set to `True` to extract `PHONE_LOCATIONS` features from the `BARNETT` provider|
+|`[COMPUTE]`| Set to `True` to extract `PHONE_LOCATIONS` features from the `DORYAB` provider|
 |`[FEATURES]` |         Features to be computed, see table below
-|`[ACCURACY_LIMIT]` |   An integer in meters, any location rows with an accuracy higher than this will be dropped. This number means there's a 68% probability the true location is within this radius
 | `[DBSCAN_EPS]`             | The maximum distance in meters between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.
 | `[DBSCAN_MINSAMPLES]`      | The number of samples (or total weight) in a neighborhood for a point to be considered as a core point of a cluster. This includes the point itself.
 | `[THRESHOLD_STATIC]`       | It is the threshold value in km/hr which labels a row as Static or Moving.
@ -143,8 +145,8 @@ Features description for `[PHONE_LOCATIONS][PROVIDERS][DORYAB]`:
 |locationvariance                                            |$meters^2$    |The sum of the variances of the latitude and longitude columns. 
 |loglocationvariance                                           | -          | Log of the sum of the variances of the latitude and longitude columns.
 |totaldistance                                                |meters        |Total distance traveled in a time segment using the haversine formula.
-|avgspeed                                                 |km/hr         |Average speed in a time segment considering only the instances labeled as Moving.
-|varspeed                                                      |km/hr         |Speed variance in a time segment considering only the instances labeled as Moving. 
+|avgspeed                                                 |km/hr         |Average speed in a time segment considering only the instances labeled as Moving. This feature is 0 when the participant is stationary during a time segment.
+|varspeed                                                      |km/hr         |Speed variance in a time segment considering only the instances labeled as Moving. This feature is 0 when the participant is stationary during a time segment.
 |{--circadianmovement--}                                      |-             | Deprecated, see Observations below. \ "It encodes the extent to which a person's location patterns follow a 24-hour circadian cycle.\" [Doryab et al.](../../citation#doryab-locations).
 |numberofsignificantplaces                                    |places        |Number of significant locations visited. It is calculated using the DBSCAN/OPTICS clustering algorithm which takes in EPS and MIN_SAMPLES as parameters to identify clusters. Each cluster is a significant place.
 |numberlocationtransitions                                    |transitions   |Number of movements between any two clusters in a time segment.
@ -165,7 +167,7 @@ Features description for `[PHONE_LOCATIONS][PROVIDERS][DORYAB]`:

 !!! note "Assumptions/Observations"
    **Significant Locations Identified**
-    Significant locations are determined using DBSCAN clustering on locations that a patient visit over the course of the period of data collection.
+    Significant locations are determined using `DBSCAN` or `OPTICS` clustering on locations that a participant visited over the course of the period of data collection. The most significant location is the place where the participant stayed for the longest time.

    **Circadian Movement Calculation**
    Note Feb 3 2021. It seems the implementation of this feature is not correct; we suggest not to use this feature until a fix is in place. For a detailed description of how this should be calculated, see [Saeb et al](https://pubmed.ncbi.nlm.nih.gov/28344895/).
@ -195,4 +197,5 @@ Features description for `[PHONE_LOCATIONS][PROVIDERS][DORYAB]`:
            the candidate will be regarded as the home cluster; otherwise, the home cluster will be the last valid day's cluster.
            If there are no valid clusters before that day, the first home location in the days after is used.

-
+    **Clustering algorithms**
+    [`DBSCAN`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html) and [`OPTICS`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.OPTICS.html#r2c55e37003fe-1) algorithms are available currently. Duplicated locations are discarded while clustering. The `DBSCAN` algorithm takes the time spent at each location into consideration. However, the `OPTICS` algorithm ignores it as it is not supported in the current [scikit-learn](https://github.com/scikit-learn/scikit-learn/issues/12394) implementation.
--- a/docs/features/phone-screen.md
+++ b/docs/features/phone-screen.md
@ -32,7 +32,7 @@ Parameters description for `[PHONE_SCREEN][PROVIDERS][RAPIDS]`:
 |`[FEATURES]` |         Features to be computed, see table below
 |`[REFERENCE_HOUR_FIRST_USE]` |  The reference point from which `firstuseafter` is to be computed, default is midnight
 |`[IGNORE_EPISODES_SHORTER_THAN]` |  Ignore episodes that are shorter than this threshold (minutes). Set to 0 to disable this filter.
-|`[IGNORE_EPISODES_LONGER_THAN]` |  Ignore episodes that are longer than this threshold (minutes). Set to 0 to disable this filter.
+|`[IGNORE_EPISODES_LONGER_THAN]` |  Ignore episodes that are longer than this threshold (minutes), default is 6 hours. Set to 0 to disable this filter.
 |`[EPISODE_TYPES]` |  Currently we only support `unlock` episodes (from when the phone is unlocked until the screen is off)


--- a/docs/img/hm-data-yield-participants-absolute-time.html
+++ b/docs/img/hm-data-yield-participants-absolute-time.html
--- a/docs/img/hm-data-yield-participants-absolute-time.png
+++ b/docs/img/hm-data-yield-participants-absolute-time.png
--- a/docs/img/hm-data-yield-participants-relative-time.html
+++ b/docs/img/hm-data-yield-participants-relative-time.html
--- a/docs/img/hm-data-yield-participants-relative-time.png
+++ b/docs/img/hm-data-yield-participants-relative-time.png
--- a/docs/img/hm-feature-correlations.html
+++ b/docs/img/hm-feature-correlations.html
--- a/docs/img/hm-feature-correlations.png
+++ b/docs/img/hm-feature-correlations.png
--- a/docs/img/hm-sensor-rows.html
+++ b/docs/img/hm-sensor-rows.html
--- a/docs/img/hm-sensor-rows.png
+++ b/docs/img/hm-sensor-rows.png
--- a/docs/index.md
+++ b/docs/index.md
@ -1,12 +1,12 @@
 # Welcome to RAPIDS documentation

-Reproducible Analysis Pipeline for Data Streams (RAPIDS) allows you to process smartphone and wearable data to [extract](features/feature-introduction.md) and [create](features/add-new-features.md) **behavioral features** (a.k.a. digital biomarkers), [visualize](visualizations/data-quality-visualizations.md) mobile sensor data, and [structure](workflow-examples/analysis.md) your analysis into reproducible workflows.
+Reproducible Analysis Pipeline for Data Streams (RAPIDS) allows you to process smartphone and wearable data to [extract](features/feature-introduction.md) and [create](features/add-new-features.md) **behavioral features** (a.k.a. digital biomarkers), [visualize](visualizations/data-quality-visualizations.md) mobile sensor data, and [structure](analysis/complete-workflow-example.md) your analysis into reproducible workflows. Check out our [paper](https://www.frontiersin.org/article/10.3389/fdgth.2021.769823)!

-RAPIDS is open source, documented, multi-platform, modular, tested, and reproducible. At the moment, we support [data streams](datastreams/data-streams-introduction) logged by smartphones, Fitbit wearables, and Empatica wearables in collaboration with the [DBDP](https://dbdp.org/). 
+RAPIDS is open source, documented, multi-platform, modular, tested, and reproducible. At the moment, we support [data streams](datastreams/data-streams-introduction) logged by smartphones, Fitbit wearables, and Empatica wearables (the latter in collaboration with the [DBDP](https://dbdp.org/)). 

 !!! tip "Where do I start?"

-    :material-power-standby: New to RAPIDS? Check our [Overview + FAQ](setup/overview/) and [minimal example](workflow-examples/minimal)
+    :material-power-standby: New to RAPIDS? Check our [Overview + FAQ](setup/overview/) and [minimal example](analysis/minimal)

    :material-play-speed: [Install](setup/installation), [configure](setup/configuration), and [execute](setup/execution) RAPIDS to [extract](features/feature-introduction.md) and [plot](visualizations/data-quality-visualizations.md) behavioral features

@ -25,7 +25,7 @@ RAPIDS is open source, documented, multi-platform, modular, tested, and reproduc

 1. **Consistent analysis**. Every participant sensor dataset is analyzed in the same way and isolated from each other.
 2. **Efficient analysis**. Every analysis step is executed only once. Whenever your data or configuration changes, only the affected files are updated.
-5. **Parallel execution**. Thanks to Snakemake, your analysis can be executed over multiple cores without changing your code.
+5. **Parallel execution**. Thanks to [Snakemake](https://snakemake.github.io/), your analysis can be executed over multiple cores without changing your code.
 6. **Code-free features**. Extract any of the behavioral features offered by RAPIDS without writing any code.
 7. **Extensible code**. You can easily add your own data streams or behavioral features in R or Python, share them with the community, and keep authorship and citations.
 8. **Time zone aware**. Your data is adjusted to one or more time zones per participant.
@ -37,7 +37,7 @@ RAPIDS is open source, documented, multi-platform, modular, tested, and reproduc
 ## Users and Contributors

 ??? quote "Community Contributors"
-    Many thanks to our community contributions and the [whole team](../team):
+    Many thanks to the [whole team](./team) and our community contributions:

    - Agam Kumar (CMU)
    - Yasaman S. Sefidgar (University of Washington)
@ -46,7 +46,7 @@ RAPIDS is open source, documented, multi-platform, modular, tested, and reproduc
    - Stephen Price (CMU)
    - Neil Singh (University of Virginia)

-    Many thanks to the researchers that made [their work](../citation) open source:
+    Many thanks to the researchers that made [their work](./citation) open source:

    - Panda et al. [paper](https://pubmed.ncbi.nlm.nih.gov/31657854/)
    - Stachl et al. [paper](https://www.pnas.org/content/117/30/17680)
@ -57,9 +57,10 @@ RAPIDS is open source, documented, multi-platform, modular, tested, and reproduc

 ??? quote "Publications using RAPIDS"
    - Predicting Symptoms of Depression and Anxiety Using Smartphone and Wearable Data [link](https://www.frontiersin.org/articles/10.3389/fpsyt.2021.625247/full)
-    - Predicting Depression from Smartphone Behavioral Markers Using Machine Learning Methods, Hyper-parameter Optimization, and Feature Importance Analysis: An Exploratory Study [link](https://preprints.jmir.org/preprint/26540)
+    - Predicting Depression from Smartphone Behavioral Markers Using Machine Learning Methods, Hyperparameter Optimization, and Feature Importance Analysis: Exploratory Study [link](https://mhealth.jmir.org/2021/7/e26540)
    -  Digital Biomarkers of Symptom Burden Self-Reported by Perioperative Patients Undergoing Pancreatic Surgery: Prospective Longitudinal Study [link](https://cancer.jmir.org/2021/2/e27975/)
-    - An Automated Machine Learning Pipeline for Monitoring and Forecasting Mobile Health Data [link](https://edas.info/showManuscript.php?m=1570708269&random=750318666&type=final&ext=pdf&title=PDF+file)
+    - An Automated Machine Learning Pipeline for Monitoring and Forecasting Mobile Health Data [link](https://ieeexplore.ieee.org/abstract/document/9483755/)
+    - Mobile Footprinting: Linking Individual Distinctiveness in Mobility Patterns to Mood, Sleep, and Brain Functional Connectivity [link](https://www.biorxiv.org/content/10.1101/2021.05.17.444568v1.abstract)

 <div class="users">
 <div><img alt="carnegie mellon university" loading="lazy" src="./img/logos/cmu.png" /></div>
--- a/docs/setup/installation.md
+++ b/docs/setup/installation.md
@ -35,14 +35,21 @@ You can install RAPIDS using Docker (the fastest), or native instructions for Ma
        ```
    7.  *Optional*. You can edit RAPIDS files with `vim` but we recommend using `Visual Studio Code` and its `Remote Containers` extension

-        ??? info "How to configure Remote Containers extension"
+        ??? info "How to configure the Remote Containers extension"
+            
+            - Make sure RAPIDS Docker container is running

-            - Make sure RAPIDS container is running
-            - Install the [Remote - Containers extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers)
-            - Go to the `Remote Explorer` panel on the left hand sidebar
-            - On the top right dropdown menu choose `Containers`
-            - Double click on the `moshiresearch/rapids` container in the`CONTAINERS` tree
-            - A new VS Code session should open on RAPIDS main folder inside the container.
+            - Install VS Code and its [Remote - Containers extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers)
+
+            - Click the `Remote Explorer` icon on the left-hand sidebar (the icon is a computer monitor)
+
+            - On the top right dropdown menu, choose `Containers`
+
+            - Right-click on the `moshiresearch/rapids` container in the `CONTAINERS` tree and select `Attach to Container`. A new VS Code window should open
+
+            - In the new window, open the `/rapids/` folder via the `File/Open...` menu
+
+            - Run RAPIDS inside a terminal in VS Code. Open one with the `Terminal/New Terminal` menu

    !!! warning
        If you installed RAPIDS using Docker for Windows on Windows 10, the container will have [limits](https://stackoverflow.com/questions/43460770/docker-windows-container-memory-limit) on the amount of RAM it can use. If you find that RAPIDS crashes due to running out of memory, [increase](https://stackoverflow.com/a/56583203/6030343) this limit.
--- a/docs/setup/overview.md
+++ b/docs/setup/overview.md
@ -23,10 +23,10 @@ Let's review some key concepts we use throughout these docs:
    - [Add your own behavioral features](../../features/add-new-features/) (we can include them in RAPIDS if you want to share them with the community)
    - [Add support for new data streams](../../datastreams/add-new-data-streams/) if yours cannot be processed by RAPIDS yet
    - Create visualizations for [data quality control](../../visualizations/data-quality-visualizations/)  and [feature inspection](../../visualizations/feature-visualizations/)
-    - [Extending RAPIDS to organize your analysis](../../workflow-examples/analysis/) and publish a code repository along with your code
+    - [Extending RAPIDS to organize your analysis](../../analysis/complete-workflow-example/) and publish a code repository along with your code

 !!! hint
-    - We recommend you follow the [Minimal Example](../../workflow-examples/minimal/) tutorial to get familiar with RAPIDS
+    - We recommend you follow the [Minimal Example](../../analysis/minimal/) tutorial to get familiar with RAPIDS

    - In order to follow any of the previous tutorials, you will have to [Install](../installation/), [Configure](../configuration/), and learn how to [Execute](../execution/) RAPIDS.

--- a/docs/snippets/parsedfitbit_format.md
+++ b/docs/snippets/parsedfitbit_format.md
@ -43,9 +43,9 @@ All columns are mandatory; however, all except `device_id` and `local_date_time`

            |device_id                              |local_date_time   |heartrate_daily_restinghr |heartrate_daily_caloriesoutofrange  |heartrate_daily_caloriesfatburn  |heartrate_daily_caloriescardio  |heartrate_daily_caloriespeak   |
            |-------------------------------------- |----------------- |------- |-------------- |------------- |------------ |-------|
-            |a748ee1a-1d0b-4ae9-9074-279a2b6ba524   |2020-10-07        |72      |1200.6102      |760.3020      |15.2048      |0      |
-            |a748ee1a-1d0b-4ae9-9074-279a2b6ba524   |2020-10-08        |70      |1100.1120      |660.0012      |23.7088      |0      |
-            |a748ee1a-1d0b-4ae9-9074-279a2b6ba524   |2020-10-09        |69      |750.3615       |734.1516      |131.8579     |0      |
+            |a748ee1a-1d0b-4ae9-9074-279a2b6ba524   |2020-10-07 00:00:00  |72      |1200.6102      |760.3020      |15.2048      |0      |
+            |a748ee1a-1d0b-4ae9-9074-279a2b6ba524   |2020-10-08 00:00:00  |70      |1100.1120      |660.0012      |23.7088      |0      |
+            |a748ee1a-1d0b-4ae9-9074-279a2b6ba524   |2020-10-09 00:00:00  |69      |750.3615       |734.1516      |131.8579     |0      |

 ??? info "FITBIT_HEARTRATE_INTRADAY"

--- a/docs/team.md
+++ b/docs/team.md
@ -9,16 +9,12 @@ If you are interested in contributing feel free to submit a pull request or cont
 ??? abstract "About"
    Julio Vega is a postdoctoral associate at the Mobile Sensing + Health Institute. He is interested in personalized methodologies to monitor chronic conditions that affect daily human behavior using mobile and wearable data.

-    - *vegaju* at *upmc* . *edu*
-    - [Personal Website](https://juliovega.info/)
-
 ### Meng Li

 ??? abstract "About"
    Meng Li received her Master of Science degree in Information Science from the University of Pittsburgh. She is interested in applying machine learning algorithms to the medical field.

    - *lim11* at *upmc* . *edu*
-    - [Linkedin Profile](https://www.linkedin.com/in/meng-li-57238414a)
    - [Github Profile](https://github.com/Meng6)

 ###  Abhineeth Reddy Kunta
@ -49,10 +45,22 @@ If you are interested in contributing feel free to submit a pull request or cont
 ### Nikunj Goel

 ??? abstract "About"
-    Nik is a graduate student at the University of Pittsburgh pursuing Master of Science in Information Science. He earned his Bachelor of Technology degree in Information Technology from India. He is a Data Enthusiasts and passionate about finding the meaning out of raw data. In a long term, his goal is to create a breakthrough in Data Science and Deep Learning.
+    Nik is a graduate student at the University of Pittsburgh pursuing Master of Science in Information Science. He earned his Bachelor of Technology degree in Information Technology from India. He is a Data Enthusiast and passionate about finding the meaning out of raw data. In a long term, his goal is to create a breakthrough in Data Science and Deep Learning.

    - [Linkedin Profile](https://www.linkedin.com/in/nikunjgoel95/)

+### Kirtiraj Khandekar
+
+??? abstract "About"
+    Raj is a graduate student at the University of Pittsburgh pursuing Master of Science in Information Science.
+
+### Weiyu Huang
+
+??? abstract "About"
+    Weiyu is a graduate student at the University of Pittsburgh pursuing Master of Science in Information Science.
+
+    - [Github Profile](https://github.com/ChinW97)
+
 ## Community Contributors

 ### Agam Kumar
@ -88,6 +96,19 @@ If you are interested in contributing feel free to submit a pull request or cont
 ??? abstract "About"
    University of Virginia

+###  Ian Barnett
+
+??? abstract "About"
+    University of Pennsylvania
+    - [Profile](https://www.dbei.med.upenn.edu/bio/ian-j-barnett-phd)
+
+
+###  Shirley Anugrah Hayati
+
+??? abstract "About"
+    University of Pennsylvania
+    - [Personal Website](https://www.shirley.id/)
+
 ## Advisors

 ### Afsaneh Doryab
@ -98,4 +119,4 @@ If you are interested in contributing feel free to submit a pull request or cont
 ### Carissa Low

 ??? abstract "About"
-    - [Profile](https://www.moshi.pitt.edu/people/carissa-low-phd)
+    - [Profile](https://www.moshi.pitt.edu/people/carissa-low-phd)
--- a/docs/visualizations/data-quality-visualizations.md
+++ b/docs/visualizations/data-quality-visualizations.md
@ -20,7 +20,7 @@ These plots can be used as a rough indication of the smartphone monitoring cover
 ## 2. Heatmaps of overall data yield
 These heatmaps are a break down per time segment and per participant of [Visualization 1](#1-histograms-of-phone-data-yield). Heatmap's rows represent participants, columns represent time segment instances and the cells’ color represent the valid yielded minute or hour ratio for a participant during a time segment instance.

-As different participants might join a study on different dates and time segments can be of any length and start on any day, the x-axis can be labelled with the absolute time of the start of each time segment instance or the time delta between the start of each time segment instance minus the start of the first instance. These plots provide a quick study overview of the monitoring coverage per person and per time segment. 
+As different participants might join a study on different dates and time segments can be of any length and start on any day, the x-axis can be labelled with the absolute time of each time segment instance or the time delta between each time segment instance and the start of the first instance for each participant. These plots provide a quick study overview of the monitoring coverage per person and per time segment.

 The figure below shows the heatmap of the valid yielded minute ratio for participants example01 and example02 on daily segments and, as we inferred from the previous histogram, the lighter (yellow) color on most time segment instances (cells) indicate both phones sensed data without interruptions for most days (except for the first and last ones).

@ -63,7 +63,7 @@ The figure below shows this heatmap for phone sensors collected by participant e
 ## 4. Heatmap of sensor row count
 These heatmaps are a per-sensor breakdown of [Visualization 1](#1-histograms-of-phone-data-yield) and [Visualization 2](#2-heatmaps-of-overall-data-yield). Note that the second row (ratio of valid yielded minutes) of this heatmap matches the respective participant (bottom) row the screenshot in Visualization 2.

-In these heatmaps rows represent phone or Fitbit sensors, columns represent time segment instances and cell’s color shows the normalized (0 to 1) row count of each sensor within a time segment instance. RAPIDS creates one heatmap per participant and they can be used to judge missing data on a per participant and per sensor basis.
+In these heatmaps rows represent phone or Fitbit sensors, columns represent time segment instances and cell’s color shows the normalized (0 to 1) row count of each sensor within a time segment instance. A grey cell represents missing data in that time segment instance. RAPIDS creates one heatmap per participant and they can be used to judge missing data on a per participant and per sensor basis.

 The figure below shows data for 14 phone sensors (including data yield) of example01’s daily segments. From the top two rows, we can see that the phone was sensing data for most of the monitoring period (as suggested by Figure 3 and Figure 4). We can also infer how phone usage influenced the different sensor streams; there are peaks of screen events during the first day (Apr 23rd), peaks of location coordinates on Apr 26th and Apr 30th, and no sent or received SMS except for Apr 23rd, Apr 29th and Apr 30th (unlabeled row between screen and locations).

--- a/donotmakechanges.py
+++ b/donotmakechanges.py
@ -0,0 +1,39 @@
+"""
+Please do not make any changes, as RAPIDS is running on tmux server ...
+"""
+# !
+# !
+"""
+Please do not make any changes, as RAPIDS is running on tmux server ...
+"""
+# !
+# !
+"""
+Please do not make any changes, as RAPIDS is running on tmux server ...
+"""
+# !
+# !
+"""
+Please do not make any changes, as RAPIDS is running on tmux server ...
+"""
+# !
+# !
+"""
+Please do not make any changes, as RAPIDS is running on tmux server ...
+"""
+# !
+# !
+"""
+Please do not make any changes, as RAPIDS is running on tmux server ...
+"""
+# !
+# !
+"""
+Please do not make any changes, as RAPIDS is running on tmux server ...
+"""
+# !
+# !
+"""
+Please do not make any changes, as RAPIDS is running on tmux server ...
+"""
+# !
--- a/environment.yml
+++ b/environment.yml
@ -1,109 +1,30 @@
-name: rapids202012
+name: rapids
 channels:
  - conda-forge
-  - defaults
 dependencies:
-  - _py-xgboost-mutex=2.0
-  - appdirs=1.4.4
-  - arrow=0.16.0
-  - asn1crypto=1.4.0
-  - astropy=4.2
-  - attrs=20.3.0
-  - binaryornot=0.4.4
-  - blas=1.0
-  - brotlipy=0.7.0
-  - bzip2=1.0.8
-  - ca-certificates=2020.12.8
-  - certifi=2020.12.5
-  - cffi=1.14.4
-  - chardet=3.0.4
-  - click=7.1.2
-  - cookiecutter=1.6.0
-  - cryptography=3.3.1
-  - datrie=0.8.2
-  - docutils=0.16
-  - future=0.18.2
-  - gitdb=4.0.5
-  - gitdb2=4.0.2
-  - gitpython=3.1.11
-  - idna=2.10
-  - imbalanced-learn=0.6.2
-  - importlib-metadata=2.0.0
-  - importlib_metadata=2.0.0
-  - intel-openmp=2019.4
-  - jinja2=2.11.2
-  - jinja2-time=0.2.0
-  - joblib=1.0.0
-  - jsonschema=3.2.0
-  - libblas=3.8.0
-  - libcblas=3.8.0
-  - libcxx=10.0.0
-  - libedit=3.1.20191231
-  - libffi=3.3
-  - libgfortran
-  - liblapack=3.8.0
-  - libxgboost=0.90
-  - lightgbm=3.1.1
-  - llvm-openmp=10.0.0
-  - markupsafe=1.1.1
-  - mkl
-  - mkl-service=2.3.0
-  - mkl_fft=1.2.0
-  - mkl_random=1.1.1
-  - more-itertools=8.6.0
-  - ncurses=6.2
-  - numpy=1.19.2
-  - numpy-base=1.19.2
-  - openssl=1.1.1i
-  - pandas=1.1.5
-  - pbr=5.5.1
-  - pip=20.3.3
-  - plotly=4.14.1
-  - poyo=0.5.0
-  - psutil=5.7.2
-  - py-xgboost=0.90
-  - pycparser=2.20
-  - pyerfa=1.7.1.1
-  - pyopenssl=20.0.1
-  - pysocks=1.7.1
-  - python=3.7.9
-  - python-dateutil=2.8.1
-  - python_abi=3.7
-  - pytz=2020.4
-  - pyyaml=5.3.1
-  - readline=8.0
-  - requests=2.25.0
-  - retrying=1.3.3
-  - scikit-learn=0.23.2
-  - scipy=1.5.2
-  - setuptools=51.0.0
-  - six=1.15.0
-  - smmap=3.0.4
-  - smmap2=3.0.1
-  - sqlite=3.33.0
-  - threadpoolctl=2.1.0
-  - tk=8.6.10
-  - urllib3=1.25.11
-  - wheel=0.36.2
-  - whichcraft=0.6.1
-  - wrapt=1.12.1
-  - xgboost=0.90
-  - xz=5.2.5
-  - yaml=0.2.5
-  - zipp=3.4.0
-  - zlib=1.2.11
-  - pip:
-    - amply==0.1.4
-    - configargparse==0.15.1
-    - decorator==4.4.2
-    - ipython-genutils==0.2.0
-    - jupyter-core==4.6.3
-    - nbformat==5.0.7
-    - pulp==2.4
-    - pyparsing==2.4.7
-    - pyrsistent==0.15.5
-    - ratelimiter==1.2.0.post0
-    - snakemake==5.30.2
-    - toposort==1.5
-    - traitlets==4.3.3
-prefix: /usr/local/Caskroom/miniconda/base/envs/rapids202012
+    - auto-sklearn
+    - hmmlearn
+    - imbalanced-learn
+    - jsonschema
+    - lightgbm
+    - matplotlib
+    - numpy
+    - pandas
+    - peakutils
+    - pip
+    - plotly
+    - python-dateutil
+    - pytz
+    - pywavelets
+    - pyyaml
+    - scikit-learn
+    - scipy
+    - seaborn
+    - setuptools
+    - bioconda::snakemake 
+    - bioconda::snakemake-minimal
+    - tqdm
+    - xgboost
+    - pip:
+        - biosppy
+        - cr_features>=0.2
--- a/example_profile/Snakefile
+++ b/example_profile/Snakefile
@ -3,7 +3,7 @@ include: "../rules/common.smk"
 include: "../rules/renv.smk"
 include: "../rules/preprocessing.smk"
 include: "../rules/features.smk"
-include: "../rules/models.smk"
+include: "../rules/models_example.smk"
 include: "../rules/reports.smk"

 import itertools
@ -204,15 +204,29 @@ for provider in config["PHONE_LOCATIONS"]["PROVIDERS"].keys():
            else:
                raise ValueError("Error: Add PHONE_LOCATIONS (and as many PHONE_SENSORS as you have) to [PHONE_DATA_YIELD][SENSORS] in config.yaml. This is necessary to compute phone_yielded_timestamps (time when the smartphone was sensing data) which is used to resample fused location data (ALL_RESAMPLED and RESAMPLED_FUSED)")

+        if provider == "BARNETT":
+            files_to_compute.extend(expand("data/interim/{pid}/phone_locations_barnett_daily.csv", pid=config["PIDS"]))
+        if provider == "DORYAB":
+            files_to_compute.extend(expand("data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns_episodes.csv", pid=config["PIDS"]))
+            files_to_compute.extend(expand("data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns_episodes_resampled_with_datetime.csv", pid=config["PIDS"]))
+
        files_to_compute.extend(expand("data/raw/{pid}/phone_locations_raw.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/interim/{pid}/phone_locations_processed.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/interim/{pid}/phone_locations_processed_with_datetime.csv", pid=config["PIDS"]))
-        files_to_compute.extend(expand("data/interim/{pid}/phone_locations_processed_with_datetime_with_home.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/interim/{pid}/phone_locations_features/phone_locations_{language}_{provider_key}.csv", pid=config["PIDS"], language=get_script_language(config["PHONE_LOCATIONS"]["PROVIDERS"][provider]["SRC_SCRIPT"]), provider_key=provider.lower()))
        files_to_compute.extend(expand("data/processed/features/{pid}/phone_locations.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"]))
        files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv")

+for provider in config["FITBIT_CALORIES_INTRADAY"]["PROVIDERS"].keys():
+    if config["FITBIT_CALORIES_INTRADAY"]["PROVIDERS"][provider]["COMPUTE"]:
+        files_to_compute.extend(expand("data/raw/{pid}/fitbit_calories_intraday_raw.csv", pid=config["PIDS"]))
+        files_to_compute.extend(expand("data/raw/{pid}/fitbit_calories_intraday_with_datetime.csv", pid=config["PIDS"]))
+        files_to_compute.extend(expand("data/interim/{pid}/fitbit_calories_intraday_features/fitbit_calories_intraday_{language}_{provider_key}.csv", pid=config["PIDS"], language=get_script_language(config["FITBIT_CALORIES_INTRADAY"]["PROVIDERS"][provider]["SRC_SCRIPT"]), provider_key=provider.lower()))
+        files_to_compute.extend(expand("data/processed/features/{pid}/fitbit_calories_intraday.csv", pid=config["PIDS"]))
+        files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"]))
+        files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv")
+
 for provider in config["FITBIT_DATA_YIELD"]["PROVIDERS"].keys():
    if config["FITBIT_DATA_YIELD"]["PROVIDERS"][provider]["COMPUTE"]:
        files_to_compute.extend(expand("data/raw/{pid}/fitbit_heartrate_intraday_raw.csv", pid=config["PIDS"]))
@ -271,6 +285,12 @@ for provider in config["FITBIT_STEPS_SUMMARY"]["PROVIDERS"].keys():

 for provider in config["FITBIT_STEPS_INTRADAY"]["PROVIDERS"].keys():
    if config["FITBIT_STEPS_INTRADAY"]["PROVIDERS"][provider]["COMPUTE"]:
+        
+        if config["FITBIT_STEPS_INTRADAY"]["EXCLUDE_SLEEP"]["TIME_BASED"]["EXCLUDE"] or config["FITBIT_STEPS_INTRADAY"]["EXCLUDE_SLEEP"]["FITBIT_BASED"]["EXCLUDE"]:
+            if config["FITBIT_STEPS_INTRADAY"]["EXCLUDE_SLEEP"]["FITBIT_BASED"]["EXCLUDE"]:
+                files_to_compute.extend(expand("data/raw/{pid}/fitbit_sleep_summary_raw.csv", pid=config["PIDS"]))
+            files_to_compute.extend(expand("data/interim/{pid}/fitbit_steps_intraday_with_datetime_exclude_sleep.csv", pid=config["PIDS"]))
+
        files_to_compute.extend(expand("data/raw/{pid}/fitbit_steps_intraday_raw.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/raw/{pid}/fitbit_steps_intraday_with_datetime.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/interim/{pid}/fitbit_steps_intraday_features/fitbit_steps_intraday_{language}_{provider_key}.csv", pid=config["PIDS"], language=get_script_language(config["FITBIT_STEPS_INTRADAY"]["PROVIDERS"][provider]["SRC_SCRIPT"]), provider_key=provider.lower()))
@ -357,11 +377,21 @@ if config["HEATMAP_SENSOR_ROW_COUNT_PER_TIME_SEGMENT"]["PLOT"]:
    files_to_compute.append("reports/data_exploration/heatmap_sensor_row_count_per_time_segment.html")

 if config["HEATMAP_PHONE_DATA_YIELD_PER_PARTICIPANT_PER_TIME_SEGMENT"]["PLOT"]:
+    if not config["PHONE_DATA_YIELD"]["PROVIDERS"]["RAPIDS"]["COMPUTE"]:
+        raise ValueError("Error: [PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] must be True in config.yaml to get heatmaps of overall data yield.")
    files_to_compute.append("reports/data_exploration/heatmap_phone_data_yield_per_participant_per_time_segment.html")

 if config["HEATMAP_FEATURE_CORRELATION_MATRIX"]["PLOT"]:
    files_to_compute.append("reports/data_exploration/heatmap_feature_correlation_matrix.html")

+# Data Cleaning
+for provider in config["ALL_CLEANING_INDIVIDUAL"]["PROVIDERS"].keys():
+    if config["ALL_CLEANING_INDIVIDUAL"]["PROVIDERS"][provider]["COMPUTE"]:
+        files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features_cleaned_" + provider.lower() +".csv", pid=config["PIDS"]))
+for provider in config["ALL_CLEANING_OVERALL"]["PROVIDERS"].keys():
+    if config["ALL_CLEANING_OVERALL"]["PROVIDERS"][provider]["COMPUTE"]:
+        files_to_compute.extend(expand("data/processed/features/all_participants/all_sensor_features_cleaned_" + provider.lower() +".csv"))
+
 # Analysis Workflow Example
 models, scalers = [], []
 for model_name in config["PARAMS_FOR_ANALYSIS"]["MODEL_NAMES"]:
@ -379,7 +409,6 @@ files_to_compute.extend(expand("data/raw/{pid}/participant_target_with_datetime.
 files_to_compute.extend(expand("data/processed/targets/{pid}/parsed_targets.csv", pid=config["PIDS"]))

 # Individual model
-files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features_cleaned.csv", pid=config["PIDS"]))
 files_to_compute.extend(expand("data/processed/models/individual_model/{pid}/input.csv", pid=config["PIDS"]))
 files_to_compute.extend(expand("data/processed/models/individual_model/{pid}/output_{cv_method}/baselines.csv", pid=config["PIDS"], cv_method=config["PARAMS_FOR_ANALYSIS"]["CV_METHODS"]))
 files_to_compute.extend(expand(
@ -392,7 +421,6 @@ files_to_compute.extend(expand(
                            scaler=scalers))

 # Population model
-files_to_compute.append("data/processed/features/all_participants/all_sensor_features_cleaned.csv")
 files_to_compute.append("data/processed/models/population_model/input.csv")
 files_to_compute.extend(expand("data/processed/models/population_model/output_{cv_method}/baselines.csv", cv_method=config["PARAMS_FOR_ANALYSIS"]["CV_METHODS"]))
 files_to_compute.extend(expand(
--- a/example_profile/example_config.yaml
+++ b/example_profile/example_config.yaml
@ -84,6 +84,7 @@ PHONE_APPLICATIONS_CRASHES:
    CATALOGUE_SOURCE: FILE # FILE (genres are read from CATALOGUE_FILE) or GOOGLE (genres are scrapped from the Play Store)
    CATALOGUE_FILE: "data/external/stachl_application_genre_catalogue.csv"
    UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether or not to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
+    PACKAGE_NAMES_HASHED: False
    SCRAPE_MISSING_CATEGORIES: False # whether or not to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
  PROVIDERS: # None implemented yet but this sensor can be used in PHONE_DATA_YIELD

@ -93,11 +94,13 @@ PHONE_APPLICATIONS_FOREGROUND:
  APPLICATION_CATEGORIES:
    CATALOGUE_SOURCE: FILE # FILE (genres are read from CATALOGUE_FILE) or GOOGLE (genres are scrapped from the Play Store)
    CATALOGUE_FILE: "data/external/stachl_application_genre_catalogue.csv"
+    PACKAGE_NAMES_HASHED: False
    UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether or not to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
    SCRAPE_MISSING_CATEGORIES: False # whether or not to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
  PROVIDERS:
    RAPIDS:
      COMPUTE: True
+      INCLUDE_EPISODE_FEATURES: False
      SINGLE_CATEGORIES: ["all", "email"]
      MULTIPLE_CATEGORIES:
        social: ["socialnetworks", "socialmediatools"]
@ -105,7 +108,11 @@ PHONE_APPLICATIONS_FOREGROUND:
      SINGLE_APPS: ["top1global", "com.facebook.moments", "com.google.android.youtube", "com.twitter.android"] # There's no entropy for single apps
      EXCLUDED_CATEGORIES: ["system_apps"]
      EXCLUDED_APPS: ["com.fitbit.FitbitMobile", "com.aware.plugin.upmc.cancer"]
-      FEATURES: ["count", "timeoffirstuse", "timeoflastuse", "frequencyentropy"]
+      FEATURES:
+        APP_EVENTS: ["countevent", "timeoffirstuse", "timeoflastuse", "frequencyentropy"]
+        APP_EPISODES: ["countepisode", "minduration", "maxduration", "meanduration", "sumduration"]
+      IGNORE_EPISODES_SHORTER_THAN: 0 # in minutes, set to 0 to disable
+      IGNORE_EPISODES_LONGER_THAN: 300 # in minutes, set to 0 to disable
      SRC_SCRIPT: src/features/phone_applications_foreground/rapids/main.py

 # See https://www.rapids.science/latest/features/phone-applications-notifications/
@ -115,6 +122,7 @@ PHONE_APPLICATIONS_NOTIFICATIONS:
    CATALOGUE_SOURCE: FILE # FILE (genres are read from CATALOGUE_FILE) or GOOGLE (genres are scrapped from the Play Store)
    CATALOGUE_FILE: "data/external/stachl_application_genre_catalogue.csv"
    UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether or not to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
+    PACKAGE_NAMES_HASHED: False
    SCRAPE_MISSING_CATEGORIES: False # whether or not to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
  PROVIDERS: # None implemented yet but this sensor can be used in PHONE_DATA_YIELD

@ -198,7 +206,11 @@ PHONE_DATA_YIELD:
 # See https://www.rapids.science/latest/features/phone-keyboard/
 PHONE_KEYBOARD:
  CONTAINER: keyboard
-  PROVIDERS: # None implemented yet but this sensor can be used in PHONE_DATA_YIELD
+  PROVIDERS:
+    RAPIDS:
+      COMPUTE: False
+      FEATURES: ["sessioncount","averageinterkeydelay","averagesessionlength","changeintextlengthlessthanminusone","changeintextlengthequaltominusone","changeintextlengthequaltoone","changeintextlengthmorethanone","maxtextlength","lastmessagelength","totalkeyboardtouches"]
+      SRC_SCRIPT: src/features/phone_keyboard/rapids/main.py

 # See https://www.rapids.science/latest/features/phone-light/
 PHONE_LIGHT:
@ -215,12 +227,12 @@ PHONE_LOCATIONS:
  LOCATIONS_TO_USE: FUSED_RESAMPLED # ALL, GPS, ALL_RESAMPLED, OR FUSED_RESAMPLED
  FUSED_RESAMPLED_CONSECUTIVE_THRESHOLD: 30 # minutes, only replicate location samples to the next sensed bin if the phone did not stop collecting data for more than this threshold
  FUSED_RESAMPLED_TIME_SINCE_VALID_LOCATION: 720 # minutes, only replicate location samples to consecutive sensed bins if they were logged within this threshold after a valid location row
-  
+  ACCURACY_LIMIT: 51 # meters
+
  PROVIDERS:
    DORYAB:
      COMPUTE: True
      FEATURES: ["locationvariance","loglocationvariance","totaldistance","avgspeed","varspeed", "numberofsignificantplaces","numberlocationtransitions","radiusgyration","timeattop1location","timeattop2location","timeattop3location","movingtostaticratio","outlierstimepercent","maxlengthstayatclusters","minlengthstayatclusters","avglengthstayatclusters","stdlengthstayatclusters","locationentropy","normalizedlocationentropy","timeathome"]
-      ACCURACY_LIMIT: 51 # meters, drops location coordinates with an accuracy higher than this. This number means there's a 68% probability the true location is within this radius
      DBSCAN_EPS: 10 # meters
      DBSCAN_MINSAMPLES: 5
      THRESHOLD_STATIC : 1 # km/h
@ -236,7 +248,6 @@ PHONE_LOCATIONS:
    BARNETT:
      COMPUTE: False
      FEATURES: ["hometime","disttravelled","rog","maxdiam","maxhomedist","siglocsvisited","avgflightlen","stdflightlen","avgflightdur","stdflightdur","probpause","siglocentropy","circdnrtn","wkenddayrtn"]
-      ACCURACY_LIMIT: 51 # meters, drops location coordinates with an accuracy higher than this. This number means there's a 68% probability the true location is within this radius
      IF_MULTIPLE_TIMEZONES: USE_MOST_COMMON
      MINUTES_DATA_USED: False # Use this for quality control purposes, how many minutes of data (location coordinates gruped by minute) were used to compute features
      SRC_SCRIPT: src/features/phone_locations/barnett/main.R
@ -513,7 +524,7 @@ HEATMAP_SENSORS_PER_MINUTE_PER_TIME_SEGMENT:

 # See https://www.rapids.science/latest/visualizations/data-quality-visualizations/#4-heatmap-of-sensor-row-count
 HEATMAP_SENSOR_ROW_COUNT_PER_TIME_SEGMENT:
-  PLOT: True
+  PLOT: False
  SENSORS:  [PHONE_ACTIVITY_RECOGNITION, PHONE_APPLICATIONS_FOREGROUND, PHONE_BATTERY, PHONE_BLUETOOTH, PHONE_CALLS, PHONE_CONVERSATION, PHONE_LIGHT, PHONE_LOCATIONS, PHONE_MESSAGES, PHONE_SCREEN, PHONE_WIFI_CONNECTED, PHONE_WIFI_VISIBLE]

 # Features ------
@ -526,6 +537,46 @@ HEATMAP_FEATURE_CORRELATION_MATRIX:
  CORR_METHOD: "pearson" # choose from {"pearson", "kendall", "spearman"}


+########################################################################################################################
+#                                                    Data Cleaning                                                     #
+########################################################################################################################
+
+ALL_CLEANING_INDIVIDUAL:
+  PROVIDERS:
+    RAPIDS:
+      COMPUTE: True
+      IMPUTE_SELECTED_EVENT_FEATURES:
+        COMPUTE: False
+        MIN_DATA_YIELDED_MINUTES_TO_IMPUTE: 0.33
+      COLS_NAN_THRESHOLD: 0.3 # set to 1 to disable
+      COLS_VAR_THRESHOLD: True
+      ROWS_NAN_THRESHOLD: 0.3 # set to 1 to disable
+      DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_HOURS # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
+      DATA_YIELD_RATIO_THRESHOLD: 0.75 # set to 0 to disable
+      DROP_HIGHLY_CORRELATED_FEATURES:
+        COMPUTE: False
+        MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
+        CORR_THRESHOLD: 0.95
+      SRC_SCRIPT: src/features/all_cleaning_individual/rapids/main.R
+
+ALL_CLEANING_OVERALL:
+  PROVIDERS:
+    RAPIDS:
+      COMPUTE: True
+      IMPUTE_SELECTED_EVENT_FEATURES:
+        COMPUTE: False
+        MIN_DATA_YIELDED_MINUTES_TO_IMPUTE: 0.33
+      COLS_NAN_THRESHOLD: 0.3 # set to 1 to disable
+      COLS_VAR_THRESHOLD: True
+      ROWS_NAN_THRESHOLD: 0.3 # set to 1 to disable
+      DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_HOURS # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
+      DATA_YIELD_RATIO_THRESHOLD: 0.75 # set to 0 to disable
+      DROP_HIGHLY_CORRELATED_FEATURES:
+        COMPUTE: False
+        MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
+        CORR_THRESHOLD: 0.95
+      SRC_SCRIPT: src/features/all_cleaning_overall/rapids/main.R
+

 ########################################################################################################################
 #                                              Analysis Workflow Example                                               #
@ -543,12 +594,6 @@ PARAMS_FOR_ANALYSIS:
  TARGET:
    FOLDER: data/external/example_workflow
    CONTAINER: participant_target.csv
-
-  # Cleaning Parameters
-  COLS_NAN_THRESHOLD: 0.3
-  COLS_VAR_THRESHOLD: True
-  ROWS_NAN_THRESHOLD: 0.3
-  DATA_YIELDED_HOURS_RATIO_THRESHOLD: 0.75
  
  MODEL_NAMES: [LogReg, kNN , SVM, DT, RF, GB, XGBoost, LightGBM]
  CV_METHODS: [LeaveOneOut]
--- a/mkdocs.yml
+++ b/mkdocs.yml
@ -74,7 +74,7 @@ extra_css:
 nav:
  - Home: 'index.md'
  - Overview: setup/overview.md
-  - Minimal Example: workflow-examples/minimal.md
+  - Minimal Example: analysis/minimal.md
  - Citation: citation.md
  - Contributing: contributing.md
  - Setup:
@ -85,6 +85,7 @@ nav:
      - Introduction: datastreams/data-streams-introduction.md
      - Phone:
        - aware_mysql: datastreams/aware-mysql.md
+        - aware_micro_mysql: datastreams/aware-micro-mysql.md
        - aware_csv: datastreams/aware-csv.md
        - aware_influxdb (beta): datastreams/aware-influxdb.md
        - Mandatory Phone Format: datastreams/mandatory-phone-format.md
@ -140,8 +141,9 @@ nav:
  - Visualizations:
    - Data Quality: visualizations/data-quality-visualizations.md
    - Features: visualizations/feature-visualizations.md
-  - Analysis Workflows:
-    - Complete Example: workflow-examples/analysis.md
+  - Analysis:
+    - Data Cleaning: analysis/data-cleaning.md
+    - Complete Workflow Example: analysis/complete-workflow-example.md
  - Developers:
    - Git Flow: developers/git-flow.md
    - Remote Support: developers/remote-support.md
--- a/problems/app_categories.txt
+++ b/problems/app_categories.txt
@ -0,0 +1,33 @@
+Warning: 1241 parsing failures.
+row           col   expected actual                                                                            file
+  1 is_system_app an integer  TRUE  'data/raw/p011/phone_applications_foreground_with_datetime_with_categories.csv'
+  2 is_system_app an integer  FALSE 'data/raw/p011/phone_applications_foreground_with_datetime_with_categories.csv'
+  3 is_system_app an integer  TRUE  'data/raw/p011/phone_applications_foreground_with_datetime_with_categories.csv'
+  4 is_system_app an integer  TRUE  'data/raw/p011/phone_applications_foreground_with_datetime_with_categories.csv'
+  5 is_system_app an integer  TRUE  'data/raw/p011/phone_applications_foreground_with_datetime_with_categories.csv'
+... ............. .......... ...... ...............................................................................
+See problems(...) for more details.
+
+Warning message:
+The following named parsers don't match the column names: application_name
+Error: Problem with `filter()` input `..1`.
+✖ object 'application_name' not found
+ℹ Input `..1` is `!is.na(application_name)`.
+Backtrace:
+     █
+  1. ├─`%>%`(...)
+  2. ├─dplyr::mutate(...)
+  3. ├─utils::head(., -1)
+  4. ├─dplyr::select(., -c("timestamp"))
+  5. ├─dplyr::filter(., !is.na(application_name))
+  6. ├─dplyr:::filter.data.frame(., !is.na(application_name))
+  7. │ └─dplyr:::filter_rows(.data, ...)
+  8. │   ├─base::withCallingHandlers(...)
+  9. │   └─mask$eval_all_filter(dots, env_filter)
+ 10. └─base::.handleSimpleError(...)
+ 11.   └─dplyr:::h(simpleError(msg, call))
+Execution halted
+[Mon Dec 13 17:19:06 2021]
+Error in rule app_episodes:
+    jobid: 54
+    output: data/interim/p011/phone_app_episodes.csv
--- a/problems/locations_barnett.txt
+++ b/problems/locations_barnett.txt
@ -0,0 +1,5 @@
+Warning message:
+In barnett_daily_features(snakemake) :
+  Barnett's location features cannot be computed for data or time segments that do not span one or more entire days (00:00:00 to 23:59:59). Values below point to the problem:
+Location data rows within a daily time segment: 0
+Location data time span in days: 398.6
--- a/renv.lock
+++ b/renv.lock
--- a/renv/activate.R
+++ b/renv/activate.R
@ -14,9 +14,6 @@ local({
  # signal that we're loading renv during R startup
  Sys.setenv("RENV_R_INITIALIZING" = "true")
  on.exit(Sys.unsetenv("RENV_R_INITIALIZING"), add = TRUE)
-  
-  if(grepl("Darwin", Sys.info()["sysname"], fixed = TRUE) & grepl("ARM64", Sys.info()["version"], fixed = TRUE)) # M1 Macs
-    Sys.setenv("TZDIR" = file.path(R.home(), "share", "zoneinfo"))

  # signal that we've consented to use renv
  options(renv.consent = TRUE)
--- a/rules/common.smk
+++ b/rules/common.smk
@ -23,10 +23,16 @@ def get_barnett_daily(wildcards):

 def get_locations_python_input(wildcards):
    if wildcards.provider_key.upper() == "DORYAB":
-        return "data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns.csv"
+        return "data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns_episodes_resampled_with_datetime.csv"
    else:
        return "data/interim/{pid}/phone_locations_processed_with_datetime.csv"

+def get_calls_input(wildcards):
+    if (wildcards.provider_key.upper() == "RAPIDS") and (config["PHONE_CALLS"]["PROVIDERS"]["RAPIDS"]["FEATURES_TYPE"] == "EPISODES"):
+        return "data/interim/{pid}/phone_calls_episodes_resampled_with_datetime.csv"
+    else:
+        return "data/raw/{pid}/phone_calls_with_datetime.csv"
+
 def find_features_files(wildcards):
    feature_files = []
    for provider_key, provider in config[(wildcards.sensor_key).upper()]["PROVIDERS"].items():
@ -34,6 +40,17 @@ def find_features_files(wildcards):
            feature_files.extend(expand("data/interim/{{pid}}/{sensor_key}_features/{sensor_key}_{language}_{provider_key}.csv", sensor_key=wildcards.sensor_key.lower(), language=get_script_language(provider["SRC_SCRIPT"]), provider_key=provider_key.lower()))
    return(feature_files)

+def find_joint_non_empatica_sensor_files(wildcards):
+    joined_files = []
+    for config_key in config.keys():
+        if config_key.startswith(("PHONE", "FITBIT")) and "PROVIDERS" in config[config_key] and isinstance(config[config_key]["PROVIDERS"], dict):
+            for provider_key, provider in config[config_key]["PROVIDERS"].items():
+                if "COMPUTE" in provider.keys() and provider["COMPUTE"]:
+                    joined_files.append("data/processed/features/{pid}/" + config_key.lower() + ".csv")
+                    break
+    return joined_files
+
+
 def optional_steps_sleep_input(wildcards):
    if config["FITBIT_STEPS_INTRADAY"]["EXCLUDE_SLEEP"]["FITBIT_BASED"]["EXCLUDE"]:
        return "data/raw/{pid}/fitbit_sleep_summary_raw.csv"
@ -108,7 +125,16 @@ def input_tzcodes_file(wilcards):
        if not config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"].lower().endswith(".csv"):
            raise ValueError("[TIMEZONE][MULTIPLE][TZCODES_FILE] should point to a CSV file, instead you typed: " + config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"])
        if not Path(config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"]).exists():
-            raise ValueError("[TIMEZONE][MULTIPLE][TZCODES_FILE] should point to a CSV file, the file in the path you typed does not exist: " + config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"])
+            try:
+                config["TIMEZONE"]["MULTIPLE"]["TZ_FILE"]
+            except KeyError:
+                raise ValueError("To create TZCODES_FILE, a list of timezones should be created " +
+                                 "with the rule preprocessing.smk/prepare_tzcodes_file " +
+                                 "which will create a file specified as config['TIMEZONE']['MULTIPLE']['TZ_FILE']." +
+                                 "\n An alternative is to provide the file manually:" +
+                                 "[TIMEZONE][MULTIPLE][TZCODES_FILE] should point to a CSV file," +
+                                 "but the file in the path you typed does not exist: " +
+                                 config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"])
        return [config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"]]
    return []

--- a/rules/features.smk
+++ b/rules/features.smk
@ -32,7 +32,7 @@ rule phone_data_yield_r_features:
    output:
        "data/interim/{pid}/phone_data_yield_features/phone_data_yield_r_{provider_key}.csv"
    script:
-        "../src/features/entry.R" 
+        "../src/features/entry.R"

 rule phone_accelerometer_python_features:
    input:
@ -125,6 +125,7 @@ rule phone_applications_crashes_r_features:
 rule phone_applications_foreground_python_features:
    input:
        sensor_data = "data/raw/{pid}/phone_applications_foreground_with_datetime_with_categories.csv",
+        episode_data = "data/interim/{pid}/phone_app_episodes_resampled_with_datetime.csv",
        time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
    params:
        provider = lambda wildcards: config["PHONE_APPLICATIONS_FOREGROUND"]["PROVIDERS"][wildcards.provider_key.upper()],
@ -138,6 +139,7 @@ rule phone_applications_foreground_python_features:
 rule phone_applications_foreground_r_features:
    input:
        sensor_data = "data/raw/{pid}/phone_applications_foreground_with_datetime_with_categories.csv",
+        episode_data = "data/interim/{pid}/phone_app_episodes_resampled_with_datetime.csv",
        time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
    params:
        provider = lambda wildcards: config["PHONE_APPLICATIONS_FOREGROUND"]["PROVIDERS"][wildcards.provider_key.upper()],
@ -262,9 +264,17 @@ rule phone_bluetooth_r_features:
    script:
        "../src/features/entry.R"

-rule calls_python_features:
+rule calls_episodes:
    input:
-        sensor_data = "data/raw/{pid}/phone_calls_with_datetime.csv",
+        calls = "data/raw/{pid}/phone_calls_raw.csv"
+    output:
+        "data/interim/{pid}/phone_calls_episodes.csv"
+    script:
+        "../src/features/phone_calls/episodes/calls_episodes.py"
+
+rule phone_calls_python_features:
+    input:
+        sensor_data = get_calls_input,
        time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
    params:
        provider = lambda wildcards: config["PHONE_CALLS"]["PROVIDERS"][wildcards.provider_key.upper()],
@ -275,9 +285,9 @@ rule calls_python_features:
    script:
        "../src/features/entry.py"

-rule calls_r_features:
+rule phone_calls_r_features:
    input:
-        sensor_data = "data/raw/{pid}/phone_calls_with_datetime.csv",
+        sensor_data = get_calls_input,
        time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
    params:
        provider = lambda wildcards: config["PHONE_CALLS"]["PROVIDERS"][wildcards.provider_key.upper()],
@ -314,6 +324,40 @@ rule conversation_r_features:
    script:
        "../src/features/entry.R"

+rule preprocess_esm:
+    input: "data/raw/{pid}/phone_esm_with_datetime.csv"
+    params:
+        scales=lambda wildcards: config["PHONE_ESM"]["PROVIDERS"]["STRAW"]["SCALES"]
+    output: "data/interim/{pid}/phone_esm_clean.csv"
+    script:
+        "../src/features/phone_esm/straw/preprocess.py"
+
+rule esm_features:
+    input:
+        sensor_data = "data/interim/{pid}/phone_esm_clean.csv",
+        time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
+    params:
+        provider = lambda wildcards: config["PHONE_ESM"]["PROVIDERS"][wildcards.provider_key.upper()],
+        provider_key = "{provider_key}",
+        sensor_key = "phone_esm",
+        scales=lambda wildcards: config["PHONE_ESM"]["PROVIDERS"][wildcards.provider_key.upper()]["SCALES"]
+    output: "data/interim/{pid}/phone_esm_features/phone_esm_python_{provider_key}.csv"
+    script:
+        "../src/features/entry.py"
+
+rule phone_speech_python_features:
+    input:
+        sensor_data = "data/raw/{pid}/phone_speech_with_datetime.csv",
+        time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
+    params:
+        provider = lambda wildcards: config["PHONE_SPEECH"]["PROVIDERS"][wildcards.provider_key.upper()],
+        provider_key = "{provider_key}",
+        sensor_key = "phone_speech"
+    output: 
+        "data/interim/{pid}/phone_speech_features/phone_speech_python_{provider_key}.csv"
+    script:
+        "../src/features/entry.py"
+
 rule phone_keyboard_python_features:
    input:
        sensor_data = "data/raw/{pid}/phone_keyboard_with_datetime.csv",
@ -372,7 +416,7 @@ rule phone_locations_add_doryab_extra_columns:
    params:
        provider = config["PHONE_LOCATIONS"]["PROVIDERS"]["DORYAB"]
    output: 
-        "data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns.csv"
+        "data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns_episodes.csv"
    script:
        "../src/features/phone_locations/doryab/add_doryab_extra_columns.py"

@ -448,6 +492,15 @@ rule screen_episodes:
    script:
        "../src/features/phone_screen/episodes/screen_episodes.R"

+rule app_episodes:
+    input:
+        screen = "data/interim/{pid}/phone_screen_episodes.csv",
+        app = "data/raw/{pid}/phone_applications_foreground_with_datetime_with_categories.csv"
+    output:
+        "data/interim/{pid}/phone_app_episodes.csv"
+    script:
+        "../src/features/phone_applications_foreground/episodes/app_episodes.R"
+
 rule phone_screen_python_features:
    input:
        sensor_episodes = "data/interim/{pid}/phone_screen_episodes_resampled_with_datetime.csv",
@ -742,22 +795,6 @@ rule fitbit_sleep_intraday_r_features:
    script:
        "../src/features/entry.R"

-rule merge_sensor_features_for_individual_participants:
-    input:
-        feature_files = input_merge_sensor_features_for_individual_participants
-    output:
-        "data/processed/features/{pid}/all_sensor_features.csv"
-    script:
-        "../src/features/utils/merge_sensor_features_for_individual_participants.R"
-
-rule merge_sensor_features_for_all_participants:
-    input:
-        feature_files = expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"])
-    output:
-        "data/processed/features/all_participants/all_sensor_features.csv"
-    script:
-        "../src/features/utils/merge_sensor_features_for_all_participants.R"
-
 rule empatica_accelerometer_python_features:
    input:
        sensor_data = "data/raw/{pid}/empatica_accelerometer_with_datetime.csv",
@ -767,7 +804,8 @@ rule empatica_accelerometer_python_features:
        provider_key = "{provider_key}",
        sensor_key = "empatica_accelerometer"
    output:
-        "data/interim/{pid}/empatica_accelerometer_features/empatica_accelerometer_python_{provider_key}.csv"
+        "data/interim/{pid}/empatica_accelerometer_features/empatica_accelerometer_python_{provider_key}.csv",
+        "data/interim/{pid}/empatica_accelerometer_features/empatica_accelerometer_python_{provider_key}_windows.csv"
    script:
        "../src/features/entry.py"

@ -793,7 +831,8 @@ rule empatica_heartrate_python_features:
        provider_key = "{provider_key}",
        sensor_key = "empatica_heartrate"
    output:
-        "data/interim/{pid}/empatica_heartrate_features/empatica_heartrate_python_{provider_key}.csv"
+        "data/interim/{pid}/empatica_heartrate_features/empatica_heartrate_python_{provider_key}.csv",
+        "data/interim/{pid}/empatica_heartrate_features/empatica_heartrate_python_{provider_key}_windows.csv"
    script:
        "../src/features/entry.py"

@ -819,7 +858,8 @@ rule empatica_temperature_python_features:
        provider_key = "{provider_key}",
        sensor_key = "empatica_temperature"
    output:
-        "data/interim/{pid}/empatica_temperature_features/empatica_temperature_python_{provider_key}.csv"
+        "data/interim/{pid}/empatica_temperature_features/empatica_temperature_python_{provider_key}.csv",
+        "data/interim/{pid}/empatica_temperature_features/empatica_temperature_python_{provider_key}_windows.csv"
    script:
        "../src/features/entry.py"

@ -845,7 +885,8 @@ rule empatica_electrodermal_activity_python_features:
        provider_key = "{provider_key}",
        sensor_key = "empatica_electrodermal_activity"
    output:
-        "data/interim/{pid}/empatica_electrodermal_activity_features/empatica_electrodermal_activity_python_{provider_key}.csv"
+        "data/interim/{pid}/empatica_electrodermal_activity_features/empatica_electrodermal_activity_python_{provider_key}.csv",
+        "data/interim/{pid}/empatica_electrodermal_activity_features/empatica_electrodermal_activity_python_{provider_key}_windows.csv"
    script:
        "../src/features/entry.py"

@ -871,7 +912,8 @@ rule empatica_blood_volume_pulse_python_features:
        provider_key = "{provider_key}",
        sensor_key = "empatica_blood_volume_pulse"
    output:
-        "data/interim/{pid}/empatica_blood_volume_pulse_features/empatica_blood_volume_pulse_python_{provider_key}.csv"
+        "data/interim/{pid}/empatica_blood_volume_pulse_features/empatica_blood_volume_pulse_python_{provider_key}.csv",
+        "data/interim/{pid}/empatica_blood_volume_pulse_features/empatica_blood_volume_pulse_python_{provider_key}_windows.csv"
    script:
        "../src/features/entry.py"

@ -897,7 +939,8 @@ rule empatica_inter_beat_interval_python_features:
        provider_key = "{provider_key}",
        sensor_key = "empatica_inter_beat_interval"
    output:
-        "data/interim/{pid}/empatica_inter_beat_interval_features/empatica_inter_beat_interval_python_{provider_key}.csv"
+        "data/interim/{pid}/empatica_inter_beat_interval_features/empatica_inter_beat_interval_python_{provider_key}.csv",
+        "data/interim/{pid}/empatica_inter_beat_interval_features/empatica_inter_beat_interval_python_{provider_key}_windows.csv"
    script:
        "../src/features/entry.py"

@ -939,3 +982,48 @@ rule empatica_tags_r_features:
        "data/interim/{pid}/empatica_tags_features/empatica_tags_r_{provider_key}.csv"
    script:
        "../src/features/entry.R"
+
+rule merge_sensor_features_for_individual_participants:
+    input:
+        feature_files = input_merge_sensor_features_for_individual_participants
+    output:
+        "data/processed/features/{pid}/all_sensor_features.csv"
+    script:
+        "../src/features/utils/merge_sensor_features_for_individual_participants.R"
+
+rule merge_sensor_features_for_all_participants:
+    input:
+        feature_files = expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"])
+    output:
+        "data/processed/features/all_participants/all_sensor_features.csv"
+    script:
+        "../src/features/utils/merge_sensor_features_for_all_participants.R"
+
+rule clean_sensor_features_for_individual_participants:
+    input:
+        sensor_data = rules.merge_sensor_features_for_individual_participants.output
+    wildcard_constraints:
+        pid = "("+"|".join(config["PIDS"])+")"
+    params:
+        provider = lambda wildcards: config["ALL_CLEANING_INDIVIDUAL"]["PROVIDERS"][wildcards.provider_key.upper()],
+        provider_key = "{provider_key}",
+        script_extension = "{script_extension}",
+        sensor_key = "all_cleaning_individual" 
+    output:
+        "data/processed/features/{pid}/all_sensor_features_cleaned_{provider_key}_{script_extension}.csv" 
+    script:
+        "../src/features/entry.{params.script_extension}"
+
+rule clean_sensor_features_for_all_participants:
+    input:
+        sensor_data = rules.merge_sensor_features_for_all_participants.output
+    params:
+        provider = lambda wildcards: config["ALL_CLEANING_OVERALL"]["PROVIDERS"][wildcards.provider_key.upper()],
+        provider_key = "{provider_key}",
+        script_extension = "{script_extension}",
+        sensor_key = "all_cleaning_overall",
+        target = "{target}"
+    output:
+        "data/processed/features/all_participants/all_sensor_features_cleaned_{provider_key}_{script_extension}_({target}).csv"
+    script:
+        "../src/features/entry.{params.script_extension}"
--- a/rules/models.smk
+++ b/rules/models.smk
@ -1,165 +1,52 @@
-rule download_demographic_data:
+rule merge_baseline_data:
+    input:
+        data = expand(config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["FOLDER"] + "/{container}", container=config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["CONTAINER"])
+    output:
+        "data/raw/baseline_merged.csv"
+    script:
+        "../src/data/merge_baseline_data.py"
+
+rule download_baseline_data:
    input:
        participant_file = "data/external/participant_files/{pid}.yaml",
-        data = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FOLDER"] + "/" + config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["CONTAINER"]
+        data = "data/raw/baseline_merged.csv"
    output:
-        "data/raw/{pid}/participant_info_raw.csv"
+        "data/raw/{pid}/participant_baseline_raw.csv"
    script:
-        "../src/data/workflow_example/download_demographic_data.R"
+        "../src/data/download_baseline_data.py"

-rule demographic_features:
+rule baseline_features:
    input:
-        participant_info = "data/raw/{pid}/participant_info_raw.csv"
+        "data/raw/{pid}/participant_baseline_raw.csv"
    params:
-        pid = "{pid}",
-        features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FEATURES"]
+        pid="{pid}",
+        features=config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["FEATURES"],
+        question_filename=config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["FOLDER"] + "/" + config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["QUESTION_LIST"]
    output:
-        "data/processed/features/{pid}/demographic_features.csv"
+        interim="data/interim/{pid}/baseline_questionnaires.csv",
+        features="data/processed/features/{pid}/baseline_features.csv"
    script:
-        "../src/features/workflow_example/demographic_features.py"
+        "../src/data/baseline_features.py"

-rule download_target_data:
+rule select_target:
    input:
-        participant_file = "data/external/participant_files/{pid}.yaml",
-        data = config["PARAMS_FOR_ANALYSIS"]["TARGET"]["FOLDER"] + "/" + config["PARAMS_FOR_ANALYSIS"]["TARGET"]["CONTAINER"]
-    output:
-        "data/raw/{pid}/participant_target_raw.csv"
-    script:
-        "../src/data/workflow_example/download_target_data.R"
-
-rule target_readable_datetime:
-    input:
-        sensor_input = "data/raw/{pid}/participant_target_raw.csv",
-        time_segments = "data/interim/time_segments/{pid}_time_segments.csv",
-        pid_file = "data/external/participant_files/{pid}.yaml",
-        tzcodes_file = input_tzcodes_file,
+        cleaned_sensor_features = "data/processed/features/{pid}/all_sensor_features_cleaned_straw_py.csv"
    params:
-        device_type = "fitbit",
-        timezone_parameters = config["TIMEZONE"],
-        pid = "{pid}",
-        time_segments_type = config["TIME_SEGMENTS"]["TYPE"],
-        include_past_periodic_segments = config["TIME_SEGMENTS"]["INCLUDE_PAST_PERIODIC_SEGMENTS"]
-    output:
-        "data/raw/{pid}/participant_target_with_datetime.csv"
-    script:
-        "../src/data/datetime/readable_datetime.R"
-
-rule parse_targets:
-    input:
-        targets = "data/raw/{pid}/participant_target_with_datetime.csv",
-        time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
-    output:
-        "data/processed/targets/{pid}/parsed_targets.csv"
-    script:
-        "../src/models/workflow_example/parse_targets.py"
-
-rule clean_sensor_features_for_individual_participants:
-    input:
-        rules.merge_sensor_features_for_individual_participants.output
-    params:
-        cols_nan_threshold = config["PARAMS_FOR_ANALYSIS"]["COLS_NAN_THRESHOLD"],
-        cols_var_threshold = config["PARAMS_FOR_ANALYSIS"]["COLS_VAR_THRESHOLD"],
-        rows_nan_threshold = config["PARAMS_FOR_ANALYSIS"]["ROWS_NAN_THRESHOLD"],
-        data_yielded_hours_ratio_threshold = config["PARAMS_FOR_ANALYSIS"]["DATA_YIELDED_HOURS_RATIO_THRESHOLD"],
-    output:
-        "data/processed/features/{pid}/all_sensor_features_cleaned.csv"
-    script:
-        "../src/models/workflow_example/clean_sensor_features.R"
-
-rule clean_sensor_features_for_all_participants:
-    input:
-        rules.merge_sensor_features_for_all_participants.output
-    params:
-        cols_nan_threshold = config["PARAMS_FOR_ANALYSIS"]["COLS_NAN_THRESHOLD"],
-        cols_var_threshold = config["PARAMS_FOR_ANALYSIS"]["COLS_VAR_THRESHOLD"],
-        rows_nan_threshold = config["PARAMS_FOR_ANALYSIS"]["ROWS_NAN_THRESHOLD"],
-        data_yielded_hours_ratio_threshold = config["PARAMS_FOR_ANALYSIS"]["DATA_YIELDED_HOURS_RATIO_THRESHOLD"],
-    output:
-        "data/processed/features/all_participants/all_sensor_features_cleaned.csv"
-    script:
-        "../src/models/workflow_example/clean_sensor_features.R"
-
-rule merge_features_and_targets_for_individual_model:
-    input:
-        cleaned_sensor_features = "data/processed/features/{pid}/all_sensor_features_cleaned.csv",
-        targets = "data/processed/targets/{pid}/parsed_targets.csv",
+        target_variable = config["PARAMS_FOR_ANALYSIS"]["TARGET"]["LABEL"]
    output:
        "data/processed/models/individual_model/{pid}/input.csv"
    script:
-        "../src/models/workflow_example/merge_features_and_targets_for_individual_model.py"
+        "../src/models/select_targets.py"

 rule merge_features_and_targets_for_population_model:
    input:
-        cleaned_sensor_features = "data/processed/features/all_participants/all_sensor_features_cleaned.csv",
-        demographic_features = expand("data/processed/features/{pid}/demographic_features.csv", pid=config["PIDS"]),
-        targets = expand("data/processed/targets/{pid}/parsed_targets.csv", pid=config["PIDS"]),
-    output:
-        "data/processed/models/population_model/input.csv"
-    script:
-        "../src/models/workflow_example/merge_features_and_targets_for_population_model.py"
-
-rule baselines_for_individual_model:
-    input:
-        "data/processed/models/individual_model/{pid}/input.csv"
+        cleaned_sensor_features = "data/processed/features/all_participants/all_sensor_features_cleaned_straw_py_({target}).csv",
+        demographic_features = expand("data/processed/features/{pid}/baseline_features.csv", pid=config["PIDS"]),
    params:
-        cv_method = "{cv_method}",
-        colnames_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FEATURES"],
+        target_variable="{target}"
    output:
-        "data/processed/models/individual_model/{pid}/output_{cv_method}/baselines.csv"
-    log:
-        "data/processed/models/individual_model/{pid}/output_{cv_method}/baselines_notes.log"
+        "data/processed/models/population_model/input_{target}.csv"
    script:
-        "../src/models/workflow_example/baselines.py"
+        "../src/models/merge_features_and_targets_for_population_model.py"

-rule baselines_for_population_model:
-    input:
-        "data/processed/models/population_model/input.csv"
-    params:
-        cv_method = "{cv_method}",
-        colnames_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FEATURES"],
-    output:
-        "data/processed/models/population_model/output_{cv_method}/baselines.csv"
-    log:
-        "data/processed/models/population_model/output_{cv_method}/baselines_notes.log"
-    script:
-        "../src/models/workflow_example/baselines.py"

-rule modelling_for_individual_participants:
-    input:
-        data = "data/processed/models/individual_model/{pid}/input.csv"
-    params:
-        model = "{model}",
-        cv_method = "{cv_method}",
-        scaler = "{scaler}",
-        categorical_operators = config["PARAMS_FOR_ANALYSIS"]["CATEGORICAL_OPERATORS"],
-        categorical_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["CATEGORICAL_FEATURES"],
-        model_hyperparams = config["PARAMS_FOR_ANALYSIS"]["MODEL_HYPERPARAMS"],
-    output:
-        fold_predictions = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/fold_predictions.csv",
-        fold_metrics = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/fold_metrics.csv",
-        overall_results = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/overall_results.csv",
-        fold_feature_importances = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/fold_feature_importances.csv"
-    log:
-        "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/notes.log"
-    script:
-        "../src/models/workflow_example/modelling.py"
-
-rule modelling_for_all_participants:
-    input:
-        data = "data/processed/models/population_model/input.csv"
-    params:
-        model = "{model}",
-        cv_method = "{cv_method}",
-        scaler = "{scaler}",
-        categorical_operators = config["PARAMS_FOR_ANALYSIS"]["CATEGORICAL_OPERATORS"],
-        categorical_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["CATEGORICAL_FEATURES"],
-        model_hyperparams = config["PARAMS_FOR_ANALYSIS"]["MODEL_HYPERPARAMS"],
-    output:
-        fold_predictions = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/fold_predictions.csv",
-        fold_metrics = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/fold_metrics.csv",
-        overall_results = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/overall_results.csv",
-        fold_feature_importances = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/fold_feature_importances.csv"
-    log:
-        "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/notes.log"
-    script:
-        "../src/models/workflow_example/modelling.py"
--- a/rules/models_example.smk
+++ b/rules/models_example.smk
@ -0,0 +1,139 @@
+rule download_demographic_data:
+    input:
+        participant_file = "data/external/participant_files/{pid}.yaml",
+        data = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FOLDER"] + "/" + config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["CONTAINER"]
+    output:
+        "data/raw/{pid}/participant_info_raw.csv"
+    script:
+        "../src/data/workflow_example/download_demographic_data.R"
+
+rule demographic_features:
+    input:
+        participant_info = "data/raw/{pid}/participant_info_raw.csv"
+    params:
+        pid = "{pid}",
+        features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FEATURES"]
+    output:
+        "data/processed/features/{pid}/demographic_features.csv"
+    script:
+        "../src/features/workflow_example/demographic_features.py"
+
+rule download_target_data:
+    input:
+        participant_file = "data/external/participant_files/{pid}.yaml",
+        data = config["PARAMS_FOR_ANALYSIS"]["TARGET"]["FOLDER"] + "/" + config["PARAMS_FOR_ANALYSIS"]["TARGET"]["CONTAINER"]
+    output:
+        "data/raw/{pid}/participant_target_raw.csv"
+    script:
+        "../src/data/workflow_example/download_target_data.R"
+
+rule target_readable_datetime:
+    input:
+        sensor_input = "data/raw/{pid}/participant_target_raw.csv",
+        time_segments = "data/interim/time_segments/{pid}_time_segments.csv",
+        pid_file = "data/external/participant_files/{pid}.yaml",
+        tzcodes_file = input_tzcodes_file,
+    params:
+        device_type = "fitbit",
+        timezone_parameters = config["TIMEZONE"],
+        pid = "{pid}",
+        time_segments_type = config["TIME_SEGMENTS"]["TYPE"],
+        include_past_periodic_segments = config["TIME_SEGMENTS"]["INCLUDE_PAST_PERIODIC_SEGMENTS"]
+    output:
+        "data/raw/{pid}/participant_target_with_datetime.csv"
+    script:
+        "../src/data/datetime/readable_datetime.R"
+
+rule parse_targets:
+    input:
+        targets = "data/raw/{pid}/participant_target_with_datetime.csv",
+        time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
+    output:
+        "data/processed/targets/{pid}/parsed_targets.csv"
+    script:
+        "../src/models/workflow_example/parse_targets.py"
+
+rule merge_features_and_targets_for_individual_model:
+    input:
+        cleaned_sensor_features = "data/processed/features/{pid}/all_sensor_features_cleaned_rapids.csv",
+        targets = "data/processed/targets/{pid}/parsed_targets.csv",
+    output:
+        "data/processed/models/individual_model/{pid}/input.csv"
+    script:
+        "../src/models/workflow_example/merge_features_and_targets_for_individual_model.py"
+
+rule merge_features_and_targets_for_population_model:
+    input:
+        cleaned_sensor_features = "data/processed/features/all_participants/all_sensor_features_cleaned_rapids.csv",
+        demographic_features = expand("data/processed/features/{pid}/demographic_features.csv", pid=config["PIDS"]),
+        targets = expand("data/processed/targets/{pid}/parsed_targets.csv", pid=config["PIDS"]),
+    output:
+        "data/processed/models/population_model/input.csv"
+    script:
+        "../src/models/workflow_example/merge_features_and_targets_for_population_model.py"
+
+rule baselines_for_individual_model:
+    input:
+        "data/processed/models/individual_model/{pid}/input.csv"
+    params:
+        cv_method = "{cv_method}",
+        colnames_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FEATURES"],
+    output:
+        "data/processed/models/individual_model/{pid}/output_{cv_method}/baselines.csv"
+    log:
+        "data/processed/models/individual_model/{pid}/output_{cv_method}/baselines_notes.log"
+    script:
+        "../src/models/workflow_example/baselines.py"
+
+rule baselines_for_population_model:
+    input:
+        "data/processed/models/population_model/input.csv"
+    params:
+        cv_method = "{cv_method}",
+        colnames_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FEATURES"],
+    output:
+        "data/processed/models/population_model/output_{cv_method}/baselines.csv"
+    log:
+        "data/processed/models/population_model/output_{cv_method}/baselines_notes.log"
+    script:
+        "../src/models/workflow_example/baselines.py"
+
+rule modelling_for_individual_participants:
+    input:
+        data = "data/processed/models/individual_model/{pid}/input.csv"
+    params:
+        model = "{model}",
+        cv_method = "{cv_method}",
+        scaler = "{scaler}",
+        categorical_operators = config["PARAMS_FOR_ANALYSIS"]["CATEGORICAL_OPERATORS"],
+        categorical_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["CATEGORICAL_FEATURES"],
+        model_hyperparams = config["PARAMS_FOR_ANALYSIS"]["MODEL_HYPERPARAMS"],
+    output:
+        fold_predictions = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/fold_predictions.csv",
+        fold_metrics = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/fold_metrics.csv",
+        overall_results = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/overall_results.csv",
+        fold_feature_importances = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/fold_feature_importances.csv"
+    log:
+        "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/notes.log"
+    script:
+        "../src/models/workflow_example/modelling.py"
+
+rule modelling_for_all_participants:
+    input:
+        data = "data/processed/models/population_model/input.csv"
+    params:
+        model = "{model}",
+        cv_method = "{cv_method}",
+        scaler = "{scaler}",
+        categorical_operators = config["PARAMS_FOR_ANALYSIS"]["CATEGORICAL_OPERATORS"],
+        categorical_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["CATEGORICAL_FEATURES"],
+        model_hyperparams = config["PARAMS_FOR_ANALYSIS"]["MODEL_HYPERPARAMS"],
+    output:
+        fold_predictions = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/fold_predictions.csv",
+        fold_metrics = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/fold_metrics.csv",
+        overall_results = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/overall_results.csv",
+        fold_feature_importances = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/fold_feature_importances.csv"
+    log:
+        "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/notes.log"
+    script:
+        "../src/models/workflow_example/modelling.py"
--- a/rules/preprocessing.smk
+++ b/rules/preprocessing.smk
@ -4,6 +4,36 @@ rule create_example_participant_files:
    shell:
        "echo 'PHONE:\n  DEVICE_IDS: [a748ee1a-1d0b-4ae9-9074-279a2b6ba524]\n  PLATFORMS: [android]\n  LABEL: test-01\n  START_DATE: 2020-04-23 00:00:00\n  END_DATE: 2020-05-04 23:59:59\nFITBIT:\n  DEVICE_IDS: [a748ee1a-1d0b-4ae9-9074-279a2b6ba524]\n  LABEL: test-01\n  START_DATE: 2020-04-23 00:00:00\n  END_DATE: 2020-05-04 23:59:59\n' >> ./data/external/participant_files/example01.yaml && echo 'PHONE:\n  DEVICE_IDS: [13dbc8a3-dae3-4834-823a-4bc96a7d459d]\n  PLATFORMS: [ios]\n  LABEL: test-02\n  START_DATE: 2020-04-23 00:00:00\n  END_DATE: 2020-05-04 23:59:59\nFITBIT:\n  DEVICE_IDS: [13dbc8a3-dae3-4834-823a-4bc96a7d459d]\n  LABEL: test-02\n  START_DATE: 2020-04-23 00:00:00\n  END_DATE: 2020-05-04 23:59:59\n' >> ./data/external/participant_files/example02.yaml"

+# rule query_usernames_device_empatica_ids:
+#     params:
+#         baseline_folder = "/mnt/e/STRAWbaseline/"
+#     output:
+#         usernames_file = config["CREATE_PARTICIPANT_FILES"]["USERNAMES_CSV"],
+#         timezone_file = config["TIMEZONE"]["MULTIPLE"]["TZ_FILE"]
+#     script:
+#         "../../participants/prepare_usernames_file.py"
+
+rule prepare_tzcodes_file:
+    input:
+        timezone_file = config["TIMEZONE"]["MULTIPLE"]["TZ_FILE"]
+    output:
+        tzcodes_file = config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"]
+    script:
+        "../tools/create_multi_timezones_file.py"
+
+rule prepare_participants_csv:
+    input:
+        username_list = config["CREATE_PARTICIPANT_FILES"]["USERNAMES_CSV"]
+    params:
+        data_configuration = config["PHONE_DATA_STREAMS"][config["PHONE_DATA_STREAMS"]["USE"]],
+        participants_table = "participants",
+        device_id_table = "esm",
+        start_end_date_table = "esm"
+    output:
+        participants_file = config["CREATE_PARTICIPANT_FILES"]["CSV_FILE_PATH"]
+    script:
+        "../src/data/translate_usernames_into_participants_data.R"
+
 rule create_participants_files:
    input:
        participants_file = config["CREATE_PARTICIPANT_FILES"]["CSV_FILE_PATH"] 
@ -98,7 +128,8 @@ rule process_phone_locations_types:
    params:
        consecutive_threshold = config["PHONE_LOCATIONS"]["FUSED_RESAMPLED_CONSECUTIVE_THRESHOLD"],
        time_since_valid_location = config["PHONE_LOCATIONS"]["FUSED_RESAMPLED_TIME_SINCE_VALID_LOCATION"],
-        locations_to_use = config["PHONE_LOCATIONS"]["LOCATIONS_TO_USE"]
+        locations_to_use = config["PHONE_LOCATIONS"]["LOCATIONS_TO_USE"],
+        accuracy_limit = config["PHONE_LOCATIONS"]["ACCURACY_LIMIT"]
    output:
        "data/interim/{pid}/phone_locations_processed.csv"
    script:
@ -146,7 +177,6 @@ rule resample_episodes_with_datetime:
    script:
        "../src/data/datetime/readable_datetime.R"

-
 rule phone_application_categories:
    input:
        "data/raw/{pid}/phone_applications_{type}_with_datetime.csv"
@ -217,5 +247,33 @@ rule empatica_readable_datetime:
        include_past_periodic_segments = config["TIME_SEGMENTS"]["INCLUDE_PAST_PERIODIC_SEGMENTS"]
    output:
        "data/raw/{pid}/empatica_{sensor}_with_datetime.csv"
+    resources:
+        mem_mb=50000
    script:
        "../src/data/datetime/readable_datetime.R"
+
+
+rule extract_event_information_from_esm:
+    input:
+        esm_raw_input = "data/raw/{pid}/phone_esm_raw.csv",
+        pid_file = "data/external/participant_files/{pid}.yaml"
+    params:
+        stage = "extract",
+        pid = "{pid}"
+    output:
+        "data/raw/ers/{pid}_ers.csv",
+        "data/raw/ers/{pid}_stress_event_targets.csv"
+    script:
+        "../src/features/phone_esm/straw/process_user_event_related_segments.py"
+
+rule merge_event_related_segments_files:
+    input:
+        ers_files = expand("data/raw/ers/{pid}_ers.csv", pid=config["PIDS"]),
+        se_files = expand("data/raw/ers/{pid}_stress_event_targets.csv", pid=config["PIDS"])
+    params:
+        stage = "merge"
+    output:
+        "data/external/straw_events.csv",
+        "data/external/stress_event_targets.csv"
+    script:
+        "../src/features/phone_esm/straw/process_user_event_related_segments.py"
--- a/rules/reports.smk
+++ b/rules/reports.smk
@ -1,6 +1,8 @@
 rule histogram_phone_data_yield:
    input:
        "data/processed/features/all_participants/all_sensor_features.csv"
+    params:
+        time_segments_type = config["TIME_SEGMENTS"]["TYPE"]
    output:
        "reports/data_exploration/histogram_phone_data_yield.html"
    script:
@ -12,7 +14,8 @@ rule heatmap_sensors_per_minute_per_time_segment:
        participant_file = "data/external/participant_files/{pid}.yaml",
        time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
    params:
-        pid = "{pid}"
+        pid = "{pid}",
+        time_segments_type = config["TIME_SEGMENTS"]["TYPE"]
    output:
        "reports/interim/{pid}/heatmap_sensors_per_minute_per_time_segment.html"
    script:
@ -33,7 +36,9 @@ rule heatmap_sensor_row_count_per_time_segment:
        participant_file = "data/external/participant_files/{pid}.yaml",
        time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
    params:
-        pid = "{pid}"
+        pid = "{pid}",
+        sensor_names = config["HEATMAP_SENSOR_ROW_COUNT_PER_TIME_SEGMENT"]["SENSORS"],
+        time_segments_type = config["TIME_SEGMENTS"]["TYPE"]
    output:
        "reports/interim/{pid}/heatmap_sensor_row_count_per_time_segment.html"
    script:
@ -49,11 +54,13 @@ rule merge_heatmap_sensor_row_count_per_time_segment:

 rule heatmap_phone_data_yield_per_participant_per_time_segment:
    input:
-        phone_data_yield = expand("data/processed/features/{pid}/phone_data_yield.csv", pid=config["PIDS"]),
-        participant_file = expand("data/external/participant_files/{pid}.yaml", pid=config["PIDS"]),
-        time_segments_labels = expand("data/interim/time_segments/{pid}_time_segments_labels.csv", pid=config["PIDS"])
+        participant_files = expand("data/external/participant_files/{pid}.yaml", pid=config["PIDS"]),
+        time_segments_file = config["TIME_SEGMENTS"]["FILE"],
+        phone_data_yield = "data/processed/features/all_participants/all_sensor_features.csv",
    params:
-        time = config["HEATMAP_PHONE_DATA_YIELD_PER_PARTICIPANT_PER_TIME_SEGMENT"]["TIME"]
+        pids = config["PIDS"],
+        time = config["HEATMAP_PHONE_DATA_YIELD_PER_PARTICIPANT_PER_TIME_SEGMENT"]["TIME"],
+        time_segments_type = config["TIME_SEGMENTS"]["TYPE"]
    output:
        "reports/data_exploration/heatmap_phone_data_yield_per_participant_per_time_segment.html"
    script:
@ -63,6 +70,7 @@ rule heatmap_feature_correlation_matrix:
    input:
        all_sensor_features = "data/processed/features/all_participants/all_sensor_features.csv" # before data cleaning
    params:
+        time_segments_type = config["TIME_SEGMENTS"]["TYPE"],
        min_rows_ratio = config["HEATMAP_FEATURE_CORRELATION_MATRIX"]["MIN_ROWS_RATIO"],
        corr_threshold = config["HEATMAP_FEATURE_CORRELATION_MATRIX"]["CORR_THRESHOLD"],
        corr_method = config["HEATMAP_FEATURE_CORRELATION_MATRIX"]["CORR_METHOD"]
--- a/src/data/baseline_features.py
+++ b/src/data/baseline_features.py
@ -0,0 +1,182 @@
+import numpy as np
+import pandas as pd
+
+pid = snakemake.params["pid"]
+requested_features = snakemake.params["features"]
+baseline_interim = pd.DataFrame(columns=["qid", "question", "score_original", "score"])
+baseline_features = pd.DataFrame(columns=requested_features)
+question_filename = snakemake.params["question_filename"]
+
+JCQ_DEMAND = "JobEisen"
+JCQ_CONTROL = "JobControle"
+
+dict_JCQ_demand_control_reverse = {
+    JCQ_DEMAND: {
+        3: " [Od mene se ne zahteva,",
+        4: " [Imam dovolj časa, da končam",
+        5: " [Pri svojem delu se ne srečujem s konfliktnimi",
+    },
+    JCQ_CONTROL: {
+        2: " |Moje delo vključuje veliko ponavljajočega",
+        6: " [Pri svojem delu imam zelo malo svobode",
+    },
+}
+
+LIMESURVEY_JCQ_MIN = 1
+LIMESURVEY_JCQ_MAX = 4
+
+DEMAND_CONTROL_RATIO_MIN = 5 / (9 * 4)
+DEMAND_CONTROL_RATIO_MAX = (4 * 5) / 9
+
+JCQ_NORMS = {
+    "F": {
+        0: DEMAND_CONTROL_RATIO_MIN,
+        1: 0.45,
+        2: 0.52,
+        3: 0.62,
+        4: DEMAND_CONTROL_RATIO_MAX,
+    },
+    "M": {
+        0: DEMAND_CONTROL_RATIO_MIN,
+        1: 0.41,
+        2: 0.48,
+        3: 0.56,
+        4: DEMAND_CONTROL_RATIO_MAX,
+    },
+}
+
+participant_info = pd.read_csv(snakemake.input[0], parse_dates=["date_of_birth"])
+
+if not participant_info.empty:
+    if "age" in requested_features:
+        now = pd.Timestamp("now")
+        baseline_features.loc[0, "age"] = (
+            now - participant_info.loc[0, "date_of_birth"]
+        ).days / 365.25245
+    if "gender" in requested_features:
+        baseline_features.loc[0, "gender"] = participant_info.loc[0, "gender"]
+    if "startlanguage" in requested_features:
+        baseline_features.loc[0, "startlanguage"] = participant_info.loc[
+            0, "startlanguage"
+        ]
+    if (
+        ("limesurvey_demand" in requested_features)
+        or ("limesurvey_control" in requested_features)
+        or ("limesurvey_demand_control_ratio" in requested_features)
+    ):
+        participant_info_t = participant_info.T
+        rows_baseline = participant_info_t.index
+
+        if ("limesurvey_demand" in requested_features) or (
+            "limesurvey_demand_control_ratio" in requested_features
+        ):
+            # Find questions about demand, but disregard time (duration of filling in questionnaire)
+            rows_demand = rows_baseline.str.startswith(
+                JCQ_DEMAND
+            ) & ~rows_baseline.str.endswith("Time")
+            limesurvey_demand = (
+                participant_info_t[rows_demand]
+                .reset_index()
+                .rename(columns={"index": "question", 0: "score_original"})
+            )
+            # Extract question IDs from names such as JobEisen[3]
+            limesurvey_demand["qid"] = (
+                limesurvey_demand["question"].str.extract(r"\[(\d+)\]").astype(int)
+            )
+            limesurvey_demand["score"] = limesurvey_demand["score_original"]
+            # Identify rows that include questions to be reversed.
+            rows_demand_reverse = limesurvey_demand["qid"].isin(
+                dict_JCQ_demand_control_reverse[JCQ_DEMAND].keys()
+            )
+            # Reverse the score, so that the maximum value becomes the minimum etc.
+            limesurvey_demand.loc[rows_demand_reverse, "score"] = (
+                LIMESURVEY_JCQ_MAX
+                + LIMESURVEY_JCQ_MIN
+                - limesurvey_demand.loc[rows_demand_reverse, "score_original"]
+            )
+            baseline_interim = pd.concat([baseline_interim, limesurvey_demand], axis=0, ignore_index=True)
+            if "limesurvey_demand" in requested_features:
+                baseline_features.loc[0, "limesurvey_demand"] = limesurvey_demand[
+                    "score"
+                ].sum()
+
+        if ("limesurvey_control" in requested_features) or (
+            "limesurvey_demand_control_ratio" in requested_features
+        ):
+            # Find questions about control, but disregard time (duration of filling in questionnaire)
+            rows_control = rows_baseline.str.startswith(
+                JCQ_CONTROL
+            ) & ~rows_baseline.str.endswith("Time")
+            limesurvey_control = (
+                participant_info_t[rows_control]
+                .reset_index()
+                .rename(columns={"index": "question", 0: "score_original"})
+            )
+            # Extract question IDs from names such as JobControle[3]
+            limesurvey_control["qid"] = (
+                limesurvey_control["question"].str.extract(r"\[(\d+)\]").astype(int)
+            )
+            limesurvey_control["score"] = limesurvey_control["score_original"]
+            # Identify rows that include questions to be reversed.
+            rows_control_reverse = limesurvey_control["qid"].isin(
+                dict_JCQ_demand_control_reverse[JCQ_CONTROL].keys()
+            )
+            # Reverse the score, so that the maximum value becomes the minimum etc.
+            limesurvey_control.loc[rows_control_reverse, "score"] = (
+                LIMESURVEY_JCQ_MAX
+                + LIMESURVEY_JCQ_MIN
+                - limesurvey_control.loc[rows_control_reverse, "score_original"]
+            )
+
+            baseline_interim = pd.concat([baseline_interim, limesurvey_control], axis=0, ignore_index=True)
+
+            if "limesurvey_control" in requested_features:
+                baseline_features.loc[0, "limesurvey_control"] = limesurvey_control[
+                    "score"
+                ].sum()
+
+        if "limesurvey_demand_control_ratio" in requested_features:
+            if limesurvey_control["score"].sum():
+                limesurvey_demand_control_ratio = (
+                        limesurvey_demand["score"].sum() / limesurvey_control["score"].sum()
+                )
+            else:
+                limesurvey_demand_control_ratio = 0
+            if (
+                JCQ_NORMS[participant_info.loc[0, "gender"]][0]
+                <= limesurvey_demand_control_ratio
+                < JCQ_NORMS[participant_info.loc[0, "gender"]][1]
+            ):
+                limesurvey_quartile = 1
+            elif (
+                JCQ_NORMS[participant_info.loc[0, "gender"]][1]
+                <= limesurvey_demand_control_ratio
+                < JCQ_NORMS[participant_info.loc[0, "gender"]][2]
+            ):
+                limesurvey_quartile = 2
+            elif (
+                JCQ_NORMS[participant_info.loc[0, "gender"]][2]
+                <= limesurvey_demand_control_ratio
+                < JCQ_NORMS[participant_info.loc[0, "gender"]][3]
+            ):
+                limesurvey_quartile = 3
+            elif (
+                JCQ_NORMS[participant_info.loc[0, "gender"]][3]
+                <= limesurvey_demand_control_ratio
+                < JCQ_NORMS[participant_info.loc[0, "gender"]][4]
+            ):
+                limesurvey_quartile = 4
+            else:
+                limesurvey_quartile = np.nan
+
+            baseline_features.loc[
+                0, "limesurvey_demand_control_ratio"
+            ] = limesurvey_demand_control_ratio
+            baseline_features.loc[
+                0, "limesurvey_demand_control_ratio_quartile"
+            ] = limesurvey_quartile
+
+if not baseline_interim.empty:
+    baseline_interim.to_csv(snakemake.output["interim"], index=False, encoding="utf-8")
+
+baseline_features.to_csv(snakemake.output["features"], index=False, encoding="utf-8")
--- a/src/data/create_participants_files.R
+++ b/src/data/create_participants_files.R
@ -1,6 +1,6 @@
 source("renv/activate.R")

-library(RMariaDB)
+#library(RMariaDB)
 library(stringr)
 library(purrr)
 library(readr)
@ -58,7 +58,7 @@ participants %>%
      lines <- append(lines, empty_fitbit)

    if(add_empatica_section == TRUE && !is.na(row[empatica_device_id_column])){
-      lines <- append(lines, c("EMPATICA:", paste0("  DEVICE_IDS: [",row[empatica_device_id_column],"]"),
+      lines <- append(lines, c("EMPATICA:", paste0("  DEVICE_IDS: [",row$label,"]"),
                               paste("  LABEL:",row$label), paste("  START_DATE:", start_date), paste("  END_DATE:", end_date)))
    } else
      lines <- append(lines, empty_empatica)
@ -73,7 +73,7 @@ participants %>%
 file_lines <-readLines("./config.yaml")
 for (i in 1:length(file_lines)){
  if(startsWith(file_lines[i], "PIDS:")){
-    file_lines[i] <- paste0("PIDS: [", paste(participants$pid, collapse = ", "), "]")
+    file_lines[i] <- paste0("PIDS: ['", paste(participants$pid, collapse = "', '"), "']")
  }
 }
 writeLines(file_lines, con = "./config.yaml") 
--- a/src/data/datetime/assign_to_time_segment.R
+++ b/src/data/datetime/assign_to_time_segment.R
@ -5,13 +5,16 @@ options(scipen=999)

 assign_rows_to_segments <- function(data, segments){
  # This function is used by all segment types, we use data.tables because they are fast
+
  data <- data.table::as.data.table(data)
  data[, assigned_segments := ""]
  for(i in seq_len(nrow(segments))) {
    segment <- segments[i,]
+
    data[segment$segment_start_ts<= timestamp & segment$segment_end_ts >= timestamp,
         assigned_segments := stringi::stri_c(assigned_segments, segment$segment_id, sep = "|")]
  }
+  
  data[,assigned_segments:=substring(assigned_segments, 2)]
  data
 }
--- a/src/data/download_baseline_data.py
+++ b/src/data/download_baseline_data.py
@ -0,0 +1,14 @@
+import pandas as pd
+import yaml
+
+filename = snakemake.input["data"]
+baseline = pd.read_csv(filename)
+
+with open(snakemake.input["participant_file"], "r") as file:
+    participant = yaml.safe_load(file)
+
+username = participant["PHONE"]["LABEL"]
+
+baseline[baseline["username"] == username].to_csv(snakemake.output[0],
+                                                  index=False,
+                                                  encoding="utf-8",)
--- a/src/data/merge_baseline_data.py
+++ b/src/data/merge_baseline_data.py
@ -0,0 +1,30 @@
+import pandas as pd
+
+VARIABLES_TO_TRANSLATE = {
+    "Gebruikersnaam": "username",
+    "Geslacht": "gender",
+    "Geboortedatum": "date_of_birth",
+}
+
+filenames = snakemake.input["data"]
+
+baseline_dfs = []
+
+for fn in filenames:
+    baseline_dfs.append(pd.read_csv(fn,
+                                    parse_dates=["Geboortedatum"],
+                                    infer_datetime_format=True,
+                                    cache_dates=True,
+                                    ))
+
+baseline = (
+    pd.concat(baseline_dfs, join="inner")
+    .reset_index()
+    .drop(columns="index")
+)
+
+baseline.rename(columns=VARIABLES_TO_TRANSLATE, copy=False, inplace=True)
+
+baseline.to_csv(snakemake.output[0],
+                index=False,
+                encoding="utf-8",)
--- a/src/data/process_location_types.R
+++ b/src/data/process_location_types.R
@ -6,9 +6,10 @@ library(tidyr)
 consecutive_threshold <- snakemake@params[["consecutive_threshold"]]
 time_since_valid_location <- snakemake@params[["time_since_valid_location"]]
 locations_to_use <- snakemake@params[["locations_to_use"]]
+accuracy_limit <- snakemake@params[["accuracy_limit"]]

 locations <- read.csv(snakemake@input[["locations"]]) %>% 
-            filter(double_latitude != 0 & double_longitude != 0) %>% 
+            filter(double_latitude != 0 & double_longitude != 0 & accuracy < accuracy_limit) %>% 
            drop_na(double_longitude, double_latitude) %>% 
            group_by(timestamp) %>% # keep only the row with the best accuracy if two or more have the same timestamp
            filter(accuracy == min(accuracy, na.rm=TRUE)) %>%  
@ -63,7 +64,7 @@ if(locations_to_use == "ALL"){
            # you can think of consecutive_threshold as the period a location row is valid for
            mutate(limit = pmin(lead(timestamp, default = 9999999999999) - 1, limit + (1000 * 60 * consecutive_threshold)),
                    n_resample = (limit - timestamp)%/%60001,
-                    n_resample = if_else(n_resample == 0, 1, n_resample)) %>% 
+                    n_resample = n_resample + 1) %>% 
            drop_na(double_longitude, double_latitude) %>%
            uncount(weights = n_resample, .id = "id") %>% 
            mutate(provider = if_else(id > 1, "resampled", provider),
--- a/src/data/streams/aware_micro_mysql/container.R
+++ b/src/data/streams/aware_micro_mysql/container.R
@ -0,0 +1,85 @@
+# if you need a new package, you should add it with renv::install(package) so your renv venv is updated
+library(RMariaDB)
+library(yaml)
+
+#' @description
+#' Auxiliary function to parse the connection credentials from a specifc group in ./credentials.yaml
+#' You can reause most of this function if you are connection to a DB or Web API.
+#' It's OK to delete this function if you don't need credentials, e.g., you are pulling data from a CSV for example.
+#' @param group the yaml key containing the credentials to connect to a database
+#' @preturn dbEngine a database engine (connection) ready to perform queries
+get_db_engine <- function(group){
+  # The working dir is aways RAPIDS root folder, so your credentials file is always /credentials.yaml
+  credentials <- read_yaml("./credentials.yaml")
+  if(!group %in% names(credentials))
+    stop(paste("The credentials group",group, "does not exist in ./credentials.yaml. The only groups that exist in that file are:", paste(names(credentials), collapse = ","), ". Did you forget to set the group in [PHONE_DATA_STREAMS][aware_mysql][DATABASE_GROUP] in config.yaml?"))
+  dbEngine <- dbConnect(MariaDB(), db = credentials[[group]][["database"]],
+                                    username = credentials[[group]][["user"]],
+                                    password = credentials[[group]][["password"]],
+                                    host = credentials[[group]][["host"]],
+                                    port = credentials[[group]][["port"]])
+  return(dbEngine)
+}
+
+# This file gets executed for each PHONE_SENSOR of each participant
+# If you are connecting to a database the env file containing its credentials is available at "./.env"
+# If you are reading a CSV file instead of a DB table, the @param sensor_container wil contain the file path as set in config.yaml
+# You are not bound to databases or files, you can query a web API or whatever data source you need.
+
+#' @description
+#' RAPIDS allows users to use the keyword "infer" (previously "multiple") to automatically infer the mobile Operative System a device was running.
+#' If you have a way to infer the OS of a device ID, implement this function. For example, for AWARE data we use the "aware_device" table.
+#'  
+#' If you don't have a way to infer the OS, call stop("Error Message") so other users know they can't use "infer" or the inference failed, 
+#' and they have to assign the OS manually in the participant file
+#' 
+#' @param stream_parameters The PHONE_STREAM_PARAMETERS key in config.yaml. If you need specific parameters add them there.
+#' @param device A device ID string
+#' @return The OS the device ran, "android" or "ios"
+
+infer_device_os <- function(stream_parameters, device){
+  dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
+  query <- paste0("SELECT device_id,brand FROM aware_device WHERE device_id = '", device, "'")
+  message(paste0("Executing the following query to infer phone OS: ", query)) 
+  os <- dbGetQuery(dbEngine, query)
+  dbDisconnect(dbEngine)
+  
+  if(nrow(os) > 0)
+    return(os %>% mutate(os = ifelse(brand == "iPhone", "ios", "android")) %>% pull(os))
+  else
+    stop(paste("We cannot infer the OS of the following device id because it does not exist in the aware_device table:", device))
+  
+  return(os)
+}
+
+#' @description
+#' Gets the sensor data for a specific device id from a database table, file or whatever source you want to query
+#' 
+#' @param stream_parameters The PHONE_STREAM_PARAMETERS key in config.yaml. If you need specific parameters add them there.
+#' @param device A device ID string
+#' @param sensor_container database table or file containing the sensor data for all participants. This is the PHONE_SENSOR[CONTAINER] key in config.yaml
+#' @param columns the columns needed from this sensor (we recommend to only return these columns instead of every column in sensor_container)
+#' @return A dataframe with the sensor data for device
+
+pull_data <- function(stream_parameters, device, sensor, sensor_container, columns){
+  dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
+
+  select_items <- c()
+  for (column in columns) {
+    select_items <- append(select_items, paste0("data->>'$.", column, "' ", column))
+  }
+
+  query <- paste0("SELECT ", paste(select_items, collapse = ",")," FROM ", sensor_container, " WHERE ", columns$DEVICE_ID ," = '", device,"'")
+
+  # Letting the user know what we are doing
+  message(paste0("Executing the following query to download data: ", query)) 
+  sensor_data <- dbGetQuery(dbEngine, query)
+  
+  dbDisconnect(dbEngine)
+  
+  if(nrow(sensor_data) == 0)
+    warning(paste("The device '", device,"' did not have data in ", sensor_container))
+
+  return(sensor_data)
+}
+
--- a/src/data/streams/aware_micro_mysql/format.yaml
+++ b/src/data/streams/aware_micro_mysql/format.yaml
@ -0,0 +1,337 @@
+PHONE_ACCELEROMETER:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      DOUBLE_VALUES_0: double_values_0
+      DOUBLE_VALUES_1: double_values_1
+      DOUBLE_VALUES_2: double_values_2
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      DOUBLE_VALUES_0: double_values_0
+      DOUBLE_VALUES_1: double_values_1
+      DOUBLE_VALUES_2: double_values_2
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_ACTIVITY_RECOGNITION:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      ACTIVITY_NAME: activity_name
+      ACTIVITY_TYPE: activity_type
+      CONFIDENCE: confidence
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      ACTIVITY_NAME: FLAG_TO_MUTATE
+      ACTIVITY_TYPE: FLAG_TO_MUTATE
+      CONFIDENCE: FLAG_TO_MUTATE
+    MUTATION:
+      COLUMN_MAPPINGS:
+        ACTIVITIES: activities
+        CONFIDENCE: confidence
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+        - "src/data/streams/mutations/phone/aware/activity_recogniton_ios_unification.R"
+
+PHONE_APPLICATIONS_CRASHES:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      PACKAGE_NAME: package_name
+      APPLICATION_NAME: application_name
+      APPLICATION_VERSION: application_version
+      ERROR_SHORT: error_short
+      ERROR_LONG: error_long
+      ERROR_CONDITION: error_condition
+      IS_SYSTEM_APP: is_system_app
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_APPLICATIONS_FOREGROUND:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      PACKAGE_NAME: package_name
+      APPLICATION_NAME: application_name
+      IS_SYSTEM_APP: is_system_app
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_APPLICATIONS_NOTIFICATIONS:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      PACKAGE_NAME: package_name
+      APPLICATION_NAME: application_name
+      TEXT: text
+      SOUND: sound
+      VIBRATE: vibrate
+      DEFAULTS: defaults
+      FLAGS: flags
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_BATTERY:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      BATTERY_STATUS: battery_status
+      BATTERY_LEVEL: battery_level
+      BATTERY_SCALE: battery_scale
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      BATTERY_STATUS: FLAG_TO_MUTATE
+      BATTERY_LEVEL: battery_level
+      BATTERY_SCALE: battery_scale
+    MUTATION:
+      COLUMN_MAPPINGS:
+        BATTERY_STATUS: battery_status
+      SCRIPTS:
+        - "src/data/streams/mutations/phone/aware/battery_ios_unification.R"
+
+PHONE_BLUETOOTH:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      BT_ADDRESS: bt_address
+      BT_NAME: bt_name
+      BT_RSSI: bt_rssi
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      BT_ADDRESS: bt_address
+      BT_NAME: bt_name
+      BT_RSSI: bt_rssi
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_CALLS:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      CALL_TYPE: call_type
+      CALL_DURATION: call_duration
+      TRACE: trace
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      CALL_TYPE: FLAG_TO_MUTATE
+      CALL_DURATION: call_duration
+      TRACE: trace
+    MUTATION:
+      COLUMN_MAPPINGS:
+        CALL_TYPE: call_type
+      SCRIPTS:
+        - "src/data/streams/mutations/phone/aware/calls_ios_unification.R"
+
+PHONE_CONVERSATION:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      DOUBLE_ENERGY: double_energy
+      INFERENCE: inference
+      DOUBLE_CONVO_START: double_convo_start
+      DOUBLE_CONVO_END: double_convo_end
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      DOUBLE_ENERGY: double_energy
+      INFERENCE: inference
+      DOUBLE_CONVO_START: double_convo_start
+      DOUBLE_CONVO_END: double_convo_end
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+        - "src/data/streams/mutations/phone/aware/conversation_ios_timestamp.R"
+
+PHONE_KEYBOARD:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      PACKAGE_NAME: package_name
+      BEFORE_TEXT: before_text
+      CURRENT_TEXT: current_text
+      IS_PASSWORD: is_password
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_LIGHT:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      DOUBLE_LIGHT_LUX: double_light_lux
+      ACCURACY: accuracy
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_LOCATIONS:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      DOUBLE_LATITUDE: double_latitude
+      DOUBLE_LONGITUDE: double_longitude
+      DOUBLE_BEARING: double_bearing
+      DOUBLE_SPEED: double_speed
+      DOUBLE_ALTITUDE: double_altitude
+      PROVIDER: provider
+      ACCURACY: accuracy
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      DOUBLE_LATITUDE: double_latitude
+      DOUBLE_LONGITUDE: double_longitude
+      DOUBLE_BEARING: double_bearing
+      DOUBLE_SPEED: double_speed
+      DOUBLE_ALTITUDE: double_altitude
+      PROVIDER: provider
+      ACCURACY: accuracy
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_LOG:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      LOG_MESSAGE: log_message
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      LOG_MESSAGE: log_message
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_MESSAGES:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      MESSAGE_TYPE: message_type
+      TRACE: trace
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_SCREEN:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      SCREEN_STATUS: screen_status
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      SCREEN_STATUS: FLAG_TO_MUTATE
+    MUTATION:
+      COLUMN_MAPPINGS:
+        SCREEN_STATUS: screen_status
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+        - "src/data/streams/mutations/phone/aware/screen_ios_unification.R"
+
+PHONE_WIFI_CONNECTED:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      MAC_ADDRESS: mac_address
+      SSID: ssid
+      BSSID: bssid
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      MAC_ADDRESS: mac_address
+      SSID: ssid
+      BSSID: bssid
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_WIFI_VISIBLE:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      SSID: ssid
+      BSSID: bssid
+      SECURITY: security
+      FREQUENCY: frequency
+      RSSI: rssi
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      SSID: ssid
+      BSSID: bssid
+      SECURITY: security
+      FREQUENCY: frequency
+      RSSI: rssi
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
--- a/src/data/streams/aware_postgresql/container.R
+++ b/src/data/streams/aware_postgresql/container.R
@ -0,0 +1,212 @@
+# if you need a new package, you should add it with renv::install(package) so your renv venv is updated
+library(RPostgres)
+# Needs libpq-dev for compiling from source.
+
+# Error installing package 'RPostgres':
+#   =====================================
+#   
+#   * installing *source* package 'RPostgres' ...
+# ** package 'RPostgres' successfully unpacked and MD5 sums checked
+# ** using staged installation
+# Using PKG_CFLAGS=
+#   Using PKG_LIBS=-lpq
+# Using PKG_PLOGR=
+#   ------------------------- ANTICONF ERROR ---------------------------
+#   Configuration failed because libpq was not found. Try installing:
+#   * deb: libpq-dev (Debian, Ubuntu, etc)
+# * rpm: postgresql-devel (Fedora, EPEL)
+# * rpm: postgreql8-devel, psstgresql92-devel, postgresql93-devel, or postgresql94-devel (Amazon Linux)
+# * csw: postgresql_dev (Solaris)
+# * brew: libpq (OSX)
+# If libpq is already installed, check that either:
+#   (i)  'pkg-config' is in your PATH AND PKG_CONFIG_PATH contains
+# a libpq.pc file; or
+# (ii) 'pg_config' is in your PATH.
+# If neither can detect , you can set INCLUDE_DIR
+# and LIB_DIR manually via:
+#   R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
+# --------------------------[ ERROR MESSAGE ]----------------------------
+#   <stdin>:1:10: fatal error: libpq-fe.h: No such file or directory
+# compilation terminated.
+
+library(dbplyr)
+library(yaml)
+
+#' @description
+#' Auxiliary function to parse the connection credentials from a specifc group in ./credentials.yaml
+#' You can reause most of this function if you are connection to a DB or Web API.
+#' It's OK to delete this function if you don't need credentials, e.g., you are pulling data from a CSV for example.
+#' @param group the yaml key containing the credentials to connect to a database
+#' @preturn dbEngine a database engine (connection) ready to perform queries
+get_db_engine <- function(group){
+  # The working dir is aways RAPIDS root folder, so your credentials file is always /credentials.yaml
+  credentials <- read_yaml("./credentials.yaml")
+  if(!group %in% names(credentials))
+    stop(paste("The credentials group",group, "does not exist in ./credentials.yaml. The only groups that exist in that file are:", paste(names(credentials), collapse = ","), ". Did you forget to set the group in [PHONE_DATA_STREAMS][aware_mysql][DATABASE_GROUP] in config.yaml?"))
+  dbEngine <- dbConnect(Postgres(), db = credentials[[group]][["database"]],
+                        user = credentials[[group]][["user"]],
+                        password = credentials[[group]][["password"]],
+                        host = credentials[[group]][["host"]],
+                        port = credentials[[group]][["port"]])
+  return(dbEngine)
+}
+
+# This file gets executed for each PHONE_SENSOR of each participant
+# If you are connecting to a database the env file containing its credentials is available at "./.env"
+# If you are reading a CSV file instead of a DB table, the @param sensor_container wil contain the file path as set in config.yaml
+# You are not bound to databases or files, you can query a web API or whatever data source you need.
+
+#' @description
+#' RAPIDS allows users to use the keyword "infer" (previously "multiple") to automatically infer the mobile Operative System a device was running.
+#' If you have a way to infer the OS of a device ID, implement this function. For example, for AWARE data we use the "aware_device" table.
+#'  
+#' If you don't have a way to infer the OS, call stop("Error Message") so other users know they can't use "infer" or the inference failed, 
+#' and they have to assign the OS manually in the participant file
+#' 
+#' @param stream_parameters The PHONE_STREAM_PARAMETERS key in config.yaml. If you need specific parameters add them there.
+#' @param device A device ID string
+#' @return The OS the device ran, "android" or "ios"
+
+infer_device_os <- function(stream_parameters, device){
+  #dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
+  #query <- paste0("SELECT device_id,brand FROM aware_device WHERE device_id = '", device, "'")
+  #message(paste0("Executing the following query to infer phone OS: ", query)) 
+  #os <- dbGetQuery(dbEngine, query)
+  #dbDisconnect(dbEngine)
+  
+  #if(nrow(os) > 0)
+  #  return(os %>% mutate(os = ifelse(brand == "iPhone", "ios", "android")) %>% pull(os))
+  #else
+  stop(paste("We cannot infer the OS of the following device id because the aware_device table does not exist."))
+  
+  #return(os)
+}
+
+#' @description
+#' Gets the sensor data for a specific device id from a database table, file or whatever source you want to query
+#' 
+#' @param stream_parameters The PHONE_STREAM_PARAMETERS key in config.yaml. If you need specific parameters add them there.
+#' @param device A device ID string
+#' @param sensor_container database table or file containing the sensor data for all participants. This is the PHONE_SENSOR[CONTAINER] key in config.yaml
+#' @param columns the columns needed from this sensor (we recommend to only return these columns instead of every column in sensor_container)
+#' @return A dataframe with the sensor data for device
+
+pull_data <- function(stream_parameters, device, sensor, sensor_container, columns){
+  dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
+  query <- paste0("SELECT ", paste(columns, collapse = ",")," FROM ", sensor_container, " WHERE ", columns$DEVICE_ID ," = '", device,"'")
+  # Letting the user know what we are doing
+  message(paste0("Executing the following query to download data: ", query)) 
+  sensor_data <- dbGetQuery(dbEngine, query)
+  
+  dbDisconnect(dbEngine)
+  
+  if(nrow(sensor_data) == 0)
+    warning(paste("The device '", device,"' did not have data in ", sensor_container))
+  
+  return(sensor_data)
+}
+
+#' @description
+#' Gets participants' IDs for specified usernames.
+#'
+#' @param stream_parameters The PHONE_DATA_STREAMS key in config.yaml. If you need specific parameters add them there.
+#' @param usernames A vector of usernames
+#' @param participants_container The name of the database table containing participants data, such as their username.
+#' @return A dataframe with participant IDs matching usernames
+
+pull_participants_ids <- function(stream_parameters, usernames, participants_container) {
+  dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
+  
+  query_participant_id <- tbl(dbEngine, participants_container) %>% 
+    filter(username %in% usernames) %>% 
+    select(username, id)
+  
+  message(paste0("Executing the following query to get participants' IDs: \n", sql_render(query_participant_id)))
+  
+  participant_data <- query_participant_id %>% collect()
+
+  dbDisconnect(dbEngine)
+  
+  if(nrow(participant_data) == 0)
+    warning(paste("We could not find requested usernames (", usernames,  ") in ", participants_container))
+  
+  return(participant_data)
+}
+
+#' @description
+#' Gets participants' IDs for specified participant IDs
+#'
+#' @param stream_parameters The PHONE_DATA_STREAMS key in config.yaml. If you need specific parameters add them there.
+#' @param participants_ids A vector of numeric participant IDs
+#' @param device_id_container The name of the database table which will be used to determine distinct device ID. Ideally, a table that reliably contains data, but not too much.
+#' @return A dataframe with a row matching each distinct device ID with a participant ID
+
+pull_participants_device_ids <- function(stream_parameters, participants_ids, device_id_container) {
+  dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
+
+  query_device_id <- tbl(dbEngine, device_id_container) %>%
+    filter(participant_id %in% !!participants_ids) %>% 
+    group_by(participant_id) %>% 
+    distinct(device_id, .keep_all = FALSE)
+  
+  message(
+    paste0(
+      "Executing the following query to get the distinct device IDs: \n",
+      sql_render(query_device_id),
+      "\n NOTE: This might take a long time."
+    )
+  )
+  
+  device_ids <- query_device_id %>% collect()
+  
+  dbDisconnect(dbEngine)
+  
+  if(nrow(device_ids) == 0)
+    warning(paste("We could not find device IDs for requested participant IDs (", participants_ids,  ") in ", device_id_container))
+  
+  return(device_ids)
+}
+
+#' @description
+#' Gets start and end datetimes for specified participant IDs.
+#'
+#' @param stream_parameters The PHONE_DATA_STREAMS key in config.yaml. If you need specific parameters add them there.
+#' @param participants_ids A vector of numeric participant IDs
+#' @param start_end_date_container The name of the database table which will be used to determine when a participant started and ended their participation. Briefing and debriefing EMAs can be meaningfully used here.
+#' @return A dataframe relating participant IDs with their start and end datetimes.
+
+pull_participants_start_end_dates <- function(stream_parameters, participants_ids, start_end_date_container) {
+  dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
+
+  query_timestamps <- tbl(dbEngine, start_end_date_container) %>% 
+    filter(
+      participant_id %in% !!participants_ids,
+      double_esm_user_answer_timestamp > 0
+    ) %>% 
+    group_by(participant_id) %>% 
+    summarise(
+      timestamp_min = min(double_esm_user_answer_timestamp, na.rm = TRUE),
+      timestamp_max = max(double_esm_user_answer_timestamp, na.rm = TRUE)
+    ) %>% 
+    select(participant_id, timestamp_min, timestamp_max)
+  
+  message(paste0("Executing the following query to get the starting and ending datetimes: \n", sql_render(query_timestamps)))
+  
+  start_end_timestamps <- query_timestamps %>% collect()
+  
+  if(nrow(start_end_timestamps) == 0)
+    warning(paste("We could not find datetimes for requested participant IDs (", participants_ids,  ") in ", start_end_date_container))
+
+  start_end_times <- start_end_timestamps %>% 
+    mutate(    
+      datetime_start = as_datetime(timestamp_min/1000, tz = "UTC"),
+      datetime_end = as_datetime(timestamp_max/1000, tz = "UTC")
+    ) %>% 
+    select(-c(timestamp_min, timestamp_max))
+  
+  dbDisconnect(dbEngine)
+  
+  return(start_end_times)
+}
+
+
--- a/src/data/streams/aware_postgresql/format.yaml
+++ b/src/data/streams/aware_postgresql/format.yaml
@ -0,0 +1,372 @@
+PHONE_ACCELEROMETER:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      DOUBLE_VALUES_0: double_values_0
+      DOUBLE_VALUES_1: double_values_1
+      DOUBLE_VALUES_2: double_values_2
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      DOUBLE_VALUES_0: double_values_0
+      DOUBLE_VALUES_1: double_values_1
+      DOUBLE_VALUES_2: double_values_2
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_ACTIVITY_RECOGNITION:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      ACTIVITY_NAME: activity_name
+      ACTIVITY_TYPE: activity_type
+      CONFIDENCE: confidence
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      ACTIVITY_NAME: FLAG_TO_MUTATE
+      ACTIVITY_TYPE: FLAG_TO_MUTATE
+      CONFIDENCE: FLAG_TO_MUTATE
+    MUTATION:
+      COLUMN_MAPPINGS:
+        ACTIVITIES: activities
+        CONFIDENCE: confidence
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+        - "src/data/streams/mutations/phone/aware/activity_recogniton_ios_unification.R"
+
+PHONE_APPLICATIONS_CRASHES:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      PACKAGE_NAME: package_name
+      APPLICATION_NAME: application_name
+      APPLICATION_VERSION: application_version
+      ERROR_SHORT: error_short
+      ERROR_LONG: error_long
+      ERROR_CONDITION: error_condition
+      IS_SYSTEM_APP: is_system_app
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_APPLICATIONS_FOREGROUND:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      PACKAGE_NAME: package_hash
+      APPLICATION_NAME: FLAG_TO_MUTATE
+      IS_SYSTEM_APP: is_system_app
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS:
+        - src/data/streams/mutations/phone/straw/app_add_name.R
+
+PHONE_APPLICATIONS_NOTIFICATIONS:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      PACKAGE_NAME: package_hash
+      APPLICATION_NAME: FLAG_TO_MUTATE
+      SOUND: sound
+      VIBRATE: vibrate
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS:
+        - src/data/streams/mutations/phone/straw/app_add_name.R
+
+PHONE_BATTERY:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      BATTERY_STATUS: battery_status
+      BATTERY_LEVEL: battery_level
+      BATTERY_SCALE: battery_scale
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      BATTERY_STATUS: FLAG_TO_MUTATE
+      BATTERY_LEVEL: battery_level
+      BATTERY_SCALE: battery_scale
+    MUTATION:
+      COLUMN_MAPPINGS:
+        BATTERY_STATUS: battery_status
+      SCRIPTS:
+        - "src/data/streams/mutations/phone/aware/battery_ios_unification.R"
+
+PHONE_BLUETOOTH:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      BT_ADDRESS: bt_address
+      BT_NAME: bt_name
+      BT_RSSI: bt_rssi
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      BT_ADDRESS: bt_address
+      BT_NAME: bt_name
+      BT_RSSI: bt_rssi
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_CALLS:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      CALL_TYPE: call_type
+      CALL_DURATION: call_duration
+      TRACE: trace
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      CALL_TYPE: FLAG_TO_MUTATE
+      CALL_DURATION: call_duration
+      TRACE: trace
+    MUTATION:
+      COLUMN_MAPPINGS:
+        CALL_TYPE: call_type
+      SCRIPTS:
+        - "src/data/streams/mutations/phone/aware/calls_ios_unification.R"
+
+PHONE_CONVERSATION:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      DOUBLE_ENERGY: double_energy
+      INFERENCE: inference
+      DOUBLE_CONVO_START: double_convo_start
+      DOUBLE_CONVO_END: double_convo_end
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      DOUBLE_ENERGY: double_energy
+      INFERENCE: inference
+      DOUBLE_CONVO_START: double_convo_start
+      DOUBLE_CONVO_END: double_convo_end
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+        - "src/data/streams/mutations/phone/aware/conversation_ios_timestamp.R"
+
+PHONE_ESM:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: double_esm_user_answer_timestamp
+      DEVICE_ID: device_id
+      ESM_STATUS: esm_status
+      ESM_USER_ANSWER: esm_user_answer
+      ESM_JSON: esm_json
+      ESM_TRIGGER: esm_trigger
+      ESM_SESSION: esm_session
+      ESM_NOTIFICATION_ID: esm_notification_id
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS:
+
+PHONE_KEYBOARD:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      PACKAGE_NAME: package_name
+      BEFORE_TEXT: before_text
+      CURRENT_TEXT: current_text
+      IS_PASSWORD: is_password
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_LIGHT:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      DOUBLE_LIGHT_LUX: double_light_lux
+      ACCURACY: accuracy
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_LOCATIONS:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      DOUBLE_LATITUDE: double_latitude
+      DOUBLE_LONGITUDE: double_longitude
+      DOUBLE_BEARING: double_bearing
+      DOUBLE_SPEED: double_speed
+      DOUBLE_ALTITUDE: double_altitude
+      PROVIDER: provider
+      ACCURACY: accuracy
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      DOUBLE_LATITUDE: double_latitude
+      DOUBLE_LONGITUDE: double_longitude
+      DOUBLE_BEARING: double_bearing
+      DOUBLE_SPEED: double_speed
+      DOUBLE_ALTITUDE: double_altitude
+      PROVIDER: provider
+      ACCURACY: accuracy
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_LOG:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      LOG_MESSAGE: log_message
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      LOG_MESSAGE: log_message
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_MESSAGES:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      MESSAGE_TYPE: message_type
+      TRACE: trace
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_SCREEN:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      SCREEN_STATUS: screen_status
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      SCREEN_STATUS: FLAG_TO_MUTATE
+    MUTATION:
+      COLUMN_MAPPINGS:
+        SCREEN_STATUS: screen_status
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+        - "src/data/streams/mutations/phone/aware/screen_ios_unification.R"
+
+PHONE_WIFI_CONNECTED:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      MAC_ADDRESS: mac_address
+      SSID: ssid
+      BSSID: bssid
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      MAC_ADDRESS: mac_address
+      SSID: ssid
+      BSSID: bssid
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_WIFI_VISIBLE:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      SSID: ssid
+      BSSID: bssid
+      SECURITY: security
+      FREQUENCY: frequency
+      RSSI: rssi
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      SSID: ssid
+      BSSID: bssid
+      SECURITY: security
+      FREQUENCY: frequency
+      RSSI: rssi
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_SPEECH:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      SPEECH_PROPORTION: speech_proportion
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      SPEECH_PROPORTION: speech_proportion
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+
+      
+
--- a/src/data/streams/empatica_zip/container.py
+++ b/src/data/streams/empatica_zip/container.py
@ -2,11 +2,16 @@ from zipfile import ZipFile
 import warnings
 from pathlib import Path
 import pandas as pd
+import numpy as np
 from pandas.core import indexing
 import yaml
 import csv
 from collections import OrderedDict
 from io import BytesIO, StringIO
+import sys, os
+
+from cr_features.hrv import get_HRV_features, get_patched_ibi_with_bvp
+from cr_features.helper_functions import empatica1d_to_array, empatica2d_to_array

 def processAcceleration(x, y, z):
    x = float(x)
@ -52,6 +57,8 @@ def extract_empatica_data(data,  sensor):
        df = pd.DataFrame.from_dict(ddict, orient='index', columns=[column])
        df[column] = df[column].astype(float)
        df.index.name = 'timestamp'
+        if df.empty:
+            return df

    elif sensor == 'EMPATICA_ACCELEROMETER':
        ddict = readFile(sensor_data_file, sensor)
@ -60,15 +67,22 @@ def extract_empatica_data(data,  sensor):
        df['y'] = df['y'].astype(float)
        df['z'] = df['z'].astype(float)
        df.index.name = 'timestamp'
+        if df.empty:
+            return df

    elif sensor == 'EMPATICA_INTER_BEAT_INTERVAL':
-        df = pd.read_csv(sensor_data_file, names=['timestamp', column], header=None)
+
+        df = pd.read_csv(sensor_data_file, names=['timings', column], header=None)
+        df['timestamp'] = df['timings']
+        if df.empty:
+            df = df.set_index('timestamp')
+            return df
        timestampstart = float(df['timestamp'][0])
-        df['timestamp'] = (df['timestamp'][1:len(df)]).astype(float) + timestampstart
+        df['timestamp'] = (df['timestamp'][1:len(df)]).astype(float) + timestampstart        
        df = df.drop([0])
        df[column] = df[column].astype(float)
        df = df.set_index('timestamp')
-
+        
    else:
        raise ValueError(
            "sensor has an invalid name: {}".format(sensor))
@ -84,6 +98,10 @@ def pull_data(data_configuration, device, sensor, container, columns_to_download
    participant_data = pd.DataFrame(columns=columns_to_download.values())
    participant_data.set_index('timestamp', inplace=True)

+    with open('config.yaml', 'r') as stream:
+        config = yaml.load(stream, Loader=yaml.FullLoader)
+    cr_ibi_provider = config['EMPATICA_INTER_BEAT_INTERVAL']['PROVIDERS']['CR']
+
    available_zipfiles = list((Path(data_configuration["FOLDER"]) / Path(device)).rglob("*.zip"))
    if len(available_zipfiles) == 0:
        warnings.warn("There were no zip files in: {}. If you were expecting data for this participant the [EMPATICA][DEVICE_IDS] key in their participant file is missing the pid".format((Path(data_configuration["FOLDER"]) / Path(device))))
@ -94,7 +112,13 @@ def pull_data(data_configuration, device, sensor, container, columns_to_download
            listOfFileNames = zipFile.namelist()
            for fileName in listOfFileNames:
                if fileName == sensor_csv:
-                    participant_data = pd.concat([participant_data, extract_empatica_data(zipFile.read(fileName),  sensor)], axis=0)
+                    if sensor == "EMPATICA_INTER_BEAT_INTERVAL" and cr_ibi_provider.get('PATCH_WITH_BVP', False):
+                        participant_data = \
+                            pd.concat([participant_data, patch_ibi_with_bvp(zipFile.read('IBI.csv'), zipFile.read('BVP.csv'))], axis=0)
+                        #print("patch with ibi")
+                    else:
+                        participant_data = pd.concat([participant_data, extract_empatica_data(zipFile.read(fileName), sensor)], axis=0)
+                        #print("no patching")
                    warning = False
            if warning:
                warnings.warn("We could not find a zipped file for {} in {} (we tried to find {})".format(sensor, zipFile, sensor_csv))
@ -105,4 +129,54 @@ def pull_data(data_configuration, device, sensor, container, columns_to_download
    participant_data["device_id"] = device
    return(participant_data)

+def patch_ibi_with_bvp(ibi_data, bvp_data):
+    ibi_data_file = BytesIO(ibi_data).getvalue().decode('utf-8')
+    ibi_data_file = StringIO(ibi_data_file)
+
+    # Begin with the cr-features part
+    try:
+        ibi_data, ibi_start_timestamp = empatica2d_to_array(ibi_data_file)
+    except (IndexError, KeyError) as e:
+        # Checks whether IBI.csv is empty
+        # It may raise a KeyError if df is empty here: startTimeStamp = df.time[0]
+        df_test = pd.read_csv(ibi_data_file, names=['timings', 'inter_beat_interval'], header=None)
+        if df_test.empty:
+            df_test['timestamp'] = df_test['timings']
+            df_test = df_test.set_index('timestamp')
+            return df_test
+        else:
+            raise IndexError("Something went wrong with indices. Error that was previously caught:\n", repr(e))
+
+    bvp_data_file = BytesIO(bvp_data).getvalue().decode('utf-8')
+    bvp_data_file = StringIO(bvp_data_file)
+
+    bvp_data, bvp_start_timestamp, sample_rate = empatica1d_to_array(bvp_data_file)
+
+    hrv_time_and_freq_features, sample, bvp_rr, bvp_timings, peak_indx = \
+        get_HRV_features(bvp_data, ma=False, 
+                        detrend=False, m_deternd=False, low_pass=False, winsorize=True, 
+                        winsorize_value=25, hampel_fiter=False, median_filter=False, 
+                        mod_z_score_filter=True, sampling=64, feature_names=['meanHr'])
+    
+    ibi_timings, ibi_rr = get_patched_ibi_with_bvp(ibi_data[0], ibi_data[1], bvp_timings, bvp_rr)
+
+    df = \
+        pd.DataFrame(np.array([ibi_timings, ibi_rr]).transpose(), columns=['timestamp', 'inter_beat_interval'])
+    df.loc[-1] = [ibi_start_timestamp, 'IBI']  # adding a row
+    df.index = df.index + 1  # shifting index
+    df = df.sort_index()  # sorting by index
+
+    # Repeated as in extract_empatica_data for IBI
+    df['timings'] = df['timestamp']
+    timestampstart = float(df['timestamp'][0])
+    df['timestamp'] = (df['timestamp'][1:len(df)]).astype(float) + timestampstart        
+    df = df.drop([0])
+    df['inter_beat_interval'] = df['inter_beat_interval'].astype(float)
+    df = df.set_index('timestamp')
+
+    # format timestamps
+    df.index *= 1000
+    df.index = df.index.astype(int)
+    return(df)
+
 # print(pull_data({'FOLDER': 'data/external/empatica'}, "e01", "EMPATICA_accelerometer", {'TIMESTAMP': 'timestamp', 'DEVICE_ID': 'device_id', 'DOUBLE_VALUES_0': 'x', 'DOUBLE_VALUES_1': 'y', 'DOUBLE_VALUES_2': 'z'}))
--- a/src/data/streams/empatica_zip/format.yaml
+++ b/src/data/streams/empatica_zip/format.yaml
@ -50,6 +50,7 @@ EMPATICA_INTER_BEAT_INTERVAL:
    TIMESTAMP: timestamp
    DEVICE_ID: device_id
    INTER_BEAT_INTERVAL: inter_beat_interval
+    TIMINGS: timings
  MUTATION:
    COLUMN_MAPPINGS:
    SCRIPTS: # List any python or r scripts that mutate your raw data
--- a/src/data/streams/mutations/fitbit/parse_calories_intraday_json.py
+++ b/src/data/streams/mutations/fitbit/parse_calories_intraday_json.py
@ -18,7 +18,7 @@ def parseCaloriesData(calories_data):
            dataset = record["activities-calories-intraday"]["dataset"]
            for data in dataset:
                d_time = datetime.strptime(data["time"], '%H:%M:%S').time()
-                d_datetime = datetime.combine(curr_date, d_time)
+                d_datetime = datetime.combine(curr_date, d_time).strftime("%Y-%m-%d %H:%M:%S")
                row_intraday = (device_id, data["level"], data["mets"], data["value"], d_datetime, 0)
                records_intraday.append(row_intraday)

--- a/src/data/streams/mutations/fitbit/parse_heartrate_intraday_json.py
+++ b/src/data/streams/mutations/fitbit/parse_heartrate_intraday_json.py
@ -32,13 +32,13 @@ def parseHeartrateZones(heartrate_data):
 def parseHeartrateIntradayData(records_intraday, dataset, device_id, curr_date, heartrate_zones_range):
    for data in dataset:
        d_time = datetime.strptime(data["time"], '%H:%M:%S').time()
-        d_datetime = datetime.combine(curr_date, d_time)
+        d_datetime = datetime.combine(curr_date, d_time).strftime("%Y-%m-%d %H:%M:%S")
        d_hr =  data["value"]

        # Get heartrate zone by range: min <= heartrate < max
        d_hrzone = None
        for hrzone, hrrange in heartrate_zones_range.items():
-            if d_hr >= hrrange[0] and d_hr < hrrange[1]:
+            if d_hr >= hrrange[0] and d_hr <= hrrange[1]:
                d_hrzone = hrzone
                break

--- a/src/data/streams/mutations/fitbit/parse_heartrate_summary_json.py
+++ b/src/data/streams/mutations/fitbit/parse_heartrate_summary_json.py
@ -1,6 +1,5 @@
 import json
 import pandas as pd
-from datetime import datetime


 HR_SUMMARY_COLUMNS = ("device_id",
@ -55,7 +54,7 @@ def parseHeartrateData(heartrate_data):
    for record in heartrate_data.json_fitbit_column:
        record = json.loads(record)  # Parse text into JSON
        if "activities-heart" in record:
-            curr_date = datetime.strptime(record["activities-heart"][0]["dateTime"], "%Y-%m-%d")
+            curr_date = record["activities-heart"][0]["dateTime"] + " 00:00:00"

            record_summary = record["activities-heart"][0]
            row_summary = parseHeartrateSummaryData(record_summary, device_id, curr_date)
--- a/src/data/streams/mutations/fitbit/parse_sleep_intraday_json.py
+++ b/src/data/streams/mutations/fitbit/parse_sleep_intraday_json.py
@ -64,7 +64,7 @@ def parseOneRecordForV1(record, device_id, d_is_main_sleep, records_intraday, ty
        d_time = datetime.strptime(data["dateTime"], '%H:%M:%S').time()
        if is_before_midnight and d_time.hour == 0:
            curr_date = end_date
-        d_datetime = datetime.combine(curr_date, d_time)
+        d_datetime = datetime.combine(curr_date, d_time).strftime("%Y-%m-%d %H:%M:%S")

        # API 1.2 stores original_level as strings, so we convert original_levels of API 1 to strings too
        # (1: "asleep", 2: "restless", 3: "awake")
@ -86,7 +86,7 @@ def parseOneRecordForV12(record, device_id, d_is_main_sleep, records_intraday, t

    if sleep_record_type == "classic":
        for data in record["levels"]["data"]:
-            d_datetime = dateutil.parser.parse(data["dateTime"])
+            d_datetime = data["dateTime"][:19].replace("T", " ")

            row_intraday = (device_id, type_episode_id, data["seconds"],
                data["level"], d_is_main_sleep, sleep_record_type,
@ -95,9 +95,10 @@ def parseOneRecordForV12(record, device_id, d_is_main_sleep, records_intraday, t
    else:
        # For sleep type "stages"
        for data in mergeLongAndShortData(record["levels"]):
+            d_datetime = data[0].strftime("%Y-%m-%d %H:%M:%S")
            row_intraday = (device_id, type_episode_id, 30,
                data[1], d_is_main_sleep, sleep_record_type,
-                data[0], 0)
+                d_datetime, 0)

            records_intraday.append(row_intraday)
    
--- a/src/data/streams/mutations/fitbit/parse_sleep_summary_json.py
+++ b/src/data/streams/mutations/fitbit/parse_sleep_summary_json.py
@ -1,8 +1,5 @@
-import json, yaml
+import json
 import pandas as pd
-import numpy as np
-from datetime import datetime, timedelta
-import dateutil.parser

 SLEEP_SUMMARY_COLUMNS = ("device_id", "efficiency",
                                "minutes_after_wakeup", "minutes_asleep", "minutes_awake", "minutes_to_fall_asleep", "minutes_in_bed",
@ -16,8 +13,8 @@ def parseOneSleepRecord(record, device_id, d_is_main_sleep, records_summary, epi
    
    sleep_record_type = episode_type

-    d_start_datetime = datetime.strptime(record["startTime"][:18], "%Y-%m-%dT%H:%M:%S")
-    d_end_datetime = datetime.strptime(record["endTime"][:18], "%Y-%m-%dT%H:%M:%S")
+    d_start_datetime = record["startTime"][:19].replace("T", " ")
+    d_end_datetime = record["endTime"][:19].replace("T", " ")
    # Summary data
    row_summary = (device_id, record["efficiency"],
                    record["minutesAfterWakeup"], record["minutesAsleep"], record["minutesAwake"], record["minutesToFallAsleep"], record["timeInBed"],
--- a/src/data/streams/mutations/fitbit/parse_steps_intraday_json.py
+++ b/src/data/streams/mutations/fitbit/parse_steps_intraday_json.py
@ -23,7 +23,7 @@ def parseStepsData(steps_data):
                dataset = record["activities-steps-intraday"]["dataset"]
                for data in dataset:
                    d_time = datetime.strptime(data["time"], '%H:%M:%S').time()
-                    d_datetime = datetime.combine(curr_date, d_time)
+                    d_datetime = datetime.combine(curr_date, d_time).strftime("%Y-%m-%d %H:%M:%S")

                    row_intraday = (device_id,
                        data["value"],
--- a/src/data/streams/mutations/fitbit/parse_steps_summary_json.py
+++ b/src/data/streams/mutations/fitbit/parse_steps_summary_json.py
@ -1,6 +1,5 @@
 import json
 import pandas as pd
-from datetime import datetime

 STEPS_COLUMNS = ("device_id", "steps", "local_date_time", "timestamp")

@ -16,7 +15,7 @@ def parseStepsData(steps_data):
    for record in steps_data.json_fitbit_column:
        record = json.loads(record)  # Parse text into JSON
        if "activities-steps" in record.keys():
-            curr_date = datetime.strptime(record["activities-steps"][0]["dateTime"], "%Y-%m-%d")
+            curr_date = record["activities-steps"][0]["dateTime"] + " 00:00:00"

            row_summary = (device_id,
                record["activities-steps"][0]["value"],
--- a/src/data/streams/mutations/phone/aware/activity_recogniton_ios_unification.R
+++ b/src/data/streams/mutations/phone/aware/activity_recogniton_ios_unification.R
@ -36,7 +36,8 @@ unify_ios_activity_recognition <- function(ios_gar){
                                         activities == "cycling" ~ "on_bicycle",
                                         activities == "walking" ~ "walking",
                                         activities == "running" ~ "running",
-                                         activities == "stationary" ~ "still"),
+                                         activities == "stationary" ~ "still",
+                                         activities == "unknown" ~ "unknown"),
               activity_type = case_when(activities == "automotive" ~ 0,
                                         activities == "cycling" ~ 1,
                                         activities == "walking" ~ 7,
--- a/src/data/streams/mutations/phone/aware/calls_ios_unification.R
+++ b/src/data/streams/mutations/phone/aware/calls_ios_unification.R
@ -39,7 +39,7 @@ unify_ios_calls <- function(ios_calls){
                        assigned_segments = first(assigned_segments))
        }
        else {
-            ios_calls <- ios_calls %>% summarise(call_type_sequence = paste(call_type, collapse = ","), call_duration = sum(call_duration),  timestamp = first(timestamp), device_id = first(device_id))
+            ios_calls <- ios_calls %>% summarise(call_type_sequence = paste(call_type, collapse = ","), call_duration = sum(as.numeric(call_duration)),  timestamp = first(timestamp), device_id = first(device_id))
        }
        ios_calls <- ios_calls %>% mutate(call_type = case_when(
            call_type_sequence == "1,2,4" | call_type_sequence == "2,1,4" ~ 1, # incoming
--- a/src/data/streams/mutations/phone/straw/app_add_name.R
+++ b/src/data/streams/mutations/phone/straw/app_add_name.R
@ -0,0 +1,8 @@
+source("renv/activate.R") # needed to use RAPIDS renv environment
+library(dplyr)
+
+main <- function(data, stream_parameters){
+    data <- data %>%
+      mutate(application_name = "hashed")
+    return(data)
+}
--- a/src/data/streams/mutations/phone/straw/app_add_name.py
+++ b/src/data/streams/mutations/phone/straw/app_add_name.py
@ -0,0 +1,5 @@
+import pandas as pd
+
+def main(data, stream_parameters):
+    data["application_name"] = "hashed"
+    return(data)
--- a/src/data/streams/rapids_columns.yaml
+++ b/src/data/streams/rapids_columns.yaml
@ -35,11 +35,8 @@ PHONE_APPLICATIONS_NOTIFICATIONS:
  - DEVICE_ID
  - PACKAGE_NAME
  - APPLICATION_NAME
-  - TEXT
  - SOUND
  - VIBRATE
-  - DEFAULTS
-  - FLAGS

 PHONE_BATTERY:
  - TIMESTAMP
@ -70,6 +67,16 @@ PHONE_CONVERSATION:
  - DOUBLE_CONVO_START
  - DOUBLE_CONVO_END

+PHONE_ESM:
+  - TIMESTAMP
+  - DEVICE_ID
+  - ESM_STATUS
+  - ESM_USER_ANSWER
+  - ESM_JSON
+  - ESM_TRIGGER
+  - ESM_SESSION
+  - ESM_NOTIFICATION_ID
+
 PHONE_KEYBOARD:
  - TIMESTAMP
  - DEVICE_ID
@ -111,6 +118,11 @@ PHONE_SCREEN:
  - DEVICE_ID
  - SCREEN_STATUS

+PHONE_SPEECH:
+  - TIMESTAMP
+  - DEVICE_ID
+  - SPEECH_PROPORTION
+
 PHONE_WIFI_CONNECTED:
  - TIMESTAMP
  - DEVICE_ID
@ -220,6 +232,7 @@ EMPATICA_INTER_BEAT_INTERVAL:
  - TIMESTAMP
  - DEVICE_ID
  - INTER_BEAT_INTERVAL
+  - TIMINGS

 EMPATICA_TAGS:
  - TIMESTAMP
--- a/src/data/translate_usernames_into_participants_data.R
+++ b/src/data/translate_usernames_into_participants_data.R
@ -0,0 +1,62 @@
+source("renv/activate.R")
+source("src/data/streams/aware_postgresql/container.R")
+
+library(RPostgres)
+library(magrittr)
+library(tidyverse)
+library(lubridate)
+
+prepare_participants_file <- function() {
+
+  username_list_csv_location <- snakemake@input[["username_list"]]
+
+  data_configuration <- snakemake@params[["data_configuration"]]
+  participants_container <- snakemake@params[["participants_table"]]
+  device_id_container <- snakemake@params[["device_id_table"]]
+  start_end_date_container <- snakemake@params[["start_end_date_table"]]
+
+  output_data_file <- snakemake@output[["participants_file"]]
+
+  platform <- "android"
+  pid_format <- "p%03d"
+  datetime_format <- "%Y-%m-%d %H:%M:%S"
+
+  participant_data <- read_csv(username_list_csv_location, col_types = "cc", progress = FALSE)
+  usernames <- participant_data$label
+
+  participant_ids <- pull_participants_ids(data_configuration, usernames, participants_container)
+  participant_data %<>%
+    left_join(participant_ids, by = c("label" = "username")) %>%
+    rename(participant_id = id)
+
+  device_ids <- pull_participants_device_ids(data_configuration, participant_data$participant_id, device_id_container)
+  device_ids %<>%
+    filter(device_id != "") %>%
+    group_by(participant_id) %>%
+    summarise(device_ids = list(unique(device_id)))
+  participant_data %<>%
+    left_join(device_ids, by = "participant_id")
+
+  start_end_datetimes <- pull_participants_start_end_dates(data_configuration, participant_data$participant_id, start_end_date_container)
+  participant_data %<>%
+    left_join(start_end_datetimes, by = "participant_id")
+
+  participant_data %<>%
+  mutate(
+    pid = sprintf(pid_format, participant_id),
+    start_date = strftime(datetime_start, format=datetime_format, tz = "UTC", usetz = FALSE), #TODO Check what timezone is expected
+    end_date = strftime(datetime_end, format=datetime_format, tz = "UTC", usetz = FALSE),
+    device_id = map_chr(device_ids, str_c, collapse = ";"),
+    number_of_devices = map_int(device_ids, length),
+    fitbit_id = ""
+    ) %>%
+  rowwise() %>%
+  mutate(platform = str_c(replicate(number_of_devices, platform), collapse = ";")) %>%
+  ungroup() %>%
+  arrange(pid) %>%
+  select(pid, label, start_date, end_date, empatica_id, device_id, platform, fitbit_id)
+
+  write_csv(participant_data, output_data_file)
+}
+
+prepare_participants_file()
--- a/src/features/init.py
+++ b/src/features/init.py
--- a/src/features/all_cleaning_individual/rapids/main.R
+++ b/src/features/all_cleaning_individual/rapids/main.R
@ -0,0 +1,89 @@
+source("renv/activate.R")
+library(tidyr)
+library("dplyr", warn.conflicts = F)
+library(tidyverse)
+library(caret)
+library(corrr)
+
+rapids_cleaning <- function(sensor_data_files, provider){
+
+    clean_features <- read.csv(sensor_data_files[["sensor_data"]], stringsAsFactors = FALSE)
+    impute_selected_event_features <- provider[["IMPUTE_SELECTED_EVENT_FEATURES"]]
+    cols_nan_threshold <- as.numeric(provider[["COLS_NAN_THRESHOLD"]])
+    drop_zero_variance_columns <- as.logical(provider[["COLS_VAR_THRESHOLD"]])
+    rows_nan_threshold <- as.numeric(provider[["ROWS_NAN_THRESHOLD"]])
+    data_yield_unit <- tolower(str_split_fixed(provider[["DATA_YIELD_FEATURE"]], "_", 4)[[4]])
+    data_yield_column <- paste0("phone_data_yield_rapids_ratiovalidyielded", data_yield_unit)
+    data_yield_ratio_threshold <- as.numeric(provider[["DATA_YIELD_RATIO_THRESHOLD"]])
+    drop_highly_correlated_features <- provider[["DROP_HIGHLY_CORRELATED_FEATURES"]]
+
+    # Impute selected event features
+    if(as.logical(impute_selected_event_features$COMPUTE)){
+        if(!"phone_data_yield_rapids_ratiovalidyieldedminutes" %in% colnames(clean_features)){
+            stop("Error: RAPIDS provider needs to impute the selected event features based on phone_data_yield_rapids_ratiovalidyieldedminutes column, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyieldedminutes' in [FEATURES].")
+        }
+        column_names <- colnames(clean_features)
+        selected_apps_features <- column_names[grepl("^phone_applications_foreground_rapids_(countevent|countepisode|minduration|maxduration|meanduration|sumduration)", column_names)]
+        selected_battery_features <- column_names[grepl("^phone_battery_rapids_", column_names)]
+        selected_calls_features <- column_names[grepl("^phone_calls_rapids_.*_(count|distinctcontacts|sumduration|minduration|maxduration|meanduration|modeduration)", column_names)]
+        selected_keyboard_features <- column_names[grepl("^phone_keyboard_rapids_(sessioncount|averagesessionlength|changeintextlengthlessthanminusone|changeintextlengthequaltominusone|changeintextlengthequaltoone|changeintextlengthmorethanone|maxtextlength|totalkeyboardtouches)", column_names)]
+        selected_messages_features <- column_names[grepl("^phone_messages_rapids_.*_(count|distinctcontacts)", column_names)]
+        selected_screen_features <- column_names[grepl("^phone_screen_rapids_(sumduration|maxduration|minduration|avgduration|countepisode)", column_names)]
+        selected_wifi_features <- column_names[grepl("^phone_wifi_(connected|visible)_rapids_", column_names)]
+        
+        selected_columns <- c(selected_apps_features, selected_battery_features, selected_calls_features, selected_keyboard_features, selected_messages_features, selected_screen_features, selected_wifi_features)
+        clean_features[selected_columns][is.na(clean_features[selected_columns]) & (clean_features$phone_data_yield_rapids_ratiovalidyieldedminutes > impute_selected_event_features$MIN_DATA_YIELDED_MINUTES_TO_IMPUTE)] <- 0
+    }
+    
+    # Drop rows with the value of data_yield_column less than data_yield_ratio_threshold
+    if(!data_yield_column %in% colnames(clean_features)){
+        stop(paste0("Error: RAPIDS provider needs to clean data based on ", data_yield_column, " column, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyielded", data_yield_unit, "' in [FEATURES]."))
+    }
+    if (data_yield_ratio_threshold > 0) {
+        clean_features <- clean_features %>% 
+        filter(.[[data_yield_column]] >= data_yield_ratio_threshold)
+    }
+
+    # Drop columns with a percentage of NA values above cols_nan_threshold
+    if(nrow(clean_features))
+        clean_features <- clean_features %>% select(where(~ sum(is.na(.)) / length(.) <= cols_nan_threshold ), starts_with("phone_esm"))
+
+    # Drop columns with zero variance
+    if(drop_zero_variance_columns)
+    clean_features <- clean_features %>% select_if(grepl("pid|local_segment|local_segment_label|local_segment_start_datetime|local_segment_end_datetime|phone_esm",names(.)) | sapply(., n_distinct, na.rm = T) > 1)
+
+    # Drop highly correlated features
+    if(as.logical(drop_highly_correlated_features$COMPUTE)){
+        
+        min_overlap_for_corr_threshold <- as.numeric(drop_highly_correlated_features$MIN_OVERLAP_FOR_CORR_THRESHOLD)
+        corr_threshold <- as.numeric(drop_highly_correlated_features$CORR_THRESHOLD)
+
+        features_for_corr <- clean_features %>% 
+            select_if(is.numeric) %>% 
+            select_if(sapply(., n_distinct, na.rm = T) > 1)
+
+        valid_pairs <- crossprod(!is.na(features_for_corr)) >= min_overlap_for_corr_threshold * nrow(features_for_corr)
+
+        if((nrow(features_for_corr) != 0) & (ncol(features_for_corr) != 0)){
+
+            highly_correlated_features <- features_for_corr %>% 
+                correlate(use = "pairwise.complete.obs", method = "spearman") %>% 
+                column_to_rownames(., var = "term") %>% 
+                as.matrix() %>% 
+                replace(!valid_pairs | is.na(.), 0) %>% 
+                findCorrelation(., cutoff = corr_threshold, verbose = F, names = T)
+
+            clean_features <- clean_features[, !names(clean_features) %in% highly_correlated_features]
+        
+        }
+    }
+
+    # Drop rows with a percentage of NA values above rows_nan_threshold
+    clean_features <- clean_features %>% 
+        mutate(percentage_na =  rowSums(is.na(.)) / ncol(.)) %>% 
+        filter(percentage_na <= rows_nan_threshold) %>% 
+        select(-percentage_na)
+
+    return(clean_features)
+}
+
--- a/src/features/all_cleaning_individual/straw/init.py
+++ b/src/features/all_cleaning_individual/straw/init.py
--- a/src/features/all_cleaning_individual/straw/main.py
+++ b/src/features/all_cleaning_individual/straw/main.py
@ -0,0 +1,180 @@
+import pandas as pd
+import numpy as np
+import math, sys, random
+import yaml
+
+from sklearn.impute import KNNImputer
+from sklearn.preprocessing import StandardScaler
+import matplotlib.pyplot as plt
+import seaborn as sns
+
+sys.path.append('/rapids/')
+from src.features import empatica_data_yield as edy
+
+pd.set_option('display.max_columns', 20)
+
+def straw_cleaning(sensor_data_files, provider):
+    
+    features = pd.read_csv(sensor_data_files["sensor_data"][0])
+    
+    esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] # Get target (esm) columns
+
+    with open('config.yaml', 'r') as stream:
+        config = yaml.load(stream, Loader=yaml.FullLoader)
+
+    excluded_columns = ['local_segment', 'local_segment_label', 'local_segment_start_datetime', 'local_segment_end_datetime']
+
+    # (1) FILTER_OUT THE ROWS THAT DO NOT HAVE THE TARGET COLUMN AVAILABLE
+    if config['PARAMS_FOR_ANALYSIS']['TARGET']['COMPUTE']:
+        target = config['PARAMS_FOR_ANALYSIS']['TARGET']['LABEL'] # get target label from config
+        if 'phone_esm_straw_' + target in features:
+            features = features[features['phone_esm_straw_' + target].notna()].reset_index(drop=True)
+        else:
+            return features
+
+    # (2.1) QUALITY CHECK (DATA YIELD COLUMN) deletes the rows where E4 or phone data is low quality
+    phone_data_yield_unit = provider["PHONE_DATA_YIELD_FEATURE"].split("_")[3].lower()
+    phone_data_yield_column = "phone_data_yield_rapids_ratiovalidyielded" + phone_data_yield_unit
+
+    if features.empty:
+        return features
+
+    features = edy.calculate_empatica_data_yield(features)
+
+    if not phone_data_yield_column in features.columns and not "empatica_data_yield" in features.columns:
+        raise KeyError(f"RAPIDS provider needs to clean the selected event features based on {phone_data_yield_column} and empatica_data_yield columns. For phone data yield, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyielded{data_yield_unit}' in [FEATURES].")
+        
+    # Drop rows where phone data yield is less then given threshold
+    if provider["PHONE_DATA_YIELD_RATIO_THRESHOLD"]:
+        features = features[features[phone_data_yield_column] >= provider["PHONE_DATA_YIELD_RATIO_THRESHOLD"]].reset_index(drop=True)
+    
+    # Drop rows where empatica data yield is less then given threshold
+    if provider["EMPATICA_DATA_YIELD_RATIO_THRESHOLD"]:
+        features = features[features["empatica_data_yield"] >= provider["EMPATICA_DATA_YIELD_RATIO_THRESHOLD"]].reset_index(drop=True)
+
+    if features.empty:
+        return features
+    
+    # (2.2) DO THE ROWS CONSIST OF ENOUGH NON-NAN VALUES?
+    min_count =  math.ceil((1 - provider["ROWS_NAN_THRESHOLD"]) * features.shape[1]) # minimal not nan values in row
+    features.dropna(axis=0, thresh=min_count, inplace=True) # Thresh => at least this many not-nans
+
+    # (3) REMOVE COLS IF THEIR NAN THRESHOLD IS PASSED (should be <= if even all NaN columns must be preserved - this solution now drops columns with all NaN rows)
+    esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] # Get target (esm) columns
+
+    features = features.loc[:, features.isna().sum() < provider["COLS_NAN_THRESHOLD"] * features.shape[0]]
+
+    # Preserve esm cols if deleted (has to come after drop cols operations)
+    for esm in esm_cols:
+        if esm not in features:
+            features[esm] = esm_cols[esm]
+
+    # (4) CONTEXTUAL IMPUTATION
+
+    # Impute selected phone features with a high number
+    impute_w_hn = [col for col in features.columns if \
+        "timeoffirstuse" in col or
+        "timeoflastuse" in col or
+        "timefirstcall" in col or
+        "timelastcall" in col or
+        "firstuseafter" in col or
+        "timefirstmessages" in col or
+        "timelastmessages" in col]
+    features[impute_w_hn] = features[impute_w_hn].fillna(1500)
+
+
+    # Impute special case (mostcommonactivity) and (homelabel)
+    impute_w_sn = [col for col in features.columns if "mostcommonactivity" in col]
+    features[impute_w_sn] = features[impute_w_sn].fillna(4) # Special case of imputation - nominal/ordinal value
+
+    impute_w_sn2 = [col for col in features.columns if "homelabel" in col]
+    features[impute_w_sn2] = features[impute_w_sn2].fillna(1) # Special case of imputation - nominal/ordinal value
+
+    impute_w_sn3 = [col for col in features.columns if "loglocationvariance" in col]
+    features[impute_w_sn2] = features[impute_w_sn2].fillna(-1000000) # Special case of imputation - nominal/ordinal value
+
+
+    # Impute selected phone features with 0
+    impute_zero = [col for col in features if \
+        col.startswith('phone_applications_foreground_rapids_') or
+        col.startswith('phone_battery_rapids_') or
+        col.startswith('phone_bluetooth_rapids_') or
+        col.startswith('phone_light_rapids_') or
+        col.startswith('phone_calls_rapids_') or
+        col.startswith('phone_messages_rapids_') or
+        col.startswith('phone_screen_rapids_') or
+        col.startswith('phone_wifi_visible')]
+
+    features[impute_zero+list(esm_cols.columns)] = features[impute_zero+list(esm_cols.columns)].fillna(0)
+
+    ## (5) STANDARDIZATION 
+    if provider["STANDARDIZATION"]:
+        features.loc[:, ~features.columns.isin(excluded_columns)] = StandardScaler().fit_transform(features.loc[:, ~features.columns.isin(excluded_columns)])
+
+    # (6) IMPUTATION: IMPUTE DATA WITH KNN METHOD
+    impute_cols = [col for col in features.columns if col not in excluded_columns]
+    features.reset_index(drop=True, inplace=True)
+    features[impute_cols] = impute(features[impute_cols], method="knn")
+
+    # (7) REMOVE COLS WHERE VARIANCE IS 0
+    esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')]
+
+    if provider["COLS_VAR_THRESHOLD"]:
+        features.drop(features.std(numeric_only=True)[features.std(numeric_only=True) == 0].index.values, axis=1, inplace=True)
+
+    fe5 = features.copy()
+
+    # (8) DROP HIGHLY CORRELATED FEATURES
+    drop_corr_features = provider["DROP_HIGHLY_CORRELATED_FEATURES"]
+    if drop_corr_features["COMPUTE"] and features.shape[0]: # If small amount of segments (rows) is present, do not execute correlation check
+        
+        numerical_cols = features.select_dtypes(include=np.number).columns.tolist()
+
+        # Remove columns where NaN count threshold is passed
+        valid_features = features[numerical_cols].loc[:, features[numerical_cols].isna().sum() < drop_corr_features['MIN_OVERLAP_FOR_CORR_THRESHOLD'] * features[numerical_cols].shape[0]]
+
+        corr_matrix = valid_features.corr().abs()
+        upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
+        to_drop = [column for column in upper.columns if any(upper[column] > drop_corr_features["CORR_THRESHOLD"])]
+
+        features.drop(to_drop, axis=1, inplace=True)
+
+    # Preserve esm cols if deleted (has to come after drop cols operations)
+    for esm in esm_cols:
+        if esm not in features:
+            features[esm] = esm_cols[esm]
+
+    # (9) VERIFY IF THERE ARE ANY NANS LEFT IN THE DATAFRAME
+    if features.isna().any().any():
+        raise ValueError("There are still some NaNs present in the dataframe. Please check for implementation errors.")
+
+    return features
+
+
+def k_nearest(df):
+    pd.set_option('display.max_columns', None)
+    imputer = KNNImputer(n_neighbors=3)
+    return pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
+
+
+def impute(df, method='zero'):
+
+    return {
+        'zero': df.fillna(0),
+        'high_number': df.fillna(1500),
+        'mean': df.fillna(df.mean()),
+        'median': df.fillna(df.median()),
+        'knn': k_nearest(df) 
+    }[method]
+
+
+def graph_bf_af(features, phase_name, plt_flag=False):
+    if plt_flag:
+        sns.set(rc={"figure.figsize":(16, 8)})
+        sns.heatmap(features.isna(), cbar=False) #features.select_dtypes(include=np.number)
+        plt.savefig(f'features_overall_nans_{phase_name}.png', bbox_inches='tight')
+
+    print(f"\n-------------{phase_name}-------------")
+    print("Rows number:", features.shape[0])
+    print("Columns number:", len(features.columns))
+    print("---------------------------------------------\n")
--- a/src/features/all_cleaning_overall/rapids/main.R
+++ b/src/features/all_cleaning_overall/rapids/main.R
@ -0,0 +1,89 @@
+source("renv/activate.R")
+library(tidyr)
+library("dplyr", warn.conflicts = F)
+library(tidyverse)
+library(caret)
+library(corrr)
+
+rapids_cleaning <- function(sensor_data_files, provider){
+
+    clean_features <- read.csv(sensor_data_files[["sensor_data"]], stringsAsFactors = FALSE)
+    impute_selected_event_features <- provider[["IMPUTE_SELECTED_EVENT_FEATURES"]]
+    cols_nan_threshold <- as.numeric(provider[["COLS_NAN_THRESHOLD"]])
+    drop_zero_variance_columns <- as.logical(provider[["COLS_VAR_THRESHOLD"]])
+    rows_nan_threshold <- as.numeric(provider[["ROWS_NAN_THRESHOLD"]])
+    data_yield_unit <- tolower(str_split_fixed(provider[["DATA_YIELD_FEATURE"]], "_", 4)[[4]])
+    data_yield_column <- paste0("phone_data_yield_rapids_ratiovalidyielded", data_yield_unit)
+    data_yield_ratio_threshold <- as.numeric(provider[["DATA_YIELD_RATIO_THRESHOLD"]])
+    drop_highly_correlated_features <- provider[["DROP_HIGHLY_CORRELATED_FEATURES"]]
+
+    # Impute selected event features
+    if(as.logical(impute_selected_event_features$COMPUTE)){
+        if(!"phone_data_yield_rapids_ratiovalidyieldedminutes" %in% colnames(clean_features)){
+            stop("Error: RAPIDS provider needs to impute the selected event features based on phone_data_yield_rapids_ratiovalidyieldedminutes column, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyieldedminutes' in [FEATURES].")
+        }
+        column_names <- colnames(clean_features)
+        selected_apps_features <- column_names[grepl("^phone_applications_foreground_rapids_(countevent|countepisode|minduration|maxduration|meanduration|sumduration)", column_names)]
+        selected_battery_features <- column_names[grepl("^phone_battery_rapids_", column_names)]
+        selected_calls_features <- column_names[grepl("^phone_calls_rapids_.*_(count|distinctcontacts|sumduration|minduration|maxduration|meanduration|modeduration)", column_names)]
+        selected_keyboard_features <- column_names[grepl("^phone_keyboard_rapids_(sessioncount|averagesessionlength|changeintextlengthlessthanminusone|changeintextlengthequaltominusone|changeintextlengthequaltoone|changeintextlengthmorethanone|maxtextlength|totalkeyboardtouches)", column_names)]
+        selected_messages_features <- column_names[grepl("^phone_messages_rapids_.*_(count|distinctcontacts)", column_names)]
+        selected_screen_features <- column_names[grepl("^phone_screen_rapids_(sumduration|maxduration|minduration|avgduration|countepisode)", column_names)]
+        selected_wifi_features <- column_names[grepl("^phone_wifi_(connected|visible)_rapids_", column_names)]
+        
+        selected_columns <- c(selected_apps_features, selected_battery_features, selected_calls_features, selected_keyboard_features, selected_messages_features, selected_screen_features, selected_wifi_features)
+        clean_features[selected_columns][is.na(clean_features[selected_columns]) & (clean_features$phone_data_yield_rapids_ratiovalidyieldedminutes > impute_selected_event_features$MIN_DATA_YIELDED_MINUTES_TO_IMPUTE)] <- 0
+    }
+    
+    # Drop rows with the value of data_yield_column less than data_yield_ratio_threshold
+    if(!data_yield_column %in% colnames(clean_features)){
+        stop(paste0("Error: RAPIDS provider needs to clean data based on ", data_yield_column, " column, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyielded", data_yield_unit, "' in [FEATURES]."))
+    }
+    if (data_yield_ratio_threshold > 0) {
+        clean_features <- clean_features %>% 
+        filter(.[[data_yield_column]] >= data_yield_ratio_threshold)
+    }
+
+    # Drop columns with a percentage of NA values above cols_nan_threshold
+    if(nrow(clean_features))
+        clean_features <- clean_features %>% select(where(~ sum(is.na(.)) / length(.) <= cols_nan_threshold ), starts_with("phone_esm"))
+
+    # Drop columns with zero variance
+    if(drop_zero_variance_columns)
+        clean_features <- clean_features %>% select_if(grepl("pid|local_segment|local_segment_label|local_segment_start_datetime|local_segment_end_datetime|phone_esm",names(.)) | sapply(., n_distinct, na.rm = T) > 1)
+
+    # Drop highly correlated features
+    if(as.logical(drop_highly_correlated_features$COMPUTE)){
+        
+        min_overlap_for_corr_threshold <- as.numeric(drop_highly_correlated_features$MIN_OVERLAP_FOR_CORR_THRESHOLD)
+        corr_threshold <- as.numeric(drop_highly_correlated_features$CORR_THRESHOLD)
+
+        features_for_corr <- clean_features %>% 
+            select_if(is.numeric) %>% 
+            select_if(sapply(., n_distinct, na.rm = T) > 1)
+
+        valid_pairs <- crossprod(!is.na(features_for_corr)) >= min_overlap_for_corr_threshold * nrow(features_for_corr)
+
+        if((nrow(features_for_corr) != 0) & (ncol(features_for_corr) != 0)){
+
+            highly_correlated_features <- features_for_corr %>% 
+                correlate(use = "pairwise.complete.obs", method = "spearman") %>% 
+                column_to_rownames(., var = "term") %>% 
+                as.matrix() %>% 
+                replace(!valid_pairs | is.na(.), 0) %>% 
+                findCorrelation(., cutoff = corr_threshold, verbose = F, names = T)
+
+            clean_features <- clean_features[, !names(clean_features) %in% highly_correlated_features]
+        
+        }
+    }
+
+    # Drop rows with a percentage of NA values above rows_nan_threshold
+    clean_features <- clean_features %>% 
+        mutate(percentage_na =  rowSums(is.na(.)) / ncol(.)) %>% 
+        filter(percentage_na <= rows_nan_threshold) %>% 
+        select(-percentage_na)
+
+    return(clean_features)
+}
+
--- a/src/features/all_cleaning_overall/straw/init.py
+++ b/src/features/all_cleaning_overall/straw/init.py
--- a/src/features/all_cleaning_overall/straw/main.py
+++ b/src/features/all_cleaning_overall/straw/main.py
@ -0,0 +1,275 @@
+import pandas as pd
+import numpy as np
+import math, sys, random, warnings, yaml
+
+from sklearn.impute import KNNImputer
+from sklearn.preprocessing import StandardScaler, minmax_scale 
+import matplotlib.pyplot as plt
+import seaborn as sns
+
+sys.path.append('/rapids/')
+from src.features import empatica_data_yield as edy
+
+def straw_cleaning(sensor_data_files, provider, target):
+
+    features = pd.read_csv(sensor_data_files["sensor_data"][0])
+
+    with open('config.yaml', 'r') as stream:
+        config = yaml.load(stream, Loader=yaml.FullLoader)
+
+    esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] # Get target (esm) columns
+
+    excluded_columns = ['local_segment', 'local_segment_label', 'local_segment_start_datetime', 'local_segment_end_datetime']
+
+    graph_bf_af(features, "1target_rows_before")
+
+    # (1.0) OVERRIDE STRESSFULNESS EVENT TARGETS IF ERS SEGMENTING_METHOD IS "STRESS_EVENT"
+    if config["TIME_SEGMENTS"]["TAILORED_EVENTS"]["SEGMENTING_METHOD"] == "stress_event": 
+    
+        stress_events_targets = pd.read_csv("data/external/stress_event_targets.csv")   
+
+        if "appraisal_stressfulness_event_mean" in config['PARAMS_FOR_ANALYSIS']['TARGET']['ALL_LABELS']:
+            features.drop(columns=['phone_esm_straw_appraisal_stressfulness_event_mean'], inplace=True)
+            features = features.merge(stress_events_targets[["label", "appraisal_stressfulness_event"]] \
+                        .rename(columns={'label': 'local_segment_label'}), on=['local_segment_label'], how='inner') \
+                        .rename(columns={'appraisal_stressfulness_event': 'phone_esm_straw_appraisal_stressfulness_event_mean'})
+
+        if "appraisal_threat_mean" in config['PARAMS_FOR_ANALYSIS']['TARGET']['ALL_LABELS']:
+            features.drop(columns=['phone_esm_straw_appraisal_threat_mean'], inplace=True)
+            features = features.merge(stress_events_targets[["label", "appraisal_threat"]] \
+                        .rename(columns={'label': 'local_segment_label'}), on=['local_segment_label'], how='inner') \
+                        .rename(columns={'appraisal_threat': 'phone_esm_straw_appraisal_threat_mean'})
+
+        if "appraisal_challenge_mean" in config['PARAMS_FOR_ANALYSIS']['TARGET']['ALL_LABELS']:
+            features.drop(columns=['phone_esm_straw_appraisal_challenge_mean'], inplace=True)
+            features = features.merge(stress_events_targets[["label", "appraisal_challenge"]] \
+                        .rename(columns={'label': 'local_segment_label'}), on=['local_segment_label'], how='inner') \
+                        .rename(columns={'appraisal_challenge': 'phone_esm_straw_appraisal_challenge_mean'})
+
+    esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] # Get target (esm) columns
+
+    # (1.1) FILTER_OUT THE ROWS THAT DO NOT HAVE THE TARGET COLUMN AVAILABLE
+    if config['PARAMS_FOR_ANALYSIS']['TARGET']['COMPUTE']:
+        features = features[features['phone_esm_straw_' + target].notna()].reset_index(drop=True)
+    
+    if features.empty:
+        return pd.DataFrame(columns=excluded_columns)
+
+    graph_bf_af(features, "2target_rows_after")
+
+    # (2) QUALITY CHECK (DATA YIELD COLUMN) drops the rows where E4 or phone data is low quality
+    phone_data_yield_unit = provider["PHONE_DATA_YIELD_FEATURE"].split("_")[3].lower()
+    phone_data_yield_column = "phone_data_yield_rapids_ratiovalidyielded" + phone_data_yield_unit
+
+    features = edy.calculate_empatica_data_yield(features)
+
+    if not phone_data_yield_column in features.columns and not "empatica_data_yield" in features.columns:
+        raise KeyError(f"RAPIDS provider needs to clean the selected event features based on {phone_data_yield_column} and empatica_data_yield columns. For phone data yield, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyielded{data_yield_unit}' in [FEATURES].")
+
+    hist = features[["empatica_data_yield", phone_data_yield_column]].hist()
+    plt.savefig(f'phone_E4_histogram.png', bbox_inches='tight')
+
+    # Drop rows where phone data yield is less then given threshold
+    if provider["PHONE_DATA_YIELD_RATIO_THRESHOLD"]:
+        hist = features[phone_data_yield_column].hist(bins=5)
+        plt.close()
+        features = features[features[phone_data_yield_column] >= provider["PHONE_DATA_YIELD_RATIO_THRESHOLD"]].reset_index(drop=True)
+
+    # Drop rows where empatica data yield is less then given threshold
+    if provider["EMPATICA_DATA_YIELD_RATIO_THRESHOLD"]:
+        features = features[features["empatica_data_yield"] >= provider["EMPATICA_DATA_YIELD_RATIO_THRESHOLD"]].reset_index(drop=True)
+
+    if features.empty:
+        return pd.DataFrame(columns=excluded_columns)
+
+    graph_bf_af(features, "3data_yield_drop_rows")
+
+    if features.empty:
+        return pd.DataFrame(columns=excluded_columns)
+
+
+    # (3) CONTEXTUAL IMPUTATION
+
+    # Impute selected phone features with a high number
+    impute_w_hn = [col for col in features.columns if \
+        "timeoffirstuse" in col or
+        "timeoflastuse" in col or
+        "timefirstcall" in col or
+        "timelastcall" in col or
+        "firstuseafter" in col or
+        "timefirstmessages" in col or
+        "timelastmessages" in col]
+    features[impute_w_hn] = features[impute_w_hn].fillna(1500)
+
+    # Impute special case (mostcommonactivity) and (homelabel)
+    impute_w_sn = [col for col in features.columns if "mostcommonactivity" in col]
+    features[impute_w_sn] = features[impute_w_sn].fillna(4) # Special case of imputation - nominal/ordinal value
+
+    impute_w_sn2 = [col for col in features.columns if "homelabel" in col]
+    features[impute_w_sn2] = features[impute_w_sn2].fillna(1) # Special case of imputation - nominal/ordinal value
+
+    impute_w_sn3 = [col for col in features.columns if "loglocationvariance" in col]
+    features[impute_w_sn3] = features[impute_w_sn3].fillna(-1000000) # Special case of imputation - loglocation
+
+    # Impute location features
+    impute_locations = [col for col in features \
+        if col.startswith('phone_locations_doryab_') and
+        'radiusgyration' not in col    
+    ]
+
+    # Impute selected phone, location, and esm features with 0
+    impute_zero = [col for col in features if \
+        col.startswith('phone_applications_foreground_rapids_') or
+        col.startswith('phone_activity_recognition_') or
+        col.startswith('phone_battery_rapids_') or
+        col.startswith('phone_bluetooth_rapids_') or
+        col.startswith('phone_light_rapids_') or
+        col.startswith('phone_calls_rapids_') or
+        col.startswith('phone_messages_rapids_') or
+        col.startswith('phone_screen_rapids_') or
+        col.startswith('phone_bluetooth_doryab_') or
+        col.startswith('phone_wifi_visible')
+        ]
+
+    features[impute_zero+impute_locations+list(esm_cols.columns)] = features[impute_zero+impute_locations+list(esm_cols.columns)].fillna(0)
+
+    pd.set_option('display.max_rows', None)
+
+    graph_bf_af(features, "4context_imp")
+ 
+    # (4) REMOVE COLS IF THEIR NAN THRESHOLD IS PASSED (should be <= if even all NaN columns must be preserved - this solution now drops columns with all NaN rows)
+    esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] # Get target (esm) columns
+
+    features = features.loc[:, features.isna().sum() < provider["COLS_NAN_THRESHOLD"] * features.shape[0]]
+
+    graph_bf_af(features, "5too_much_nans_cols")
+    # (5) REMOVE COLS WHERE VARIANCE IS 0
+
+    if provider["COLS_VAR_THRESHOLD"]:
+        features.drop(features.std(numeric_only=True)[features.std(numeric_only=True) == 0].index.values, axis=1, inplace=True)
+
+    graph_bf_af(features, "6variance_drop")
+
+    # Preserve esm cols if deleted (has to come after drop cols operations)
+    for esm in esm_cols:
+        if esm not in features:
+            features[esm] = esm_cols[esm]
+    
+    # (6) DO THE ROWS CONSIST OF ENOUGH NON-NAN VALUES?
+    min_count =  math.ceil((1 - provider["ROWS_NAN_THRESHOLD"]) * features.shape[1]) # minimal not nan values in row
+    features.dropna(axis=0, thresh=min_count, inplace=True) # Thresh => at least this many not-nans
+
+    graph_bf_af(features, "7too_much_nans_rows")
+
+    if features.empty:
+        return pd.DataFrame(columns=excluded_columns)
+
+    # (7) STANDARDIZATION
+    if provider["STANDARDIZATION"]:
+        nominal_cols = [col for col in features.columns if "mostcommonactivity" in col or "homelabel" in col] # Excluded nominal features
+        # Expected warning within this code block
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore", category=RuntimeWarning)
+            if provider["TARGET_STANDARDIZATION"]:
+                features.loc[:, ~features.columns.isin(excluded_columns + ["pid"] + nominal_cols)] = \
+                    features.loc[:, ~features.columns.isin(excluded_columns + nominal_cols)].groupby('pid').transform(lambda x: StandardScaler().fit_transform(x.values[:,np.newaxis]).ravel())
+            else:
+                features.loc[:, ~features.columns.isin(excluded_columns + ["pid"] + nominal_cols + ['phone_esm_straw_' + target])] = \
+                    features.loc[:, ~features.columns.isin(excluded_columns + nominal_cols + ['phone_esm_straw_' + target])].groupby('pid').transform(lambda x: StandardScaler().fit_transform(x.values[:,np.newaxis]).ravel())
+
+    graph_bf_af(features, "8standardization")
+
+    # (8) IMPUTATION: IMPUTE DATA WITH KNN METHOD
+    features.reset_index(drop=True, inplace=True)
+    impute_cols = [col for col in features.columns if col not in excluded_columns and col != "pid"]
+
+    features[impute_cols] = impute(features[impute_cols], method="knn")
+
+    graph_bf_af(features, "9knn_after")
+
+
+    # (9) DROP HIGHLY CORRELATED FEATURES
+    esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')]
+
+    drop_corr_features = provider["DROP_HIGHLY_CORRELATED_FEATURES"]
+    if drop_corr_features["COMPUTE"] and features.shape[0] > 5: # If small amount of segments (rows) is present, do not execute correlation check
+        
+        numerical_cols = features.select_dtypes(include=np.number).columns.tolist()
+
+        # Remove columns where NaN count threshold is passed
+        valid_features = features[numerical_cols].loc[:, features[numerical_cols].isna().sum() < drop_corr_features['MIN_OVERLAP_FOR_CORR_THRESHOLD'] * features[numerical_cols].shape[0]]
+
+        corr_matrix = valid_features.corr().abs()
+        upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
+        to_drop = [column for column in upper.columns if any(upper[column] > drop_corr_features["CORR_THRESHOLD"])]
+
+        # sns.heatmap(corr_matrix, cmap="YlGnBu")
+        # plt.savefig(f'correlation_matrix.png', bbox_inches='tight')
+        # plt.close()
+
+        # s = corr_matrix.unstack()
+        # so = s.sort_values(ascending=False)
+
+        # pd.set_option('display.max_rows', None)
+        # sorted_upper = upper.unstack().sort_values(ascending=False)
+        # print(sorted_upper[sorted_upper > drop_corr_features["CORR_THRESHOLD"]])
+
+        features.drop(to_drop, axis=1, inplace=True)
+
+    # Preserve esm cols if deleted (has to come after drop cols operations)
+    for esm in esm_cols:
+        if esm not in features:
+            features[esm] = esm_cols[esm]
+
+    graph_bf_af(features, "10correlation_drop")
+
+    # Transform categorical columns to category dtype
+
+    cat1 = [col for col in features.columns if "mostcommonactivity" in col]
+    if cat1: # Transform columns to category dtype (mostcommonactivity)
+        features[cat1] = features[cat1].astype(int).astype('category')
+
+    cat2 = [col for col in features.columns if "homelabel" in col]
+    if cat2: # Transform columns to category dtype (homelabel)
+        features[cat2] = features[cat2].astype(int).astype('category')
+
+    # (10) DROP ALL WINDOW RELATED COLUMNS
+    win_count_cols = [col for col in features if "SO_windowsCount" in col]
+    if win_count_cols:
+        features.drop(columns=win_count_cols, inplace=True)
+
+    # (11) VERIFY IF THERE ARE ANY NANS LEFT IN THE DATAFRAME
+    if features.isna().any().any():
+        raise ValueError("There are still some NaNs present in the dataframe. Please check for implementation errors.")
+
+
+    return features
+
+
+def k_nearest(df):
+    imputer = KNNImputer(n_neighbors=3)
+    return pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
+
+
+def impute(df, method='zero'):
+
+    return {
+        'zero': df.fillna(0),
+        'high_number': df.fillna(1500),
+        'mean': df.fillna(df.mean()),
+        'median': df.fillna(df.median()),
+        'knn': k_nearest(df) 
+    }[method]
+
+
+def graph_bf_af(features, phase_name, plt_flag=False):
+    if plt_flag:
+        sns.set(rc={"figure.figsize":(16, 8)})
+        sns.heatmap(features.isna(), cbar=False) #features.select_dtypes(include=np.number)
+        plt.savefig(f'features_overall_nans_{phase_name}.png', bbox_inches='tight')
+
+    print(f"\n-------------{phase_name}-------------")
+    print("Rows number:", features.shape[0])
+    print("Columns number:", len(features.columns))
+    print("NaN values:", features.isna().sum().sum())
+    print("---------------------------------------------\n")
--- a/src/features/cr_features_helper_methods.py
+++ b/src/features/cr_features_helper_methods.py
@ -0,0 +1,59 @@
+import pandas as pd
+import numpy as np
+import math as m
+
+import sys
+
+def extract_second_order_features(intraday_features, so_features_names, prefix=""):
+    
+    if prefix:
+        groupby_cols = ['local_segment', 'local_segment_label', 'local_segment_start_datetime', 'local_segment_end_datetime']
+    else:
+        groupby_cols = ['local_segment']
+
+    if not intraday_features.empty:
+        so_features = pd.DataFrame()
+        #print(intraday_features.drop("level_1", axis=1).groupby(["local_segment"]).nsmallest())
+        if "mean" in so_features_names:
+            so_features = pd.concat([so_features, intraday_features.drop(prefix+"level_1", axis=1).groupby(groupby_cols).mean(numeric_only=True).add_suffix("_SO_mean")], axis=1)
+        
+        if "median" in so_features_names:
+            so_features = pd.concat([so_features, intraday_features.drop(prefix+"level_1", axis=1).groupby(groupby_cols).median(numeric_only=True).add_suffix("_SO_median")], axis=1)
+        
+        if "sd" in so_features_names:
+            so_features = pd.concat([so_features, intraday_features.drop(prefix+"level_1", axis=1).groupby(groupby_cols).std(numeric_only=True).fillna(0).add_suffix("_SO_sd")], axis=1)
+        
+        if "nlargest" in so_features_names: # largest 5 -- maybe there is a faster groupby solution?
+            for column in intraday_features.loc[:, ~intraday_features.columns.isin(groupby_cols+[prefix+"level_1"])]:
+                so_features[column+"_SO_nlargest"] = intraday_features.drop(prefix+"level_1", axis=1).groupby(groupby_cols)[column].apply(lambda x: x.nlargest(5).mean())
+        
+        if "nsmallest" in so_features_names: # smallest 5 -- maybe there is a faster groupby solution?
+            for column in intraday_features.loc[:, ~intraday_features.columns.isin(groupby_cols+[prefix+"level_1"])]:
+                so_features[column+"_SO_nsmallest"] = intraday_features.drop(prefix+"level_1", axis=1).groupby(groupby_cols)[column].apply(lambda x: x.nsmallest(5).mean())
+        
+        if "count_windows" in so_features_names:
+            so_features["SO_windowsCount"] = intraday_features.groupby(groupby_cols).count()[prefix+"level_1"]
+
+        # numPeaksNonZero specialized for EDA sensor
+        if "eda_num_peaks_non_zero" in so_features_names and prefix+"numPeaks" in intraday_features.columns:
+            so_features[prefix+"SO_numPeaksNonZero"] = intraday_features.groupby(groupby_cols)[prefix+"numPeaks"].apply(lambda x: (x!=0).sum())
+
+        # numWindowsNonZero specialized for BVP and IBI sensors
+        if "hrv_num_windows_non_nan" in so_features_names and prefix+"meanHr" in intraday_features.columns:
+            so_features[prefix+"SO_numWindowsNonNaN"] = intraday_features.groupby(groupby_cols)[prefix+"meanHr"].apply(lambda x: (~np.isnan(x)).sum())
+            
+        so_features.reset_index(inplace=True)
+
+    else:
+        so_features = pd.DataFrame(columns=groupby_cols)
+
+    return so_features
+
+def get_sample_rate(data): # To-Do get the sample rate information from the file's metadata
+    try:
+        timestamps_diff = data['timestamp'].diff().dropna().mean()
+        print("Timestamp diff:", timestamps_diff)
+    except:
+        raise Exception("Error occured while trying to get the mean sample rate from the data.")
+
+    return m.ceil(1000/timestamps_diff)
--- a/src/features/empatica_accelerometer/cr/main.py
+++ b/src/features/empatica_accelerometer/cr/main.py
@ -0,0 +1,75 @@
+import pandas as pd
+from scipy.stats import entropy
+
+from cr_features.helper_functions import convert_to2d, accelerometer_features, frequency_features
+from cr_features.calculate_features_old import calculateFeatures
+from cr_features.calculate_features import calculate_features
+from cr_features_helper_methods import extract_second_order_features
+
+import sys
+
+def extract_acc_features_from_intraday_data(acc_intraday_data, features, window_length, time_segment, filter_data_by_segment):
+    acc_intraday_features = pd.DataFrame(columns=["local_segment"] + features)
+
+    if not acc_intraday_data.empty:   
+        sample_rate = 32
+     
+        acc_intraday_data = filter_data_by_segment(acc_intraday_data, time_segment)
+
+        if not acc_intraday_data.empty:
+
+            acc_intraday_features = pd.DataFrame()
+
+            # apply methods from calculate features module
+            if window_length is None:
+                acc_intraday_features = \
+                    acc_intraday_data.groupby('local_segment').apply(lambda x: calculate_features( \
+                    convert_to2d(x['double_values_0'], x.shape[0]), \
+                    convert_to2d(x['double_values_1'], x.shape[0]), \
+                    convert_to2d(x['double_values_2'], x.shape[0]), \
+                    fs=sample_rate, feature_names=features, show_progress=False)) 
+            else:
+                acc_intraday_features = \
+                    acc_intraday_data.groupby('local_segment').apply(lambda x: calculate_features( \
+                    convert_to2d(x['double_values_0'], window_length*sample_rate), \
+                    convert_to2d(x['double_values_1'], window_length*sample_rate), \
+                    convert_to2d(x['double_values_2'], window_length*sample_rate), \
+                    fs=sample_rate, feature_names=features, show_progress=False)) 
+
+            acc_intraday_features.reset_index(inplace=True)
+
+    return acc_intraday_features
+
+
+
+def cr_features(sensor_data_files, time_segment, provider, filter_data_by_segment, *args, **kwargs):
+    
+    data_types = {'local_timezone': 'str', 'device_id': 'str', 'timestamp': 'int64', 'double_values_0': 'float64',
+                    'double_values_1': 'float64', 'double_values_2': 'float64', 'local_date_time': 'str', 'local_date': "str",
+                    'local_time': "str", 'local_hour': "str", 'local_minute': "str", 'assigned_segments': "str"}
+    acc_intraday_data = pd.read_csv(sensor_data_files["sensor_data"], dtype=data_types)    
+
+    requested_intraday_features = provider["FEATURES"]
+    
+    calc_windows = kwargs.get('calc_windows', False)
+
+    if provider["WINDOWS"]["COMPUTE"] and calc_windows:
+        requested_window_length = provider["WINDOWS"]["WINDOW_LENGTH"]
+    else:
+        requested_window_length = None
+
+    # name of the features this function can compute
+    base_intraday_features_names = accelerometer_features + frequency_features
+    # the subset of requested features this function can compute
+    intraday_features_to_compute = list(set(requested_intraday_features) & set(base_intraday_features_names))
+
+    # extract features from intraday data
+    acc_intraday_features = extract_acc_features_from_intraday_data(acc_intraday_data, intraday_features_to_compute, 
+                                                                requested_window_length, time_segment, filter_data_by_segment)
+
+    if calc_windows:
+        so_features_names = provider["WINDOWS"]["SECOND_ORDER_FEATURES"]
+        acc_second_order_features = extract_second_order_features(acc_intraday_features, so_features_names)
+        return acc_intraday_features, acc_second_order_features
+
+    return acc_intraday_features
--- a/src/features/empatica_blood_volume_pulse/cr/main.py
+++ b/src/features/empatica_blood_volume_pulse/cr/main.py
@ -0,0 +1,73 @@
+import pandas as pd
+from sklearn.preprocessing import StandardScaler
+
+from cr_features.helper_functions import convert_to2d, hrv_features
+from cr_features.hrv import extract_hrv_features_2d_wrapper
+from cr_features_helper_methods import extract_second_order_features
+
+import sys
+
+# pd.set_option('display.max_rows', 1000)
+pd.set_option('display.max_columns', None)
+
+def extract_bvp_features_from_intraday_data(bvp_intraday_data, features, window_length, time_segment, filter_data_by_segment):
+    bvp_intraday_features = pd.DataFrame(columns=["local_segment"] + features)
+
+    if not bvp_intraday_data.empty:
+        sample_rate = 64
+     
+        bvp_intraday_data = filter_data_by_segment(bvp_intraday_data, time_segment)
+
+        if not bvp_intraday_data.empty:
+
+            bvp_intraday_features = pd.DataFrame()
+
+            # apply methods from calculate features module
+            if window_length is None:
+                bvp_intraday_features = \
+                    bvp_intraday_data.groupby('local_segment').apply(\
+                    lambda x: 
+                        extract_hrv_features_2d_wrapper(
+                            convert_to2d(x['blood_volume_pulse'], x.shape[0]), 
+                            sampling=sample_rate, hampel_fiter=False, median_filter=False, mod_z_score_filter=True, feature_names=features))
+
+            else:
+                bvp_intraday_features = \
+                    bvp_intraday_data.groupby('local_segment').apply(\
+                    lambda x: 
+                        extract_hrv_features_2d_wrapper(
+                            convert_to2d(x['blood_volume_pulse'], window_length*sample_rate), 
+                            sampling=sample_rate, hampel_fiter=False, median_filter=False, mod_z_score_filter=True, feature_names=features)) 
+
+            bvp_intraday_features.reset_index(inplace=True)
+
+    return bvp_intraday_features
+
+
+def cr_features(sensor_data_files, time_segment, provider, filter_data_by_segment, *args, **kwargs):
+    bvp_intraday_data = pd.read_csv(sensor_data_files["sensor_data"])
+
+    requested_intraday_features = provider["FEATURES"]
+    
+    calc_windows = kwargs.get('calc_windows', False)
+
+    if provider["WINDOWS"]["COMPUTE"] and calc_windows:
+        requested_window_length = provider["WINDOWS"]["WINDOW_LENGTH"]
+    else:
+        requested_window_length = None
+
+    # name of the features this function can compute
+    base_intraday_features_names = hrv_features
+    # the subset of requested features this function can compute
+    intraday_features_to_compute = list(set(requested_intraday_features) & set(base_intraday_features_names))
+
+    # extract features from intraday data
+    bvp_intraday_features = extract_bvp_features_from_intraday_data(bvp_intraday_data, intraday_features_to_compute, 
+                                                                requested_window_length, time_segment, filter_data_by_segment)
+                                                                
+    if calc_windows:
+        so_features_names = provider["WINDOWS"]["SECOND_ORDER_FEATURES"]
+        bvp_second_order_features = extract_second_order_features(bvp_intraday_features, so_features_names)
+        return bvp_intraday_features, bvp_second_order_features
+
+    return bvp_intraday_features
--- a/src/features/empatica_data_yield.py
+++ b/src/features/empatica_data_yield.py
@ -0,0 +1,32 @@
+import pandas as pd
+import numpy as np
+from datetime import datetime
+
+import sys, yaml
+
+def calculate_empatica_data_yield(features): # TODO
+
+    # Get time segment duration in seconds from all segments in features dataframe
+    datetime_start =  pd.to_datetime(features['local_segment_start_datetime'], format='%Y-%m-%d %H:%M:%S')
+    datetime_end = pd.to_datetime(features['local_segment_end_datetime'], format='%Y-%m-%d %H:%M:%S')
+    tseg_duration = (datetime_end - datetime_start).dt.total_seconds()
+
+    with open('config.yaml', 'r') as stream:
+        config = yaml.load(stream, Loader=yaml.FullLoader)
+        
+    sensors = ["EMPATICA_ACCELEROMETER", "EMPATICA_TEMPERATURE", "EMPATICA_ELECTRODERMAL_ACTIVITY", "EMPATICA_INTER_BEAT_INTERVAL"]
+    for sensor in sensors:
+        features[f"{sensor.lower()}_data_yield"] = \
+            (features[f"{sensor.lower()}_cr_SO_windowsCount"] * config[sensor]["PROVIDERS"]["CR"]["WINDOWS"]["WINDOW_LENGTH"]) / tseg_duration \
+            if f'{sensor.lower()}_cr_SO_windowsCount' in features else 0
+
+    empatica_data_yield_cols = [sensor.lower() + "_data_yield" for sensor in sensors]
+    pd.set_option('display.max_rows', None)
+
+    # Assigns 1 to values that are over 1 (in case of windows not being filled fully)
+    features[empatica_data_yield_cols] = features[empatica_data_yield_cols].apply(lambda x: [y if y <= 1 or np.isnan(y) else 1 for y in x])
+    
+    features["empatica_data_yield"] = features[empatica_data_yield_cols].mean(axis=1, numeric_only=True).fillna(0)
+    features.drop(empatica_data_yield_cols, axis=1, inplace=True) # In case of if the advanced operations will later not be needed (e.g., weighted average)
+
+    return features
--- a/src/features/empatica_electrodermal_activity/cr/main.py
+++ b/src/features/empatica_electrodermal_activity/cr/main.py
@ -0,0 +1,82 @@
+import pandas as pd
+import numpy as np
+from scipy.stats import entropy
+
+from cr_features.helper_functions import convert_to2d, gsr_features
+from cr_features.calculate_features import calculate_features
+from cr_features.gsr import extractGsrFeatures2D
+from cr_features_helper_methods import extract_second_order_features
+
+import sys
+
+#pd.set_option('display.max_columns', None)
+#pd.set_option('display.max_rows', None)
+#np.seterr(invalid='ignore')
+
+
+def extract_eda_features_from_intraday_data(eda_intraday_data, features, window_length, time_segment, filter_data_by_segment):
+    eda_intraday_features = pd.DataFrame(columns=["local_segment"] + features)
+
+    if not eda_intraday_data.empty:   
+        sample_rate = 4  
+     
+        eda_intraday_data = filter_data_by_segment(eda_intraday_data, time_segment)
+
+        if not eda_intraday_data.empty: 
+
+            eda_intraday_features = pd.DataFrame()
+
+            # apply methods from calculate features module 
+            if window_length is None:
+                eda_intraday_features = \
+                    eda_intraday_data.groupby('local_segment').apply(\
+                    lambda x: extractGsrFeatures2D(convert_to2d(x['electrodermal_activity'], x.shape[0]), sampleRate=sample_rate, featureNames=features,
+                    threshold=.01, offset=1, riseTime=5, decayTime=15)) 
+            else:
+                eda_intraday_features = \
+                    eda_intraday_data.groupby('local_segment').apply(\
+                    lambda x: extractGsrFeatures2D(convert_to2d(x['electrodermal_activity'], window_length*sample_rate), sampleRate=sample_rate, featureNames=features,
+                    threshold=.01, offset=1, riseTime=5, decayTime=15)) 
+
+            eda_intraday_features.reset_index(inplace=True)
+
+    return eda_intraday_features
+
+
+def cr_features(sensor_data_files, time_segment, provider, filter_data_by_segment, *args, **kwargs):
+
+    data_types = {'local_timezone': 'str', 'device_id': 'str', 'timestamp': 'int64', 'electrodermal_activity': 'float64', 'local_date_time': 'str', 
+                  'local_date': "str", 'local_time': "str", 'local_hour': "str", 'local_minute': "str", 'assigned_segments': "str"}
+
+    eda_intraday_data = pd.read_csv(sensor_data_files["sensor_data"], dtype=data_types)
+
+    requested_intraday_features = provider["FEATURES"]
+    
+    calc_windows = kwargs.get('calc_windows', False)
+
+    if provider["WINDOWS"]["COMPUTE"] and calc_windows:
+        requested_window_length = provider["WINDOWS"]["WINDOW_LENGTH"]
+    else:
+        requested_window_length = None
+
+    # name of the features this function can compute
+    base_intraday_features_names = gsr_features
+    # the subset of requested features this function can compute
+    intraday_features_to_compute = list(set(requested_intraday_features) & set(base_intraday_features_names))
+
+    # extract features from intraday data
+    eda_intraday_features = extract_eda_features_from_intraday_data(eda_intraday_data, intraday_features_to_compute, 
+                                                                requested_window_length, time_segment, filter_data_by_segment)
+
+    if calc_windows:
+        if provider["WINDOWS"]["IMPUTE_NANS"]:
+            eda_intraday_features[eda_intraday_features["numPeaks"] == 0] = \
+                eda_intraday_features[eda_intraday_features["numPeaks"] == 0].fillna(0)
+            pd.set_option('display.max_columns', None)
+            
+        so_features_names = provider["WINDOWS"]["SECOND_ORDER_FEATURES"]
+        eda_second_order_features = extract_second_order_features(eda_intraday_features, so_features_names)
+    
+        return eda_intraday_features, eda_second_order_features
+
+    return eda_intraday_features
--- a/Show More
+++ b/Show More