Bring back requested fields in config.yaml.

Update coding files based on 7e565c34db98265afcda922a337493781fdd8ed5 in supermodule.
Completely remove PACKAGE_NAMES_HASHED and instead provide a differently structured file.
2023-04-19 11:07:58 +02:00 · 2023-04-18 22:58:42 +02:00 · 2023-04-18 22:45:12 +02:00 · 2023-04-18 22:40:11 +02:00 · 2023-04-18 21:34:59 +02:00 · 2023-04-18 21:23:26 +02:00
200 changed files with 12460 additions and 2051 deletions
--- a/.gitattributes
+++ b/.gitattributes
@ -0,0 +1,7 @@
+# We'll let Git's auto-detection algorithm infer if a file is text. If it is,
+# enforce LF line endings regardless of OS or git configurations.
+* text=auto eol=lf
+
+# Isolate binary files in case the auto-detection algorithm fails and
+# marks them as text files (which could brick them).
+*.{png,jpg,jpeg,gif,webp,woff,woff2} binary
--- a/.gitignore
+++ b/.gitignore
@ -93,10 +93,17 @@ packrat/*

 # exclude data from source control by default
 data/external/*
+!/data/external/empatica/empatica1/E4 Data.zip
 !/data/external/.gitkeep
 !/data/external/stachl_application_genre_catalogue.csv
 !/data/external/timesegments*.csv
 !/data/external/wiki_tz.csv
+!/data/external/main_study_usernames.csv
+!/data/external/timezone.csv
+!/data/external/play_store_application_genre_catalogue.csv
+!/data/external/play_store_categories_count.csv
+
+
 data/raw/*
 !/data/raw/.gitkeep
 data/interim/*
@ -114,3 +121,12 @@ settings.dcf
 tests/fakedata_generation/
 site/
 credentials.yaml
+
+# Docker container and other files
+.devcontainer
+
+# Calculating features module
+calculatingfeatures/
+
+# Temp folder for rapids data/external
+rapids_temp_data/
--- a/README.md
+++ b/README.md
@ -11,3 +11,191 @@
 For more information refer to our [documentation](http://www.rapids.science)

 By [MoSHI](https://www.moshi.pitt.edu/), [University of Pittsburgh](https://www.pitt.edu/)
+
+## Installation 
+
+For RAPIDS installation refer to to the [documentation](https://www.rapids.science/1.8/setup/installation/)
+
+### For the installation of the Docker version
+
+1. Follow the [instructions](https://www.rapids.science/1.8/setup/installation/) to setup RAPIDS via Docker (from scratch).
+
+2. Delete current contents in /rapids/ folder when in a container session.
+    ```
+    cd ..
+    rm -rf rapids/{*,.*}
+    cd rapids
+    ```
+
+3. Clone RAPIDS workspace from Git and checkout a specific branch.
+    ```
+    git clone "https://repo.ijs.si/junoslukan/rapids.git" .
+    git checkout <branch_name>
+    ```
+
+4. Install missing “libpq-dev” dependency with bash.
+    ```
+    apt-get update -y
+    apt-get install -y libpq-dev
+    ```
+
+5. Restore R venv.
+Type R to go to the interactive R session and then:
+    ```
+    renv::restore()
+    ```
+
+6. Install cr-features module 
+From: https://repo.ijs.si/matjazbostic/calculatingfeatures.git -> branch master. 
+Then follow the "cr-features module" section below.  
+
+7. Install all required packages from environment.yml, prune also deletes conda packages not present in environment file.
+    ```
+    conda env update --file environment.yml –prune
+    ```
+
+8. If you wish to update your R or Python venvs.
+    ```
+    R in interactive session:
+    renv::snapshot()
+    Python: 
+    conda env export --no-builds | sed 's/^.*libgfortran.*$/  - libgfortran/' | sed 's/^.*mkl=.*$/  - mkl/' >  environment.yml
+    ```
+
+### cr-features module 
+
+This RAPIDS extension uses cr-features library accessible [here](https://repo.ijs.si/matjazbostic/calculatingfeatures).
+
+To use cr-features library:
+
+- Follow the installation instructions in the [README.md](https://repo.ijs.si/matjazbostic/calculatingfeatures/-/blob/master/README.md).
+
+- Copy built calculatingfeatures folder into the RAPIDS workspace.
+
+- Install the cr-features package by:
+    ```
+    pip install path/to/the/calculatingfeatures/folder
+    e.g. pip install ./calculatingfeatures if the folder is copied to main parent directory
+    cr-features package has to be built and installed everytime to get the newest version. 
+    Or an the newest version of the docker image must be used.   
+    ```
+
+## Updating RAPIDS
+
+To update RAPIDS, first pull and merge [origin]( https://github.com/carissalow/rapids), such as with:
+
+```commandline
+git fetch --progress "origin" refs/heads/master
+git merge --no-ff origin/master
+```
+
+Next, update the conda and R virtual environment.
+
+```bash
+R -e 'renv::restore(repos = c(CRAN = "https://packagemanager.rstudio.com/all/__linux__/focal/latest"))'
+```
+
+## Custom configuration
+### Credentials
+
+As mentioned under [Database in RAPIDS documentation](https://www.rapids.science/1.6/snippets/database/), a `credentials.yaml` file is needed to connect to a database.
+It should contain:
+
+```yaml
+PSQL_STRAW:
+  database: staw
+  host: 212.235.208.113
+  password: password
+  port: 5432
+  user: staw_db
+```
+
+where`password` needs to be specified as well.
+
+## Possible installation issues
+### Missing dependencies for RPostgres
+
+To install `RPostgres` R package (used to connect to the PostgreSQL database), an error might occur:
+
+```text
+------------------------- ANTICONF ERROR ---------------------------
+Configuration failed because libpq was not found. Try installing:
+   * deb: libpq-dev (Debian, Ubuntu, etc)
+   * rpm: postgresql-devel (Fedora, EPEL)
+   * rpm: postgreql8-devel, psstgresql92-devel, postgresql93-devel, or postgresql94-devel (Amazon Linux)
+   * csw: postgresql_dev (Solaris)
+   * brew: libpq (OSX)
+If libpq is already installed, check that either:
+  (i)  'pkg-config' is in your PATH AND PKG_CONFIG_PATH contains a libpq.pc file; or
+  (ii) 'pg_config' is in your PATH.
+If neither can detect , you can set INCLUDE_DIR
+and LIB_DIR manually via:
+  R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
+--------------------------[ ERROR MESSAGE ]----------------------------
+  <stdin>:1:10: fatal error: libpq-fe.h: No such file or directory
+compilation terminated.
+```
+
+The library requires `libpq` for compiling from source, so install accordingly.
+
+### Timezone environment variable for tidyverse (relevant for WSL2)
+
+One of the R packages, `tidyverse` might need access to the `TZ` environment variable during the installation.
+On Ubuntu 20.04 on WSL2 this triggers the following error:
+
+```text
+> install.packages('tidyverse')
+
+ERROR: configuration failed for package ‘xml2’
+System has not been booted with systemd as init system (PID 1). Can't operate.
+Failed to create bus connection: Host is down
+Warning in system("timedatectl", intern = TRUE) :
+  running command 'timedatectl' had status 1
+Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) :
+  namespace ‘xml2’ 1.3.1 is already loaded, but >= 1.3.2 is required
+Calls: <Anonymous> ... namespaceImportFrom -> asNamespace -> loadNamespace
+Execution halted
+ERROR: lazy loading failed for package ‘tidyverse’
+```
+
+This happens because WSL2 does not use the `timedatectl` service, which provides this variable.
+
+```bash
+~$ timedatectl
+System has not been booted with systemd as init system (PID 1). Can't operate.
+Failed to create bus connection: Host is down
+```
+
+and later 
+
+```bash 
+Warning message:
+In system("timedatectl", intern = TRUE) :
+  running command 'timedatectl' had status 1
+Execution halted
+```
+
+This can be amended by setting the environment variable manually before attempting to install `tidyverse`:
+
+```bash
+export TZ='Europe/Ljubljana'
+```
+
+Note: if this is needed to avoid runtime issues, you need to either define this environment variable in each new terminal window or (better) define it in your `~/.bashrc` or `~/.bash_profile`.
+
+## Possible runtime issues
+### Unix end of line characters
+
+Upon running rapids, an error might occur:
+
+```bash
+/usr/bin/env: ‘python3\r’: No such file or directory
+```
+
+This is due to Windows style end of line characters. 
+To amend this, I added a `.gitattributes` files to force `git` to checkout `rapids` using Unix EOL characters.
+If this still fails, `dos2unix` can be used to change them.
+
+### System has not been booted with systemd as init system (PID 1)
+
+See [the installation issue above](#Timezone-environment-variable-for-tidyverse-(relevant-for-WSL2)).
--- a/52
+++ b/52
@ -5,6 +5,7 @@ include: "rules/common.smk"
 include: "rules/renv.smk"
 include: "rules/preprocessing.smk"
 include: "rules/features.smk"
+include: "rules/models.smk"
 include: "rules/reports.smk"

 import itertools
@ -163,6 +164,25 @@ for provider in config["PHONE_CONVERSATION"]["PROVIDERS"].keys():
        files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"]))
        files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv")

+for provider in config["PHONE_ESM"]["PROVIDERS"].keys():
+    if config["PHONE_ESM"]["PROVIDERS"][provider]["COMPUTE"]:
+        files_to_compute.extend(expand("data/raw/{pid}/phone_esm_raw.csv",pid=config["PIDS"]))
+        files_to_compute.extend(expand("data/raw/{pid}/phone_esm_with_datetime.csv",pid=config["PIDS"]))
+        files_to_compute.extend(expand("data/interim/{pid}/phone_esm_clean.csv",pid=config["PIDS"]))
+        files_to_compute.extend(expand("data/interim/{pid}/phone_esm_features/phone_esm_{language}_{provider_key}.csv",pid=config["PIDS"],language=get_script_language(config["PHONE_ESM"]["PROVIDERS"][provider]["SRC_SCRIPT"]),provider_key=provider.lower()))
+        files_to_compute.extend(expand("data/processed/features/{pid}/phone_esm.csv", pid=config["PIDS"]))
+        # files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv",pid=config["PIDS"]))
+        # files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv")
+
+for provider in config["PHONE_SPEECH"]["PROVIDERS"].keys():
+    if config["PHONE_SPEECH"]["PROVIDERS"][provider]["COMPUTE"]:
+        files_to_compute.extend(expand("data/raw/{pid}/phone_speech_raw.csv",pid=config["PIDS"]))
+        files_to_compute.extend(expand("data/raw/{pid}/phone_speech_with_datetime.csv",pid=config["PIDS"]))
+        files_to_compute.extend(expand("data/interim/{pid}/phone_speech_features/phone_speech_{language}_{provider_key}.csv",pid=config["PIDS"],language=get_script_language(config["PHONE_SPEECH"]["PROVIDERS"][provider]["SRC_SCRIPT"]),provider_key=provider.lower()))
+        files_to_compute.extend(expand("data/processed/features/{pid}/phone_speech.csv", pid=config["PIDS"]))
+        files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"]))
+        files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv")
+
 # We can delete these if's as soon as we add feature PROVIDERS to any of these sensors
 if isinstance(config["PHONE_APPLICATIONS_CRASHES"]["PROVIDERS"], dict):
    for provider in config["PHONE_APPLICATIONS_CRASHES"]["PROVIDERS"].keys():
@ -317,7 +337,7 @@ for provider in config["EMPATICA_ACCELEROMETER"]["PROVIDERS"].keys():
        files_to_compute.extend(expand("data/processed/features/{pid}/empatica_accelerometer.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"]))
        files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv")
-
+     
 for provider in config["EMPATICA_HEARTRATE"]["PROVIDERS"].keys():
    if config["EMPATICA_HEARTRATE"]["PROVIDERS"][provider]["COMPUTE"]:
        files_to_compute.extend(expand("data/raw/{pid}/empatica_heartrate_raw.csv", pid=config["PIDS"]))
@ -363,7 +383,7 @@ for provider in config["EMPATICA_INTER_BEAT_INTERVAL"]["PROVIDERS"].keys():
        files_to_compute.extend(expand("data/processed/features/{pid}/empatica_inter_beat_interval.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"]))
        files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv")
-
+     
 if isinstance(config["EMPATICA_TAGS"]["PROVIDERS"], dict):
    for provider in config["EMPATICA_TAGS"]["PROVIDERS"].keys():
        if config["EMPATICA_TAGS"]["PROVIDERS"][provider]["COMPUTE"]:
@ -394,6 +414,34 @@ if config["HEATMAP_PHONE_DATA_YIELD_PER_PARTICIPANT_PER_TIME_SEGMENT"]["PLOT"]:
 if config["HEATMAP_FEATURE_CORRELATION_MATRIX"]["PLOT"]:
    files_to_compute.append("reports/data_exploration/heatmap_feature_correlation_matrix.html")

+# Data Cleaning
+for provider in config["ALL_CLEANING_INDIVIDUAL"]["PROVIDERS"].keys():
+    if config["ALL_CLEANING_INDIVIDUAL"]["PROVIDERS"][provider]["COMPUTE"]:
+        if provider == "STRAW":
+            files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features_cleaned_" + provider.lower() + "_py.csv", pid=config["PIDS"]))
+        else:
+            files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features_cleaned_" + provider.lower() + "_R.csv", pid=config["PIDS"]))
+
+for provider in config["ALL_CLEANING_OVERALL"]["PROVIDERS"].keys():
+    if config["ALL_CLEANING_OVERALL"]["PROVIDERS"][provider]["COMPUTE"]:
+        if provider == "STRAW":
+            for target in config["PARAMS_FOR_ANALYSIS"]["TARGET"]["ALL_LABELS"]:
+                files_to_compute.extend(expand("data/processed/features/all_participants/all_sensor_features_cleaned_" + provider.lower() +"_py_(" + target + ").csv"))
+        else:
+            files_to_compute.extend(expand("data/processed/features/all_participants/all_sensor_features_cleaned_" + provider.lower() +"_R.csv"))     
+
+# Baseline features
+if config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["COMPUTE"]:
+    files_to_compute.extend(expand("data/raw/baseline_merged.csv"))
+    files_to_compute.extend(expand("data/raw/{pid}/participant_baseline_raw.csv", pid=config["PIDS"]))
+    files_to_compute.extend(expand("data/interim/{pid}/baseline_questionnaires.csv", pid=config["PIDS"]))
+    files_to_compute.extend(expand("data/processed/features/{pid}/baseline_features.csv", pid=config["PIDS"]))
+
+# Targets (labels)
+if config["PARAMS_FOR_ANALYSIS"]["TARGET"]["COMPUTE"]:
+    files_to_compute.extend(expand("data/processed/models/individual_model/{pid}/input.csv", pid=config["PIDS"]))
+    for target in config["PARAMS_FOR_ANALYSIS"]["TARGET"]["ALL_LABELS"]:
+        files_to_compute.extend(expand("data/processed/models/population_model/input_" + target + ".csv"))

 rule all:
    input:
--- a/init.py
+++ b/init.py
--- a/automl_test.py
+++ b/automl_test.py
@ -0,0 +1,57 @@
+from pprint import pprint
+import sklearn.metrics
+import autosklearn.regression
+
+import datetime
+import importlib
+import os
+import sys
+
+import numpy as np
+import matplotlib.pyplot as plt
+import pandas as pd
+import seaborn as sns
+import yaml
+
+from sklearn import linear_model, svm, kernel_ridge, gaussian_process
+from sklearn.model_selection import LeaveOneGroupOut, cross_val_score, train_test_split
+from sklearn.metrics import mean_squared_error, r2_score
+from sklearn.impute import SimpleImputer
+
+model_input = pd.read_csv("data/processed/models/population_model/input_PANAS_negative_affect_mean.csv") # Standardizirani podatki
+
+model_input.dropna(axis=1, how="all", inplace=True)
+model_input.dropna(axis=0, how="any", subset=["target"], inplace=True)
+
+categorical_feature_colnames = ["gender", "startlanguage"]
+categorical_feature_colnames += [col for col in model_input.columns if "mostcommonactivity" in col or "homelabel" in col]
+categorical_features = model_input[categorical_feature_colnames].copy()
+mode_categorical_features = categorical_features.mode().iloc[0]
+categorical_features = categorical_features.fillna(mode_categorical_features)
+categorical_features = categorical_features.apply(lambda col: col.astype("category"))
+if not categorical_features.empty:
+    categorical_features = pd.get_dummies(categorical_features)
+numerical_features = model_input.drop(categorical_feature_colnames, axis=1)
+model_in = pd.concat([numerical_features, categorical_features], axis=1)
+
+index_columns = ["local_segment", "local_segment_label", "local_segment_start_datetime", "local_segment_end_datetime"]
+model_in.set_index(index_columns, inplace=True)
+
+X_train, X_test, y_train, y_test = train_test_split(model_in.drop(["target", "pid"], axis=1), model_in["target"], test_size=0.30)
+
+automl = autosklearn.regression.AutoSklearnRegressor(
+    time_left_for_this_task=7200,
+    per_run_time_limit=120
+)
+automl.fit(X_train, y_train, dataset_name='straw')
+
+print(automl.leaderboard())
+pprint(automl.show_models(), indent=4)
+
+train_predictions = automl.predict(X_train)
+print("Train R2 score:", sklearn.metrics.r2_score(y_train, train_predictions))
+test_predictions = automl.predict(X_test)
+print("Test R2 score:", sklearn.metrics.r2_score(y_test, test_predictions))
+
+import sys
+sys.exit()
--- a/config.yaml
+++ b/config.yaml
@ -3,16 +3,17 @@
 ########################################################################################################################

 # See https://www.rapids.science/latest/setup/configuration/#participant-files
-PIDS: [test01]
+PIDS: ['p031', 'p032', 'p033', 'p034', 'p035', 'p036', 'p037', 'p038', 'p039', 'p040', 'p042', 'p043', 'p044', 'p045', 'p046', 'p049', 'p050', 'p052', 'p053', 'p054', 'p055', 'p057', 'p058', 'p059', 'p060', 'p061', 'p062', 'p064', 'p067', 'p068', 'p069', 'p070', 'p071', 'p072', 'p073', 'p074', 'p075', 'p076', 'p077', 'p078', 'p079', 'p080', 'p081', 'p082', 'p083', 'p084', 'p085', 'p086', 'p088', 'p089', 'p090', 'p091', 'p092', 'p093', 'p106', 'p107']

 # See https://www.rapids.science/latest/setup/configuration/#automatic-creation-of-participant-files
 CREATE_PARTICIPANT_FILES:
-  CSV_FILE_PATH: "data/external/example_participants.csv" # see docs for required format
+  USERNAMES_CSV: "data/external/main_study_usernames.csv"
+  CSV_FILE_PATH: "data/external/main_study_participants.csv" # see docs for required format
  PHONE_SECTION:
    ADD: True
    IGNORED_DEVICE_IDS: []
  FITBIT_SECTION:
-    ADD: True
+    ADD: False
    IGNORED_DEVICE_IDS: []
  EMPATICA_SECTION:
    ADD: True
@ -20,19 +21,25 @@ CREATE_PARTICIPANT_FILES:

 # See https://www.rapids.science/latest/setup/configuration/#time-segments
 TIME_SEGMENTS: &time_segments
-  TYPE: PERIODIC # FREQUENCY, PERIODIC, EVENT
-  FILE: "data/external/timesegments_periodic.csv"
-  INCLUDE_PAST_PERIODIC_SEGMENTS: FALSE # Only relevant if TYPE=PERIODIC, see docs
+  TYPE: EVENT # FREQUENCY, PERIODIC, EVENT
+  FILE: "data/external/straw_events.csv"
+  INCLUDE_PAST_PERIODIC_SEGMENTS: TRUE # Only relevant if TYPE=PERIODIC, see docs
+  TAILORED_EVENTS: # Only relevant if TYPE=EVENT
+    COMPUTE: True
+    SEGMENTING_METHOD: "30_before" # 30_before, 90_before, stress_event
+    INTERVAL_OF_INTEREST: 10 # duration of event of interest [minutes]
+    IOI_ERROR_TOLERANCE: 5 # interval of interest erorr tolerance (before and after IOI) [minutes]

 # See https://www.rapids.science/latest/setup/configuration/#timezone-of-your-study
 TIMEZONE: 
-    TYPE: SINGLE
+    TYPE: MULTIPLE
    SINGLE:
-      TZCODE: America/New_York
+      TZCODE: Europe/Ljubljana
    MULTIPLE:
-      TZCODES_FILE: data/external/multiple_timezones_example.csv
-      IF_MISSING_TZCODE: STOP
-      DEFAULT_TZCODE: America/New_York
+      TZ_FILE: data/external/timezone.csv
+      TZCODES_FILE: data/external/multiple_timezones.csv
+      IF_MISSING_TZCODE: USE_DEFAULT
+      DEFAULT_TZCODE: Europe/Ljubljana
      FITBIT: 
        ALLOW_MULTIPLE_TZ_PER_DEVICE: False
        INFER_FROM_SMARTPHONE_TZ: False
@ -43,12 +50,15 @@ TIMEZONE:

 # See https://www.rapids.science/latest/setup/configuration/#data-stream-configuration
 PHONE_DATA_STREAMS:
-  USE: aware_mysql
+  USE: aware_postgresql
  
  # AVAILABLE:
  aware_mysql: 
    DATABASE_GROUP: MY_GROUP

+  aware_postgresql:
+    DATABASE_GROUP: PSQL_STRAW
+  
  aware_csv:
    FOLDER: data/external/aware_csv
  
@ -65,7 +75,6 @@ PHONE_ACCELEROMETER:
      COMPUTE: False
      FEATURES: ["maxmagnitude", "minmagnitude", "avgmagnitude", "medianmagnitude", "stdmagnitude"]
      SRC_SCRIPT: src/features/phone_accelerometer/rapids/main.py
-    
    PANDA:
      COMPUTE: False
      VALID_SENSED_MINUTES: False
@ -77,12 +86,12 @@ PHONE_ACCELEROMETER:
 # See https://www.rapids.science/latest/features/phone-activity-recognition/
 PHONE_ACTIVITY_RECOGNITION:
  CONTAINER: 
-    ANDROID: plugin_google_activity_recognition
+    ANDROID: google_ar
    IOS: plugin_ios_activity_recognition
  EPISODE_THRESHOLD_BETWEEN_ROWS: 5 # minutes. Max time difference for two consecutive rows to be considered within the same AR episode.
  PROVIDERS:
    RAPIDS:
-      COMPUTE: False
+      COMPUTE: True
      FEATURES: ["count", "mostcommonactivity", "countuniqueactivities", "durationstationary", "durationmobile", "durationvehicle"]
      ACTIVITY_CLASSES:
        STATIONARY: ["still", "tilting"]
@ -95,33 +104,42 @@ PHONE_APPLICATIONS_CRASHES:
  CONTAINER: applications_crashes
  APPLICATION_CATEGORIES:
    CATALOGUE_SOURCE: FILE # FILE (genres are read from CATALOGUE_FILE) or GOOGLE (genres are scrapped from the Play Store)
-    CATALOGUE_FILE: "data/external/stachl_application_genre_catalogue.csv"
-    UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether or not to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
-    SCRAPE_MISSING_CATEGORIES: False # whether or not to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
+    CATALOGUE_FILE: "data/external/play_store_application_genre_catalogue.csv"
+    UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
+    SCRAPE_MISSING_CATEGORIES: False # whether to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
  PROVIDERS: # None implemented yet but this sensor can be used in PHONE_DATA_YIELD

 # See https://www.rapids.science/latest/features/phone-applications-foreground/
 PHONE_APPLICATIONS_FOREGROUND:
-  CONTAINER: applications_foreground
+  CONTAINER: applications
  APPLICATION_CATEGORIES:
    CATALOGUE_SOURCE: FILE # FILE (genres are read from CATALOGUE_FILE) or GOOGLE (genres are scrapped from the Play Store)
-    CATALOGUE_FILE: "data/external/stachl_application_genre_catalogue.csv"
-    UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether or not to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
-    SCRAPE_MISSING_CATEGORIES: False # whether or not to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
+    CATALOGUE_FILE: "data/external/play_store_application_genre_catalogue.csv"
+    # Refer to data/external/play_store_categories_count.csv for a list of categories (genres) and their frequency.
+    UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
+    SCRAPE_MISSING_CATEGORIES: False # whether to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
  PROVIDERS:
    RAPIDS:
-      COMPUTE: False
-      INCLUDE_EPISODE_FEATURES: False
-      SINGLE_CATEGORIES: ["all", "email"]
+      COMPUTE: True
+      INCLUDE_EPISODE_FEATURES: True
+      SINGLE_CATEGORIES: ["Productivity", "Tools", "Communication", "Education", "Social"]
      MULTIPLE_CATEGORIES:
-        social: ["socialnetworks", "socialmediatools"]
-        entertainment: ["entertainment", "gamingknowledge", "gamingcasual", "gamingadventure", "gamingstrategy", "gamingtoolscommunity", "gamingroleplaying", "gamingaction", "gaminglogic", "gamingsports", "gamingsimulation"]
+        games: ["Puzzle", "Card", "Casual", "Board", "Strategy", "Trivia", "Word", "Adventure", "Role Playing", "Simulation", "Board, Brain Games", "Racing"]
+        social: ["Communication", "Social", "Dating"]
+        productivity: ["Tools", "Productivity", "Finance", "Education", "News & Magazines", "Business", "Books & Reference"]
+        health: ["Health & Fitness", "Lifestyle", "Food & Drink", "Sports", "Medical", "Parenting"]
+        entertainment: ["Shopping", "Music & Audio", "Entertainment", "Travel & Local", "Photography", "Video Players & Editors", "Personalization", "House & Home", "Art & Design", "Auto & Vehicles", "Entertainment,Music & Video",
+                        "Puzzle", "Card", "Casual", "Board", "Strategy", "Trivia", "Word", "Adventure", "Role Playing", "Simulation", "Board, Brain Games", "Racing" # Add all games.
+        ]
+        maps_weather: ["Maps & Navigation", "Weather"]
      CUSTOM_CATEGORIES:
-        social_media: ["com.google.android.youtube", "com.snapchat.android", "com.instagram.android", "com.zhiliaoapp.musically", "com.facebook.katana"]
-        dating: ["com.tinder", "com.relance.happycouple", "com.kiwi.joyride"]
-      SINGLE_APPS: ["top1global", "com.facebook.moments", "com.google.android.youtube", "com.twitter.android"] # There's no entropy for single apps
-      EXCLUDED_CATEGORIES: []
-      EXCLUDED_APPS: ["com.fitbit.FitbitMobile", "com.aware.plugin.upmc.cancer"]
+      SINGLE_APPS: []
+      EXCLUDED_CATEGORIES: ["System", "STRAW"]
+      # Note: A special option here is "is_system_app".
+      # This excludes applications that have is_system_app = TRUE, which is a separate column in the table.
+      # However, all of these applications have been assigned System category.
+      # I will therefore filter by that category, which is a superset and is more complete. JL
+      EXCLUDED_APPS: []
      FEATURES: 
        APP_EVENTS: ["countevent", "timeoffirstuse", "timeoflastuse", "frequencyentropy"]
        APP_EPISODES: ["countepisode", "minduration", "maxduration", "meanduration", "sumduration"]
@ -131,7 +149,7 @@ PHONE_APPLICATIONS_FOREGROUND:

 # See https://www.rapids.science/latest/features/phone-applications-notifications/
 PHONE_APPLICATIONS_NOTIFICATIONS:
-  CONTAINER: applications_notifications
+  CONTAINER: notifications
  APPLICATION_CATEGORIES:
    CATALOGUE_SOURCE: FILE # FILE (genres are read from CATALOGUE_FILE) or GOOGLE (genres are scrapped from the Play Store)
    CATALOGUE_FILE: "data/external/stachl_application_genre_catalogue.csv"
@ -145,7 +163,7 @@ PHONE_BATTERY:
  EPISODE_THRESHOLD_BETWEEN_ROWS: 30 # minutes. Max time difference for two consecutive rows to be considered within the same battery episode.
  PROVIDERS:
    RAPIDS:
-      COMPUTE: False
+      COMPUTE: True
      FEATURES: ["countdischarge", "sumdurationdischarge", "countcharge", "sumdurationcharge", "avgconsumptionrate", "maxconsumptionrate"]
      SRC_SCRIPT: src/features/phone_battery/rapids/main.py

@ -159,7 +177,7 @@ PHONE_BLUETOOTH:
      SRC_SCRIPT: src/features/phone_bluetooth/rapids/main.R

    DORYAB:
-      COMPUTE: False
+      COMPUTE: True
      FEATURES: 
        ALL: 
            DEVICES: ["countscans", "uniquedevices", "meanscans", "stdscans"]
@ -177,10 +195,10 @@ PHONE_BLUETOOTH:

 # See https://www.rapids.science/latest/features/phone-calls/
 PHONE_CALLS:
-  CONTAINER: calls
+  CONTAINER: call
  PROVIDERS:
    RAPIDS:
-      COMPUTE: False
+      COMPUTE: True
      FEATURES_TYPE: EPISODES # EVENTS or EPISODES
      CALL_TYPES: [missed, incoming, outgoing]
      FEATURES:
@ -190,7 +208,7 @@ PHONE_CALLS:
      SRC_SCRIPT: src/features/phone_calls/rapids/main.R

 # See https://www.rapids.science/latest/features/phone-conversation/
-PHONE_CONVERSATION:
+PHONE_CONVERSATION: # TODO Adapt for speech
  CONTAINER: 
    ANDROID: plugin_studentlife_audio_android
    IOS: plugin_studentlife_audio
@ -209,14 +227,35 @@ PHONE_CONVERSATION:

 # See https://www.rapids.science/latest/features/phone-data-yield/
 PHONE_DATA_YIELD:
-  SENSORS: []
+  SENSORS: [#PHONE_ACCELEROMETER,
+            PHONE_ACTIVITY_RECOGNITION,
+            PHONE_APPLICATIONS_FOREGROUND,
+            PHONE_APPLICATIONS_NOTIFICATIONS,
+            PHONE_BATTERY,
+            PHONE_BLUETOOTH,
+            PHONE_CALLS,
+            PHONE_LIGHT,
+            PHONE_LOCATIONS,
+            PHONE_MESSAGES,
+            PHONE_SCREEN,
+            PHONE_WIFI_VISIBLE]
  PROVIDERS:
    RAPIDS:
-      COMPUTE: False
+      COMPUTE: True
      FEATURES: [ratiovalidyieldedminutes, ratiovalidyieldedhours]
      MINUTE_RATIO_THRESHOLD_FOR_VALID_YIELDED_HOURS: 0.5 # 0 to 1, minimum percentage of valid minutes in an hour to be considered valid.
      SRC_SCRIPT: src/features/phone_data_yield/rapids/main.R

+PHONE_ESM:
+  CONTAINER: esm
+  PROVIDERS:
+    STRAW:
+      COMPUTE: True
+      SCALES: ["PANAS_positive_affect", "PANAS_negative_affect", "JCQ_job_demand", "JCQ_job_control", "JCQ_supervisor_support", "JCQ_coworker_support", 
+              "appraisal_stressfulness_period", "appraisal_stressfulness_event", "appraisal_threat", "appraisal_challenge"]
+      FEATURES: [mean]
+      SRC_SCRIPT: src/features/phone_esm/straw/main.py
+
 # See https://www.rapids.science/latest/features/phone-keyboard/
 PHONE_KEYBOARD:
  CONTAINER: keyboard
@ -228,10 +267,10 @@ PHONE_KEYBOARD:

 # See https://www.rapids.science/latest/features/phone-light/
 PHONE_LIGHT:
-  CONTAINER: light
+  CONTAINER: light_sensor
  PROVIDERS:
    RAPIDS:
-      COMPUTE: False
+      COMPUTE: True
      FEATURES: ["count", "maxlux", "minlux", "avglux", "medianlux", "stdlux"]
      SRC_SCRIPT: src/features/phone_light/rapids/main.py

@ -245,7 +284,7 @@ PHONE_LOCATIONS:

  PROVIDERS:
    DORYAB:
-      COMPUTE: False
+      COMPUTE: True
      FEATURES: ["locationvariance","loglocationvariance","totaldistance","avgspeed","varspeed", "numberofsignificantplaces","numberlocationtransitions","radiusgyration","timeattop1location","timeattop2location","timeattop3location","movingtostaticratio","outlierstimepercent","maxlengthstayatclusters","minlengthstayatclusters","avglengthstayatclusters","stdlengthstayatclusters","locationentropy","normalizedlocationentropy","timeathome", "homelabel"]
      DBSCAN_EPS: 100 # meters
      DBSCAN_MINSAMPLES: 5
@ -260,7 +299,7 @@ PHONE_LOCATIONS:
      SRC_SCRIPT: src/features/phone_locations/doryab/main.py

    BARNETT:
-      COMPUTE: False
+      COMPUTE: True
      FEATURES: ["hometime","disttravelled","rog","maxdiam","maxhomedist","siglocsvisited","avgflightlen","stdflightlen","avgflightdur","stdflightdur","probpause","siglocentropy","circdnrtn","wkenddayrtn"]
      IF_MULTIPLE_TIMEZONES: USE_MOST_COMMON
      MINUTES_DATA_USED: False # Use this for quality control purposes, how many minutes of data (location coordinates gruped by minute) were used to compute features
@ -275,10 +314,10 @@ PHONE_LOG:

 # See https://www.rapids.science/latest/features/phone-messages/
 PHONE_MESSAGES:
-  CONTAINER: messages
+  CONTAINER: sms
  PROVIDERS:
    RAPIDS:
-      COMPUTE: False
+      COMPUTE: True
      MESSAGES_TYPES : [received, sent]
      FEATURES: 
        received: [count, distinctcontacts, timefirstmessage, timelastmessage, countmostfrequentcontact]
@ -290,7 +329,7 @@ PHONE_SCREEN:
  CONTAINER: screen
  PROVIDERS:
    RAPIDS:
-      COMPUTE: False
+      COMPUTE: True
      REFERENCE_HOUR_FIRST_USE: 0
      IGNORE_EPISODES_SHORTER_THAN: 0 # in minutes, set to 0 to disable
      IGNORE_EPISODES_LONGER_THAN: 360 # in minutes, set to 0 to disable
@ -298,6 +337,15 @@ PHONE_SCREEN:
      EPISODE_TYPES: ["unlock"]
      SRC_SCRIPT: src/features/phone_screen/rapids/main.py

+# Custom added sensor
+PHONE_SPEECH:
+  CONTAINER: speech
+  PROVIDERS:
+    STRAW:
+      COMPUTE: True
+      FEATURES: ["meanspeech", "stdspeech", "nlargest", "nsmallest", "medianspeech"]
+      SRC_SCRIPT: src/features/phone_speech/straw/main.py
+
 # See https://www.rapids.science/latest/features/phone-wifi-connected/
 PHONE_WIFI_CONNECTED:
  CONTAINER: sensor_wifi
@ -312,7 +360,7 @@ PHONE_WIFI_VISIBLE:
  CONTAINER: wifi
  PROVIDERS:
    RAPIDS:
-      COMPUTE: False
+      COMPUTE: True
      FEATURES: ["countscans", "uniquedevices", "countscansmostuniquedevice"]
      SRC_SCRIPT: src/features/phone_wifi_visible/rapids/main.R

@ -415,7 +463,6 @@ FITBIT_SLEEP_INTRADAY:
        UNIFIED: [awake, asleep]
      SLEEP_TYPES: [main, nap, all]
      SRC_SCRIPT: src/features/fitbit_sleep_intraday/rapids/main.py
-  
    PRICE:
      COMPUTE: False
      FEATURES: [avgduration, avgratioduration, avgstarttimeofepisodemain, avgendtimeofepisodemain, avgmidpointofepisodemain, stdstarttimeofepisodemain, stdendtimeofepisodemain, stdmidpointofepisodemain, socialjetlag, rmssdmeanstarttimeofepisodemain, rmssdmeanendtimeofepisodemain, rmssdmeanmidpointofepisodemain, rmssdmedianstarttimeofepisodemain, rmssdmedianendtimeofepisodemain, rmssdmedianmidpointofepisodemain]
@ -451,13 +498,15 @@ FITBIT_STEPS_INTRADAY:
    RAPIDS:
      COMPUTE: False
      FEATURES:
-        STEPS: ["sum", "max", "min", "avg", "std"]
+        STEPS: ["sum", "max", "min", "avg", "std", "firststeptime", "laststeptime"]
        SEDENTARY_BOUT: ["countepisode", "sumduration", "maxduration", "minduration", "avgduration", "stdduration"]
        ACTIVE_BOUT: ["countepisode", "sumduration", "maxduration", "minduration", "avgduration", "stdduration"]
+      REFERENCE_HOUR: 0
      THRESHOLD_ACTIVE_BOUT: 10 # steps
      INCLUDE_ZERO_STEP_ROWS: False
      SRC_SCRIPT: src/features/fitbit_steps_intraday/rapids/main.py

+
 ########################################################################################################################
 #                                                 EMPATICA                                                             #
 ########################################################################################################################
@ -479,6 +528,15 @@ EMPATICA_ACCELEROMETER:
      COMPUTE: False
      FEATURES: ["maxmagnitude", "minmagnitude", "avgmagnitude", "medianmagnitude", "stdmagnitude"]
      SRC_SCRIPT: src/features/empatica_accelerometer/dbdp/main.py
+    CR:
+      COMPUTE: True
+      FEATURES: ["totalMagnitudeBand", "absoluteMeanBand", "varianceBand"] # Acc features
+      WINDOWS:
+        COMPUTE: True
+        WINDOW_LENGTH: 15 # specify window length in seconds
+        SECOND_ORDER_FEATURES: ['mean', 'median', 'sd', 'nlargest', 'nsmallest', 'count_windows']
+      SRC_SCRIPT: src/features/empatica_accelerometer/cr/main.py
+

 # See https://www.rapids.science/latest/features/empatica-heartrate/
 EMPATICA_HEARTRATE:
@ -497,6 +555,15 @@ EMPATICA_TEMPERATURE:
      COMPUTE: False
      FEATURES: ["maxtemp", "mintemp", "avgtemp", "mediantemp", "modetemp", "stdtemp", "diffmaxmodetemp", "diffminmodetemp", "entropytemp"]
      SRC_SCRIPT: src/features/empatica_temperature/dbdp/main.py
+    CR:
+      COMPUTE: True
+      FEATURES: ["maximum", "minimum", "meanAbsChange", "longestStrikeAboveMean", "longestStrikeBelowMean", 
+                  "stdDev", "median", "meanChange", "sumSquared", "squareSumOfComponent", "sumOfSquareComponents"]
+      WINDOWS:
+        COMPUTE: True
+        WINDOW_LENGTH: 300 # specify window length in seconds
+        SECOND_ORDER_FEATURES: ['mean', 'median', 'sd', 'nlargest', 'nsmallest', 'count_windows']
+      SRC_SCRIPT: src/features/empatica_temperature/cr/main.py

 # See https://www.rapids.science/latest/features/empatica-electrodermal-activity/
 EMPATICA_ELECTRODERMAL_ACTIVITY:
@ -506,6 +573,19 @@ EMPATICA_ELECTRODERMAL_ACTIVITY:
      COMPUTE: False
      FEATURES: ["maxeda", "mineda", "avgeda", "medianeda", "modeeda", "stdeda", "diffmaxmodeeda", "diffminmodeeda", "entropyeda"]
      SRC_SCRIPT: src/features/empatica_electrodermal_activity/dbdp/main.py
+    CR:
+      COMPUTE: True
+      FEATURES: ['mean', 'std', 'q25', 'q75', 'qd', 'deriv', 'power', 'numPeaks', 'ratePeaks', 'powerPeaks', 'sumPosDeriv', 'propPosDeriv', 'derivTonic', 
+                  'sigTonicDifference', 'freqFeats','maxPeakAmplitudeChangeBefore', 'maxPeakAmplitudeChangeAfter', 'avgPeakAmplitudeChangeBefore', 
+                  'avgPeakAmplitudeChangeAfter', 'avgPeakChangeRatio', 'maxPeakIncreaseTime', 'maxPeakDecreaseTime', 'maxPeakDuration', 'maxPeakChangeRatio',
+                  'avgPeakIncreaseTime', 'avgPeakDecreaseTime', 'avgPeakDuration', 'signalOverallChange', 'changeDuration', 'changeRate', 'significantIncrease', 
+                  'significantDecrease']
+      WINDOWS:
+        COMPUTE: True
+        WINDOW_LENGTH: 60 # specify window length in seconds
+        SECOND_ORDER_FEATURES: ['mean', 'median', 'sd', 'nlargest', 'nsmallest', count_windows, eda_num_peaks_non_zero]
+        IMPUTE_NANS: True
+      SRC_SCRIPT: src/features/empatica_electrodermal_activity/cr/main.py

 # See https://www.rapids.science/latest/features/empatica-blood-volume-pulse/
 EMPATICA_BLOOD_VOLUME_PULSE:
@ -515,6 +595,15 @@ EMPATICA_BLOOD_VOLUME_PULSE:
      COMPUTE: False
      FEATURES: ["maxbvp", "minbvp", "avgbvp", "medianbvp", "modebvp", "stdbvp", "diffmaxmodebvp", "diffminmodebvp", "entropybvp"]
      SRC_SCRIPT: src/features/empatica_blood_volume_pulse/dbdp/main.py
+    CR:
+      COMPUTE: False
+      FEATURES: ['meanHr', 'ibi', 'sdnn', 'sdsd', 'rmssd', 'pnn20', 'pnn50', 'sd', 'sd2', 'sd1/sd2', 'numRR', # Time features
+                  'VLF', 'LF', 'LFnorm', 'HF', 'HFnorm', 'LF/HF', 'fullIntegral'] # Freq features
+      WINDOWS:
+        COMPUTE: True
+        WINDOW_LENGTH: 300 # specify window length in seconds
+        SECOND_ORDER_FEATURES: ['mean', 'median', 'sd', 'nlargest', 'nsmallest', 'count_windows', 'hrv_num_windows_non_nan']
+      SRC_SCRIPT: src/features/empatica_blood_volume_pulse/cr/main.py

 # See https://www.rapids.science/latest/features/empatica-inter-beat-interval/
 EMPATICA_INTER_BEAT_INTERVAL:
@ -524,6 +613,16 @@ EMPATICA_INTER_BEAT_INTERVAL:
      COMPUTE: False
      FEATURES: ["maxibi", "minibi", "avgibi", "medianibi", "modeibi", "stdibi", "diffmaxmodeibi", "diffminmodeibi", "entropyibi"]
      SRC_SCRIPT: src/features/empatica_inter_beat_interval/dbdp/main.py
+    CR:
+      COMPUTE: True
+      FEATURES: ['meanHr', 'ibi', 'sdnn', 'sdsd', 'rmssd', 'pnn20', 'pnn50', 'sd', 'sd2', 'sd1/sd2', 'numRR', # Time features
+                  'VLF', 'LF', 'LFnorm', 'HF', 'HFnorm', 'LF/HF', 'fullIntegral'] # Freq features            
+      PATCH_WITH_BVP: True
+      WINDOWS:
+        COMPUTE: True
+        WINDOW_LENGTH: 300 # specify window length in seconds
+        SECOND_ORDER_FEATURES: ['mean', 'median', 'sd', 'nlargest', 'nsmallest', 'count_windows', 'hrv_num_windows_non_nan']
+      SRC_SCRIPT: src/features/empatica_inter_beat_interval/cr/main.py

 # See https://www.rapids.science/latest/features/empatica-tags/
 EMPATICA_TAGS:
@ -564,3 +663,96 @@ HEATMAP_FEATURE_CORRELATION_MATRIX:
  CORR_THRESHOLD: 0.1
  CORR_METHOD: "pearson" # choose from {"pearson", "kendall", "spearman"}

+
+########################################################################################################################
+#                                                    Data Cleaning                                                     #
+########################################################################################################################
+
+ALL_CLEANING_INDIVIDUAL:
+  PROVIDERS:
+    RAPIDS:
+      COMPUTE: False
+      IMPUTE_SELECTED_EVENT_FEATURES:
+        COMPUTE: False
+        MIN_DATA_YIELDED_MINUTES_TO_IMPUTE: 0.33
+      COLS_NAN_THRESHOLD: 1 # set to 1 to disable
+      COLS_VAR_THRESHOLD: True
+      ROWS_NAN_THRESHOLD: 1 # set to 1 to disable
+      DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_HOURS # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
+      DATA_YIELD_RATIO_THRESHOLD: 0 # set to 0 to disable
+      DROP_HIGHLY_CORRELATED_FEATURES:
+        COMPUTE: True
+        MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
+        CORR_THRESHOLD: 0.95
+      SRC_SCRIPT: src/features/all_cleaning_individual/rapids/main.R
+    STRAW:
+      COMPUTE: True
+      PHONE_DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_MINUTES # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
+      PHONE_DATA_YIELD_RATIO_THRESHOLD: 0.5 # set to 0 to disable
+      EMPATICA_DATA_YIELD_RATIO_THRESHOLD: 0.5 # set to 0 to disable
+      ROWS_NAN_THRESHOLD: 0.33 # set to 1 to disable
+      COLS_NAN_THRESHOLD: 0.9 # set to 1 to remove only columns that contains all (100% of) NaN
+      COLS_VAR_THRESHOLD: True
+      DROP_HIGHLY_CORRELATED_FEATURES:
+        COMPUTE: True
+        MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
+        CORR_THRESHOLD: 0.95
+      STANDARDIZATION: True
+      SRC_SCRIPT: src/features/all_cleaning_individual/straw/main.py
+
+ALL_CLEANING_OVERALL:
+  PROVIDERS:
+    RAPIDS:
+      COMPUTE: False
+      IMPUTE_SELECTED_EVENT_FEATURES:
+        COMPUTE: False
+        MIN_DATA_YIELDED_MINUTES_TO_IMPUTE: 0.33
+      COLS_NAN_THRESHOLD: 1 # set to 1 to disable
+      COLS_VAR_THRESHOLD: True
+      ROWS_NAN_THRESHOLD: 1 # set to 1 to disable
+      DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_HOURS # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
+      DATA_YIELD_RATIO_THRESHOLD: 0 # set to 0 to disable
+      DROP_HIGHLY_CORRELATED_FEATURES:
+        COMPUTE: True
+        MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
+        CORR_THRESHOLD: 0.95
+      SRC_SCRIPT: src/features/all_cleaning_overall/rapids/main.R
+    STRAW:
+      COMPUTE: True
+      PHONE_DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_MINUTES # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
+      PHONE_DATA_YIELD_RATIO_THRESHOLD: 0.5 # set to 0 to disable
+      EMPATICA_DATA_YIELD_RATIO_THRESHOLD: 0.5 # set to 0 to disable
+      ROWS_NAN_THRESHOLD: 0.33 # set to 1 to disable
+      COLS_NAN_THRESHOLD: 0.8 # set to 1 to remove only columns that contains all (100% of) NaN
+      COLS_VAR_THRESHOLD: True
+      DROP_HIGHLY_CORRELATED_FEATURES:
+        COMPUTE: True
+        MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
+        CORR_THRESHOLD: 0.95
+      STANDARDIZATION: True
+      TARGET_STANDARDIZATION: False
+      SRC_SCRIPT: src/features/all_cleaning_overall/straw/main.py
+
+
+########################################################################################################################
+#                                                      Baseline                                                        #
+########################################################################################################################
+
+PARAMS_FOR_ANALYSIS:
+  BASELINE:
+    COMPUTE: True
+    FOLDER: data/external/baseline
+    CONTAINER: [results-survey637813_final.csv,  # Slovenia
+                results-survey358134_final.csv,  # Belgium 1
+                results-survey413767_final.csv  # Belgium 2
+    ]
+    QUESTION_LIST: survey637813+question_text.csv
+    FEATURES: [age, gender, startlanguage, limesurvey_demand, limesurvey_control, limesurvey_demand_control_ratio, limesurvey_demand_control_ratio_quartile]
+    CATEGORICAL_FEATURES: [gender]
+
+  TARGET:
+    COMPUTE: True
+    LABEL: appraisal_stressfulness_event_mean
+    ALL_LABELS: [PANAS_positive_affect_mean, PANAS_negative_affect_mean, JCQ_job_demand_mean, JCQ_job_control_mean, JCQ_supervisor_support_mean, JCQ_coworker_support_mean, appraisal_stressfulness_period_mean]
+                # PANAS_positive_affect_mean, PANAS_negative_affect_mean, JCQ_job_demand_mean, JCQ_job_control_mean, JCQ_supervisor_support_mean, 
+                # JCQ_coworker_support_mean, appraisal_stressfulness_period_mean, appraisal_stressfulness_event_mean, appraisal_threat_mean, appraisal_challenge_mean
--- a/data/external/aware_csv/calls.csv
+++ b/data/external/aware_csv/calls.csv
@ -0,0 +1,9 @@
+"_id","timestamp","device_id","call_type","call_duration","trace"
+1,1587663260695,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",2,14,"d5e84f8af01b2728021d4f43f53a163c0c90000c"
+2,1587739118007,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",3,0,"47c125dc7bd163b8612cdea13724a814917b6e93"
+5,1587746544891,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",2,95,"9cc793ffd6e88b1d850ce540b5d7e000ef5650d4"
+6,1587911379859,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",2,63,"51fb9344e988049a3fec774c7ca622358bf80264"
+7,1587992647361,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",3,0,"2a862a7730cfdfaf103a9487afe3e02935fd6e02"
+8,1588020039448,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",1,11,"a2c53f6a086d98622c06107780980cf1bb4e37bd"
+11,1588176189024,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",2,65,"56589df8c830c70e330b644921ed38e08d8fd1f3"
+12,1588197745079,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",3,0,"cab458018a8ed3b626515e794c70b6f415318adc"
--- a/data/external/empatica/empatica1/E4
+++ b/data/external/empatica/empatica1/E4
--- a/data/external/main_study_usernames.csv
+++ b/data/external/main_study_usernames.csv
@ -0,0 +1,57 @@
+label,empatica_id
+uploader_79170,A0245B
+uploader_89788,A02731
+uploader_68294,A02705
+uploader_92856,A024AF
+uploader_23726,A0231C
+uploader_66620,A02305
+uploader_58435,A026B5
+uploader_87801,A022A8
+uploader_96055,A027BA
+uploader_69549,A0226C
+uploader_26363,A0263D
+uploader_72010,A023FA
+uploader_13997,A024AF
+uploader_31156,A02305
+uploader_63187,A027BA
+uploader_94821,A022A8
+uploader_65413,A023F1;A023FA
+uploader_36488,A02713
+uploader_91087,A0231C
+uploader_35174,A025D1
+uploader_73880,A02705
+uploader_78650,A02731
+uploader_70578,A0245B
+uploader_88313,A02736
+uploader_58482,A0261A
+uploader_80601,A027BA
+uploader_93729,A0226C
+uploader_61663,A0245B
+uploader_80848,A025D1
+uploader_57312,A023F9;A02361;A027A0
+uploader_52087,A02666
+uploader_98770,A02953
+uploader_51327,A0245F
+uploader_11737,A02732
+uploader_77440,A0264E
+uploader_57277,A02422
+uploader_13098,A026E5
+uploader_80719,A023C8
+uploader_54698,A02953
+uploader_95571,A02853
+uploader_21880,A024DC
+uploader_92905,A02920
+uploader_12108,A023F4
+uploader_17436,A026E5
+uploader_58440,A0273F
+uploader_22172,A0245F
+uploader_39250,A02422
+uploader_15311,A023F9
+uploader_45766,A02920
+uploader_23096,A02361
+uploader_78243,A02422
+uploader_58777,A0245F
+uploader_82941,A02666
+uploader_89606,A023F4
+uploader_82969,A023C8
+uploader_53573,A024DC;A02361
--- a/data/external/participant_files/p01.yaml
+++ b/data/external/participant_files/p01.yaml
@ -0,0 +1,11 @@
+PHONE:
+  DEVICE_IDS: [4b62a655-cbf0-4ac0-a448-06726f45b56a]
+  PLATFORMS: [android]
+  LABEL: uploader_53573
+  START_DATE: 2021-05-21 09:21:24
+  END_DATE: 2021-07-12 17:32:07
+EMPATICA:
+  DEVICE_IDS: [uploader_53573]
+  LABEL: uploader_53573
+  START_DATE: 2021-05-21 09:21:24
+  END_DATE: 2021-07-12 17:32:07
--- a/data/external/play_store_application_genre_catalogue.csv
+++ b/data/external/play_store_application_genre_catalogue.csv
--- a/data/external/play_store_categories_count.csv
+++ b/data/external/play_store_categories_count.csv
@ -0,0 +1,45 @@
+genre,n
+System,261
+Tools,96
+Productivity,71
+Health & Fitness,60
+Finance,54
+Communication,39
+Music & Audio,39
+Shopping,38
+Lifestyle,33
+Education,28
+News & Magazines,24
+Maps & Navigation,23
+Entertainment,21
+Business,18
+Travel & Local,18
+Books & Reference,16
+Social,16
+Weather,16
+Food & Drink,14
+Sports,14
+Other,13
+Photography,13
+Puzzle,13
+Video Players & Editors,12
+Card,9
+Casual,9
+Personalization,8
+Medical,7
+Board,5
+Strategy,4
+House & Home,3
+Trivia,3
+Word,3
+Adventure,2
+Art & Design,2
+Auto & Vehicles,2
+Dating,2
+Role Playing,2
+STRAW,2
+Simulation,2
+"Board,Brain Games",1
+"Entertainment,Music & Video",1
+Parenting,1
+Racing,1
--- a/data/external/timesegments_daily.csv
+++ b/data/external/timesegments_daily.csv
@ -0,0 +1,3 @@
+label,start_time,length,repeats_on,repeats_value
+daily,04:00:00,23H 59M 59S,every_day,0
+working_day,04:00:00,18H 00M 00S,every_day,0
--- a/data/external/timesegments_frequency.csv
+++ b/data/external/timesegments_frequency.csv
@ -1,2 +1,2 @@
 label,length
-thirtyminutes,30
+fiveminutes,5
--- a/data/external/timesegments_periodic.csv
+++ b/data/external/timesegments_periodic.csv
@ -1,9 +1,2 @@
 label,start_time,length,repeats_on,repeats_value
-threeday,00:00:00,2D 23H 59M 59S,every_day,0
-daily, 00:00:00,23H 59M 59S, every_day, 0
-morning,06:00:00,5H 59M 59S,every_day,0
-afternoon,12:00:00,5H 59M 59S,every_day,0
-evening,18:00:00,5H 59M 59S,every_day,0
-night,00:00:00,5H 59M 59S,every_day,0
-two_weeks_overlapping,00:00:00,13D 23H 59M 59S,every_day,0
-weekends,00:00:00,2D 23H 59M 59S,wday,5
+daily,00:00:00,23H 59M 59S,every_day,0
--- a/data/external/timezone.csv
+++ b/data/external/timezone.csv
--- a/docs/analysis/complete-workflow-example.md
+++ b/docs/analysis/complete-workflow-example.md
@ -1,8 +1,8 @@
 # Analysis Workflow Example

 !!! info "TL;DR"
-    - In addition to using RAPIDS to extract behavioral features and create plots, you can structure your data analysis within RAPIDS (i.e. cleaning your features and creating ML/statistical models)
-    - We include an analysis example in RAPIDS that covers raw data processing, cleaning, feature extraction, machine learning modeling, and evaluation
+    - In addition to using RAPIDS to extract behavioral features, create plots, and clean sensor features, you can structure your data analysis within RAPIDS (i.e. creating ML/statistical models and evaluating your models)
+    - We include an analysis example in RAPIDS that covers raw data processing, feature extraction, cleaning, machine learning modeling, and evaluation
    - Use this example as a guide to structure your own analysis within RAPIDS
    - RAPIDS analysis workflows are compatible with your favorite data science tools and libraries
    - RAPIDS analysis workflows are reproducible and we encourage you to publish them along with your research papers
@ -69,12 +69,12 @@ Note you will see a lot of warning messages, you can ignore them since they happ
 ??? info "6. Feature cleaning."
    In this stage we perform four steps to clean our sensor feature file. First, we discard days with a data yield hour ratio less than or equal to 0.75, i.e. we include days with at least 18 hours of data. Second, we drop columns (features) with more than 30% of missing rows. Third, we drop columns with zero variance. Fourth, we drop rows (days) with more than 30% of missing columns (features). In this cleaning stage several parameters are created and exposed in `example_profile/example_config.yaml`. 

-    After this step, we kept 163 features over 11 days for the individual model of p01, 101 features over 12 days for the individual model of p02 and 109 features over 20 days for the population model. Note that the difference in the number of features between p01 and p02 is mostly due to iOS restrictions that stops researchers from collecting the same number of sensors than in Android phones. 
+    After this step, we kept 173 features over 11 days for the individual model of p01, 101 features over 12 days for the individual model of p02 and 117 features over 22 days for the population model. Note that the difference in the number of features between p01 and p02 is mostly due to iOS restrictions that stops researchers from collecting the same number of sensors than in Android phones. 
    
    Feature cleaning for the individual models is done in the `clean_sensor_features_for_individual_participants` rule and for the population model in the `clean_sensor_features_for_all_participants` rule in `rules/models.smk`.

 ??? info "7. Merge features and targets."
-    In this step we merge the cleaned features and target labels for our individual models in the `merge_features_and_targets_for_individual_model` rule in `rules/models.smk`. Additionally, we merge the cleaned features, target labels, and demographic features of our two participants for the population model in the `merge_features_and_targets_for_population_model` rule in `rules/models.smk`. These two merged files are the input for our individual and population models. 
+    In this step we merge the cleaned features and target labels for our individual models in the `merge_features_and_targets_for_individual_model` rule in `rules/features.smk`. Additionally, we merge the cleaned features, target labels, and demographic features of our two participants for the population model in the `merge_features_and_targets_for_population_model` rule in `rules/features.smk`. These two merged files are the input for our individual and population models. 

 ??? info "8. Modelling."
    This stage has three phases: model building, training and evaluation. 
--- a/docs/analysis/data-cleaning.md
+++ b/docs/analysis/data-cleaning.md
@ -0,0 +1,92 @@
+Data Cleaning
+=============
+
+The goal of this module is to perform basic clean tasks on the behavioral features that RAPIDS computes. You might need to do further processing depending on your analysis objectives. This module can clean features at the individual level and at the study level. If you are interested in creating individual models (using each participant's features independently of the others) use [`ALL_CLEANING_INDIVIDUAL`]. If you are interested in creating population models (using everyone's data in the same model) use [`ALL_CLEANING_OVERALL`]
+    
+## Clean sensor features for individual participants
+
+!!! info "File Sequence"
+    ```bash
+    - data/processed/features/{pid}/all_sensor_features.csv
+    - data/processed/features/{pid}/all_sensor_features_cleaned_{provider_key}.csv
+    ```
+
+### RAPIDS provider
+
+Parameters description for `[ALL_CLEANING_INDIVIDUAL][PROVIDERS][RAPIDS]`:
+
+|Key&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;            | Description |
+|----------------|-----------------------------------------------------------------------------------------------------------------------------------
+|`[COMPUTE]` | Set to `True` to execute the cleaning tasks described below. You can use the parameters of each task to tweak them or deactivate them|
+|`[IMPUTE_SELECTED_EVENT_FEATURES]`     | Fill NAs with 0 only for event-based features, see table below
+|`[COLS_NAN_THRESHOLD]`                 | Discard columns with missing value ratios higher than `[COLS_NAN_THRESHOLD]`. Set to 1 to disable
+|`[COLS_VAR_THRESHOLD]`                 | Set to `True` to discard columns with zero variance
+|`[ROWS_NAN_THRESHOLD]`                 | Discard rows with missing value ratios higher than `[ROWS_NAN_THRESHOLD]`. Set to 1 to disable
+|`[DATA_YIELD_FEATURE]`                 | `RATIO_VALID_YIELDED_HOURS` or `RATIO_VALID_YIELDED_MINUTES`
+|`[DATA_YIELD_RATIO_THRESHOLD]`         | Discard rows with `ratiovalidyieldedhours` or `ratiovalidyieldedminutes` feature less than `[DATA_YIELD_RATIO_THRESHOLD]`. The feature name is determined by `[DATA_YIELD_FEATURE]` parameter. Set to 0 to disable
+|`DROP_HIGHLY_CORRELATED_FEATURES`      | Discard highly correlated features, see table below
+
+Parameters description for `[ALL_CLEANING_INDIVIDUAL][PROVIDERS][RAPIDS][IMPUTE_SELECTED_EVENT_FEATURES]`:
+
+|Parameters                             | Description                                                    |
+|-------------------------------------- |----------------------------------------------------------------|
+|`[COMPUTE]`                            | Set to `True` to fill NAs with 0 for phone event-based features
+|`[MIN_DATA_YIELDED_MINUTES_TO_IMPUTE]` | Any feature value in a time segment instance with phone data yield > `[MIN_DATA_YIELDED_MINUTES_TO_IMPUTE]` will be replaced with a zero. See below for an explanation. |
+
+Parameters description for `[ALL_CLEANING_INDIVIDUAL][PROVIDERS][RAPIDS][DROP_HIGHLY_CORRELATED_FEATURES]`:
+
+|Parameters                             | Description                                                    |
+|-------------------------------------- |----------------------------------------------------------------|
+|`[COMPUTE]`                            | Set to `True` to drop highly correlated features
+|`[MIN_OVERLAP_FOR_CORR_THRESHOLD]`     | Minimum ratio of observations required per pair of columns (features) to be considered as a valid correlation. 
+|`[CORR_THRESHOLD]` | The absolute values of pair-wise correlations are calculated. If two variables have a valid correlation higher than `[CORR_THRESHOLD]`, we looks at the mean absolute correlation of each variable and removes the variable with the largest mean absolute correlation.
+
+Steps to clean sensor features for individual participants. It only considers the **phone sensors** currently.
+
+??? info "1. Fill NA with 0 for the selected event features."
+    Some event features should be zero instead of NA. In this step, we fill those missing features with 0 when the `phone_data_yield_rapids_ratiovalidyieldedminutes` column is higher than the `[IMPUTE_SELECTED_EVENT_FEATURES][MIN_DATA_YIELDED_MINUTES_TO_IMPUTE]` parameter. Plugins such as Activity Recognition sensor are not considered. You can skip this step by setting `[IMPUTE_SELECTED_EVENT_FEATURES][COMPUTE]` to `False`.
+    
+    Take phone calls sensor as an example. If there are no calls records during a time segment for a participant, then (1) the calls sensor was not working during that time segment; or (2) the calls sensor was working and the participant did not have any calls during that time segment. To differentiate these two situations, we assume the selected sensors are working when `phone_data_yield_rapids_ratiovalidyieldedminutes > [MIN_DATA_YIELDED_MINUTES_TO_IMPUTE]`.
+
+    The following phone event-based features are considered currently:
+
+      - Application foreground: countevent, countepisode, minduration, maxduration, meanduration, sumduration.
+      - Battery: all features.
+      - Calls: count, distinctcontacts, sumduration, minduration, maxduration, meanduration, modeduration.
+      - Keyboard: sessioncount, averagesessionlength, changeintextlengthlessthanminusone, changeintextlengthequaltominusone, changeintextlengthequaltoone, changeintextlengthmorethanone, maxtextlength, totalkeyboardtouches.
+      - Messages: count, distinctcontacts.
+      - Screen: sumduration, maxduration, minduration, avgduration, countepisode.
+      - WiFi: all connected and visible features.
+
+??? info "2. Discard unreliable rows."
+    Extracted features might be not reliable if the sensor only works for a short period during a time segment. In this step, we discard rows when the `phone_data_yield_rapids_ratiovalidyieldedminutes` column or the `phone_data_yield_rapids_ratiovalidyieldedhours` column is less than the `[DATA_YIELD_RATIO_THRESHOLD]` parameter. We recommend using `phone_data_yield_rapids_ratiovalidyieldedminutes` column (set `[DATA_YIELD_FEATURE]` to `RATIO_VALID_YIELDED_MINUTES`) on time segments that are shorter than two or three hours and `phone_data_yield_rapids_ratiovalidyieldedhours` (set `[DATA_YIELD_FEATURE]` to `RATIO_VALID_YIELDED_HOURS`) for longer segments. We do not recommend you to skip this step, but you can do it by setting `[DATA_YIELD_RATIO_THRESHOLD]` to 0.
+
+??? info "3. Discard columns (features) with too many missing values."
+    In this step, we discard columns with missing value ratios higher than `[COLS_NAN_THRESHOLD]`. We do not recommend you to skip this step, but you can do it by setting `[COLS_NAN_THRESHOLD]` to 1.
+
+??? info "4. Discard columns (features) with zero variance."
+    In this step, we discard columns with zero variance. We do not recommend you to skip this step, but you can do it by setting `[COLS_VAR_THRESHOLD]` to `False`.
+
+??? info "5. Drop highly correlated features."
+    As highly correlated features might not bring additional information and will increase the complexity of a model, we drop them in this step. The absolute values of pair-wise correlations are calculated. Each correlation vector between two variables is regarded as valid only if the ratio of valid value pairs (i.e. non NA pairs) is greater than or equal to `[DROP_HIGHLY_CORRELATED_FEATURES][MIN_OVERLAP_FOR_CORR_THRESHOLD]`. If two variables have a correlation coefficient higher than `[DROP_HIGHLY_CORRELATED_FEATURES][CORR_THRESHOLD]`, we look at the mean absolute correlation of each variable and remove the variable with the largest mean absolute correlation. This step can be skipped by setting `[DROP_HIGHLY_CORRELATED_FEATURES][COMPUTE]` to False.
+
+??? info "6. Discard rows with too many missing values."
+    In this step, we discard rows with missing value ratios higher than `[ROWS_NAN_THRESHOLD]`. We do not recommend you to skip this step, but you can do it by setting `[ROWS_NAN_THRESHOLD]` to 1. In other words, we are discarding time segments (e.g. days) that did not have enough data to be considered reliable. This step is similar to step 2 except the ratio is computed based on NA values instead of a phone data yield threshold.
+
+
+
+
+## Clean sensor features for all participants
+
+!!! info "File Sequence"
+    ```bash
+    - data/processed/features/all_participants/all_sensor_features.csv
+    - data/processed/features/all_participants/all_sensor_features_cleaned_{provider_key}.csv
+    ```
+
+
+### RAPIDS provider
+
+Parameters description and the steps are the same as the above [RAPIDS provider](#rapids-provider) section for individual participants.
+
+
--- a/docs/workflow-examples/minimal.md
+++ b/docs/workflow-examples/minimal.md
--- a/docs/change-log.md
+++ b/docs/change-log.md
@ -1,6 +1,17 @@
 # Change Log
-## v.1.7.0
+## v1.8.0
+- Add data stream for AWARE Micro server
+- Fix the NA bug in PHONE_LOCATIONS BARNETT provider
+- Fix the bug of data type for call_duration field
+- Fix the index bug of heatmap_sensors_per_minute_per_time_segment
+## v1.7.1
+- Update docs for Git Flow section
+- Update RAPIDS paper information
+## v1.7.0
+- Add firststeptime and laststeptime features to FITBIT_STEPS_INTRADAY RAPIDS provider
+- Update tests for Fitbit steps intraday features
 - Add tests for phone battery features
+- Add a data cleaning module to replace NAs with 0 in selected event-based features, discard unreliable rows and columns, discard columns with zero variance, and discard highly correlated columns
 ## v1.6.0
 - Refactor PHONE_CALLS RAPIDS provider to compute features based on call episodes or events
 - Refactor PHONE_LOCATIONS DORYAB provider to compute features based on location episodes
--- a/docs/citation.md
+++ b/docs/citation.md
@ -5,14 +5,10 @@

 ## RAPIDS

-If you used RAPIDS, please cite [this paper](https://preprints.jmir.org/preprint/23246).
+If you used RAPIDS, please cite [this paper](https://www.frontiersin.org/article/10.3389/fdgth.2021.769823).

 !!! cite "RAPIDS et al. citation"
-    Vega J, Li M, Aguillera K, Goel N, Joshi E, Durica KC, Kunta AR, Low CA
-    RAPIDS: Reproducible Analysis Pipeline for Data Streams Collected with Mobile Devices
-    JMIR Preprints. 18/08/2020:23246
-    DOI: 10.2196/preprints.23246
-    URL: https://preprints.jmir.org/preprint/23246
+    Vega, J., Li, M., Aguillera, K., Goel, N., Joshi, E., Khandekar, K., ... & Low, C. A. (2021). Reproducible Analysis Pipeline for Data Streams (RAPIDS): Open-Source Software to Process Data Collected with Mobile Devices. Frontiers in Digital Health, 168.

 ## DBDP (all Empatica sensors)

--- a/docs/datastreams/aware-micro-mysql.md
+++ b/docs/datastreams/aware-micro-mysql.md
@ -0,0 +1,15 @@
+# `aware_micro_mysql`
+
+This [data stream](../../datastreams/data-streams-introduction) handles iOS and Android sensor data collected with the [AWARE Framework's](https://awareframework.com/) [AWARE Micro](https://github.com/denzilferreira/aware-micro) server and stored in a MySQL database.
+
+## Container
+A MySQL database with a table per sensor, each containing the data for all participants. Sensor data is stored in a JSON field within each table called `data`
+
+The script to connect and download data from this container is at:
+```bash
+src/data/streams/aware_micro_mysql/container.R
+```
+
+## Format
+
+--8<---- "docs/snippets/aware_format.md"
--- a/docs/datastreams/data-streams-introduction.md
+++ b/docs/datastreams/data-streams-introduction.md
@ -16,6 +16,7 @@ For reference, these are the data streams we currently support:
 | Data Stream | Device | Format | Container | Docs
 |--|--|--|--|--|
 | `aware_mysql`| Phone | AWARE app | MySQL | [link](../aware-mysql)
+| `aware_micro_mysql`| Phone | AWARE Micro server | MySQL | [link](../aware-micro-mysql)
 | `aware_csv`| Phone | AWARE app | CSV files | [link](../aware-csv)
 | `aware_influxdb` (beta)| Phone | AWARE app | InfluxDB | [link](../aware-influxdb)
 | `fitbitjson_mysql`| Fitbit | JSON (per [Fitbit's API](https://dev.fitbit.com/build/reference/web-api/)) | MySQL | [link](../fitbitjson-mysql)
--- a/docs/developers/git-flow.md
+++ b/docs/developers/git-flow.md
@ -127,9 +127,9 @@ git branch -d release/v[NEW_RELEASE]
 ```
 git checkout master
 git merge --ff-only develop
-git push
+git push # Unlock the master branch before merging
 ```
-1. Go to [GitHub](https://github.com/carissalow/rapids/tags) and create a new release based on the newest tag `v[NEW_RELEASE]` (remember to add the change log)
+1. Release happens automatically after passing the tests

 ## Release a Hotfix
 1. Pull the latest master
@ -156,6 +156,6 @@ git branch -d hotfix/v[NEW_HOTFIX]
 ```
 git checkout master
 git merge --ff-only v[NEW_HOTFIX]
-git push
+git push # Unlock the master branch before merging
 ```
-1. Go to [GitHub](https://github.com/carissalow/rapids/tags) and create a new release based on the newest tag `v[NEW_HOTFIX]` (remember to add the change log)
+1. Release happens automatically after passing the tests
--- a/docs/features/fitbit-steps-intraday.md
+++ b/docs/features/fitbit-steps-intraday.md
@ -29,6 +29,7 @@ Parameters description for `[FITBIT_STEPS_INTRADAY][PROVIDERS][RAPIDS]`:
 |----------------|-----------------------------------------------------------------------------------------------------------------------------------
 |`[COMPUTE]`                | Set to `True` to extract `FITBIT_STEPS_INTRADAY` features from the `RAPIDS` provider|
 |`[FEATURES]`               |         Features to be computed from steps intraday data, see table below           |
+|`[REFERENCE_HOUR]`         | The reference point from which `firststeptime` or `laststeptime` is to be computed, default is midnight |
 |`[THRESHOLD_ACTIVE_BOUT]`  | Every minute with Fitbit steps data wil be labelled as `sedentary` if its step count is below this threshold, otherwise, `active`.    |
 |`[INCLUDE_ZERO_STEP_ROWS]` | Whether or not to include time segments with a 0 step count during the whole day.                          |

@ -42,6 +43,8 @@ Features description for `[FITBIT_STEPS_INTRADAY][PROVIDERS][RAPIDS]`:
 |minsteps                   |steps          |The minimum step count during a time segment.
 |avgsteps                   |steps          |The average step count during a time segment.
 |stdsteps                   |steps          |The standard deviation of step count during a time segment.
+|firststeptime              |minutes        |Minutes until the first non-zero step count.
+|laststeptime               |minutes        |Minutes until the last non-zero step count.
 |countepisodesedentarybout  |bouts          |Number of sedentary bouts during a time segment.
 |sumdurationsedentarybout   |minutes        |Total duration of all sedentary bouts during a time segment.
 |maxdurationsedentarybout   |minutes        |The maximum duration of any sedentary bout during a time segment.
--- a/docs/features/phone-keyboard.md
+++ b/docs/features/phone-keyboard.md
@ -6,6 +6,12 @@ Sensor parameters description for `[PHONE_KEYBOARD]`:
 |----------------|-----------------------------------------------------------------------------------------------------------------------------------
 |`[CONTAINER]`| Data stream [container](../../datastreams/data-streams-introduction/) (database table, CSV file, etc.) where the keyboard data is stored

+## RAPIDS provider
+
+!!! info "Available time segments and platforms"
+    - Available for all time segments
+    - Available for Android only
+
 !!! info "File Sequence"
    ```bash
    - data/raw/{pid}/phone_keyboard_raw.csv
--- a/docs/index.md
+++ b/docs/index.md
@ -1,12 +1,12 @@
 # Welcome to RAPIDS documentation

-Reproducible Analysis Pipeline for Data Streams (RAPIDS) allows you to process smartphone and wearable data to [extract](features/feature-introduction.md) and [create](features/add-new-features.md) **behavioral features** (a.k.a. digital biomarkers), [visualize](visualizations/data-quality-visualizations.md) mobile sensor data, and [structure](workflow-examples/analysis.md) your analysis into reproducible workflows.
+Reproducible Analysis Pipeline for Data Streams (RAPIDS) allows you to process smartphone and wearable data to [extract](features/feature-introduction.md) and [create](features/add-new-features.md) **behavioral features** (a.k.a. digital biomarkers), [visualize](visualizations/data-quality-visualizations.md) mobile sensor data, and [structure](analysis/complete-workflow-example.md) your analysis into reproducible workflows. Check out our [paper](https://www.frontiersin.org/article/10.3389/fdgth.2021.769823)!

 RAPIDS is open source, documented, multi-platform, modular, tested, and reproducible. At the moment, we support [data streams](datastreams/data-streams-introduction) logged by smartphones, Fitbit wearables, and Empatica wearables (the latter in collaboration with the [DBDP](https://dbdp.org/)). 

 !!! tip "Where do I start?"

-    :material-power-standby: New to RAPIDS? Check our [Overview + FAQ](setup/overview/) and [minimal example](workflow-examples/minimal)
+    :material-power-standby: New to RAPIDS? Check our [Overview + FAQ](setup/overview/) and [minimal example](analysis/minimal)

    :material-play-speed: [Install](setup/installation), [configure](setup/configuration), and [execute](setup/execution) RAPIDS to [extract](features/feature-introduction.md) and [plot](visualizations/data-quality-visualizations.md) behavioral features

--- a/docs/setup/overview.md
+++ b/docs/setup/overview.md
@ -23,10 +23,10 @@ Let's review some key concepts we use throughout these docs:
    - [Add your own behavioral features](../../features/add-new-features/) (we can include them in RAPIDS if you want to share them with the community)
    - [Add support for new data streams](../../datastreams/add-new-data-streams/) if yours cannot be processed by RAPIDS yet
    - Create visualizations for [data quality control](../../visualizations/data-quality-visualizations/)  and [feature inspection](../../visualizations/feature-visualizations/)
-    - [Extending RAPIDS to organize your analysis](../../workflow-examples/analysis/) and publish a code repository along with your code
+    - [Extending RAPIDS to organize your analysis](../../analysis/complete-workflow-example/) and publish a code repository along with your code

 !!! hint
-    - We recommend you follow the [Minimal Example](../../workflow-examples/minimal/) tutorial to get familiar with RAPIDS
+    - We recommend you follow the [Minimal Example](../../analysis/minimal/) tutorial to get familiar with RAPIDS

    - In order to follow any of the previous tutorials, you will have to [Install](../installation/), [Configure](../configuration/), and learn how to [Execute](../execution/) RAPIDS.

--- a/docs/team.md
+++ b/docs/team.md
@ -15,7 +15,6 @@ If you are interested in contributing feel free to submit a pull request or cont
    Meng Li received her Master of Science degree in Information Science from the University of Pittsburgh. She is interested in applying machine learning algorithms to the medical field.

    - *lim11* at *upmc* . *edu*
-    - [Linkedin Profile](https://www.linkedin.com/in/meng-li-57238414a)
    - [Github Profile](https://github.com/Meng6)

 ###  Abhineeth Reddy Kunta
--- a/donotmakechanges.py
+++ b/donotmakechanges.py
@ -0,0 +1,39 @@
+"""
+Please do not make any changes, as RAPIDS is running on tmux server ...
+"""
+# !
+# !
+"""
+Please do not make any changes, as RAPIDS is running on tmux server ...
+"""
+# !
+# !
+"""
+Please do not make any changes, as RAPIDS is running on tmux server ...
+"""
+# !
+# !
+"""
+Please do not make any changes, as RAPIDS is running on tmux server ...
+"""
+# !
+# !
+"""
+Please do not make any changes, as RAPIDS is running on tmux server ...
+"""
+# !
+# !
+"""
+Please do not make any changes, as RAPIDS is running on tmux server ...
+"""
+# !
+# !
+"""
+Please do not make any changes, as RAPIDS is running on tmux server ...
+"""
+# !
+# !
+"""
+Please do not make any changes, as RAPIDS is running on tmux server ...
+"""
+# !
--- a/environment.yml
+++ b/environment.yml
@ -1,112 +1,30 @@
-name: rapids202108
+name: rapids
 channels:
  - conda-forge
-  - defaults
 dependencies:
-  - _py-xgboost-mutex=2.0
-  - appdirs=1.4.4
-  - arrow=0.16.0
-  - asn1crypto=1.4.0
-  - astropy=4.2.1
-  - attrs=20.3.0
-  - binaryornot=0.4.4
-  - blas=1.0
-  - brotlipy=0.7.0
-  - bzip2=1.0.8
-  - ca-certificates=2021.7.5
-  - certifi=2021.5.30
-  - cffi=1.14.4
-  - chardet=3.0.4
-  - click=7.1.2
-  - cookiecutter=1.6.0
-  - cryptography=3.3.1
-  - datrie=0.8.2
-  - docutils=0.16
-  - future=0.18.2
-  - gitdb=4.0.5
-  - gitdb2=4.0.2
-  - gitpython=3.1.11
-  - idna=2.10
-  - imbalanced-learn=0.6.2
-  - importlib-metadata=2.0.0
-  - importlib_metadata=2.0.0
-  - intel-openmp=2019.4
-  - jinja2=2.11.2
-  - jinja2-time=0.2.0
-  - joblib=1.0.0
-  - jsonschema=3.2.0
-  - libblas=3.8.0
-  - libcblas=3.8.0
-  - libcxx=10.0.0
-  - libedit=3.1.20191231
-  - libffi=3.3
-  - libgfortran
-  - liblapack=3.8.0
-  - libopenblas=0.3.10
-  - libxgboost=0.90
-  - lightgbm=3.1.1
-  - llvm-openmp=10.0.0
-  - markupsafe=1.1.1
-  - mkl
-  - mkl-service=2.3.0
-  - mkl_fft=1.2.0
-  - mkl_random=1.1.1
-  - more-itertools=8.6.0
-  - ncurses=6.2
-  - numpy=1.19.2
-  - numpy-base=1.19.2
-  - openblas=0.3.4
-  - openssl=1.1.1k
-  - pandas=1.1.5
-  - pbr=5.5.1
-  - pip=20.3.3
-  - plotly=4.14.1
-  - poyo=0.5.0
-  - psutil=5.7.2
-  - py-xgboost=0.90
-  - pycparser=2.20
-  - pyerfa=1.7.1.1
-  - pyopenssl=20.0.1
-  - pysocks=1.7.1
-  - python=3.7.9
-  - python-dateutil=2.8.1
-  - python_abi=3.7
-  - pytz=2020.4
-  - pyyaml=5.3.1
-  - readline=8.0
-  - requests=2.25.0
-  - retrying=1.3.3
-  - scikit-learn=0.23.2
-  - scipy=1.5.2
-  - setuptools=51.0.0
-  - six=1.15.0
-  - smmap=3.0.4
-  - smmap2=3.0.1
-  - sqlite=3.33.0
-  - threadpoolctl=2.1.0
-  - tk=8.6.10
-  - tqdm=4.62.0
-  - urllib3=1.25.11
-  - wheel=0.36.2
-  - whichcraft=0.6.1
-  - wrapt=1.12.1
-  - xgboost=0.90
-  - xz=5.2.5
-  - yaml=0.2.5
-  - zipp=3.4.0
-  - zlib=1.2.11
-  - pip:
-    - amply==0.1.4
-    - configargparse==0.15.1
-    - decorator==4.4.2
-    - ipython-genutils==0.2.0
-    - jupyter-core==4.6.3
-    - nbformat==5.0.7
-    - pulp==2.4
-    - pyparsing==2.4.7
-    - pyrsistent==0.15.5
-    - ratelimiter==1.2.0.post0
-    - snakemake==5.30.2
-    - toposort==1.5
-    - traitlets==4.3.3
-prefix: /usr/local/Caskroom/miniconda/base/envs/rapids202108
+    - auto-sklearn
+    - hmmlearn
+    - imbalanced-learn
+    - jsonschema
+    - lightgbm
+    - matplotlib
+    - numpy
+    - pandas
+    - peakutils
+    - pip
+    - plotly
+    - python-dateutil
+    - pytz
+    - pywavelets
+    - pyyaml
+    - scikit-learn
+    - scipy
+    - seaborn
+    - setuptools
+    - bioconda::snakemake 
+    - bioconda::snakemake-minimal
+    - tqdm
+    - xgboost
+    - pip:
+        - biosppy
+        - cr_features>=0.2
--- a/example_profile/Snakefile
+++ b/example_profile/Snakefile
@ -3,7 +3,7 @@ include: "../rules/common.smk"
 include: "../rules/renv.smk"
 include: "../rules/preprocessing.smk"
 include: "../rules/features.smk"
-include: "../rules/models.smk"
+include: "../rules/models_example.smk"
 include: "../rules/reports.smk"

 import itertools
@ -384,6 +384,14 @@ if config["HEATMAP_PHONE_DATA_YIELD_PER_PARTICIPANT_PER_TIME_SEGMENT"]["PLOT"]:
 if config["HEATMAP_FEATURE_CORRELATION_MATRIX"]["PLOT"]:
    files_to_compute.append("reports/data_exploration/heatmap_feature_correlation_matrix.html")

+# Data Cleaning
+for provider in config["ALL_CLEANING_INDIVIDUAL"]["PROVIDERS"].keys():
+    if config["ALL_CLEANING_INDIVIDUAL"]["PROVIDERS"][provider]["COMPUTE"]:
+        files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features_cleaned_" + provider.lower() +".csv", pid=config["PIDS"]))
+for provider in config["ALL_CLEANING_OVERALL"]["PROVIDERS"].keys():
+    if config["ALL_CLEANING_OVERALL"]["PROVIDERS"][provider]["COMPUTE"]:
+        files_to_compute.extend(expand("data/processed/features/all_participants/all_sensor_features_cleaned_" + provider.lower() +".csv"))
+
 # Analysis Workflow Example
 models, scalers = [], []
 for model_name in config["PARAMS_FOR_ANALYSIS"]["MODEL_NAMES"]:
@ -401,7 +409,6 @@ files_to_compute.extend(expand("data/raw/{pid}/participant_target_with_datetime.
 files_to_compute.extend(expand("data/processed/targets/{pid}/parsed_targets.csv", pid=config["PIDS"]))

 # Individual model
-files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features_cleaned.csv", pid=config["PIDS"]))
 files_to_compute.extend(expand("data/processed/models/individual_model/{pid}/input.csv", pid=config["PIDS"]))
 files_to_compute.extend(expand("data/processed/models/individual_model/{pid}/output_{cv_method}/baselines.csv", pid=config["PIDS"], cv_method=config["PARAMS_FOR_ANALYSIS"]["CV_METHODS"]))
 files_to_compute.extend(expand(
@ -414,7 +421,6 @@ files_to_compute.extend(expand(
                            scaler=scalers))

 # Population model
-files_to_compute.append("data/processed/features/all_participants/all_sensor_features_cleaned.csv")
 files_to_compute.append("data/processed/models/population_model/input.csv")
 files_to_compute.extend(expand("data/processed/models/population_model/output_{cv_method}/baselines.csv", cv_method=config["PARAMS_FOR_ANALYSIS"]["CV_METHODS"]))
 files_to_compute.extend(expand(
--- a/example_profile/example_config.yaml
+++ b/example_profile/example_config.yaml
@ -84,6 +84,7 @@ PHONE_APPLICATIONS_CRASHES:
    CATALOGUE_SOURCE: FILE # FILE (genres are read from CATALOGUE_FILE) or GOOGLE (genres are scrapped from the Play Store)
    CATALOGUE_FILE: "data/external/stachl_application_genre_catalogue.csv"
    UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether or not to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
+    PACKAGE_NAMES_HASHED: False
    SCRAPE_MISSING_CATEGORIES: False # whether or not to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
  PROVIDERS: # None implemented yet but this sensor can be used in PHONE_DATA_YIELD

@ -93,6 +94,7 @@ PHONE_APPLICATIONS_FOREGROUND:
  APPLICATION_CATEGORIES:
    CATALOGUE_SOURCE: FILE # FILE (genres are read from CATALOGUE_FILE) or GOOGLE (genres are scrapped from the Play Store)
    CATALOGUE_FILE: "data/external/stachl_application_genre_catalogue.csv"
+    PACKAGE_NAMES_HASHED: False
    UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether or not to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
    SCRAPE_MISSING_CATEGORIES: False # whether or not to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
  PROVIDERS:
@ -120,6 +122,7 @@ PHONE_APPLICATIONS_NOTIFICATIONS:
    CATALOGUE_SOURCE: FILE # FILE (genres are read from CATALOGUE_FILE) or GOOGLE (genres are scrapped from the Play Store)
    CATALOGUE_FILE: "data/external/stachl_application_genre_catalogue.csv"
    UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether or not to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
+    PACKAGE_NAMES_HASHED: False
    SCRAPE_MISSING_CATEGORIES: False # whether or not to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
  PROVIDERS: # None implemented yet but this sensor can be used in PHONE_DATA_YIELD

@ -521,7 +524,7 @@ HEATMAP_SENSORS_PER_MINUTE_PER_TIME_SEGMENT:

 # See https://www.rapids.science/latest/visualizations/data-quality-visualizations/#4-heatmap-of-sensor-row-count
 HEATMAP_SENSOR_ROW_COUNT_PER_TIME_SEGMENT:
-  PLOT: True
+  PLOT: False
  SENSORS:  [PHONE_ACTIVITY_RECOGNITION, PHONE_APPLICATIONS_FOREGROUND, PHONE_BATTERY, PHONE_BLUETOOTH, PHONE_CALLS, PHONE_CONVERSATION, PHONE_LIGHT, PHONE_LOCATIONS, PHONE_MESSAGES, PHONE_SCREEN, PHONE_WIFI_CONNECTED, PHONE_WIFI_VISIBLE]

 # Features ------
@ -534,6 +537,46 @@ HEATMAP_FEATURE_CORRELATION_MATRIX:
  CORR_METHOD: "pearson" # choose from {"pearson", "kendall", "spearman"}


+########################################################################################################################
+#                                                    Data Cleaning                                                     #
+########################################################################################################################
+
+ALL_CLEANING_INDIVIDUAL:
+  PROVIDERS:
+    RAPIDS:
+      COMPUTE: True
+      IMPUTE_SELECTED_EVENT_FEATURES:
+        COMPUTE: False
+        MIN_DATA_YIELDED_MINUTES_TO_IMPUTE: 0.33
+      COLS_NAN_THRESHOLD: 0.3 # set to 1 to disable
+      COLS_VAR_THRESHOLD: True
+      ROWS_NAN_THRESHOLD: 0.3 # set to 1 to disable
+      DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_HOURS # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
+      DATA_YIELD_RATIO_THRESHOLD: 0.75 # set to 0 to disable
+      DROP_HIGHLY_CORRELATED_FEATURES:
+        COMPUTE: False
+        MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
+        CORR_THRESHOLD: 0.95
+      SRC_SCRIPT: src/features/all_cleaning_individual/rapids/main.R
+
+ALL_CLEANING_OVERALL:
+  PROVIDERS:
+    RAPIDS:
+      COMPUTE: True
+      IMPUTE_SELECTED_EVENT_FEATURES:
+        COMPUTE: False
+        MIN_DATA_YIELDED_MINUTES_TO_IMPUTE: 0.33
+      COLS_NAN_THRESHOLD: 0.3 # set to 1 to disable
+      COLS_VAR_THRESHOLD: True
+      ROWS_NAN_THRESHOLD: 0.3 # set to 1 to disable
+      DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_HOURS # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
+      DATA_YIELD_RATIO_THRESHOLD: 0.75 # set to 0 to disable
+      DROP_HIGHLY_CORRELATED_FEATURES:
+        COMPUTE: False
+        MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
+        CORR_THRESHOLD: 0.95
+      SRC_SCRIPT: src/features/all_cleaning_overall/rapids/main.R
+

 ########################################################################################################################
 #                                              Analysis Workflow Example                                               #
@ -551,12 +594,6 @@ PARAMS_FOR_ANALYSIS:
  TARGET:
    FOLDER: data/external/example_workflow
    CONTAINER: participant_target.csv
-
-  # Cleaning Parameters
-  COLS_NAN_THRESHOLD: 0.3
-  COLS_VAR_THRESHOLD: True
-  ROWS_NAN_THRESHOLD: 0.3
-  DATA_YIELDED_HOURS_RATIO_THRESHOLD: 0.75
  
  MODEL_NAMES: [LogReg, kNN , SVM, DT, RF, GB, XGBoost, LightGBM]
  CV_METHODS: [LeaveOneOut]
--- a/mkdocs.yml
+++ b/mkdocs.yml
@ -74,7 +74,7 @@ extra_css:
 nav:
  - Home: 'index.md'
  - Overview: setup/overview.md
-  - Minimal Example: workflow-examples/minimal.md
+  - Minimal Example: analysis/minimal.md
  - Citation: citation.md
  - Contributing: contributing.md
  - Setup:
@ -85,6 +85,7 @@ nav:
      - Introduction: datastreams/data-streams-introduction.md
      - Phone:
        - aware_mysql: datastreams/aware-mysql.md
+        - aware_micro_mysql: datastreams/aware-micro-mysql.md
        - aware_csv: datastreams/aware-csv.md
        - aware_influxdb (beta): datastreams/aware-influxdb.md
        - Mandatory Phone Format: datastreams/mandatory-phone-format.md
@ -140,8 +141,9 @@ nav:
  - Visualizations:
    - Data Quality: visualizations/data-quality-visualizations.md
    - Features: visualizations/feature-visualizations.md
-  - Analysis Workflows:
-    - Complete Example: workflow-examples/analysis.md
+  - Analysis:
+    - Data Cleaning: analysis/data-cleaning.md
+    - Complete Workflow Example: analysis/complete-workflow-example.md
  - Developers:
    - Git Flow: developers/git-flow.md
    - Remote Support: developers/remote-support.md
--- a/problems/app_categories.txt
+++ b/problems/app_categories.txt
@ -0,0 +1,33 @@
+Warning: 1241 parsing failures.
+row           col   expected actual                                                                            file
+  1 is_system_app an integer  TRUE  'data/raw/p011/phone_applications_foreground_with_datetime_with_categories.csv'
+  2 is_system_app an integer  FALSE 'data/raw/p011/phone_applications_foreground_with_datetime_with_categories.csv'
+  3 is_system_app an integer  TRUE  'data/raw/p011/phone_applications_foreground_with_datetime_with_categories.csv'
+  4 is_system_app an integer  TRUE  'data/raw/p011/phone_applications_foreground_with_datetime_with_categories.csv'
+  5 is_system_app an integer  TRUE  'data/raw/p011/phone_applications_foreground_with_datetime_with_categories.csv'
+... ............. .......... ...... ...............................................................................
+See problems(...) for more details.
+
+Warning message:
+The following named parsers don't match the column names: application_name
+Error: Problem with `filter()` input `..1`.
+✖ object 'application_name' not found
+ℹ Input `..1` is `!is.na(application_name)`.
+Backtrace:
+     █
+  1. ├─`%>%`(...)
+  2. ├─dplyr::mutate(...)
+  3. ├─utils::head(., -1)
+  4. ├─dplyr::select(., -c("timestamp"))
+  5. ├─dplyr::filter(., !is.na(application_name))
+  6. ├─dplyr:::filter.data.frame(., !is.na(application_name))
+  7. │ └─dplyr:::filter_rows(.data, ...)
+  8. │   ├─base::withCallingHandlers(...)
+  9. │   └─mask$eval_all_filter(dots, env_filter)
+ 10. └─base::.handleSimpleError(...)
+ 11.   └─dplyr:::h(simpleError(msg, call))
+Execution halted
+[Mon Dec 13 17:19:06 2021]
+Error in rule app_episodes:
+    jobid: 54
+    output: data/interim/p011/phone_app_episodes.csv
--- a/problems/locations_barnett.txt
+++ b/problems/locations_barnett.txt
@ -0,0 +1,5 @@
+Warning message:
+In barnett_daily_features(snakemake) :
+  Barnett's location features cannot be computed for data or time segments that do not span one or more entire days (00:00:00 to 23:59:59). Values below point to the problem:
+Location data rows within a daily time segment: 0
+Location data time span in days: 398.6
--- a/renv.lock
+++ b/renv.lock
--- a/renv/activate.R
+++ b/renv/activate.R
@ -14,9 +14,6 @@ local({
  # signal that we're loading renv during R startup
  Sys.setenv("RENV_R_INITIALIZING" = "true")
  on.exit(Sys.unsetenv("RENV_R_INITIALIZING"), add = TRUE)
-  
-  if(grepl("Darwin", Sys.info()["sysname"], fixed = TRUE) & grepl("ARM64", Sys.info()["version"], fixed = TRUE)) # M1 Macs
-    Sys.setenv("TZDIR" = file.path(R.home(), "share", "zoneinfo"))

  # signal that we've consented to use renv
  options(renv.consent = TRUE)
--- a/rules/common.smk
+++ b/rules/common.smk
@ -40,6 +40,17 @@ def find_features_files(wildcards):
            feature_files.extend(expand("data/interim/{{pid}}/{sensor_key}_features/{sensor_key}_{language}_{provider_key}.csv", sensor_key=wildcards.sensor_key.lower(), language=get_script_language(provider["SRC_SCRIPT"]), provider_key=provider_key.lower()))
    return(feature_files)

+def find_joint_non_empatica_sensor_files(wildcards):
+    joined_files = []
+    for config_key in config.keys():
+        if config_key.startswith(("PHONE", "FITBIT")) and "PROVIDERS" in config[config_key] and isinstance(config[config_key]["PROVIDERS"], dict):
+            for provider_key, provider in config[config_key]["PROVIDERS"].items():
+                if "COMPUTE" in provider.keys() and provider["COMPUTE"]:
+                    joined_files.append("data/processed/features/{pid}/" + config_key.lower() + ".csv")
+                    break
+    return joined_files
+
+
 def optional_steps_sleep_input(wildcards):
    if config["FITBIT_STEPS_INTRADAY"]["EXCLUDE_SLEEP"]["FITBIT_BASED"]["EXCLUDE"]:
        return "data/raw/{pid}/fitbit_sleep_summary_raw.csv"
@ -114,7 +125,16 @@ def input_tzcodes_file(wilcards):
        if not config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"].lower().endswith(".csv"):
            raise ValueError("[TIMEZONE][MULTIPLE][TZCODES_FILE] should point to a CSV file, instead you typed: " + config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"])
        if not Path(config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"]).exists():
-            raise ValueError("[TIMEZONE][MULTIPLE][TZCODES_FILE] should point to a CSV file, the file in the path you typed does not exist: " + config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"])
+            try:
+                config["TIMEZONE"]["MULTIPLE"]["TZ_FILE"]
+            except KeyError:
+                raise ValueError("To create TZCODES_FILE, a list of timezones should be created " +
+                                 "with the rule preprocessing.smk/prepare_tzcodes_file " +
+                                 "which will create a file specified as config['TIMEZONE']['MULTIPLE']['TZ_FILE']." +
+                                 "\n An alternative is to provide the file manually:" +
+                                 "[TIMEZONE][MULTIPLE][TZCODES_FILE] should point to a CSV file," +
+                                 "but the file in the path you typed does not exist: " +
+                                 config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"])
        return [config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"]]
    return []

--- a/rules/features.smk
+++ b/rules/features.smk
@ -32,7 +32,7 @@ rule phone_data_yield_r_features:
    output:
        "data/interim/{pid}/phone_data_yield_features/phone_data_yield_r_{provider_key}.csv"
    script:
-        "../src/features/entry.R" 
+        "../src/features/entry.R"

 rule phone_accelerometer_python_features:
    input:
@ -324,6 +324,40 @@ rule conversation_r_features:
    script:
        "../src/features/entry.R"

+rule preprocess_esm:
+    input: "data/raw/{pid}/phone_esm_with_datetime.csv"
+    params:
+        scales=lambda wildcards: config["PHONE_ESM"]["PROVIDERS"]["STRAW"]["SCALES"]
+    output: "data/interim/{pid}/phone_esm_clean.csv"
+    script:
+        "../src/features/phone_esm/straw/preprocess.py"
+
+rule esm_features:
+    input:
+        sensor_data = "data/interim/{pid}/phone_esm_clean.csv",
+        time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
+    params:
+        provider = lambda wildcards: config["PHONE_ESM"]["PROVIDERS"][wildcards.provider_key.upper()],
+        provider_key = "{provider_key}",
+        sensor_key = "phone_esm",
+        scales=lambda wildcards: config["PHONE_ESM"]["PROVIDERS"][wildcards.provider_key.upper()]["SCALES"]
+    output: "data/interim/{pid}/phone_esm_features/phone_esm_python_{provider_key}.csv"
+    script:
+        "../src/features/entry.py"
+
+rule phone_speech_python_features:
+    input:
+        sensor_data = "data/raw/{pid}/phone_speech_with_datetime.csv",
+        time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
+    params:
+        provider = lambda wildcards: config["PHONE_SPEECH"]["PROVIDERS"][wildcards.provider_key.upper()],
+        provider_key = "{provider_key}",
+        sensor_key = "phone_speech"
+    output: 
+        "data/interim/{pid}/phone_speech_features/phone_speech_python_{provider_key}.csv"
+    script:
+        "../src/features/entry.py"
+
 rule phone_keyboard_python_features:
    input:
        sensor_data = "data/raw/{pid}/phone_keyboard_with_datetime.csv",
@ -761,22 +795,6 @@ rule fitbit_sleep_intraday_r_features:
    script:
        "../src/features/entry.R"

-rule merge_sensor_features_for_individual_participants:
-    input:
-        feature_files = input_merge_sensor_features_for_individual_participants
-    output:
-        "data/processed/features/{pid}/all_sensor_features.csv"
-    script:
-        "../src/features/utils/merge_sensor_features_for_individual_participants.R"
-
-rule merge_sensor_features_for_all_participants:
-    input:
-        feature_files = expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"])
-    output:
-        "data/processed/features/all_participants/all_sensor_features.csv"
-    script:
-        "../src/features/utils/merge_sensor_features_for_all_participants.R"
-
 rule empatica_accelerometer_python_features:
    input:
        sensor_data = "data/raw/{pid}/empatica_accelerometer_with_datetime.csv",
@ -786,7 +804,8 @@ rule empatica_accelerometer_python_features:
        provider_key = "{provider_key}",
        sensor_key = "empatica_accelerometer"
    output:
-        "data/interim/{pid}/empatica_accelerometer_features/empatica_accelerometer_python_{provider_key}.csv"
+        "data/interim/{pid}/empatica_accelerometer_features/empatica_accelerometer_python_{provider_key}.csv",
+        "data/interim/{pid}/empatica_accelerometer_features/empatica_accelerometer_python_{provider_key}_windows.csv"
    script:
        "../src/features/entry.py"

@ -812,7 +831,8 @@ rule empatica_heartrate_python_features:
        provider_key = "{provider_key}",
        sensor_key = "empatica_heartrate"
    output:
-        "data/interim/{pid}/empatica_heartrate_features/empatica_heartrate_python_{provider_key}.csv"
+        "data/interim/{pid}/empatica_heartrate_features/empatica_heartrate_python_{provider_key}.csv",
+        "data/interim/{pid}/empatica_heartrate_features/empatica_heartrate_python_{provider_key}_windows.csv"
    script:
        "../src/features/entry.py"

@ -838,7 +858,8 @@ rule empatica_temperature_python_features:
        provider_key = "{provider_key}",
        sensor_key = "empatica_temperature"
    output:
-        "data/interim/{pid}/empatica_temperature_features/empatica_temperature_python_{provider_key}.csv"
+        "data/interim/{pid}/empatica_temperature_features/empatica_temperature_python_{provider_key}.csv",
+        "data/interim/{pid}/empatica_temperature_features/empatica_temperature_python_{provider_key}_windows.csv"
    script:
        "../src/features/entry.py"

@ -864,7 +885,8 @@ rule empatica_electrodermal_activity_python_features:
        provider_key = "{provider_key}",
        sensor_key = "empatica_electrodermal_activity"
    output:
-        "data/interim/{pid}/empatica_electrodermal_activity_features/empatica_electrodermal_activity_python_{provider_key}.csv"
+        "data/interim/{pid}/empatica_electrodermal_activity_features/empatica_electrodermal_activity_python_{provider_key}.csv",
+        "data/interim/{pid}/empatica_electrodermal_activity_features/empatica_electrodermal_activity_python_{provider_key}_windows.csv"
    script:
        "../src/features/entry.py"

@ -890,7 +912,8 @@ rule empatica_blood_volume_pulse_python_features:
        provider_key = "{provider_key}",
        sensor_key = "empatica_blood_volume_pulse"
    output:
-        "data/interim/{pid}/empatica_blood_volume_pulse_features/empatica_blood_volume_pulse_python_{provider_key}.csv"
+        "data/interim/{pid}/empatica_blood_volume_pulse_features/empatica_blood_volume_pulse_python_{provider_key}.csv",
+        "data/interim/{pid}/empatica_blood_volume_pulse_features/empatica_blood_volume_pulse_python_{provider_key}_windows.csv"
    script:
        "../src/features/entry.py"

@ -916,7 +939,8 @@ rule empatica_inter_beat_interval_python_features:
        provider_key = "{provider_key}",
        sensor_key = "empatica_inter_beat_interval"
    output:
-        "data/interim/{pid}/empatica_inter_beat_interval_features/empatica_inter_beat_interval_python_{provider_key}.csv"
+        "data/interim/{pid}/empatica_inter_beat_interval_features/empatica_inter_beat_interval_python_{provider_key}.csv",
+        "data/interim/{pid}/empatica_inter_beat_interval_features/empatica_inter_beat_interval_python_{provider_key}_windows.csv"
    script:
        "../src/features/entry.py"

@ -958,3 +982,48 @@ rule empatica_tags_r_features:
        "data/interim/{pid}/empatica_tags_features/empatica_tags_r_{provider_key}.csv"
    script:
        "../src/features/entry.R"
+
+rule merge_sensor_features_for_individual_participants:
+    input:
+        feature_files = input_merge_sensor_features_for_individual_participants
+    output:
+        "data/processed/features/{pid}/all_sensor_features.csv"
+    script:
+        "../src/features/utils/merge_sensor_features_for_individual_participants.R"
+
+rule merge_sensor_features_for_all_participants:
+    input:
+        feature_files = expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"])
+    output:
+        "data/processed/features/all_participants/all_sensor_features.csv"
+    script:
+        "../src/features/utils/merge_sensor_features_for_all_participants.R"
+
+rule clean_sensor_features_for_individual_participants:
+    input:
+        sensor_data = rules.merge_sensor_features_for_individual_participants.output
+    wildcard_constraints:
+        pid = "("+"|".join(config["PIDS"])+")"
+    params:
+        provider = lambda wildcards: config["ALL_CLEANING_INDIVIDUAL"]["PROVIDERS"][wildcards.provider_key.upper()],
+        provider_key = "{provider_key}",
+        script_extension = "{script_extension}",
+        sensor_key = "all_cleaning_individual" 
+    output:
+        "data/processed/features/{pid}/all_sensor_features_cleaned_{provider_key}_{script_extension}.csv" 
+    script:
+        "../src/features/entry.{params.script_extension}"
+
+rule clean_sensor_features_for_all_participants:
+    input:
+        sensor_data = rules.merge_sensor_features_for_all_participants.output
+    params:
+        provider = lambda wildcards: config["ALL_CLEANING_OVERALL"]["PROVIDERS"][wildcards.provider_key.upper()],
+        provider_key = "{provider_key}",
+        script_extension = "{script_extension}",
+        sensor_key = "all_cleaning_overall",
+        target = "{target}"
+    output:
+        "data/processed/features/all_participants/all_sensor_features_cleaned_{provider_key}_{script_extension}_({target}).csv"
+    script:
+        "../src/features/entry.{params.script_extension}"
--- a/rules/models.smk
+++ b/rules/models.smk
@ -1,165 +1,52 @@
-rule download_demographic_data:
+rule merge_baseline_data:
+    input:
+        data = expand(config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["FOLDER"] + "/{container}", container=config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["CONTAINER"])
+    output:
+        "data/raw/baseline_merged.csv"
+    script:
+        "../src/data/merge_baseline_data.py"
+
+rule download_baseline_data:
    input:
        participant_file = "data/external/participant_files/{pid}.yaml",
-        data = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FOLDER"] + "/" + config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["CONTAINER"]
+        data = "data/raw/baseline_merged.csv"
    output:
-        "data/raw/{pid}/participant_info_raw.csv"
+        "data/raw/{pid}/participant_baseline_raw.csv"
    script:
-        "../src/data/workflow_example/download_demographic_data.R"
+        "../src/data/download_baseline_data.py"

-rule demographic_features:
+rule baseline_features:
    input:
-        participant_info = "data/raw/{pid}/participant_info_raw.csv"
+        "data/raw/{pid}/participant_baseline_raw.csv"
    params:
-        pid = "{pid}",
-        features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FEATURES"]
+        pid="{pid}",
+        features=config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["FEATURES"],
+        question_filename=config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["FOLDER"] + "/" + config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["QUESTION_LIST"]
    output:
-        "data/processed/features/{pid}/demographic_features.csv"
+        interim="data/interim/{pid}/baseline_questionnaires.csv",
+        features="data/processed/features/{pid}/baseline_features.csv"
    script:
-        "../src/features/workflow_example/demographic_features.py"
+        "../src/data/baseline_features.py"

-rule download_target_data:
+rule select_target:
    input:
-        participant_file = "data/external/participant_files/{pid}.yaml",
-        data = config["PARAMS_FOR_ANALYSIS"]["TARGET"]["FOLDER"] + "/" + config["PARAMS_FOR_ANALYSIS"]["TARGET"]["CONTAINER"]
-    output:
-        "data/raw/{pid}/participant_target_raw.csv"
-    script:
-        "../src/data/workflow_example/download_target_data.R"
-
-rule target_readable_datetime:
-    input:
-        sensor_input = "data/raw/{pid}/participant_target_raw.csv",
-        time_segments = "data/interim/time_segments/{pid}_time_segments.csv",
-        pid_file = "data/external/participant_files/{pid}.yaml",
-        tzcodes_file = input_tzcodes_file,
+        cleaned_sensor_features = "data/processed/features/{pid}/all_sensor_features_cleaned_straw_py.csv"
    params:
-        device_type = "fitbit",
-        timezone_parameters = config["TIMEZONE"],
-        pid = "{pid}",
-        time_segments_type = config["TIME_SEGMENTS"]["TYPE"],
-        include_past_periodic_segments = config["TIME_SEGMENTS"]["INCLUDE_PAST_PERIODIC_SEGMENTS"]
-    output:
-        "data/raw/{pid}/participant_target_with_datetime.csv"
-    script:
-        "../src/data/datetime/readable_datetime.R"
-
-rule parse_targets:
-    input:
-        targets = "data/raw/{pid}/participant_target_with_datetime.csv",
-        time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
-    output:
-        "data/processed/targets/{pid}/parsed_targets.csv"
-    script:
-        "../src/models/workflow_example/parse_targets.py"
-
-rule clean_sensor_features_for_individual_participants:
-    input:
-        rules.merge_sensor_features_for_individual_participants.output
-    params:
-        cols_nan_threshold = config["PARAMS_FOR_ANALYSIS"]["COLS_NAN_THRESHOLD"],
-        cols_var_threshold = config["PARAMS_FOR_ANALYSIS"]["COLS_VAR_THRESHOLD"],
-        rows_nan_threshold = config["PARAMS_FOR_ANALYSIS"]["ROWS_NAN_THRESHOLD"],
-        data_yielded_hours_ratio_threshold = config["PARAMS_FOR_ANALYSIS"]["DATA_YIELDED_HOURS_RATIO_THRESHOLD"],
-    output:
-        "data/processed/features/{pid}/all_sensor_features_cleaned.csv"
-    script:
-        "../src/models/workflow_example/clean_sensor_features.R"
-
-rule clean_sensor_features_for_all_participants:
-    input:
-        rules.merge_sensor_features_for_all_participants.output
-    params:
-        cols_nan_threshold = config["PARAMS_FOR_ANALYSIS"]["COLS_NAN_THRESHOLD"],
-        cols_var_threshold = config["PARAMS_FOR_ANALYSIS"]["COLS_VAR_THRESHOLD"],
-        rows_nan_threshold = config["PARAMS_FOR_ANALYSIS"]["ROWS_NAN_THRESHOLD"],
-        data_yielded_hours_ratio_threshold = config["PARAMS_FOR_ANALYSIS"]["DATA_YIELDED_HOURS_RATIO_THRESHOLD"],
-    output:
-        "data/processed/features/all_participants/all_sensor_features_cleaned.csv"
-    script:
-        "../src/models/workflow_example/clean_sensor_features.R"
-
-rule merge_features_and_targets_for_individual_model:
-    input:
-        cleaned_sensor_features = "data/processed/features/{pid}/all_sensor_features_cleaned.csv",
-        targets = "data/processed/targets/{pid}/parsed_targets.csv",
+        target_variable = config["PARAMS_FOR_ANALYSIS"]["TARGET"]["LABEL"]
    output:
        "data/processed/models/individual_model/{pid}/input.csv"
    script:
-        "../src/models/workflow_example/merge_features_and_targets_for_individual_model.py"
+        "../src/models/select_targets.py"

 rule merge_features_and_targets_for_population_model:
    input:
-        cleaned_sensor_features = "data/processed/features/all_participants/all_sensor_features_cleaned.csv",
-        demographic_features = expand("data/processed/features/{pid}/demographic_features.csv", pid=config["PIDS"]),
-        targets = expand("data/processed/targets/{pid}/parsed_targets.csv", pid=config["PIDS"]),
-    output:
-        "data/processed/models/population_model/input.csv"
-    script:
-        "../src/models/workflow_example/merge_features_and_targets_for_population_model.py"
-
-rule baselines_for_individual_model:
-    input:
-        "data/processed/models/individual_model/{pid}/input.csv"
+        cleaned_sensor_features = "data/processed/features/all_participants/all_sensor_features_cleaned_straw_py_({target}).csv",
+        demographic_features = expand("data/processed/features/{pid}/baseline_features.csv", pid=config["PIDS"]),
    params:
-        cv_method = "{cv_method}",
-        colnames_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FEATURES"],
+        target_variable="{target}"
    output:
-        "data/processed/models/individual_model/{pid}/output_{cv_method}/baselines.csv"
-    log:
-        "data/processed/models/individual_model/{pid}/output_{cv_method}/baselines_notes.log"
+        "data/processed/models/population_model/input_{target}.csv"
    script:
-        "../src/models/workflow_example/baselines.py"
+        "../src/models/merge_features_and_targets_for_population_model.py"

-rule baselines_for_population_model:
-    input:
-        "data/processed/models/population_model/input.csv"
-    params:
-        cv_method = "{cv_method}",
-        colnames_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FEATURES"],
-    output:
-        "data/processed/models/population_model/output_{cv_method}/baselines.csv"
-    log:
-        "data/processed/models/population_model/output_{cv_method}/baselines_notes.log"
-    script:
-        "../src/models/workflow_example/baselines.py"

-rule modelling_for_individual_participants:
-    input:
-        data = "data/processed/models/individual_model/{pid}/input.csv"
-    params:
-        model = "{model}",
-        cv_method = "{cv_method}",
-        scaler = "{scaler}",
-        categorical_operators = config["PARAMS_FOR_ANALYSIS"]["CATEGORICAL_OPERATORS"],
-        categorical_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["CATEGORICAL_FEATURES"],
-        model_hyperparams = config["PARAMS_FOR_ANALYSIS"]["MODEL_HYPERPARAMS"],
-    output:
-        fold_predictions = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/fold_predictions.csv",
-        fold_metrics = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/fold_metrics.csv",
-        overall_results = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/overall_results.csv",
-        fold_feature_importances = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/fold_feature_importances.csv"
-    log:
-        "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/notes.log"
-    script:
-        "../src/models/workflow_example/modelling.py"
-
-rule modelling_for_all_participants:
-    input:
-        data = "data/processed/models/population_model/input.csv"
-    params:
-        model = "{model}",
-        cv_method = "{cv_method}",
-        scaler = "{scaler}",
-        categorical_operators = config["PARAMS_FOR_ANALYSIS"]["CATEGORICAL_OPERATORS"],
-        categorical_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["CATEGORICAL_FEATURES"],
-        model_hyperparams = config["PARAMS_FOR_ANALYSIS"]["MODEL_HYPERPARAMS"],
-    output:
-        fold_predictions = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/fold_predictions.csv",
-        fold_metrics = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/fold_metrics.csv",
-        overall_results = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/overall_results.csv",
-        fold_feature_importances = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/fold_feature_importances.csv"
-    log:
-        "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/notes.log"
-    script:
-        "../src/models/workflow_example/modelling.py"
--- a/rules/models_example.smk
+++ b/rules/models_example.smk
@ -0,0 +1,139 @@
+rule download_demographic_data:
+    input:
+        participant_file = "data/external/participant_files/{pid}.yaml",
+        data = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FOLDER"] + "/" + config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["CONTAINER"]
+    output:
+        "data/raw/{pid}/participant_info_raw.csv"
+    script:
+        "../src/data/workflow_example/download_demographic_data.R"
+
+rule demographic_features:
+    input:
+        participant_info = "data/raw/{pid}/participant_info_raw.csv"
+    params:
+        pid = "{pid}",
+        features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FEATURES"]
+    output:
+        "data/processed/features/{pid}/demographic_features.csv"
+    script:
+        "../src/features/workflow_example/demographic_features.py"
+
+rule download_target_data:
+    input:
+        participant_file = "data/external/participant_files/{pid}.yaml",
+        data = config["PARAMS_FOR_ANALYSIS"]["TARGET"]["FOLDER"] + "/" + config["PARAMS_FOR_ANALYSIS"]["TARGET"]["CONTAINER"]
+    output:
+        "data/raw/{pid}/participant_target_raw.csv"
+    script:
+        "../src/data/workflow_example/download_target_data.R"
+
+rule target_readable_datetime:
+    input:
+        sensor_input = "data/raw/{pid}/participant_target_raw.csv",
+        time_segments = "data/interim/time_segments/{pid}_time_segments.csv",
+        pid_file = "data/external/participant_files/{pid}.yaml",
+        tzcodes_file = input_tzcodes_file,
+    params:
+        device_type = "fitbit",
+        timezone_parameters = config["TIMEZONE"],
+        pid = "{pid}",
+        time_segments_type = config["TIME_SEGMENTS"]["TYPE"],
+        include_past_periodic_segments = config["TIME_SEGMENTS"]["INCLUDE_PAST_PERIODIC_SEGMENTS"]
+    output:
+        "data/raw/{pid}/participant_target_with_datetime.csv"
+    script:
+        "../src/data/datetime/readable_datetime.R"
+
+rule parse_targets:
+    input:
+        targets = "data/raw/{pid}/participant_target_with_datetime.csv",
+        time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
+    output:
+        "data/processed/targets/{pid}/parsed_targets.csv"
+    script:
+        "../src/models/workflow_example/parse_targets.py"
+
+rule merge_features_and_targets_for_individual_model:
+    input:
+        cleaned_sensor_features = "data/processed/features/{pid}/all_sensor_features_cleaned_rapids.csv",
+        targets = "data/processed/targets/{pid}/parsed_targets.csv",
+    output:
+        "data/processed/models/individual_model/{pid}/input.csv"
+    script:
+        "../src/models/workflow_example/merge_features_and_targets_for_individual_model.py"
+
+rule merge_features_and_targets_for_population_model:
+    input:
+        cleaned_sensor_features = "data/processed/features/all_participants/all_sensor_features_cleaned_rapids.csv",
+        demographic_features = expand("data/processed/features/{pid}/demographic_features.csv", pid=config["PIDS"]),
+        targets = expand("data/processed/targets/{pid}/parsed_targets.csv", pid=config["PIDS"]),
+    output:
+        "data/processed/models/population_model/input.csv"
+    script:
+        "../src/models/workflow_example/merge_features_and_targets_for_population_model.py"
+
+rule baselines_for_individual_model:
+    input:
+        "data/processed/models/individual_model/{pid}/input.csv"
+    params:
+        cv_method = "{cv_method}",
+        colnames_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FEATURES"],
+    output:
+        "data/processed/models/individual_model/{pid}/output_{cv_method}/baselines.csv"
+    log:
+        "data/processed/models/individual_model/{pid}/output_{cv_method}/baselines_notes.log"
+    script:
+        "../src/models/workflow_example/baselines.py"
+
+rule baselines_for_population_model:
+    input:
+        "data/processed/models/population_model/input.csv"
+    params:
+        cv_method = "{cv_method}",
+        colnames_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FEATURES"],
+    output:
+        "data/processed/models/population_model/output_{cv_method}/baselines.csv"
+    log:
+        "data/processed/models/population_model/output_{cv_method}/baselines_notes.log"
+    script:
+        "../src/models/workflow_example/baselines.py"
+
+rule modelling_for_individual_participants:
+    input:
+        data = "data/processed/models/individual_model/{pid}/input.csv"
+    params:
+        model = "{model}",
+        cv_method = "{cv_method}",
+        scaler = "{scaler}",
+        categorical_operators = config["PARAMS_FOR_ANALYSIS"]["CATEGORICAL_OPERATORS"],
+        categorical_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["CATEGORICAL_FEATURES"],
+        model_hyperparams = config["PARAMS_FOR_ANALYSIS"]["MODEL_HYPERPARAMS"],
+    output:
+        fold_predictions = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/fold_predictions.csv",
+        fold_metrics = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/fold_metrics.csv",
+        overall_results = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/overall_results.csv",
+        fold_feature_importances = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/fold_feature_importances.csv"
+    log:
+        "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/notes.log"
+    script:
+        "../src/models/workflow_example/modelling.py"
+
+rule modelling_for_all_participants:
+    input:
+        data = "data/processed/models/population_model/input.csv"
+    params:
+        model = "{model}",
+        cv_method = "{cv_method}",
+        scaler = "{scaler}",
+        categorical_operators = config["PARAMS_FOR_ANALYSIS"]["CATEGORICAL_OPERATORS"],
+        categorical_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["CATEGORICAL_FEATURES"],
+        model_hyperparams = config["PARAMS_FOR_ANALYSIS"]["MODEL_HYPERPARAMS"],
+    output:
+        fold_predictions = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/fold_predictions.csv",
+        fold_metrics = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/fold_metrics.csv",
+        overall_results = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/overall_results.csv",
+        fold_feature_importances = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/fold_feature_importances.csv"
+    log:
+        "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/notes.log"
+    script:
+        "../src/models/workflow_example/modelling.py"
--- a/rules/preprocessing.smk
+++ b/rules/preprocessing.smk
@ -4,6 +4,36 @@ rule create_example_participant_files:
    shell:
        "echo 'PHONE:\n  DEVICE_IDS: [a748ee1a-1d0b-4ae9-9074-279a2b6ba524]\n  PLATFORMS: [android]\n  LABEL: test-01\n  START_DATE: 2020-04-23 00:00:00\n  END_DATE: 2020-05-04 23:59:59\nFITBIT:\n  DEVICE_IDS: [a748ee1a-1d0b-4ae9-9074-279a2b6ba524]\n  LABEL: test-01\n  START_DATE: 2020-04-23 00:00:00\n  END_DATE: 2020-05-04 23:59:59\n' >> ./data/external/participant_files/example01.yaml && echo 'PHONE:\n  DEVICE_IDS: [13dbc8a3-dae3-4834-823a-4bc96a7d459d]\n  PLATFORMS: [ios]\n  LABEL: test-02\n  START_DATE: 2020-04-23 00:00:00\n  END_DATE: 2020-05-04 23:59:59\nFITBIT:\n  DEVICE_IDS: [13dbc8a3-dae3-4834-823a-4bc96a7d459d]\n  LABEL: test-02\n  START_DATE: 2020-04-23 00:00:00\n  END_DATE: 2020-05-04 23:59:59\n' >> ./data/external/participant_files/example02.yaml"

+# rule query_usernames_device_empatica_ids:
+#     params:
+#         baseline_folder = "/mnt/e/STRAWbaseline/"
+#     output:
+#         usernames_file = config["CREATE_PARTICIPANT_FILES"]["USERNAMES_CSV"],
+#         timezone_file = config["TIMEZONE"]["MULTIPLE"]["TZ_FILE"]
+#     script:
+#         "../../participants/prepare_usernames_file.py"
+
+rule prepare_tzcodes_file:
+    input:
+        timezone_file = config["TIMEZONE"]["MULTIPLE"]["TZ_FILE"]
+    output:
+        tzcodes_file = config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"]
+    script:
+        "../tools/create_multi_timezones_file.py"
+
+rule prepare_participants_csv:
+    input:
+        username_list = config["CREATE_PARTICIPANT_FILES"]["USERNAMES_CSV"]
+    params:
+        data_configuration = config["PHONE_DATA_STREAMS"][config["PHONE_DATA_STREAMS"]["USE"]],
+        participants_table = "participants",
+        device_id_table = "esm",
+        start_end_date_table = "esm"
+    output:
+        participants_file = config["CREATE_PARTICIPANT_FILES"]["CSV_FILE_PATH"]
+    script:
+        "../src/data/translate_usernames_into_participants_data.R"
+
 rule create_participants_files:
    input:
        participants_file = config["CREATE_PARTICIPANT_FILES"]["CSV_FILE_PATH"] 
@ -147,7 +177,6 @@ rule resample_episodes_with_datetime:
    script:
        "../src/data/datetime/readable_datetime.R"

-
 rule phone_application_categories:
    input:
        "data/raw/{pid}/phone_applications_{type}_with_datetime.csv"
@ -218,5 +247,33 @@ rule empatica_readable_datetime:
        include_past_periodic_segments = config["TIME_SEGMENTS"]["INCLUDE_PAST_PERIODIC_SEGMENTS"]
    output:
        "data/raw/{pid}/empatica_{sensor}_with_datetime.csv"
+    resources:
+        mem_mb=50000
    script:
        "../src/data/datetime/readable_datetime.R"
+
+
+rule extract_event_information_from_esm:
+    input:
+        esm_raw_input = "data/raw/{pid}/phone_esm_raw.csv",
+        pid_file = "data/external/participant_files/{pid}.yaml"
+    params:
+        stage = "extract",
+        pid = "{pid}"
+    output:
+        "data/raw/ers/{pid}_ers.csv",
+        "data/raw/ers/{pid}_stress_event_targets.csv"
+    script:
+        "../src/features/phone_esm/straw/process_user_event_related_segments.py"
+
+rule merge_event_related_segments_files:
+    input:
+        ers_files = expand("data/raw/ers/{pid}_ers.csv", pid=config["PIDS"]),
+        se_files = expand("data/raw/ers/{pid}_stress_event_targets.csv", pid=config["PIDS"])
+    params:
+        stage = "merge"
+    output:
+        "data/external/straw_events.csv",
+        "data/external/stress_event_targets.csv"
+    script:
+        "../src/features/phone_esm/straw/process_user_event_related_segments.py"
--- a/src/data/baseline_features.py
+++ b/src/data/baseline_features.py
@ -0,0 +1,182 @@
+import numpy as np
+import pandas as pd
+
+pid = snakemake.params["pid"]
+requested_features = snakemake.params["features"]
+baseline_interim = pd.DataFrame(columns=["qid", "question", "score_original", "score"])
+baseline_features = pd.DataFrame(columns=requested_features)
+question_filename = snakemake.params["question_filename"]
+
+JCQ_DEMAND = "JobEisen"
+JCQ_CONTROL = "JobControle"
+
+dict_JCQ_demand_control_reverse = {
+    JCQ_DEMAND: {
+        3: " [Od mene se ne zahteva,",
+        4: " [Imam dovolj časa, da končam",
+        5: " [Pri svojem delu se ne srečujem s konfliktnimi",
+    },
+    JCQ_CONTROL: {
+        2: " |Moje delo vključuje veliko ponavljajočega",
+        6: " [Pri svojem delu imam zelo malo svobode",
+    },
+}
+
+LIMESURVEY_JCQ_MIN = 1
+LIMESURVEY_JCQ_MAX = 4
+
+DEMAND_CONTROL_RATIO_MIN = 5 / (9 * 4)
+DEMAND_CONTROL_RATIO_MAX = (4 * 5) / 9
+
+JCQ_NORMS = {
+    "F": {
+        0: DEMAND_CONTROL_RATIO_MIN,
+        1: 0.45,
+        2: 0.52,
+        3: 0.62,
+        4: DEMAND_CONTROL_RATIO_MAX,
+    },
+    "M": {
+        0: DEMAND_CONTROL_RATIO_MIN,
+        1: 0.41,
+        2: 0.48,
+        3: 0.56,
+        4: DEMAND_CONTROL_RATIO_MAX,
+    },
+}
+
+participant_info = pd.read_csv(snakemake.input[0], parse_dates=["date_of_birth"])
+
+if not participant_info.empty:
+    if "age" in requested_features:
+        now = pd.Timestamp("now")
+        baseline_features.loc[0, "age"] = (
+            now - participant_info.loc[0, "date_of_birth"]
+        ).days / 365.25245
+    if "gender" in requested_features:
+        baseline_features.loc[0, "gender"] = participant_info.loc[0, "gender"]
+    if "startlanguage" in requested_features:
+        baseline_features.loc[0, "startlanguage"] = participant_info.loc[
+            0, "startlanguage"
+        ]
+    if (
+        ("limesurvey_demand" in requested_features)
+        or ("limesurvey_control" in requested_features)
+        or ("limesurvey_demand_control_ratio" in requested_features)
+    ):
+        participant_info_t = participant_info.T
+        rows_baseline = participant_info_t.index
+
+        if ("limesurvey_demand" in requested_features) or (
+            "limesurvey_demand_control_ratio" in requested_features
+        ):
+            # Find questions about demand, but disregard time (duration of filling in questionnaire)
+            rows_demand = rows_baseline.str.startswith(
+                JCQ_DEMAND
+            ) & ~rows_baseline.str.endswith("Time")
+            limesurvey_demand = (
+                participant_info_t[rows_demand]
+                .reset_index()
+                .rename(columns={"index": "question", 0: "score_original"})
+            )
+            # Extract question IDs from names such as JobEisen[3]
+            limesurvey_demand["qid"] = (
+                limesurvey_demand["question"].str.extract(r"\[(\d+)\]").astype(int)
+            )
+            limesurvey_demand["score"] = limesurvey_demand["score_original"]
+            # Identify rows that include questions to be reversed.
+            rows_demand_reverse = limesurvey_demand["qid"].isin(
+                dict_JCQ_demand_control_reverse[JCQ_DEMAND].keys()
+            )
+            # Reverse the score, so that the maximum value becomes the minimum etc.
+            limesurvey_demand.loc[rows_demand_reverse, "score"] = (
+                LIMESURVEY_JCQ_MAX
+                + LIMESURVEY_JCQ_MIN
+                - limesurvey_demand.loc[rows_demand_reverse, "score_original"]
+            )
+            baseline_interim = pd.concat([baseline_interim, limesurvey_demand], axis=0, ignore_index=True)
+            if "limesurvey_demand" in requested_features:
+                baseline_features.loc[0, "limesurvey_demand"] = limesurvey_demand[
+                    "score"
+                ].sum()
+
+        if ("limesurvey_control" in requested_features) or (
+            "limesurvey_demand_control_ratio" in requested_features
+        ):
+            # Find questions about control, but disregard time (duration of filling in questionnaire)
+            rows_control = rows_baseline.str.startswith(
+                JCQ_CONTROL
+            ) & ~rows_baseline.str.endswith("Time")
+            limesurvey_control = (
+                participant_info_t[rows_control]
+                .reset_index()
+                .rename(columns={"index": "question", 0: "score_original"})
+            )
+            # Extract question IDs from names such as JobControle[3]
+            limesurvey_control["qid"] = (
+                limesurvey_control["question"].str.extract(r"\[(\d+)\]").astype(int)
+            )
+            limesurvey_control["score"] = limesurvey_control["score_original"]
+            # Identify rows that include questions to be reversed.
+            rows_control_reverse = limesurvey_control["qid"].isin(
+                dict_JCQ_demand_control_reverse[JCQ_CONTROL].keys()
+            )
+            # Reverse the score, so that the maximum value becomes the minimum etc.
+            limesurvey_control.loc[rows_control_reverse, "score"] = (
+                LIMESURVEY_JCQ_MAX
+                + LIMESURVEY_JCQ_MIN
+                - limesurvey_control.loc[rows_control_reverse, "score_original"]
+            )
+
+            baseline_interim = pd.concat([baseline_interim, limesurvey_control], axis=0, ignore_index=True)
+
+            if "limesurvey_control" in requested_features:
+                baseline_features.loc[0, "limesurvey_control"] = limesurvey_control[
+                    "score"
+                ].sum()
+
+        if "limesurvey_demand_control_ratio" in requested_features:
+            if limesurvey_control["score"].sum():
+                limesurvey_demand_control_ratio = (
+                        limesurvey_demand["score"].sum() / limesurvey_control["score"].sum()
+                )
+            else:
+                limesurvey_demand_control_ratio = 0
+            if (
+                JCQ_NORMS[participant_info.loc[0, "gender"]][0]
+                <= limesurvey_demand_control_ratio
+                < JCQ_NORMS[participant_info.loc[0, "gender"]][1]
+            ):
+                limesurvey_quartile = 1
+            elif (
+                JCQ_NORMS[participant_info.loc[0, "gender"]][1]
+                <= limesurvey_demand_control_ratio
+                < JCQ_NORMS[participant_info.loc[0, "gender"]][2]
+            ):
+                limesurvey_quartile = 2
+            elif (
+                JCQ_NORMS[participant_info.loc[0, "gender"]][2]
+                <= limesurvey_demand_control_ratio
+                < JCQ_NORMS[participant_info.loc[0, "gender"]][3]
+            ):
+                limesurvey_quartile = 3
+            elif (
+                JCQ_NORMS[participant_info.loc[0, "gender"]][3]
+                <= limesurvey_demand_control_ratio
+                < JCQ_NORMS[participant_info.loc[0, "gender"]][4]
+            ):
+                limesurvey_quartile = 4
+            else:
+                limesurvey_quartile = np.nan
+
+            baseline_features.loc[
+                0, "limesurvey_demand_control_ratio"
+            ] = limesurvey_demand_control_ratio
+            baseline_features.loc[
+                0, "limesurvey_demand_control_ratio_quartile"
+            ] = limesurvey_quartile
+
+if not baseline_interim.empty:
+    baseline_interim.to_csv(snakemake.output["interim"], index=False, encoding="utf-8")
+
+baseline_features.to_csv(snakemake.output["features"], index=False, encoding="utf-8")
--- a/src/data/create_participants_files.R
+++ b/src/data/create_participants_files.R
@ -1,6 +1,6 @@
 source("renv/activate.R")

-library(RMariaDB)
+#library(RMariaDB)
 library(stringr)
 library(purrr)
 library(readr)
@ -58,7 +58,7 @@ participants %>%
      lines <- append(lines, empty_fitbit)

    if(add_empatica_section == TRUE && !is.na(row[empatica_device_id_column])){
-      lines <- append(lines, c("EMPATICA:", paste0("  DEVICE_IDS: [",row[empatica_device_id_column],"]"),
+      lines <- append(lines, c("EMPATICA:", paste0("  DEVICE_IDS: [",row$label,"]"),
                               paste("  LABEL:",row$label), paste("  START_DATE:", start_date), paste("  END_DATE:", end_date)))
    } else
      lines <- append(lines, empty_empatica)
--- a/src/data/datetime/assign_to_time_segment.R
+++ b/src/data/datetime/assign_to_time_segment.R
@ -5,13 +5,16 @@ options(scipen=999)

 assign_rows_to_segments <- function(data, segments){
  # This function is used by all segment types, we use data.tables because they are fast
+
  data <- data.table::as.data.table(data)
  data[, assigned_segments := ""]
  for(i in seq_len(nrow(segments))) {
    segment <- segments[i,]
+
    data[segment$segment_start_ts<= timestamp & segment$segment_end_ts >= timestamp,
         assigned_segments := stringi::stri_c(assigned_segments, segment$segment_id, sep = "|")]
  }
+  
  data[,assigned_segments:=substring(assigned_segments, 2)]
  data
 }
--- a/src/data/download_baseline_data.py
+++ b/src/data/download_baseline_data.py
@ -0,0 +1,14 @@
+import pandas as pd
+import yaml
+
+filename = snakemake.input["data"]
+baseline = pd.read_csv(filename)
+
+with open(snakemake.input["participant_file"], "r") as file:
+    participant = yaml.safe_load(file)
+
+username = participant["PHONE"]["LABEL"]
+
+baseline[baseline["username"] == username].to_csv(snakemake.output[0],
+                                                  index=False,
+                                                  encoding="utf-8",)
--- a/src/data/merge_baseline_data.py
+++ b/src/data/merge_baseline_data.py
@ -0,0 +1,30 @@
+import pandas as pd
+
+VARIABLES_TO_TRANSLATE = {
+    "Gebruikersnaam": "username",
+    "Geslacht": "gender",
+    "Geboortedatum": "date_of_birth",
+}
+
+filenames = snakemake.input["data"]
+
+baseline_dfs = []
+
+for fn in filenames:
+    baseline_dfs.append(pd.read_csv(fn,
+                                    parse_dates=["Geboortedatum"],
+                                    infer_datetime_format=True,
+                                    cache_dates=True,
+                                    ))
+
+baseline = (
+    pd.concat(baseline_dfs, join="inner")
+    .reset_index()
+    .drop(columns="index")
+)
+
+baseline.rename(columns=VARIABLES_TO_TRANSLATE, copy=False, inplace=True)
+
+baseline.to_csv(snakemake.output[0],
+                index=False,
+                encoding="utf-8",)
--- a/src/data/streams/aware_micro_mysql/container.R
+++ b/src/data/streams/aware_micro_mysql/container.R
@ -0,0 +1,85 @@
+# if you need a new package, you should add it with renv::install(package) so your renv venv is updated
+library(RMariaDB)
+library(yaml)
+
+#' @description
+#' Auxiliary function to parse the connection credentials from a specifc group in ./credentials.yaml
+#' You can reause most of this function if you are connection to a DB or Web API.
+#' It's OK to delete this function if you don't need credentials, e.g., you are pulling data from a CSV for example.
+#' @param group the yaml key containing the credentials to connect to a database
+#' @preturn dbEngine a database engine (connection) ready to perform queries
+get_db_engine <- function(group){
+  # The working dir is aways RAPIDS root folder, so your credentials file is always /credentials.yaml
+  credentials <- read_yaml("./credentials.yaml")
+  if(!group %in% names(credentials))
+    stop(paste("The credentials group",group, "does not exist in ./credentials.yaml. The only groups that exist in that file are:", paste(names(credentials), collapse = ","), ". Did you forget to set the group in [PHONE_DATA_STREAMS][aware_mysql][DATABASE_GROUP] in config.yaml?"))
+  dbEngine <- dbConnect(MariaDB(), db = credentials[[group]][["database"]],
+                                    username = credentials[[group]][["user"]],
+                                    password = credentials[[group]][["password"]],
+                                    host = credentials[[group]][["host"]],
+                                    port = credentials[[group]][["port"]])
+  return(dbEngine)
+}
+
+# This file gets executed for each PHONE_SENSOR of each participant
+# If you are connecting to a database the env file containing its credentials is available at "./.env"
+# If you are reading a CSV file instead of a DB table, the @param sensor_container wil contain the file path as set in config.yaml
+# You are not bound to databases or files, you can query a web API or whatever data source you need.
+
+#' @description
+#' RAPIDS allows users to use the keyword "infer" (previously "multiple") to automatically infer the mobile Operative System a device was running.
+#' If you have a way to infer the OS of a device ID, implement this function. For example, for AWARE data we use the "aware_device" table.
+#'  
+#' If you don't have a way to infer the OS, call stop("Error Message") so other users know they can't use "infer" or the inference failed, 
+#' and they have to assign the OS manually in the participant file
+#' 
+#' @param stream_parameters The PHONE_STREAM_PARAMETERS key in config.yaml. If you need specific parameters add them there.
+#' @param device A device ID string
+#' @return The OS the device ran, "android" or "ios"
+
+infer_device_os <- function(stream_parameters, device){
+  dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
+  query <- paste0("SELECT device_id,brand FROM aware_device WHERE device_id = '", device, "'")
+  message(paste0("Executing the following query to infer phone OS: ", query)) 
+  os <- dbGetQuery(dbEngine, query)
+  dbDisconnect(dbEngine)
+  
+  if(nrow(os) > 0)
+    return(os %>% mutate(os = ifelse(brand == "iPhone", "ios", "android")) %>% pull(os))
+  else
+    stop(paste("We cannot infer the OS of the following device id because it does not exist in the aware_device table:", device))
+  
+  return(os)
+}
+
+#' @description
+#' Gets the sensor data for a specific device id from a database table, file or whatever source you want to query
+#' 
+#' @param stream_parameters The PHONE_STREAM_PARAMETERS key in config.yaml. If you need specific parameters add them there.
+#' @param device A device ID string
+#' @param sensor_container database table or file containing the sensor data for all participants. This is the PHONE_SENSOR[CONTAINER] key in config.yaml
+#' @param columns the columns needed from this sensor (we recommend to only return these columns instead of every column in sensor_container)
+#' @return A dataframe with the sensor data for device
+
+pull_data <- function(stream_parameters, device, sensor, sensor_container, columns){
+  dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
+
+  select_items <- c()
+  for (column in columns) {
+    select_items <- append(select_items, paste0("data->>'$.", column, "' ", column))
+  }
+
+  query <- paste0("SELECT ", paste(select_items, collapse = ",")," FROM ", sensor_container, " WHERE ", columns$DEVICE_ID ," = '", device,"'")
+
+  # Letting the user know what we are doing
+  message(paste0("Executing the following query to download data: ", query)) 
+  sensor_data <- dbGetQuery(dbEngine, query)
+  
+  dbDisconnect(dbEngine)
+  
+  if(nrow(sensor_data) == 0)
+    warning(paste("The device '", device,"' did not have data in ", sensor_container))
+
+  return(sensor_data)
+}
+
--- a/src/data/streams/aware_micro_mysql/format.yaml
+++ b/src/data/streams/aware_micro_mysql/format.yaml
@ -0,0 +1,337 @@
+PHONE_ACCELEROMETER:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      DOUBLE_VALUES_0: double_values_0
+      DOUBLE_VALUES_1: double_values_1
+      DOUBLE_VALUES_2: double_values_2
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      DOUBLE_VALUES_0: double_values_0
+      DOUBLE_VALUES_1: double_values_1
+      DOUBLE_VALUES_2: double_values_2
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_ACTIVITY_RECOGNITION:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      ACTIVITY_NAME: activity_name
+      ACTIVITY_TYPE: activity_type
+      CONFIDENCE: confidence
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      ACTIVITY_NAME: FLAG_TO_MUTATE
+      ACTIVITY_TYPE: FLAG_TO_MUTATE
+      CONFIDENCE: FLAG_TO_MUTATE
+    MUTATION:
+      COLUMN_MAPPINGS:
+        ACTIVITIES: activities
+        CONFIDENCE: confidence
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+        - "src/data/streams/mutations/phone/aware/activity_recogniton_ios_unification.R"
+
+PHONE_APPLICATIONS_CRASHES:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      PACKAGE_NAME: package_name
+      APPLICATION_NAME: application_name
+      APPLICATION_VERSION: application_version
+      ERROR_SHORT: error_short
+      ERROR_LONG: error_long
+      ERROR_CONDITION: error_condition
+      IS_SYSTEM_APP: is_system_app
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_APPLICATIONS_FOREGROUND:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      PACKAGE_NAME: package_name
+      APPLICATION_NAME: application_name
+      IS_SYSTEM_APP: is_system_app
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_APPLICATIONS_NOTIFICATIONS:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      PACKAGE_NAME: package_name
+      APPLICATION_NAME: application_name
+      TEXT: text
+      SOUND: sound
+      VIBRATE: vibrate
+      DEFAULTS: defaults
+      FLAGS: flags
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_BATTERY:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      BATTERY_STATUS: battery_status
+      BATTERY_LEVEL: battery_level
+      BATTERY_SCALE: battery_scale
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      BATTERY_STATUS: FLAG_TO_MUTATE
+      BATTERY_LEVEL: battery_level
+      BATTERY_SCALE: battery_scale
+    MUTATION:
+      COLUMN_MAPPINGS:
+        BATTERY_STATUS: battery_status
+      SCRIPTS:
+        - "src/data/streams/mutations/phone/aware/battery_ios_unification.R"
+
+PHONE_BLUETOOTH:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      BT_ADDRESS: bt_address
+      BT_NAME: bt_name
+      BT_RSSI: bt_rssi
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      BT_ADDRESS: bt_address
+      BT_NAME: bt_name
+      BT_RSSI: bt_rssi
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_CALLS:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      CALL_TYPE: call_type
+      CALL_DURATION: call_duration
+      TRACE: trace
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      CALL_TYPE: FLAG_TO_MUTATE
+      CALL_DURATION: call_duration
+      TRACE: trace
+    MUTATION:
+      COLUMN_MAPPINGS:
+        CALL_TYPE: call_type
+      SCRIPTS:
+        - "src/data/streams/mutations/phone/aware/calls_ios_unification.R"
+
+PHONE_CONVERSATION:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      DOUBLE_ENERGY: double_energy
+      INFERENCE: inference
+      DOUBLE_CONVO_START: double_convo_start
+      DOUBLE_CONVO_END: double_convo_end
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      DOUBLE_ENERGY: double_energy
+      INFERENCE: inference
+      DOUBLE_CONVO_START: double_convo_start
+      DOUBLE_CONVO_END: double_convo_end
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+        - "src/data/streams/mutations/phone/aware/conversation_ios_timestamp.R"
+
+PHONE_KEYBOARD:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      PACKAGE_NAME: package_name
+      BEFORE_TEXT: before_text
+      CURRENT_TEXT: current_text
+      IS_PASSWORD: is_password
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_LIGHT:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      DOUBLE_LIGHT_LUX: double_light_lux
+      ACCURACY: accuracy
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_LOCATIONS:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      DOUBLE_LATITUDE: double_latitude
+      DOUBLE_LONGITUDE: double_longitude
+      DOUBLE_BEARING: double_bearing
+      DOUBLE_SPEED: double_speed
+      DOUBLE_ALTITUDE: double_altitude
+      PROVIDER: provider
+      ACCURACY: accuracy
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      DOUBLE_LATITUDE: double_latitude
+      DOUBLE_LONGITUDE: double_longitude
+      DOUBLE_BEARING: double_bearing
+      DOUBLE_SPEED: double_speed
+      DOUBLE_ALTITUDE: double_altitude
+      PROVIDER: provider
+      ACCURACY: accuracy
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_LOG:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      LOG_MESSAGE: log_message
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      LOG_MESSAGE: log_message
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_MESSAGES:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      MESSAGE_TYPE: message_type
+      TRACE: trace
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_SCREEN:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      SCREEN_STATUS: screen_status
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      SCREEN_STATUS: FLAG_TO_MUTATE
+    MUTATION:
+      COLUMN_MAPPINGS:
+        SCREEN_STATUS: screen_status
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+        - "src/data/streams/mutations/phone/aware/screen_ios_unification.R"
+
+PHONE_WIFI_CONNECTED:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      MAC_ADDRESS: mac_address
+      SSID: ssid
+      BSSID: bssid
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      MAC_ADDRESS: mac_address
+      SSID: ssid
+      BSSID: bssid
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_WIFI_VISIBLE:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      SSID: ssid
+      BSSID: bssid
+      SECURITY: security
+      FREQUENCY: frequency
+      RSSI: rssi
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      SSID: ssid
+      BSSID: bssid
+      SECURITY: security
+      FREQUENCY: frequency
+      RSSI: rssi
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
--- a/src/data/streams/aware_postgresql/container.R
+++ b/src/data/streams/aware_postgresql/container.R
@ -0,0 +1,212 @@
+# if you need a new package, you should add it with renv::install(package) so your renv venv is updated
+library(RPostgres)
+# Needs libpq-dev for compiling from source.
+
+# Error installing package 'RPostgres':
+#   =====================================
+#   
+#   * installing *source* package 'RPostgres' ...
+# ** package 'RPostgres' successfully unpacked and MD5 sums checked
+# ** using staged installation
+# Using PKG_CFLAGS=
+#   Using PKG_LIBS=-lpq
+# Using PKG_PLOGR=
+#   ------------------------- ANTICONF ERROR ---------------------------
+#   Configuration failed because libpq was not found. Try installing:
+#   * deb: libpq-dev (Debian, Ubuntu, etc)
+# * rpm: postgresql-devel (Fedora, EPEL)
+# * rpm: postgreql8-devel, psstgresql92-devel, postgresql93-devel, or postgresql94-devel (Amazon Linux)
+# * csw: postgresql_dev (Solaris)
+# * brew: libpq (OSX)
+# If libpq is already installed, check that either:
+#   (i)  'pkg-config' is in your PATH AND PKG_CONFIG_PATH contains
+# a libpq.pc file; or
+# (ii) 'pg_config' is in your PATH.
+# If neither can detect , you can set INCLUDE_DIR
+# and LIB_DIR manually via:
+#   R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
+# --------------------------[ ERROR MESSAGE ]----------------------------
+#   <stdin>:1:10: fatal error: libpq-fe.h: No such file or directory
+# compilation terminated.
+
+library(dbplyr)
+library(yaml)
+
+#' @description
+#' Auxiliary function to parse the connection credentials from a specifc group in ./credentials.yaml
+#' You can reause most of this function if you are connection to a DB or Web API.
+#' It's OK to delete this function if you don't need credentials, e.g., you are pulling data from a CSV for example.
+#' @param group the yaml key containing the credentials to connect to a database
+#' @preturn dbEngine a database engine (connection) ready to perform queries
+get_db_engine <- function(group){
+  # The working dir is aways RAPIDS root folder, so your credentials file is always /credentials.yaml
+  credentials <- read_yaml("./credentials.yaml")
+  if(!group %in% names(credentials))
+    stop(paste("The credentials group",group, "does not exist in ./credentials.yaml. The only groups that exist in that file are:", paste(names(credentials), collapse = ","), ". Did you forget to set the group in [PHONE_DATA_STREAMS][aware_mysql][DATABASE_GROUP] in config.yaml?"))
+  dbEngine <- dbConnect(Postgres(), db = credentials[[group]][["database"]],
+                        user = credentials[[group]][["user"]],
+                        password = credentials[[group]][["password"]],
+                        host = credentials[[group]][["host"]],
+                        port = credentials[[group]][["port"]])
+  return(dbEngine)
+}
+
+# This file gets executed for each PHONE_SENSOR of each participant
+# If you are connecting to a database the env file containing its credentials is available at "./.env"
+# If you are reading a CSV file instead of a DB table, the @param sensor_container wil contain the file path as set in config.yaml
+# You are not bound to databases or files, you can query a web API or whatever data source you need.
+
+#' @description
+#' RAPIDS allows users to use the keyword "infer" (previously "multiple") to automatically infer the mobile Operative System a device was running.
+#' If you have a way to infer the OS of a device ID, implement this function. For example, for AWARE data we use the "aware_device" table.
+#'  
+#' If you don't have a way to infer the OS, call stop("Error Message") so other users know they can't use "infer" or the inference failed, 
+#' and they have to assign the OS manually in the participant file
+#' 
+#' @param stream_parameters The PHONE_STREAM_PARAMETERS key in config.yaml. If you need specific parameters add them there.
+#' @param device A device ID string
+#' @return The OS the device ran, "android" or "ios"
+
+infer_device_os <- function(stream_parameters, device){
+  #dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
+  #query <- paste0("SELECT device_id,brand FROM aware_device WHERE device_id = '", device, "'")
+  #message(paste0("Executing the following query to infer phone OS: ", query)) 
+  #os <- dbGetQuery(dbEngine, query)
+  #dbDisconnect(dbEngine)
+  
+  #if(nrow(os) > 0)
+  #  return(os %>% mutate(os = ifelse(brand == "iPhone", "ios", "android")) %>% pull(os))
+  #else
+  stop(paste("We cannot infer the OS of the following device id because the aware_device table does not exist."))
+  
+  #return(os)
+}
+
+#' @description
+#' Gets the sensor data for a specific device id from a database table, file or whatever source you want to query
+#' 
+#' @param stream_parameters The PHONE_STREAM_PARAMETERS key in config.yaml. If you need specific parameters add them there.
+#' @param device A device ID string
+#' @param sensor_container database table or file containing the sensor data for all participants. This is the PHONE_SENSOR[CONTAINER] key in config.yaml
+#' @param columns the columns needed from this sensor (we recommend to only return these columns instead of every column in sensor_container)
+#' @return A dataframe with the sensor data for device
+
+pull_data <- function(stream_parameters, device, sensor, sensor_container, columns){
+  dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
+  query <- paste0("SELECT ", paste(columns, collapse = ",")," FROM ", sensor_container, " WHERE ", columns$DEVICE_ID ," = '", device,"'")
+  # Letting the user know what we are doing
+  message(paste0("Executing the following query to download data: ", query)) 
+  sensor_data <- dbGetQuery(dbEngine, query)
+  
+  dbDisconnect(dbEngine)
+  
+  if(nrow(sensor_data) == 0)
+    warning(paste("The device '", device,"' did not have data in ", sensor_container))
+  
+  return(sensor_data)
+}
+
+#' @description
+#' Gets participants' IDs for specified usernames.
+#'
+#' @param stream_parameters The PHONE_DATA_STREAMS key in config.yaml. If you need specific parameters add them there.
+#' @param usernames A vector of usernames
+#' @param participants_container The name of the database table containing participants data, such as their username.
+#' @return A dataframe with participant IDs matching usernames
+
+pull_participants_ids <- function(stream_parameters, usernames, participants_container) {
+  dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
+  
+  query_participant_id <- tbl(dbEngine, participants_container) %>% 
+    filter(username %in% usernames) %>% 
+    select(username, id)
+  
+  message(paste0("Executing the following query to get participants' IDs: \n", sql_render(query_participant_id)))
+  
+  participant_data <- query_participant_id %>% collect()
+
+  dbDisconnect(dbEngine)
+  
+  if(nrow(participant_data) == 0)
+    warning(paste("We could not find requested usernames (", usernames,  ") in ", participants_container))
+  
+  return(participant_data)
+}
+
+#' @description
+#' Gets participants' IDs for specified participant IDs
+#'
+#' @param stream_parameters The PHONE_DATA_STREAMS key in config.yaml. If you need specific parameters add them there.
+#' @param participants_ids A vector of numeric participant IDs
+#' @param device_id_container The name of the database table which will be used to determine distinct device ID. Ideally, a table that reliably contains data, but not too much.
+#' @return A dataframe with a row matching each distinct device ID with a participant ID
+
+pull_participants_device_ids <- function(stream_parameters, participants_ids, device_id_container) {
+  dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
+
+  query_device_id <- tbl(dbEngine, device_id_container) %>%
+    filter(participant_id %in% !!participants_ids) %>% 
+    group_by(participant_id) %>% 
+    distinct(device_id, .keep_all = FALSE)
+  
+  message(
+    paste0(
+      "Executing the following query to get the distinct device IDs: \n",
+      sql_render(query_device_id),
+      "\n NOTE: This might take a long time."
+    )
+  )
+  
+  device_ids <- query_device_id %>% collect()
+  
+  dbDisconnect(dbEngine)
+  
+  if(nrow(device_ids) == 0)
+    warning(paste("We could not find device IDs for requested participant IDs (", participants_ids,  ") in ", device_id_container))
+  
+  return(device_ids)
+}
+
+#' @description
+#' Gets start and end datetimes for specified participant IDs.
+#'
+#' @param stream_parameters The PHONE_DATA_STREAMS key in config.yaml. If you need specific parameters add them there.
+#' @param participants_ids A vector of numeric participant IDs
+#' @param start_end_date_container The name of the database table which will be used to determine when a participant started and ended their participation. Briefing and debriefing EMAs can be meaningfully used here.
+#' @return A dataframe relating participant IDs with their start and end datetimes.
+
+pull_participants_start_end_dates <- function(stream_parameters, participants_ids, start_end_date_container) {
+  dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
+
+  query_timestamps <- tbl(dbEngine, start_end_date_container) %>% 
+    filter(
+      participant_id %in% !!participants_ids,
+      double_esm_user_answer_timestamp > 0
+    ) %>% 
+    group_by(participant_id) %>% 
+    summarise(
+      timestamp_min = min(double_esm_user_answer_timestamp, na.rm = TRUE),
+      timestamp_max = max(double_esm_user_answer_timestamp, na.rm = TRUE)
+    ) %>% 
+    select(participant_id, timestamp_min, timestamp_max)
+  
+  message(paste0("Executing the following query to get the starting and ending datetimes: \n", sql_render(query_timestamps)))
+  
+  start_end_timestamps <- query_timestamps %>% collect()
+  
+  if(nrow(start_end_timestamps) == 0)
+    warning(paste("We could not find datetimes for requested participant IDs (", participants_ids,  ") in ", start_end_date_container))
+
+  start_end_times <- start_end_timestamps %>% 
+    mutate(    
+      datetime_start = as_datetime(timestamp_min/1000, tz = "UTC"),
+      datetime_end = as_datetime(timestamp_max/1000, tz = "UTC")
+    ) %>% 
+    select(-c(timestamp_min, timestamp_max))
+  
+  dbDisconnect(dbEngine)
+  
+  return(start_end_times)
+}
+
+
--- a/src/data/streams/aware_postgresql/format.yaml
+++ b/src/data/streams/aware_postgresql/format.yaml
@ -0,0 +1,372 @@
+PHONE_ACCELEROMETER:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      DOUBLE_VALUES_0: double_values_0
+      DOUBLE_VALUES_1: double_values_1
+      DOUBLE_VALUES_2: double_values_2
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      DOUBLE_VALUES_0: double_values_0
+      DOUBLE_VALUES_1: double_values_1
+      DOUBLE_VALUES_2: double_values_2
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_ACTIVITY_RECOGNITION:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      ACTIVITY_NAME: activity_name
+      ACTIVITY_TYPE: activity_type
+      CONFIDENCE: confidence
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      ACTIVITY_NAME: FLAG_TO_MUTATE
+      ACTIVITY_TYPE: FLAG_TO_MUTATE
+      CONFIDENCE: FLAG_TO_MUTATE
+    MUTATION:
+      COLUMN_MAPPINGS:
+        ACTIVITIES: activities
+        CONFIDENCE: confidence
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+        - "src/data/streams/mutations/phone/aware/activity_recogniton_ios_unification.R"
+
+PHONE_APPLICATIONS_CRASHES:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      PACKAGE_NAME: package_name
+      APPLICATION_NAME: application_name
+      APPLICATION_VERSION: application_version
+      ERROR_SHORT: error_short
+      ERROR_LONG: error_long
+      ERROR_CONDITION: error_condition
+      IS_SYSTEM_APP: is_system_app
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_APPLICATIONS_FOREGROUND:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      PACKAGE_NAME: package_hash
+      APPLICATION_NAME: FLAG_TO_MUTATE
+      IS_SYSTEM_APP: is_system_app
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS:
+        - src/data/streams/mutations/phone/straw/app_add_name.R
+
+PHONE_APPLICATIONS_NOTIFICATIONS:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      PACKAGE_NAME: package_hash
+      APPLICATION_NAME: FLAG_TO_MUTATE
+      SOUND: sound
+      VIBRATE: vibrate
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS:
+        - src/data/streams/mutations/phone/straw/app_add_name.R
+
+PHONE_BATTERY:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      BATTERY_STATUS: battery_status
+      BATTERY_LEVEL: battery_level
+      BATTERY_SCALE: battery_scale
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      BATTERY_STATUS: FLAG_TO_MUTATE
+      BATTERY_LEVEL: battery_level
+      BATTERY_SCALE: battery_scale
+    MUTATION:
+      COLUMN_MAPPINGS:
+        BATTERY_STATUS: battery_status
+      SCRIPTS:
+        - "src/data/streams/mutations/phone/aware/battery_ios_unification.R"
+
+PHONE_BLUETOOTH:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      BT_ADDRESS: bt_address
+      BT_NAME: bt_name
+      BT_RSSI: bt_rssi
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      BT_ADDRESS: bt_address
+      BT_NAME: bt_name
+      BT_RSSI: bt_rssi
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_CALLS:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      CALL_TYPE: call_type
+      CALL_DURATION: call_duration
+      TRACE: trace
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      CALL_TYPE: FLAG_TO_MUTATE
+      CALL_DURATION: call_duration
+      TRACE: trace
+    MUTATION:
+      COLUMN_MAPPINGS:
+        CALL_TYPE: call_type
+      SCRIPTS:
+        - "src/data/streams/mutations/phone/aware/calls_ios_unification.R"
+
+PHONE_CONVERSATION:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      DOUBLE_ENERGY: double_energy
+      INFERENCE: inference
+      DOUBLE_CONVO_START: double_convo_start
+      DOUBLE_CONVO_END: double_convo_end
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      DOUBLE_ENERGY: double_energy
+      INFERENCE: inference
+      DOUBLE_CONVO_START: double_convo_start
+      DOUBLE_CONVO_END: double_convo_end
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+        - "src/data/streams/mutations/phone/aware/conversation_ios_timestamp.R"
+
+PHONE_ESM:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: double_esm_user_answer_timestamp
+      DEVICE_ID: device_id
+      ESM_STATUS: esm_status
+      ESM_USER_ANSWER: esm_user_answer
+      ESM_JSON: esm_json
+      ESM_TRIGGER: esm_trigger
+      ESM_SESSION: esm_session
+      ESM_NOTIFICATION_ID: esm_notification_id
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS:
+
+PHONE_KEYBOARD:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      PACKAGE_NAME: package_name
+      BEFORE_TEXT: before_text
+      CURRENT_TEXT: current_text
+      IS_PASSWORD: is_password
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_LIGHT:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      DOUBLE_LIGHT_LUX: double_light_lux
+      ACCURACY: accuracy
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_LOCATIONS:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      DOUBLE_LATITUDE: double_latitude
+      DOUBLE_LONGITUDE: double_longitude
+      DOUBLE_BEARING: double_bearing
+      DOUBLE_SPEED: double_speed
+      DOUBLE_ALTITUDE: double_altitude
+      PROVIDER: provider
+      ACCURACY: accuracy
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      DOUBLE_LATITUDE: double_latitude
+      DOUBLE_LONGITUDE: double_longitude
+      DOUBLE_BEARING: double_bearing
+      DOUBLE_SPEED: double_speed
+      DOUBLE_ALTITUDE: double_altitude
+      PROVIDER: provider
+      ACCURACY: accuracy
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_LOG:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      LOG_MESSAGE: log_message
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      LOG_MESSAGE: log_message
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_MESSAGES:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      MESSAGE_TYPE: message_type
+      TRACE: trace
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_SCREEN:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      SCREEN_STATUS: screen_status
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      SCREEN_STATUS: FLAG_TO_MUTATE
+    MUTATION:
+      COLUMN_MAPPINGS:
+        SCREEN_STATUS: screen_status
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+        - "src/data/streams/mutations/phone/aware/screen_ios_unification.R"
+
+PHONE_WIFI_CONNECTED:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      MAC_ADDRESS: mac_address
+      SSID: ssid
+      BSSID: bssid
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      MAC_ADDRESS: mac_address
+      SSID: ssid
+      BSSID: bssid
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_WIFI_VISIBLE:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      SSID: ssid
+      BSSID: bssid
+      SECURITY: security
+      FREQUENCY: frequency
+      RSSI: rssi
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      SSID: ssid
+      BSSID: bssid
+      SECURITY: security
+      FREQUENCY: frequency
+      RSSI: rssi
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+PHONE_SPEECH:
+  ANDROID:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      SPEECH_PROPORTION: speech_proportion
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+  IOS:
+    RAPIDS_COLUMN_MAPPINGS:
+      TIMESTAMP: timestamp
+      DEVICE_ID: device_id
+      SPEECH_PROPORTION: speech_proportion
+    MUTATION:
+      COLUMN_MAPPINGS:
+      SCRIPTS: # List any python or r scripts that mutate your raw data
+
+
+      
+
--- a/src/data/streams/empatica_zip/container.py
+++ b/src/data/streams/empatica_zip/container.py
@ -2,11 +2,16 @@ from zipfile import ZipFile
 import warnings
 from pathlib import Path
 import pandas as pd
+import numpy as np
 from pandas.core import indexing
 import yaml
 import csv
 from collections import OrderedDict
 from io import BytesIO, StringIO
+import sys, os
+
+from cr_features.hrv import get_HRV_features, get_patched_ibi_with_bvp
+from cr_features.helper_functions import empatica1d_to_array, empatica2d_to_array

 def processAcceleration(x, y, z):
    x = float(x)
@ -52,6 +57,8 @@ def extract_empatica_data(data,  sensor):
        df = pd.DataFrame.from_dict(ddict, orient='index', columns=[column])
        df[column] = df[column].astype(float)
        df.index.name = 'timestamp'
+        if df.empty:
+            return df

    elif sensor == 'EMPATICA_ACCELEROMETER':
        ddict = readFile(sensor_data_file, sensor)
@ -60,15 +67,22 @@ def extract_empatica_data(data,  sensor):
        df['y'] = df['y'].astype(float)
        df['z'] = df['z'].astype(float)
        df.index.name = 'timestamp'
+        if df.empty:
+            return df

    elif sensor == 'EMPATICA_INTER_BEAT_INTERVAL':
-        df = pd.read_csv(sensor_data_file, names=['timestamp', column], header=None)
+
+        df = pd.read_csv(sensor_data_file, names=['timings', column], header=None)
+        df['timestamp'] = df['timings']
+        if df.empty:
+            df = df.set_index('timestamp')
+            return df
        timestampstart = float(df['timestamp'][0])
-        df['timestamp'] = (df['timestamp'][1:len(df)]).astype(float) + timestampstart
+        df['timestamp'] = (df['timestamp'][1:len(df)]).astype(float) + timestampstart        
        df = df.drop([0])
        df[column] = df[column].astype(float)
        df = df.set_index('timestamp')
-
+        
    else:
        raise ValueError(
            "sensor has an invalid name: {}".format(sensor))
@ -84,6 +98,10 @@ def pull_data(data_configuration, device, sensor, container, columns_to_download
    participant_data = pd.DataFrame(columns=columns_to_download.values())
    participant_data.set_index('timestamp', inplace=True)

+    with open('config.yaml', 'r') as stream:
+        config = yaml.load(stream, Loader=yaml.FullLoader)
+    cr_ibi_provider = config['EMPATICA_INTER_BEAT_INTERVAL']['PROVIDERS']['CR']
+
    available_zipfiles = list((Path(data_configuration["FOLDER"]) / Path(device)).rglob("*.zip"))
    if len(available_zipfiles) == 0:
        warnings.warn("There were no zip files in: {}. If you were expecting data for this participant the [EMPATICA][DEVICE_IDS] key in their participant file is missing the pid".format((Path(data_configuration["FOLDER"]) / Path(device))))
@ -94,7 +112,13 @@ def pull_data(data_configuration, device, sensor, container, columns_to_download
            listOfFileNames = zipFile.namelist()
            for fileName in listOfFileNames:
                if fileName == sensor_csv:
-                    participant_data = pd.concat([participant_data, extract_empatica_data(zipFile.read(fileName),  sensor)], axis=0)
+                    if sensor == "EMPATICA_INTER_BEAT_INTERVAL" and cr_ibi_provider.get('PATCH_WITH_BVP', False):
+                        participant_data = \
+                            pd.concat([participant_data, patch_ibi_with_bvp(zipFile.read('IBI.csv'), zipFile.read('BVP.csv'))], axis=0)
+                        #print("patch with ibi")
+                    else:
+                        participant_data = pd.concat([participant_data, extract_empatica_data(zipFile.read(fileName), sensor)], axis=0)
+                        #print("no patching")
                    warning = False
            if warning:
                warnings.warn("We could not find a zipped file for {} in {} (we tried to find {})".format(sensor, zipFile, sensor_csv))
@ -105,4 +129,54 @@ def pull_data(data_configuration, device, sensor, container, columns_to_download
    participant_data["device_id"] = device
    return(participant_data)

+def patch_ibi_with_bvp(ibi_data, bvp_data):
+    ibi_data_file = BytesIO(ibi_data).getvalue().decode('utf-8')
+    ibi_data_file = StringIO(ibi_data_file)
+
+    # Begin with the cr-features part
+    try:
+        ibi_data, ibi_start_timestamp = empatica2d_to_array(ibi_data_file)
+    except (IndexError, KeyError) as e:
+        # Checks whether IBI.csv is empty
+        # It may raise a KeyError if df is empty here: startTimeStamp = df.time[0]
+        df_test = pd.read_csv(ibi_data_file, names=['timings', 'inter_beat_interval'], header=None)
+        if df_test.empty:
+            df_test['timestamp'] = df_test['timings']
+            df_test = df_test.set_index('timestamp')
+            return df_test
+        else:
+            raise IndexError("Something went wrong with indices. Error that was previously caught:\n", repr(e))
+
+    bvp_data_file = BytesIO(bvp_data).getvalue().decode('utf-8')
+    bvp_data_file = StringIO(bvp_data_file)
+
+    bvp_data, bvp_start_timestamp, sample_rate = empatica1d_to_array(bvp_data_file)
+
+    hrv_time_and_freq_features, sample, bvp_rr, bvp_timings, peak_indx = \
+        get_HRV_features(bvp_data, ma=False, 
+                        detrend=False, m_deternd=False, low_pass=False, winsorize=True, 
+                        winsorize_value=25, hampel_fiter=False, median_filter=False, 
+                        mod_z_score_filter=True, sampling=64, feature_names=['meanHr'])
+    
+    ibi_timings, ibi_rr = get_patched_ibi_with_bvp(ibi_data[0], ibi_data[1], bvp_timings, bvp_rr)
+
+    df = \
+        pd.DataFrame(np.array([ibi_timings, ibi_rr]).transpose(), columns=['timestamp', 'inter_beat_interval'])
+    df.loc[-1] = [ibi_start_timestamp, 'IBI']  # adding a row
+    df.index = df.index + 1  # shifting index
+    df = df.sort_index()  # sorting by index
+
+    # Repeated as in extract_empatica_data for IBI
+    df['timings'] = df['timestamp']
+    timestampstart = float(df['timestamp'][0])
+    df['timestamp'] = (df['timestamp'][1:len(df)]).astype(float) + timestampstart        
+    df = df.drop([0])
+    df['inter_beat_interval'] = df['inter_beat_interval'].astype(float)
+    df = df.set_index('timestamp')
+
+    # format timestamps
+    df.index *= 1000
+    df.index = df.index.astype(int)
+    return(df)
+
 # print(pull_data({'FOLDER': 'data/external/empatica'}, "e01", "EMPATICA_accelerometer", {'TIMESTAMP': 'timestamp', 'DEVICE_ID': 'device_id', 'DOUBLE_VALUES_0': 'x', 'DOUBLE_VALUES_1': 'y', 'DOUBLE_VALUES_2': 'z'}))
--- a/src/data/streams/empatica_zip/format.yaml
+++ b/src/data/streams/empatica_zip/format.yaml
@ -50,6 +50,7 @@ EMPATICA_INTER_BEAT_INTERVAL:
    TIMESTAMP: timestamp
    DEVICE_ID: device_id
    INTER_BEAT_INTERVAL: inter_beat_interval
+    TIMINGS: timings
  MUTATION:
    COLUMN_MAPPINGS:
    SCRIPTS: # List any python or r scripts that mutate your raw data
--- a/src/data/streams/mutations/phone/aware/calls_ios_unification.R
+++ b/src/data/streams/mutations/phone/aware/calls_ios_unification.R
@ -39,7 +39,7 @@ unify_ios_calls <- function(ios_calls){
                        assigned_segments = first(assigned_segments))
        }
        else {
-            ios_calls <- ios_calls %>% summarise(call_type_sequence = paste(call_type, collapse = ","), call_duration = sum(call_duration),  timestamp = first(timestamp), device_id = first(device_id))
+            ios_calls <- ios_calls %>% summarise(call_type_sequence = paste(call_type, collapse = ","), call_duration = sum(as.numeric(call_duration)),  timestamp = first(timestamp), device_id = first(device_id))
        }
        ios_calls <- ios_calls %>% mutate(call_type = case_when(
            call_type_sequence == "1,2,4" | call_type_sequence == "2,1,4" ~ 1, # incoming
--- a/src/data/streams/mutations/phone/straw/app_add_name.R
+++ b/src/data/streams/mutations/phone/straw/app_add_name.R
@ -0,0 +1,8 @@
+source("renv/activate.R") # needed to use RAPIDS renv environment
+library(dplyr)
+
+main <- function(data, stream_parameters){
+    data <- data %>%
+      mutate(application_name = "hashed")
+    return(data)
+}
--- a/src/data/streams/mutations/phone/straw/app_add_name.py
+++ b/src/data/streams/mutations/phone/straw/app_add_name.py
@ -0,0 +1,5 @@
+import pandas as pd
+
+def main(data, stream_parameters):
+    data["application_name"] = "hashed"
+    return(data)
--- a/src/data/streams/rapids_columns.yaml
+++ b/src/data/streams/rapids_columns.yaml
@ -35,11 +35,8 @@ PHONE_APPLICATIONS_NOTIFICATIONS:
  - DEVICE_ID
  - PACKAGE_NAME
  - APPLICATION_NAME
-  - TEXT
  - SOUND
  - VIBRATE
-  - DEFAULTS
-  - FLAGS

 PHONE_BATTERY:
  - TIMESTAMP
@ -70,6 +67,16 @@ PHONE_CONVERSATION:
  - DOUBLE_CONVO_START
  - DOUBLE_CONVO_END

+PHONE_ESM:
+  - TIMESTAMP
+  - DEVICE_ID
+  - ESM_STATUS
+  - ESM_USER_ANSWER
+  - ESM_JSON
+  - ESM_TRIGGER
+  - ESM_SESSION
+  - ESM_NOTIFICATION_ID
+
 PHONE_KEYBOARD:
  - TIMESTAMP
  - DEVICE_ID
@ -111,6 +118,11 @@ PHONE_SCREEN:
  - DEVICE_ID
  - SCREEN_STATUS

+PHONE_SPEECH:
+  - TIMESTAMP
+  - DEVICE_ID
+  - SPEECH_PROPORTION
+
 PHONE_WIFI_CONNECTED:
  - TIMESTAMP
  - DEVICE_ID
@ -220,6 +232,7 @@ EMPATICA_INTER_BEAT_INTERVAL:
  - TIMESTAMP
  - DEVICE_ID
  - INTER_BEAT_INTERVAL
+  - TIMINGS

 EMPATICA_TAGS:
  - TIMESTAMP
--- a/src/data/translate_usernames_into_participants_data.R
+++ b/src/data/translate_usernames_into_participants_data.R
@ -0,0 +1,62 @@
+source("renv/activate.R")
+source("src/data/streams/aware_postgresql/container.R")
+
+library(RPostgres)
+library(magrittr)
+library(tidyverse)
+library(lubridate)
+
+prepare_participants_file <- function() {
+
+  username_list_csv_location <- snakemake@input[["username_list"]]
+
+  data_configuration <- snakemake@params[["data_configuration"]]
+  participants_container <- snakemake@params[["participants_table"]]
+  device_id_container <- snakemake@params[["device_id_table"]]
+  start_end_date_container <- snakemake@params[["start_end_date_table"]]
+
+  output_data_file <- snakemake@output[["participants_file"]]
+
+  platform <- "android"
+  pid_format <- "p%03d"
+  datetime_format <- "%Y-%m-%d %H:%M:%S"
+
+  participant_data <- read_csv(username_list_csv_location, col_types = "cc", progress = FALSE)
+  usernames <- participant_data$label
+
+  participant_ids <- pull_participants_ids(data_configuration, usernames, participants_container)
+  participant_data %<>%
+    left_join(participant_ids, by = c("label" = "username")) %>%
+    rename(participant_id = id)
+
+  device_ids <- pull_participants_device_ids(data_configuration, participant_data$participant_id, device_id_container)
+  device_ids %<>%
+    filter(device_id != "") %>%
+    group_by(participant_id) %>%
+    summarise(device_ids = list(unique(device_id)))
+  participant_data %<>%
+    left_join(device_ids, by = "participant_id")
+
+  start_end_datetimes <- pull_participants_start_end_dates(data_configuration, participant_data$participant_id, start_end_date_container)
+  participant_data %<>%
+    left_join(start_end_datetimes, by = "participant_id")
+
+  participant_data %<>%
+  mutate(
+    pid = sprintf(pid_format, participant_id),
+    start_date = strftime(datetime_start, format=datetime_format, tz = "UTC", usetz = FALSE), #TODO Check what timezone is expected
+    end_date = strftime(datetime_end, format=datetime_format, tz = "UTC", usetz = FALSE),
+    device_id = map_chr(device_ids, str_c, collapse = ";"),
+    number_of_devices = map_int(device_ids, length),
+    fitbit_id = ""
+    ) %>%
+  rowwise() %>%
+  mutate(platform = str_c(replicate(number_of_devices, platform), collapse = ";")) %>%
+  ungroup() %>%
+  arrange(pid) %>%
+  select(pid, label, start_date, end_date, empatica_id, device_id, platform, fitbit_id)
+
+  write_csv(participant_data, output_data_file)
+}
+
+prepare_participants_file()
--- a/src/features/init.py
+++ b/src/features/init.py
--- a/src/features/all_cleaning_individual/rapids/main.R
+++ b/src/features/all_cleaning_individual/rapids/main.R
@ -0,0 +1,89 @@
+source("renv/activate.R")
+library(tidyr)
+library("dplyr", warn.conflicts = F)
+library(tidyverse)
+library(caret)
+library(corrr)
+
+rapids_cleaning <- function(sensor_data_files, provider){
+
+    clean_features <- read.csv(sensor_data_files[["sensor_data"]], stringsAsFactors = FALSE)
+    impute_selected_event_features <- provider[["IMPUTE_SELECTED_EVENT_FEATURES"]]
+    cols_nan_threshold <- as.numeric(provider[["COLS_NAN_THRESHOLD"]])
+    drop_zero_variance_columns <- as.logical(provider[["COLS_VAR_THRESHOLD"]])
+    rows_nan_threshold <- as.numeric(provider[["ROWS_NAN_THRESHOLD"]])
+    data_yield_unit <- tolower(str_split_fixed(provider[["DATA_YIELD_FEATURE"]], "_", 4)[[4]])
+    data_yield_column <- paste0("phone_data_yield_rapids_ratiovalidyielded", data_yield_unit)
+    data_yield_ratio_threshold <- as.numeric(provider[["DATA_YIELD_RATIO_THRESHOLD"]])
+    drop_highly_correlated_features <- provider[["DROP_HIGHLY_CORRELATED_FEATURES"]]
+
+    # Impute selected event features
+    if(as.logical(impute_selected_event_features$COMPUTE)){
+        if(!"phone_data_yield_rapids_ratiovalidyieldedminutes" %in% colnames(clean_features)){
+            stop("Error: RAPIDS provider needs to impute the selected event features based on phone_data_yield_rapids_ratiovalidyieldedminutes column, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyieldedminutes' in [FEATURES].")
+        }
+        column_names <- colnames(clean_features)
+        selected_apps_features <- column_names[grepl("^phone_applications_foreground_rapids_(countevent|countepisode|minduration|maxduration|meanduration|sumduration)", column_names)]
+        selected_battery_features <- column_names[grepl("^phone_battery_rapids_", column_names)]
+        selected_calls_features <- column_names[grepl("^phone_calls_rapids_.*_(count|distinctcontacts|sumduration|minduration|maxduration|meanduration|modeduration)", column_names)]
+        selected_keyboard_features <- column_names[grepl("^phone_keyboard_rapids_(sessioncount|averagesessionlength|changeintextlengthlessthanminusone|changeintextlengthequaltominusone|changeintextlengthequaltoone|changeintextlengthmorethanone|maxtextlength|totalkeyboardtouches)", column_names)]
+        selected_messages_features <- column_names[grepl("^phone_messages_rapids_.*_(count|distinctcontacts)", column_names)]
+        selected_screen_features <- column_names[grepl("^phone_screen_rapids_(sumduration|maxduration|minduration|avgduration|countepisode)", column_names)]
+        selected_wifi_features <- column_names[grepl("^phone_wifi_(connected|visible)_rapids_", column_names)]
+        
+        selected_columns <- c(selected_apps_features, selected_battery_features, selected_calls_features, selected_keyboard_features, selected_messages_features, selected_screen_features, selected_wifi_features)
+        clean_features[selected_columns][is.na(clean_features[selected_columns]) & (clean_features$phone_data_yield_rapids_ratiovalidyieldedminutes > impute_selected_event_features$MIN_DATA_YIELDED_MINUTES_TO_IMPUTE)] <- 0
+    }
+    
+    # Drop rows with the value of data_yield_column less than data_yield_ratio_threshold
+    if(!data_yield_column %in% colnames(clean_features)){
+        stop(paste0("Error: RAPIDS provider needs to clean data based on ", data_yield_column, " column, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyielded", data_yield_unit, "' in [FEATURES]."))
+    }
+    if (data_yield_ratio_threshold > 0) {
+        clean_features <- clean_features %>% 
+        filter(.[[data_yield_column]] >= data_yield_ratio_threshold)
+    }
+
+    # Drop columns with a percentage of NA values above cols_nan_threshold
+    if(nrow(clean_features))
+        clean_features <- clean_features %>% select(where(~ sum(is.na(.)) / length(.) <= cols_nan_threshold ), starts_with("phone_esm"))
+
+    # Drop columns with zero variance
+    if(drop_zero_variance_columns)
+    clean_features <- clean_features %>% select_if(grepl("pid|local_segment|local_segment_label|local_segment_start_datetime|local_segment_end_datetime|phone_esm",names(.)) | sapply(., n_distinct, na.rm = T) > 1)
+
+    # Drop highly correlated features
+    if(as.logical(drop_highly_correlated_features$COMPUTE)){
+        
+        min_overlap_for_corr_threshold <- as.numeric(drop_highly_correlated_features$MIN_OVERLAP_FOR_CORR_THRESHOLD)
+        corr_threshold <- as.numeric(drop_highly_correlated_features$CORR_THRESHOLD)
+
+        features_for_corr <- clean_features %>% 
+            select_if(is.numeric) %>% 
+            select_if(sapply(., n_distinct, na.rm = T) > 1)
+
+        valid_pairs <- crossprod(!is.na(features_for_corr)) >= min_overlap_for_corr_threshold * nrow(features_for_corr)
+
+        if((nrow(features_for_corr) != 0) & (ncol(features_for_corr) != 0)){
+
+            highly_correlated_features <- features_for_corr %>% 
+                correlate(use = "pairwise.complete.obs", method = "spearman") %>% 
+                column_to_rownames(., var = "term") %>% 
+                as.matrix() %>% 
+                replace(!valid_pairs | is.na(.), 0) %>% 
+                findCorrelation(., cutoff = corr_threshold, verbose = F, names = T)
+
+            clean_features <- clean_features[, !names(clean_features) %in% highly_correlated_features]
+        
+        }
+    }
+
+    # Drop rows with a percentage of NA values above rows_nan_threshold
+    clean_features <- clean_features %>% 
+        mutate(percentage_na =  rowSums(is.na(.)) / ncol(.)) %>% 
+        filter(percentage_na <= rows_nan_threshold) %>% 
+        select(-percentage_na)
+
+    return(clean_features)
+}
+
--- a/src/features/all_cleaning_individual/straw/init.py
+++ b/src/features/all_cleaning_individual/straw/init.py
--- a/src/features/all_cleaning_individual/straw/main.py
+++ b/src/features/all_cleaning_individual/straw/main.py
@ -0,0 +1,180 @@
+import pandas as pd
+import numpy as np
+import math, sys, random
+import yaml
+
+from sklearn.impute import KNNImputer
+from sklearn.preprocessing import StandardScaler
+import matplotlib.pyplot as plt
+import seaborn as sns
+
+sys.path.append('/rapids/')
+from src.features import empatica_data_yield as edy
+
+pd.set_option('display.max_columns', 20)
+
+def straw_cleaning(sensor_data_files, provider):
+    
+    features = pd.read_csv(sensor_data_files["sensor_data"][0])
+    
+    esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] # Get target (esm) columns
+
+    with open('config.yaml', 'r') as stream:
+        config = yaml.load(stream, Loader=yaml.FullLoader)
+
+    excluded_columns = ['local_segment', 'local_segment_label', 'local_segment_start_datetime', 'local_segment_end_datetime']
+
+    # (1) FILTER_OUT THE ROWS THAT DO NOT HAVE THE TARGET COLUMN AVAILABLE
+    if config['PARAMS_FOR_ANALYSIS']['TARGET']['COMPUTE']:
+        target = config['PARAMS_FOR_ANALYSIS']['TARGET']['LABEL'] # get target label from config
+        if 'phone_esm_straw_' + target in features:
+            features = features[features['phone_esm_straw_' + target].notna()].reset_index(drop=True)
+        else:
+            return features
+
+    # (2.1) QUALITY CHECK (DATA YIELD COLUMN) deletes the rows where E4 or phone data is low quality
+    phone_data_yield_unit = provider["PHONE_DATA_YIELD_FEATURE"].split("_")[3].lower()
+    phone_data_yield_column = "phone_data_yield_rapids_ratiovalidyielded" + phone_data_yield_unit
+
+    if features.empty:
+        return features
+
+    features = edy.calculate_empatica_data_yield(features)
+
+    if not phone_data_yield_column in features.columns and not "empatica_data_yield" in features.columns:
+        raise KeyError(f"RAPIDS provider needs to clean the selected event features based on {phone_data_yield_column} and empatica_data_yield columns. For phone data yield, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyielded{data_yield_unit}' in [FEATURES].")
+        
+    # Drop rows where phone data yield is less then given threshold
+    if provider["PHONE_DATA_YIELD_RATIO_THRESHOLD"]:
+        features = features[features[phone_data_yield_column] >= provider["PHONE_DATA_YIELD_RATIO_THRESHOLD"]].reset_index(drop=True)
+    
+    # Drop rows where empatica data yield is less then given threshold
+    if provider["EMPATICA_DATA_YIELD_RATIO_THRESHOLD"]:
+        features = features[features["empatica_data_yield"] >= provider["EMPATICA_DATA_YIELD_RATIO_THRESHOLD"]].reset_index(drop=True)
+
+    if features.empty:
+        return features
+    
+    # (2.2) DO THE ROWS CONSIST OF ENOUGH NON-NAN VALUES?
+    min_count =  math.ceil((1 - provider["ROWS_NAN_THRESHOLD"]) * features.shape[1]) # minimal not nan values in row
+    features.dropna(axis=0, thresh=min_count, inplace=True) # Thresh => at least this many not-nans
+
+    # (3) REMOVE COLS IF THEIR NAN THRESHOLD IS PASSED (should be <= if even all NaN columns must be preserved - this solution now drops columns with all NaN rows)
+    esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] # Get target (esm) columns
+
+    features = features.loc[:, features.isna().sum() < provider["COLS_NAN_THRESHOLD"] * features.shape[0]]
+
+    # Preserve esm cols if deleted (has to come after drop cols operations)
+    for esm in esm_cols:
+        if esm not in features:
+            features[esm] = esm_cols[esm]
+
+    # (4) CONTEXTUAL IMPUTATION
+
+    # Impute selected phone features with a high number
+    impute_w_hn = [col for col in features.columns if \
+        "timeoffirstuse" in col or
+        "timeoflastuse" in col or
+        "timefirstcall" in col or
+        "timelastcall" in col or
+        "firstuseafter" in col or
+        "timefirstmessages" in col or
+        "timelastmessages" in col]
+    features[impute_w_hn] = features[impute_w_hn].fillna(1500)
+
+
+    # Impute special case (mostcommonactivity) and (homelabel)
+    impute_w_sn = [col for col in features.columns if "mostcommonactivity" in col]
+    features[impute_w_sn] = features[impute_w_sn].fillna(4) # Special case of imputation - nominal/ordinal value
+
+    impute_w_sn2 = [col for col in features.columns if "homelabel" in col]
+    features[impute_w_sn2] = features[impute_w_sn2].fillna(1) # Special case of imputation - nominal/ordinal value
+
+    impute_w_sn3 = [col for col in features.columns if "loglocationvariance" in col]
+    features[impute_w_sn2] = features[impute_w_sn2].fillna(-1000000) # Special case of imputation - nominal/ordinal value
+
+
+    # Impute selected phone features with 0
+    impute_zero = [col for col in features if \
+        col.startswith('phone_applications_foreground_rapids_') or
+        col.startswith('phone_battery_rapids_') or
+        col.startswith('phone_bluetooth_rapids_') or
+        col.startswith('phone_light_rapids_') or
+        col.startswith('phone_calls_rapids_') or
+        col.startswith('phone_messages_rapids_') or
+        col.startswith('phone_screen_rapids_') or
+        col.startswith('phone_wifi_visible')]
+
+    features[impute_zero+list(esm_cols.columns)] = features[impute_zero+list(esm_cols.columns)].fillna(0)
+
+    ## (5) STANDARDIZATION 
+    if provider["STANDARDIZATION"]:
+        features.loc[:, ~features.columns.isin(excluded_columns)] = StandardScaler().fit_transform(features.loc[:, ~features.columns.isin(excluded_columns)])
+
+    # (6) IMPUTATION: IMPUTE DATA WITH KNN METHOD
+    impute_cols = [col for col in features.columns if col not in excluded_columns]
+    features.reset_index(drop=True, inplace=True)
+    features[impute_cols] = impute(features[impute_cols], method="knn")
+
+    # (7) REMOVE COLS WHERE VARIANCE IS 0
+    esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')]
+
+    if provider["COLS_VAR_THRESHOLD"]:
+        features.drop(features.std(numeric_only=True)[features.std(numeric_only=True) == 0].index.values, axis=1, inplace=True)
+
+    fe5 = features.copy()
+
+    # (8) DROP HIGHLY CORRELATED FEATURES
+    drop_corr_features = provider["DROP_HIGHLY_CORRELATED_FEATURES"]
+    if drop_corr_features["COMPUTE"] and features.shape[0]: # If small amount of segments (rows) is present, do not execute correlation check
+        
+        numerical_cols = features.select_dtypes(include=np.number).columns.tolist()
+
+        # Remove columns where NaN count threshold is passed
+        valid_features = features[numerical_cols].loc[:, features[numerical_cols].isna().sum() < drop_corr_features['MIN_OVERLAP_FOR_CORR_THRESHOLD'] * features[numerical_cols].shape[0]]
+
+        corr_matrix = valid_features.corr().abs()
+        upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
+        to_drop = [column for column in upper.columns if any(upper[column] > drop_corr_features["CORR_THRESHOLD"])]
+
+        features.drop(to_drop, axis=1, inplace=True)
+
+    # Preserve esm cols if deleted (has to come after drop cols operations)
+    for esm in esm_cols:
+        if esm not in features:
+            features[esm] = esm_cols[esm]
+
+    # (9) VERIFY IF THERE ARE ANY NANS LEFT IN THE DATAFRAME
+    if features.isna().any().any():
+        raise ValueError("There are still some NaNs present in the dataframe. Please check for implementation errors.")
+
+    return features
+
+
+def k_nearest(df):
+    pd.set_option('display.max_columns', None)
+    imputer = KNNImputer(n_neighbors=3)
+    return pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
+
+
+def impute(df, method='zero'):
+
+    return {
+        'zero': df.fillna(0),
+        'high_number': df.fillna(1500),
+        'mean': df.fillna(df.mean()),
+        'median': df.fillna(df.median()),
+        'knn': k_nearest(df) 
+    }[method]
+
+
+def graph_bf_af(features, phase_name, plt_flag=False):
+    if plt_flag:
+        sns.set(rc={"figure.figsize":(16, 8)})
+        sns.heatmap(features.isna(), cbar=False) #features.select_dtypes(include=np.number)
+        plt.savefig(f'features_overall_nans_{phase_name}.png', bbox_inches='tight')
+
+    print(f"\n-------------{phase_name}-------------")
+    print("Rows number:", features.shape[0])
+    print("Columns number:", len(features.columns))
+    print("---------------------------------------------\n")
--- a/src/features/all_cleaning_overall/rapids/main.R
+++ b/src/features/all_cleaning_overall/rapids/main.R
@ -0,0 +1,89 @@
+source("renv/activate.R")
+library(tidyr)
+library("dplyr", warn.conflicts = F)
+library(tidyverse)
+library(caret)
+library(corrr)
+
+rapids_cleaning <- function(sensor_data_files, provider){
+
+    clean_features <- read.csv(sensor_data_files[["sensor_data"]], stringsAsFactors = FALSE)
+    impute_selected_event_features <- provider[["IMPUTE_SELECTED_EVENT_FEATURES"]]
+    cols_nan_threshold <- as.numeric(provider[["COLS_NAN_THRESHOLD"]])
+    drop_zero_variance_columns <- as.logical(provider[["COLS_VAR_THRESHOLD"]])
+    rows_nan_threshold <- as.numeric(provider[["ROWS_NAN_THRESHOLD"]])
+    data_yield_unit <- tolower(str_split_fixed(provider[["DATA_YIELD_FEATURE"]], "_", 4)[[4]])
+    data_yield_column <- paste0("phone_data_yield_rapids_ratiovalidyielded", data_yield_unit)
+    data_yield_ratio_threshold <- as.numeric(provider[["DATA_YIELD_RATIO_THRESHOLD"]])
+    drop_highly_correlated_features <- provider[["DROP_HIGHLY_CORRELATED_FEATURES"]]
+
+    # Impute selected event features
+    if(as.logical(impute_selected_event_features$COMPUTE)){
+        if(!"phone_data_yield_rapids_ratiovalidyieldedminutes" %in% colnames(clean_features)){
+            stop("Error: RAPIDS provider needs to impute the selected event features based on phone_data_yield_rapids_ratiovalidyieldedminutes column, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyieldedminutes' in [FEATURES].")
+        }
+        column_names <- colnames(clean_features)
+        selected_apps_features <- column_names[grepl("^phone_applications_foreground_rapids_(countevent|countepisode|minduration|maxduration|meanduration|sumduration)", column_names)]
+        selected_battery_features <- column_names[grepl("^phone_battery_rapids_", column_names)]
+        selected_calls_features <- column_names[grepl("^phone_calls_rapids_.*_(count|distinctcontacts|sumduration|minduration|maxduration|meanduration|modeduration)", column_names)]
+        selected_keyboard_features <- column_names[grepl("^phone_keyboard_rapids_(sessioncount|averagesessionlength|changeintextlengthlessthanminusone|changeintextlengthequaltominusone|changeintextlengthequaltoone|changeintextlengthmorethanone|maxtextlength|totalkeyboardtouches)", column_names)]
+        selected_messages_features <- column_names[grepl("^phone_messages_rapids_.*_(count|distinctcontacts)", column_names)]
+        selected_screen_features <- column_names[grepl("^phone_screen_rapids_(sumduration|maxduration|minduration|avgduration|countepisode)", column_names)]
+        selected_wifi_features <- column_names[grepl("^phone_wifi_(connected|visible)_rapids_", column_names)]
+        
+        selected_columns <- c(selected_apps_features, selected_battery_features, selected_calls_features, selected_keyboard_features, selected_messages_features, selected_screen_features, selected_wifi_features)
+        clean_features[selected_columns][is.na(clean_features[selected_columns]) & (clean_features$phone_data_yield_rapids_ratiovalidyieldedminutes > impute_selected_event_features$MIN_DATA_YIELDED_MINUTES_TO_IMPUTE)] <- 0
+    }
+    
+    # Drop rows with the value of data_yield_column less than data_yield_ratio_threshold
+    if(!data_yield_column %in% colnames(clean_features)){
+        stop(paste0("Error: RAPIDS provider needs to clean data based on ", data_yield_column, " column, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyielded", data_yield_unit, "' in [FEATURES]."))
+    }
+    if (data_yield_ratio_threshold > 0) {
+        clean_features <- clean_features %>% 
+        filter(.[[data_yield_column]] >= data_yield_ratio_threshold)
+    }
+
+    # Drop columns with a percentage of NA values above cols_nan_threshold
+    if(nrow(clean_features))
+        clean_features <- clean_features %>% select(where(~ sum(is.na(.)) / length(.) <= cols_nan_threshold ), starts_with("phone_esm"))
+
+    # Drop columns with zero variance
+    if(drop_zero_variance_columns)
+        clean_features <- clean_features %>% select_if(grepl("pid|local_segment|local_segment_label|local_segment_start_datetime|local_segment_end_datetime|phone_esm",names(.)) | sapply(., n_distinct, na.rm = T) > 1)
+
+    # Drop highly correlated features
+    if(as.logical(drop_highly_correlated_features$COMPUTE)){
+        
+        min_overlap_for_corr_threshold <- as.numeric(drop_highly_correlated_features$MIN_OVERLAP_FOR_CORR_THRESHOLD)
+        corr_threshold <- as.numeric(drop_highly_correlated_features$CORR_THRESHOLD)
+
+        features_for_corr <- clean_features %>% 
+            select_if(is.numeric) %>% 
+            select_if(sapply(., n_distinct, na.rm = T) > 1)
+
+        valid_pairs <- crossprod(!is.na(features_for_corr)) >= min_overlap_for_corr_threshold * nrow(features_for_corr)
+
+        if((nrow(features_for_corr) != 0) & (ncol(features_for_corr) != 0)){
+
+            highly_correlated_features <- features_for_corr %>% 
+                correlate(use = "pairwise.complete.obs", method = "spearman") %>% 
+                column_to_rownames(., var = "term") %>% 
+                as.matrix() %>% 
+                replace(!valid_pairs | is.na(.), 0) %>% 
+                findCorrelation(., cutoff = corr_threshold, verbose = F, names = T)
+
+            clean_features <- clean_features[, !names(clean_features) %in% highly_correlated_features]
+        
+        }
+    }
+
+    # Drop rows with a percentage of NA values above rows_nan_threshold
+    clean_features <- clean_features %>% 
+        mutate(percentage_na =  rowSums(is.na(.)) / ncol(.)) %>% 
+        filter(percentage_na <= rows_nan_threshold) %>% 
+        select(-percentage_na)
+
+    return(clean_features)
+}
+
--- a/src/features/all_cleaning_overall/straw/init.py
+++ b/src/features/all_cleaning_overall/straw/init.py
--- a/src/features/all_cleaning_overall/straw/main.py
+++ b/src/features/all_cleaning_overall/straw/main.py
@ -0,0 +1,275 @@
+import pandas as pd
+import numpy as np
+import math, sys, random, warnings, yaml
+
+from sklearn.impute import KNNImputer
+from sklearn.preprocessing import StandardScaler, minmax_scale 
+import matplotlib.pyplot as plt
+import seaborn as sns
+
+sys.path.append('/rapids/')
+from src.features import empatica_data_yield as edy
+
+def straw_cleaning(sensor_data_files, provider, target):
+
+    features = pd.read_csv(sensor_data_files["sensor_data"][0])
+
+    with open('config.yaml', 'r') as stream:
+        config = yaml.load(stream, Loader=yaml.FullLoader)
+
+    esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] # Get target (esm) columns
+
+    excluded_columns = ['local_segment', 'local_segment_label', 'local_segment_start_datetime', 'local_segment_end_datetime']
+
+    graph_bf_af(features, "1target_rows_before")
+
+    # (1.0) OVERRIDE STRESSFULNESS EVENT TARGETS IF ERS SEGMENTING_METHOD IS "STRESS_EVENT"
+    if config["TIME_SEGMENTS"]["TAILORED_EVENTS"]["SEGMENTING_METHOD"] == "stress_event": 
+    
+        stress_events_targets = pd.read_csv("data/external/stress_event_targets.csv")   
+
+        if "appraisal_stressfulness_event_mean" in config['PARAMS_FOR_ANALYSIS']['TARGET']['ALL_LABELS']:
+            features.drop(columns=['phone_esm_straw_appraisal_stressfulness_event_mean'], inplace=True)
+            features = features.merge(stress_events_targets[["label", "appraisal_stressfulness_event"]] \
+                        .rename(columns={'label': 'local_segment_label'}), on=['local_segment_label'], how='inner') \
+                        .rename(columns={'appraisal_stressfulness_event': 'phone_esm_straw_appraisal_stressfulness_event_mean'})
+
+        if "appraisal_threat_mean" in config['PARAMS_FOR_ANALYSIS']['TARGET']['ALL_LABELS']:
+            features.drop(columns=['phone_esm_straw_appraisal_threat_mean'], inplace=True)
+            features = features.merge(stress_events_targets[["label", "appraisal_threat"]] \
+                        .rename(columns={'label': 'local_segment_label'}), on=['local_segment_label'], how='inner') \
+                        .rename(columns={'appraisal_threat': 'phone_esm_straw_appraisal_threat_mean'})
+
+        if "appraisal_challenge_mean" in config['PARAMS_FOR_ANALYSIS']['TARGET']['ALL_LABELS']:
+            features.drop(columns=['phone_esm_straw_appraisal_challenge_mean'], inplace=True)
+            features = features.merge(stress_events_targets[["label", "appraisal_challenge"]] \
+                        .rename(columns={'label': 'local_segment_label'}), on=['local_segment_label'], how='inner') \
+                        .rename(columns={'appraisal_challenge': 'phone_esm_straw_appraisal_challenge_mean'})
+
+    esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] # Get target (esm) columns
+
+    # (1.1) FILTER_OUT THE ROWS THAT DO NOT HAVE THE TARGET COLUMN AVAILABLE
+    if config['PARAMS_FOR_ANALYSIS']['TARGET']['COMPUTE']:
+        features = features[features['phone_esm_straw_' + target].notna()].reset_index(drop=True)
+    
+    if features.empty:
+        return pd.DataFrame(columns=excluded_columns)
+
+    graph_bf_af(features, "2target_rows_after")
+
+    # (2) QUALITY CHECK (DATA YIELD COLUMN) drops the rows where E4 or phone data is low quality
+    phone_data_yield_unit = provider["PHONE_DATA_YIELD_FEATURE"].split("_")[3].lower()
+    phone_data_yield_column = "phone_data_yield_rapids_ratiovalidyielded" + phone_data_yield_unit
+
+    features = edy.calculate_empatica_data_yield(features)
+
+    if not phone_data_yield_column in features.columns and not "empatica_data_yield" in features.columns:
+        raise KeyError(f"RAPIDS provider needs to clean the selected event features based on {phone_data_yield_column} and empatica_data_yield columns. For phone data yield, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyielded{data_yield_unit}' in [FEATURES].")
+
+    hist = features[["empatica_data_yield", phone_data_yield_column]].hist()
+    plt.savefig(f'phone_E4_histogram.png', bbox_inches='tight')
+
+    # Drop rows where phone data yield is less then given threshold
+    if provider["PHONE_DATA_YIELD_RATIO_THRESHOLD"]:
+        hist = features[phone_data_yield_column].hist(bins=5)
+        plt.close()
+        features = features[features[phone_data_yield_column] >= provider["PHONE_DATA_YIELD_RATIO_THRESHOLD"]].reset_index(drop=True)
+
+    # Drop rows where empatica data yield is less then given threshold
+    if provider["EMPATICA_DATA_YIELD_RATIO_THRESHOLD"]:
+        features = features[features["empatica_data_yield"] >= provider["EMPATICA_DATA_YIELD_RATIO_THRESHOLD"]].reset_index(drop=True)
+
+    if features.empty:
+        return pd.DataFrame(columns=excluded_columns)
+
+    graph_bf_af(features, "3data_yield_drop_rows")
+
+    if features.empty:
+        return pd.DataFrame(columns=excluded_columns)
+
+
+    # (3) CONTEXTUAL IMPUTATION
+
+    # Impute selected phone features with a high number
+    impute_w_hn = [col for col in features.columns if \
+        "timeoffirstuse" in col or
+        "timeoflastuse" in col or
+        "timefirstcall" in col or
+        "timelastcall" in col or
+        "firstuseafter" in col or
+        "timefirstmessages" in col or
+        "timelastmessages" in col]
+    features[impute_w_hn] = features[impute_w_hn].fillna(1500)
+
+    # Impute special case (mostcommonactivity) and (homelabel)
+    impute_w_sn = [col for col in features.columns if "mostcommonactivity" in col]
+    features[impute_w_sn] = features[impute_w_sn].fillna(4) # Special case of imputation - nominal/ordinal value
+
+    impute_w_sn2 = [col for col in features.columns if "homelabel" in col]
+    features[impute_w_sn2] = features[impute_w_sn2].fillna(1) # Special case of imputation - nominal/ordinal value
+
+    impute_w_sn3 = [col for col in features.columns if "loglocationvariance" in col]
+    features[impute_w_sn3] = features[impute_w_sn3].fillna(-1000000) # Special case of imputation - loglocation
+
+    # Impute location features
+    impute_locations = [col for col in features \
+        if col.startswith('phone_locations_doryab_') and
+        'radiusgyration' not in col    
+    ]
+
+    # Impute selected phone, location, and esm features with 0
+    impute_zero = [col for col in features if \
+        col.startswith('phone_applications_foreground_rapids_') or
+        col.startswith('phone_activity_recognition_') or
+        col.startswith('phone_battery_rapids_') or
+        col.startswith('phone_bluetooth_rapids_') or
+        col.startswith('phone_light_rapids_') or
+        col.startswith('phone_calls_rapids_') or
+        col.startswith('phone_messages_rapids_') or
+        col.startswith('phone_screen_rapids_') or
+        col.startswith('phone_bluetooth_doryab_') or
+        col.startswith('phone_wifi_visible')
+        ]
+
+    features[impute_zero+impute_locations+list(esm_cols.columns)] = features[impute_zero+impute_locations+list(esm_cols.columns)].fillna(0)
+
+    pd.set_option('display.max_rows', None)
+
+    graph_bf_af(features, "4context_imp")
+ 
+    # (4) REMOVE COLS IF THEIR NAN THRESHOLD IS PASSED (should be <= if even all NaN columns must be preserved - this solution now drops columns with all NaN rows)
+    esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] # Get target (esm) columns
+
+    features = features.loc[:, features.isna().sum() < provider["COLS_NAN_THRESHOLD"] * features.shape[0]]
+
+    graph_bf_af(features, "5too_much_nans_cols")
+    # (5) REMOVE COLS WHERE VARIANCE IS 0
+
+    if provider["COLS_VAR_THRESHOLD"]:
+        features.drop(features.std(numeric_only=True)[features.std(numeric_only=True) == 0].index.values, axis=1, inplace=True)
+
+    graph_bf_af(features, "6variance_drop")
+
+    # Preserve esm cols if deleted (has to come after drop cols operations)
+    for esm in esm_cols:
+        if esm not in features:
+            features[esm] = esm_cols[esm]
+    
+    # (6) DO THE ROWS CONSIST OF ENOUGH NON-NAN VALUES?
+    min_count =  math.ceil((1 - provider["ROWS_NAN_THRESHOLD"]) * features.shape[1]) # minimal not nan values in row
+    features.dropna(axis=0, thresh=min_count, inplace=True) # Thresh => at least this many not-nans
+
+    graph_bf_af(features, "7too_much_nans_rows")
+
+    if features.empty:
+        return pd.DataFrame(columns=excluded_columns)
+
+    # (7) STANDARDIZATION
+    if provider["STANDARDIZATION"]:
+        nominal_cols = [col for col in features.columns if "mostcommonactivity" in col or "homelabel" in col] # Excluded nominal features
+        # Expected warning within this code block
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore", category=RuntimeWarning)
+            if provider["TARGET_STANDARDIZATION"]:
+                features.loc[:, ~features.columns.isin(excluded_columns + ["pid"] + nominal_cols)] = \
+                    features.loc[:, ~features.columns.isin(excluded_columns + nominal_cols)].groupby('pid').transform(lambda x: StandardScaler().fit_transform(x.values[:,np.newaxis]).ravel())
+            else:
+                features.loc[:, ~features.columns.isin(excluded_columns + ["pid"] + nominal_cols + ['phone_esm_straw_' + target])] = \
+                    features.loc[:, ~features.columns.isin(excluded_columns + nominal_cols + ['phone_esm_straw_' + target])].groupby('pid').transform(lambda x: StandardScaler().fit_transform(x.values[:,np.newaxis]).ravel())
+
+    graph_bf_af(features, "8standardization")
+
+    # (8) IMPUTATION: IMPUTE DATA WITH KNN METHOD
+    features.reset_index(drop=True, inplace=True)
+    impute_cols = [col for col in features.columns if col not in excluded_columns and col != "pid"]
+
+    features[impute_cols] = impute(features[impute_cols], method="knn")
+
+    graph_bf_af(features, "9knn_after")
+
+
+    # (9) DROP HIGHLY CORRELATED FEATURES
+    esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')]
+
+    drop_corr_features = provider["DROP_HIGHLY_CORRELATED_FEATURES"]
+    if drop_corr_features["COMPUTE"] and features.shape[0] > 5: # If small amount of segments (rows) is present, do not execute correlation check
+        
+        numerical_cols = features.select_dtypes(include=np.number).columns.tolist()
+
+        # Remove columns where NaN count threshold is passed
+        valid_features = features[numerical_cols].loc[:, features[numerical_cols].isna().sum() < drop_corr_features['MIN_OVERLAP_FOR_CORR_THRESHOLD'] * features[numerical_cols].shape[0]]
+
+        corr_matrix = valid_features.corr().abs()
+        upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
+        to_drop = [column for column in upper.columns if any(upper[column] > drop_corr_features["CORR_THRESHOLD"])]
+
+        # sns.heatmap(corr_matrix, cmap="YlGnBu")
+        # plt.savefig(f'correlation_matrix.png', bbox_inches='tight')
+        # plt.close()
+
+        # s = corr_matrix.unstack()
+        # so = s.sort_values(ascending=False)
+
+        # pd.set_option('display.max_rows', None)
+        # sorted_upper = upper.unstack().sort_values(ascending=False)
+        # print(sorted_upper[sorted_upper > drop_corr_features["CORR_THRESHOLD"]])
+
+        features.drop(to_drop, axis=1, inplace=True)
+
+    # Preserve esm cols if deleted (has to come after drop cols operations)
+    for esm in esm_cols:
+        if esm not in features:
+            features[esm] = esm_cols[esm]
+
+    graph_bf_af(features, "10correlation_drop")
+
+    # Transform categorical columns to category dtype
+
+    cat1 = [col for col in features.columns if "mostcommonactivity" in col]
+    if cat1: # Transform columns to category dtype (mostcommonactivity)
+        features[cat1] = features[cat1].astype(int).astype('category')
+
+    cat2 = [col for col in features.columns if "homelabel" in col]
+    if cat2: # Transform columns to category dtype (homelabel)
+        features[cat2] = features[cat2].astype(int).astype('category')
+
+    # (10) DROP ALL WINDOW RELATED COLUMNS
+    win_count_cols = [col for col in features if "SO_windowsCount" in col]
+    if win_count_cols:
+        features.drop(columns=win_count_cols, inplace=True)
+
+    # (11) VERIFY IF THERE ARE ANY NANS LEFT IN THE DATAFRAME
+    if features.isna().any().any():
+        raise ValueError("There are still some NaNs present in the dataframe. Please check for implementation errors.")
+
+
+    return features
+
+
+def k_nearest(df):
+    imputer = KNNImputer(n_neighbors=3)
+    return pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
+
+
+def impute(df, method='zero'):
+
+    return {
+        'zero': df.fillna(0),
+        'high_number': df.fillna(1500),
+        'mean': df.fillna(df.mean()),
+        'median': df.fillna(df.median()),
+        'knn': k_nearest(df) 
+    }[method]
+
+
+def graph_bf_af(features, phase_name, plt_flag=False):
+    if plt_flag:
+        sns.set(rc={"figure.figsize":(16, 8)})
+        sns.heatmap(features.isna(), cbar=False) #features.select_dtypes(include=np.number)
+        plt.savefig(f'features_overall_nans_{phase_name}.png', bbox_inches='tight')
+
+    print(f"\n-------------{phase_name}-------------")
+    print("Rows number:", features.shape[0])
+    print("Columns number:", len(features.columns))
+    print("NaN values:", features.isna().sum().sum())
+    print("---------------------------------------------\n")
--- a/src/features/cr_features_helper_methods.py
+++ b/src/features/cr_features_helper_methods.py
@ -0,0 +1,59 @@
+import pandas as pd
+import numpy as np
+import math as m
+
+import sys
+
+def extract_second_order_features(intraday_features, so_features_names, prefix=""):
+    
+    if prefix:
+        groupby_cols = ['local_segment', 'local_segment_label', 'local_segment_start_datetime', 'local_segment_end_datetime']
+    else:
+        groupby_cols = ['local_segment']
+
+    if not intraday_features.empty:
+        so_features = pd.DataFrame()
+        #print(intraday_features.drop("level_1", axis=1).groupby(["local_segment"]).nsmallest())
+        if "mean" in so_features_names:
+            so_features = pd.concat([so_features, intraday_features.drop(prefix+"level_1", axis=1).groupby(groupby_cols).mean(numeric_only=True).add_suffix("_SO_mean")], axis=1)
+        
+        if "median" in so_features_names:
+            so_features = pd.concat([so_features, intraday_features.drop(prefix+"level_1", axis=1).groupby(groupby_cols).median(numeric_only=True).add_suffix("_SO_median")], axis=1)
+        
+        if "sd" in so_features_names:
+            so_features = pd.concat([so_features, intraday_features.drop(prefix+"level_1", axis=1).groupby(groupby_cols).std(numeric_only=True).fillna(0).add_suffix("_SO_sd")], axis=1)
+        
+        if "nlargest" in so_features_names: # largest 5 -- maybe there is a faster groupby solution?
+            for column in intraday_features.loc[:, ~intraday_features.columns.isin(groupby_cols+[prefix+"level_1"])]:
+                so_features[column+"_SO_nlargest"] = intraday_features.drop(prefix+"level_1", axis=1).groupby(groupby_cols)[column].apply(lambda x: x.nlargest(5).mean())
+        
+        if "nsmallest" in so_features_names: # smallest 5 -- maybe there is a faster groupby solution?
+            for column in intraday_features.loc[:, ~intraday_features.columns.isin(groupby_cols+[prefix+"level_1"])]:
+                so_features[column+"_SO_nsmallest"] = intraday_features.drop(prefix+"level_1", axis=1).groupby(groupby_cols)[column].apply(lambda x: x.nsmallest(5).mean())
+        
+        if "count_windows" in so_features_names:
+            so_features["SO_windowsCount"] = intraday_features.groupby(groupby_cols).count()[prefix+"level_1"]
+
+        # numPeaksNonZero specialized for EDA sensor
+        if "eda_num_peaks_non_zero" in so_features_names and prefix+"numPeaks" in intraday_features.columns:
+            so_features[prefix+"SO_numPeaksNonZero"] = intraday_features.groupby(groupby_cols)[prefix+"numPeaks"].apply(lambda x: (x!=0).sum())
+
+        # numWindowsNonZero specialized for BVP and IBI sensors
+        if "hrv_num_windows_non_nan" in so_features_names and prefix+"meanHr" in intraday_features.columns:
+            so_features[prefix+"SO_numWindowsNonNaN"] = intraday_features.groupby(groupby_cols)[prefix+"meanHr"].apply(lambda x: (~np.isnan(x)).sum())
+            
+        so_features.reset_index(inplace=True)
+
+    else:
+        so_features = pd.DataFrame(columns=groupby_cols)
+
+    return so_features
+
+def get_sample_rate(data): # To-Do get the sample rate information from the file's metadata
+    try:
+        timestamps_diff = data['timestamp'].diff().dropna().mean()
+        print("Timestamp diff:", timestamps_diff)
+    except:
+        raise Exception("Error occured while trying to get the mean sample rate from the data.")
+
+    return m.ceil(1000/timestamps_diff)
--- a/src/features/empatica_accelerometer/cr/main.py
+++ b/src/features/empatica_accelerometer/cr/main.py
@ -0,0 +1,75 @@
+import pandas as pd
+from scipy.stats import entropy
+
+from cr_features.helper_functions import convert_to2d, accelerometer_features, frequency_features
+from cr_features.calculate_features_old import calculateFeatures
+from cr_features.calculate_features import calculate_features
+from cr_features_helper_methods import extract_second_order_features
+
+import sys
+
+def extract_acc_features_from_intraday_data(acc_intraday_data, features, window_length, time_segment, filter_data_by_segment):
+    acc_intraday_features = pd.DataFrame(columns=["local_segment"] + features)
+
+    if not acc_intraday_data.empty:   
+        sample_rate = 32
+     
+        acc_intraday_data = filter_data_by_segment(acc_intraday_data, time_segment)
+
+        if not acc_intraday_data.empty:
+
+            acc_intraday_features = pd.DataFrame()
+
+            # apply methods from calculate features module
+            if window_length is None:
+                acc_intraday_features = \
+                    acc_intraday_data.groupby('local_segment').apply(lambda x: calculate_features( \
+                    convert_to2d(x['double_values_0'], x.shape[0]), \
+                    convert_to2d(x['double_values_1'], x.shape[0]), \
+                    convert_to2d(x['double_values_2'], x.shape[0]), \
+                    fs=sample_rate, feature_names=features, show_progress=False)) 
+            else:
+                acc_intraday_features = \
+                    acc_intraday_data.groupby('local_segment').apply(lambda x: calculate_features( \
+                    convert_to2d(x['double_values_0'], window_length*sample_rate), \
+                    convert_to2d(x['double_values_1'], window_length*sample_rate), \
+                    convert_to2d(x['double_values_2'], window_length*sample_rate), \
+                    fs=sample_rate, feature_names=features, show_progress=False)) 
+
+            acc_intraday_features.reset_index(inplace=True)
+
+    return acc_intraday_features
+
+
+
+def cr_features(sensor_data_files, time_segment, provider, filter_data_by_segment, *args, **kwargs):
+    
+    data_types = {'local_timezone': 'str', 'device_id': 'str', 'timestamp': 'int64', 'double_values_0': 'float64',
+                    'double_values_1': 'float64', 'double_values_2': 'float64', 'local_date_time': 'str', 'local_date': "str",
+                    'local_time': "str", 'local_hour': "str", 'local_minute': "str", 'assigned_segments': "str"}
+    acc_intraday_data = pd.read_csv(sensor_data_files["sensor_data"], dtype=data_types)    
+
+    requested_intraday_features = provider["FEATURES"]
+    
+    calc_windows = kwargs.get('calc_windows', False)
+
+    if provider["WINDOWS"]["COMPUTE"] and calc_windows:
+        requested_window_length = provider["WINDOWS"]["WINDOW_LENGTH"]
+    else:
+        requested_window_length = None
+
+    # name of the features this function can compute
+    base_intraday_features_names = accelerometer_features + frequency_features
+    # the subset of requested features this function can compute
+    intraday_features_to_compute = list(set(requested_intraday_features) & set(base_intraday_features_names))
+
+    # extract features from intraday data
+    acc_intraday_features = extract_acc_features_from_intraday_data(acc_intraday_data, intraday_features_to_compute, 
+                                                                requested_window_length, time_segment, filter_data_by_segment)
+
+    if calc_windows:
+        so_features_names = provider["WINDOWS"]["SECOND_ORDER_FEATURES"]
+        acc_second_order_features = extract_second_order_features(acc_intraday_features, so_features_names)
+        return acc_intraday_features, acc_second_order_features
+
+    return acc_intraday_features
--- a/src/features/empatica_blood_volume_pulse/cr/main.py
+++ b/src/features/empatica_blood_volume_pulse/cr/main.py
@ -0,0 +1,73 @@
+import pandas as pd
+from sklearn.preprocessing import StandardScaler
+
+from cr_features.helper_functions import convert_to2d, hrv_features
+from cr_features.hrv import extract_hrv_features_2d_wrapper
+from cr_features_helper_methods import extract_second_order_features
+
+import sys
+
+# pd.set_option('display.max_rows', 1000)
+pd.set_option('display.max_columns', None)
+
+def extract_bvp_features_from_intraday_data(bvp_intraday_data, features, window_length, time_segment, filter_data_by_segment):
+    bvp_intraday_features = pd.DataFrame(columns=["local_segment"] + features)
+
+    if not bvp_intraday_data.empty:
+        sample_rate = 64
+     
+        bvp_intraday_data = filter_data_by_segment(bvp_intraday_data, time_segment)
+
+        if not bvp_intraday_data.empty:
+
+            bvp_intraday_features = pd.DataFrame()
+
+            # apply methods from calculate features module
+            if window_length is None:
+                bvp_intraday_features = \
+                    bvp_intraday_data.groupby('local_segment').apply(\
+                    lambda x: 
+                        extract_hrv_features_2d_wrapper(
+                            convert_to2d(x['blood_volume_pulse'], x.shape[0]), 
+                            sampling=sample_rate, hampel_fiter=False, median_filter=False, mod_z_score_filter=True, feature_names=features))
+
+            else:
+                bvp_intraday_features = \
+                    bvp_intraday_data.groupby('local_segment').apply(\
+                    lambda x: 
+                        extract_hrv_features_2d_wrapper(
+                            convert_to2d(x['blood_volume_pulse'], window_length*sample_rate), 
+                            sampling=sample_rate, hampel_fiter=False, median_filter=False, mod_z_score_filter=True, feature_names=features)) 
+
+            bvp_intraday_features.reset_index(inplace=True)
+
+    return bvp_intraday_features
+
+
+def cr_features(sensor_data_files, time_segment, provider, filter_data_by_segment, *args, **kwargs):
+    bvp_intraday_data = pd.read_csv(sensor_data_files["sensor_data"])
+
+    requested_intraday_features = provider["FEATURES"]
+    
+    calc_windows = kwargs.get('calc_windows', False)
+
+    if provider["WINDOWS"]["COMPUTE"] and calc_windows:
+        requested_window_length = provider["WINDOWS"]["WINDOW_LENGTH"]
+    else:
+        requested_window_length = None
+
+    # name of the features this function can compute
+    base_intraday_features_names = hrv_features
+    # the subset of requested features this function can compute
+    intraday_features_to_compute = list(set(requested_intraday_features) & set(base_intraday_features_names))
+
+    # extract features from intraday data
+    bvp_intraday_features = extract_bvp_features_from_intraday_data(bvp_intraday_data, intraday_features_to_compute, 
+                                                                requested_window_length, time_segment, filter_data_by_segment)
+                                                                
+    if calc_windows:
+        so_features_names = provider["WINDOWS"]["SECOND_ORDER_FEATURES"]
+        bvp_second_order_features = extract_second_order_features(bvp_intraday_features, so_features_names)
+        return bvp_intraday_features, bvp_second_order_features
+
+    return bvp_intraday_features
--- a/src/features/empatica_data_yield.py
+++ b/src/features/empatica_data_yield.py
@ -0,0 +1,32 @@
+import pandas as pd
+import numpy as np
+from datetime import datetime
+
+import sys, yaml
+
+def calculate_empatica_data_yield(features): # TODO
+
+    # Get time segment duration in seconds from all segments in features dataframe
+    datetime_start =  pd.to_datetime(features['local_segment_start_datetime'], format='%Y-%m-%d %H:%M:%S')
+    datetime_end = pd.to_datetime(features['local_segment_end_datetime'], format='%Y-%m-%d %H:%M:%S')
+    tseg_duration = (datetime_end - datetime_start).dt.total_seconds()
+
+    with open('config.yaml', 'r') as stream:
+        config = yaml.load(stream, Loader=yaml.FullLoader)
+        
+    sensors = ["EMPATICA_ACCELEROMETER", "EMPATICA_TEMPERATURE", "EMPATICA_ELECTRODERMAL_ACTIVITY", "EMPATICA_INTER_BEAT_INTERVAL"]
+    for sensor in sensors:
+        features[f"{sensor.lower()}_data_yield"] = \
+            (features[f"{sensor.lower()}_cr_SO_windowsCount"] * config[sensor]["PROVIDERS"]["CR"]["WINDOWS"]["WINDOW_LENGTH"]) / tseg_duration \
+            if f'{sensor.lower()}_cr_SO_windowsCount' in features else 0
+
+    empatica_data_yield_cols = [sensor.lower() + "_data_yield" for sensor in sensors]
+    pd.set_option('display.max_rows', None)
+
+    # Assigns 1 to values that are over 1 (in case of windows not being filled fully)
+    features[empatica_data_yield_cols] = features[empatica_data_yield_cols].apply(lambda x: [y if y <= 1 or np.isnan(y) else 1 for y in x])
+    
+    features["empatica_data_yield"] = features[empatica_data_yield_cols].mean(axis=1, numeric_only=True).fillna(0)
+    features.drop(empatica_data_yield_cols, axis=1, inplace=True) # In case of if the advanced operations will later not be needed (e.g., weighted average)
+
+    return features
--- a/src/features/empatica_electrodermal_activity/cr/main.py
+++ b/src/features/empatica_electrodermal_activity/cr/main.py
@ -0,0 +1,82 @@
+import pandas as pd
+import numpy as np
+from scipy.stats import entropy
+
+from cr_features.helper_functions import convert_to2d, gsr_features
+from cr_features.calculate_features import calculate_features
+from cr_features.gsr import extractGsrFeatures2D
+from cr_features_helper_methods import extract_second_order_features
+
+import sys
+
+#pd.set_option('display.max_columns', None)
+#pd.set_option('display.max_rows', None)
+#np.seterr(invalid='ignore')
+
+
+def extract_eda_features_from_intraday_data(eda_intraday_data, features, window_length, time_segment, filter_data_by_segment):
+    eda_intraday_features = pd.DataFrame(columns=["local_segment"] + features)
+
+    if not eda_intraday_data.empty:   
+        sample_rate = 4  
+     
+        eda_intraday_data = filter_data_by_segment(eda_intraday_data, time_segment)
+
+        if not eda_intraday_data.empty: 
+
+            eda_intraday_features = pd.DataFrame()
+
+            # apply methods from calculate features module 
+            if window_length is None:
+                eda_intraday_features = \
+                    eda_intraday_data.groupby('local_segment').apply(\
+                    lambda x: extractGsrFeatures2D(convert_to2d(x['electrodermal_activity'], x.shape[0]), sampleRate=sample_rate, featureNames=features,
+                    threshold=.01, offset=1, riseTime=5, decayTime=15)) 
+            else:
+                eda_intraday_features = \
+                    eda_intraday_data.groupby('local_segment').apply(\
+                    lambda x: extractGsrFeatures2D(convert_to2d(x['electrodermal_activity'], window_length*sample_rate), sampleRate=sample_rate, featureNames=features,
+                    threshold=.01, offset=1, riseTime=5, decayTime=15)) 
+
+            eda_intraday_features.reset_index(inplace=True)
+
+    return eda_intraday_features
+
+
+def cr_features(sensor_data_files, time_segment, provider, filter_data_by_segment, *args, **kwargs):
+
+    data_types = {'local_timezone': 'str', 'device_id': 'str', 'timestamp': 'int64', 'electrodermal_activity': 'float64', 'local_date_time': 'str', 
+                  'local_date': "str", 'local_time': "str", 'local_hour': "str", 'local_minute': "str", 'assigned_segments': "str"}
+
+    eda_intraday_data = pd.read_csv(sensor_data_files["sensor_data"], dtype=data_types)
+
+    requested_intraday_features = provider["FEATURES"]
+    
+    calc_windows = kwargs.get('calc_windows', False)
+
+    if provider["WINDOWS"]["COMPUTE"] and calc_windows:
+        requested_window_length = provider["WINDOWS"]["WINDOW_LENGTH"]
+    else:
+        requested_window_length = None
+
+    # name of the features this function can compute
+    base_intraday_features_names = gsr_features
+    # the subset of requested features this function can compute
+    intraday_features_to_compute = list(set(requested_intraday_features) & set(base_intraday_features_names))
+
+    # extract features from intraday data
+    eda_intraday_features = extract_eda_features_from_intraday_data(eda_intraday_data, intraday_features_to_compute, 
+                                                                requested_window_length, time_segment, filter_data_by_segment)
+
+    if calc_windows:
+        if provider["WINDOWS"]["IMPUTE_NANS"]:
+            eda_intraday_features[eda_intraday_features["numPeaks"] == 0] = \
+                eda_intraday_features[eda_intraday_features["numPeaks"] == 0].fillna(0)
+            pd.set_option('display.max_columns', None)
+            
+        so_features_names = provider["WINDOWS"]["SECOND_ORDER_FEATURES"]
+        eda_second_order_features = extract_second_order_features(eda_intraday_features, so_features_names)
+    
+        return eda_intraday_features, eda_second_order_features
+
+    return eda_intraday_features
--- a/src/features/empatica_inter_beat_interval/cr/main.py
+++ b/src/features/empatica_inter_beat_interval/cr/main.py
@ -0,0 +1,83 @@
+import pandas as pd
+from sklearn.preprocessing import StandardScaler
+import numpy as np
+
+from cr_features.helper_functions import convert_ibi_to2d_time, hrv_features
+from cr_features.hrv import extract_hrv_features_2d_wrapper, get_HRV_features
+from cr_features_helper_methods import extract_second_order_features
+
+import math
+import sys
+
+# pd.set_option('display.max_rows', 1000)
+pd.set_option('display.max_columns', None)
+
+
+def extract_ibi_features_from_intraday_data(ibi_intraday_data, features, window_length, time_segment, filter_data_by_segment):
+    ibi_intraday_features = pd.DataFrame(columns=["local_segment"] + features)
+
+    if not ibi_intraday_data.empty:   
+
+        ibi_intraday_data = filter_data_by_segment(ibi_intraday_data, time_segment)
+
+        if not ibi_intraday_data.empty:
+
+            ibi_intraday_features = pd.DataFrame()
+
+            # apply methods from calculate features module
+            if window_length is None:
+                ibi_intraday_features = \
+                    ibi_intraday_data.groupby('local_segment').apply(\
+                    lambda x: 
+                        extract_hrv_features_2d_wrapper(
+                            signal_2D = \
+                                convert_ibi_to2d_time(x[['timings', 'inter_beat_interval']], math.ceil(x['timings'].iloc[-1]))[0], 
+                            ibi_timings = \
+                                convert_ibi_to2d_time(x[['timings', 'inter_beat_interval']], math.ceil(x['timings'].iloc[-1]))[1], 
+                            sampling=None, hampel_fiter=False, median_filter=False, mod_z_score_filter=True, feature_names=features)) 
+            else:
+                ibi_intraday_features = \
+                    ibi_intraday_data.groupby('local_segment').apply(\
+                    lambda x: 
+                        extract_hrv_features_2d_wrapper(
+                            signal_2D = convert_ibi_to2d_time(x[['timings', 'inter_beat_interval']], window_length)[0],
+                            ibi_timings = convert_ibi_to2d_time(x[['timings', 'inter_beat_interval']], window_length)[1],
+                            sampling=None, hampel_fiter=False, median_filter=False, mod_z_score_filter=True, feature_names=features)) 
+
+            ibi_intraday_features.reset_index(inplace=True)
+
+    return ibi_intraday_features
+
+
+def cr_features(sensor_data_files, time_segment, provider, filter_data_by_segment, *args, **kwargs):
+
+    data_types = {'local_timezone': 'str', 'device_id': 'str', 'timestamp': 'int64', 'inter_beat_interval': 'float64', 'timings': 'float64', 'local_date_time': 'str', 
+                  'local_date': "str", 'local_time': "str", 'local_hour': "str", 'local_minute': "str", 'assigned_segments': "str"}
+
+    ibi_intraday_data = pd.read_csv(sensor_data_files["sensor_data"], dtype=data_types)
+
+    requested_intraday_features = provider["FEATURES"]
+    
+    calc_windows = kwargs.get('calc_windows', False)
+
+    if provider["WINDOWS"]["COMPUTE"] and calc_windows:
+        requested_window_length = provider["WINDOWS"]["WINDOW_LENGTH"]
+    else:
+        requested_window_length = None
+
+    # name of the features this function can compute
+    base_intraday_features_names = hrv_features
+    # the subset of requested features this function can compute
+    intraday_features_to_compute = list(set(requested_intraday_features) & set(base_intraday_features_names))
+
+    # extract features from intraday data
+    ibi_intraday_features = extract_ibi_features_from_intraday_data(ibi_intraday_data, intraday_features_to_compute, 
+                                                                requested_window_length, time_segment, filter_data_by_segment)
+
+    if calc_windows:        
+        so_features_names = provider["WINDOWS"]["SECOND_ORDER_FEATURES"]
+        ibi_second_order_features = extract_second_order_features(ibi_intraday_features, so_features_names)
+
+        return ibi_intraday_features, ibi_second_order_features
+
+    return ibi_intraday_features
--- a/src/features/empatica_temperature/cr/main.py
+++ b/src/features/empatica_temperature/cr/main.py
@ -0,0 +1,68 @@
+import pandas as pd
+from scipy.stats import entropy
+
+from cr_features.helper_functions import convert_to2d, generic_features
+from cr_features.calculate_features_old import calculateFeatures
+from cr_features.calculate_features import calculate_features
+from cr_features_helper_methods import extract_second_order_features
+
+import sys
+
+def extract_temp_features_from_intraday_data(temperature_intraday_data, features, window_length, time_segment, filter_data_by_segment):
+    temperature_intraday_features = pd.DataFrame(columns=["local_segment"] + features)
+
+    if not temperature_intraday_data.empty:
+        sample_rate = 4
+
+        temperature_intraday_data = filter_data_by_segment(temperature_intraday_data, time_segment)
+
+        if not temperature_intraday_data.empty:
+
+            temperature_intraday_features = pd.DataFrame()
+
+            # apply methods from calculate features module
+            if window_length is None:
+                temperature_intraday_features = \
+                    temperature_intraday_data.groupby('local_segment').apply(\
+                    lambda x: calculate_features(convert_to2d(x['temperature'], x.shape[0]), fs=sample_rate, feature_names=features, show_progress=False))
+            else:
+                temperature_intraday_features = \
+                    temperature_intraday_data.groupby('local_segment').apply(\
+                    lambda x: calculate_features(convert_to2d(x['temperature'], window_length*sample_rate), fs=sample_rate, feature_names=features, show_progress=False))
+
+
+            temperature_intraday_features.reset_index(inplace=True)
+
+    return temperature_intraday_features
+
+
+def cr_features(sensor_data_files, time_segment, provider, filter_data_by_segment, *args, **kwargs):
+    data_types = {'local_timezone': 'str', 'device_id': 'str', 'timestamp': 'int64', 'temperature': 'float64', 'local_date_time': 'str', 
+                'local_date': "str", 'local_time': "str", 'local_hour': "str", 'local_minute': "str", 'assigned_segments': "str"}
+
+    temperature_intraday_data = pd.read_csv(sensor_data_files["sensor_data"], dtype=data_types)
+
+    requested_intraday_features = provider["FEATURES"]
+
+    calc_windows = kwargs.get('calc_windows', False)
+
+    if provider["WINDOWS"]["COMPUTE"] and calc_windows:
+        requested_window_length = provider["WINDOWS"]["WINDOW_LENGTH"]
+    else:
+        requested_window_length = None
+    
+    # name of the features this function can compute
+    base_intraday_features_names = generic_features
+    # the subset of requested features this function can compute
+    intraday_features_to_compute = list(set(requested_intraday_features) & set(base_intraday_features_names))
+
+    # extract features from intraday data
+    temperature_intraday_features = extract_temp_features_from_intraday_data(temperature_intraday_data, intraday_features_to_compute, 
+                                                                        requested_window_length, time_segment, filter_data_by_segment)
+
+    if calc_windows:
+        so_features_names = provider["WINDOWS"]["SECOND_ORDER_FEATURES"]
+        temperature_second_order_features = extract_second_order_features(temperature_intraday_features, so_features_names)
+        return temperature_intraday_features, temperature_second_order_features
+
+    return temperature_intraday_features
--- a/src/features/entry.R
+++ b/src/features/entry.R
@ -4,13 +4,19 @@ library("dplyr",warn.conflicts = F)
 library("tidyr")

 sensor_data_files <- snakemake@input
-sensor_data_files$time_segments_labels <- NULL
-time_segments_file <- snakemake@input[["time_segments_labels"]]

 provider <- snakemake@params["provider"][["provider"]]
 provider_key <- snakemake@params["provider_key"]
 sensor_key <- snakemake@params["sensor_key"]

-sensor_features <- fetch_provider_features(provider, provider_key, sensor_key, sensor_data_files, time_segments_file)
+if(sensor_key == "all_cleaning_individual" | sensor_key == "all_cleaning_overall"){
+    # Data cleaning
+    sensor_features = run_provider_cleaning_script(provider, provider_key, sensor_key, sensor_data_files)
+}else{
+    # Extract sensor features
+    sensor_data_files$time_segments_labels <- NULL
+    time_segments_file <- snakemake@input[["time_segments_labels"]]
+    sensor_features <- fetch_provider_features(provider, provider_key, sensor_key, sensor_data_files, time_segments_file)
+}

 write.csv(sensor_features, snakemake@output[[1]], row.names = FALSE)
--- a/src/features/entry.py
+++ b/src/features/entry.py
@ -1,14 +1,38 @@
 import pandas as pd
-from utils.utils import fetch_provider_features
+from utils.utils import fetch_provider_features, run_provider_cleaning_script
+
+import sys

 sensor_data_files = dict(snakemake.input)
-del sensor_data_files["time_segments_labels"]
-time_segments_file = snakemake.input["time_segments_labels"]

 provider = snakemake.params["provider"]
 provider_key = snakemake.params["provider_key"]
 sensor_key = snakemake.params["sensor_key"]

-sensor_features = fetch_provider_features(provider, provider_key, sensor_key, sensor_data_files, time_segments_file)
+calc_windows = True if (provider.get("WINDOWS", False) and provider["WINDOWS"].get("COMPUTE", False)) else False

-sensor_features.to_csv(snakemake.output[0], index=False)
+if sensor_key == "all_cleaning_individual" or sensor_key == "all_cleaning_overall":
+    # Data cleaning
+    if "overall" in sensor_key:
+        sensor_features = run_provider_cleaning_script(provider, provider_key, sensor_key, sensor_data_files, snakemake.params["target"])
+    else:
+        sensor_features = run_provider_cleaning_script(provider, provider_key, sensor_key, sensor_data_files)
+else:
+    # Extract sensor features
+    del sensor_data_files["time_segments_labels"]
+    time_segments_file = snakemake.input["time_segments_labels"]
+
+    if calc_windows:
+        window_features, second_order_features = fetch_provider_features(provider, provider_key, sensor_key, sensor_data_files, time_segments_file, calc_windows=True)        
+
+        window_features.to_csv(snakemake.output[1], index=False)
+        second_order_features.to_csv(snakemake.output[0], index=False)
+
+    elif "empatica" in sensor_key: 
+        pd.DataFrame().to_csv(snakemake.output[1], index=False)
+
+    if not calc_windows:
+        sensor_features = fetch_provider_features(provider, provider_key, sensor_key, sensor_data_files, time_segments_file, calc_windows=False)
+
+if not calc_windows:
+    sensor_features.to_csv(snakemake.output[0], index=False)
--- a/src/features/fitbit_steps_intraday/rapids/main.py
+++ b/src/features/fitbit_steps_intraday/rapids/main.py
@ -1,9 +1,10 @@
 import pandas as pd
 import numpy as np

-def statsFeatures(steps_data, features_to_compute, features_type, steps_features):
+def statsFeatures(steps_data, features_to_compute, features_type, steps_features, *args, **kwargs):
    if features_type == "steps" or features_type == "sumsteps":
        col_name = "steps"
+        reference_hour = kwargs["reference_hour"]
    elif features_type == "durationsedentarybout" or features_type == "durationactivebout":
        col_name = "duration"
    else:
@ -23,6 +24,10 @@ def statsFeatures(steps_data, features_to_compute, features_type, steps_features
        steps_features["median" + features_type] = steps_data.groupby(["local_segment"])[col_name].median()
    if "std" + features_type in features_to_compute:
        steps_features["std" + features_type] = steps_data.groupby(["local_segment"])[col_name].std()
+    if (col_name == "steps") and ("firststeptime" in features_to_compute):
+        steps_features["firststeptime"] = steps_data[steps_data["steps"].ne(0)].groupby(["local_segment"])["local_time"].first().apply(lambda x: (int(x.split(":")[0]) - reference_hour) * 60 + int(x.split(":")[1]) + (int(x.split(":")[2]) / 60))
+    if (col_name == "steps") and ("laststeptime" in features_to_compute):
+        steps_features["laststeptime"] = steps_data[steps_data["steps"].ne(0)].groupby(["local_segment"])["local_time"].last().apply(lambda x: (int(x.split(":")[0]) - reference_hour) * 60 + int(x.split(":")[1]) + (int(x.split(":")[2]) / 60))

    return steps_features

@ -38,11 +43,11 @@ def getBouts(steps_data):

    return bouts

-def extractStepsFeaturesFromIntradayData(steps_intraday_data, threshold_active_bout, intraday_features_to_compute_steps, intraday_features_to_compute_sedentarybout, intraday_features_to_compute_activebout, steps_intraday_features):
+def extractStepsFeaturesFromIntradayData(steps_intraday_data, reference_hour, threshold_active_bout, intraday_features_to_compute_steps, intraday_features_to_compute_sedentarybout, intraday_features_to_compute_activebout, steps_intraday_features):
    steps_intraday_features = pd.DataFrame()

    # statistics features of steps count
-    steps_intraday_features = statsFeatures(steps_intraday_data, intraday_features_to_compute_steps, "steps", steps_intraday_features)
+    steps_intraday_features = statsFeatures(steps_intraday_data, intraday_features_to_compute_steps, "steps", steps_intraday_features, reference_hour=reference_hour)

    # sedentary bout: less than THRESHOLD_ACTIVE_BOUT (default: 10) steps in a minute
    # active bout: greater or equal to THRESHOLD_ACTIVE_BOUT (default: 10) steps in a minute
@ -66,6 +71,7 @@ def extractStepsFeaturesFromIntradayData(steps_intraday_data, threshold_active_b

 def rapids_features(sensor_data_files, time_segment, provider, filter_data_by_segment, *args, **kwargs):

+    reference_hour = provider["REFERENCE_HOUR"]
    threshold_active_bout = provider["THRESHOLD_ACTIVE_BOUT"]
    include_zero_step_rows = provider["INCLUDE_ZERO_STEP_ROWS"]

@ -73,11 +79,11 @@ def rapids_features(sensor_data_files, time_segment, provider, filter_data_by_se

    requested_intraday_features = provider["FEATURES"]

-    requested_intraday_features_steps = [x + "steps" for x in requested_intraday_features["STEPS"]]
+    requested_intraday_features_steps = [x + "steps" if x not in ["firststeptime", "laststeptime"] else x for x in requested_intraday_features["STEPS"]]
    requested_intraday_features_sedentarybout = [x + "sedentarybout" for x in requested_intraday_features["SEDENTARY_BOUT"]]
    requested_intraday_features_activebout = [x + "activebout" for x in requested_intraday_features["ACTIVE_BOUT"]]
    # name of the features this function can compute
-    base_intraday_features_steps = ["sumsteps", "maxsteps", "minsteps", "avgsteps", "stdsteps"]
+    base_intraday_features_steps = ["sumsteps", "maxsteps", "minsteps", "avgsteps", "stdsteps", "firststeptime", "laststeptime"]
    base_intraday_features_sedentarybout = ["countepisodesedentarybout", "sumdurationsedentarybout", "maxdurationsedentarybout", "mindurationsedentarybout", "avgdurationsedentarybout", "stddurationsedentarybout"]
    base_intraday_features_activebout = ["countepisodeactivebout", "sumdurationactivebout", "maxdurationactivebout", "mindurationactivebout", "avgdurationactivebout", "stddurationactivebout"]
    # the subset of requested features this function can compute
@ -99,6 +105,6 @@ def rapids_features(sensor_data_files, time_segment, provider, filter_data_by_se
        steps_intraday_data = filter_data_by_segment(steps_intraday_data, time_segment)

        if not steps_intraday_data.empty:
-            steps_intraday_features = extractStepsFeaturesFromIntradayData(steps_intraday_data, threshold_active_bout, intraday_features_to_compute_steps, intraday_features_to_compute_sedentarybout, intraday_features_to_compute_activebout, steps_intraday_features)
+            steps_intraday_features = extractStepsFeaturesFromIntradayData(steps_intraday_data, reference_hour, threshold_active_bout, intraday_features_to_compute_steps, intraday_features_to_compute_sedentarybout, intraday_features_to_compute_activebout, steps_intraday_features)
    
    return steps_intraday_features
--- a/src/features/phone_activity_recognition/rapids/main.py
+++ b/src/features/phone_activity_recognition/rapids/main.py
@ -1,5 +1,4 @@
 import pandas as pd
-import numpy as np

 def rapids_features(sensor_data_files, time_segment, provider, filter_data_by_segment, *args, **kwargs):

@ -31,11 +30,13 @@ def rapids_features(sensor_data_files, time_segment, provider, filter_data_by_se
                if "duration" + column.lower() in features_to_compute:
                    filtered_data = ar_episodes[ar_episodes["activity_name"].isin(pd.Series(activity_labels))]
                    if not filtered_data.empty:
-                        ar_features["duration" + column.lower()] = ar_episodes[ar_episodes["activity_name"].isin(pd.Series(activity_labels))].groupby(["local_segment"])["duration"].sum().fillna(0)
+                        ar_features["duration" + column.lower()] = ar_episodes[ar_episodes["activity_name"].isin(pd.Series(activity_labels))].groupby(["local_segment"])["duration"].sum()
                    else:
                        ar_features["duration" + column.lower()] = 0

            ar_features.index.names = ["local_segment"]
            ar_features = ar_features.reset_index()
    
+    ar_features.fillna(value={"count": 0, "countuniqueactivities": 0, "durationstationary": 0, "durationmobile": 0, "durationvehicle": 0, "mostcommonactivity": 4}, inplace=True)
+
    return ar_features
--- a/src/features/phone_applications_foreground/rapids/main.py
+++ b/src/features/phone_applications_foreground/rapids/main.py
@ -9,57 +9,41 @@ def compute_features(filtered_data, apps_type, requested_features, apps_features
    if "timeoffirstuse" in requested_features:
        time_first_event = filtered_data.sort_values(by="timestamp", ascending=True).drop_duplicates(subset="local_segment", keep="first").set_index("local_segment")
        if time_first_event.empty:
-            apps_features["timeoffirstuse" + apps_type] = np.nan
+            apps_features["timeoffirstuse" + apps_type] = 1500 # np.nan
        else:
            apps_features["timeoffirstuse" + apps_type] = time_first_event["local_hour"] * 60 + time_first_event["local_minute"]
    if "timeoflastuse" in requested_features:
        time_last_event = filtered_data.sort_values(by="timestamp", ascending=False).drop_duplicates(subset="local_segment", keep="first").set_index("local_segment")
        if time_last_event.empty:
-            apps_features["timeoflastuse" + apps_type] = np.nan
+            apps_features["timeoflastuse" + apps_type] = 1500 # np.nan
        else:
            apps_features["timeoflastuse" + apps_type] = time_last_event["local_hour"] * 60 + time_last_event["local_minute"]
    if "frequencyentropy" in requested_features:
        apps_with_count = filtered_data.groupby(["local_segment","application_name"]).count().sort_values(by="timestamp", ascending=False).reset_index()
        if (len(apps_with_count.index) < 2 ):
-            apps_features["frequencyentropy" + apps_type] = np.nan
+            apps_features["frequencyentropy" + apps_type] = 0 # np.nan
        else:    
            apps_features["frequencyentropy" + apps_type] = apps_with_count.groupby("local_segment")["timestamp"].agg(entropy)
    if "countevent" in requested_features:
        apps_features["countevent" + apps_type] = filtered_data.groupby(["local_segment"]).count()["timestamp"]
-        apps_features.fillna(value={"countevent" + apps_type: 0}, inplace=True)

    if "countepisode" in requested_features:
        apps_features["countepisode" + apps_type] = filtered_data.groupby(["local_segment"]).count()["start_timestamp"]
-        apps_features.fillna(value={"countepisode" + apps_type: 0}, inplace=True)

    if "minduration" in requested_features:
-        grouped_data = filtered_data.groupby(by = ['local_segment'])['duration'].min()
-        if grouped_data.empty:
-            apps_features["minduration" + apps_type] = np.nan
-        else:
-            apps_features["minduration" + apps_type] = grouped_data
+        apps_features["minduration" + apps_type] = filtered_data.groupby(by = ["local_segment"])["duration"].min()
            
    if "maxduration" in requested_features:
-        grouped_data = filtered_data.groupby(by = ['local_segment'])['duration'].max()
-        if grouped_data.empty:
-            apps_features["maxduration" + apps_type] = np.nan
-        else:
-            apps_features["maxduration" + apps_type] = grouped_data
+        apps_features["maxduration" + apps_type] = filtered_data.groupby(by = ["local_segment"])["duration"].max()
            
    if "meanduration" in requested_features:
-        grouped_data = filtered_data.groupby(by = ['local_segment'])['duration'].mean()
-        if grouped_data.empty:
-            apps_features["meanduration" + apps_type] = np.nan
-        else:
-            apps_features["meanduration" + apps_type] = grouped_data
+        apps_features["meanduration" + apps_type] = filtered_data.groupby(by = ["local_segment"])["duration"].mean()
            
    if "sumduration" in requested_features:
-        grouped_data = filtered_data.groupby(by = ['local_segment'])['duration'].sum()
-        if grouped_data.empty:
-            apps_features["sumduration" + apps_type] = np.nan
-        else:
-            apps_features["sumduration" + apps_type] = grouped_data
-    apps_features.index.names = ['local_segment']
+        apps_features["sumduration" + apps_type] = filtered_data.groupby(by = ["local_segment"])["duration"].sum()
+    
+    apps_features.index.names = ["local_segment"]
+    
    return apps_features

 def process_app_features(data, requested_features, time_segment, provider, filter_data_by_segment):
@ -145,4 +129,6 @@ def rapids_features(sensor_data_files, time_segment, provider, filter_data_by_se
        
        features = pd.merge(episodes_features, features, how='outer', on='local_segment')

+    features.fillna(value={feature_name: 0 for feature_name in features.columns if feature_name.startswith(("countevent", "countepisode", "minduration", "maxduration", "meanduration", "sumduration"))}, inplace=True)
+    
    return features
--- a/src/features/phone_bluetooth/doryab/main.py
+++ b/src/features/phone_bluetooth/doryab/main.py
@ -14,8 +14,8 @@ def deviceFeatures(devices, ownership, common_devices, features_to_compute, feat
        features = features.join(device_value_counts.groupby("local_segment")["bt_address"].nunique().to_frame("uniquedevices" + ownership), how="outer")
    if "meanscans" in features_to_compute:
        features = features.join(device_value_counts.groupby("local_segment")["scans"].mean().to_frame("meanscans" + ownership), how="outer")
-    if "stdscans" in features_to_compute:
-        features = features.join(device_value_counts.groupby("local_segment")["scans"].std().to_frame("stdscans" + ownership), how="outer")
+    if "stdscans" in features_to_compute: 
+        features = features.join(device_value_counts.groupby("local_segment")["scans"].std().to_frame("stdscans" + ownership).fillna(0), how="outer")
    # Most frequent device within segments, across segments, and across dataset
    if "countscansmostfrequentdevicewithinsegments" in features_to_compute:
        features = features.join(device_value_counts.groupby("local_segment")["scans"].max().to_frame("countscansmostfrequentdevicewithinsegments" + ownership), how="outer")
--- a/src/features/phone_calls/rapids/main.R
+++ b/src/features/phone_calls/rapids/main.R
@ -88,6 +88,16 @@ rapids_features <- function(sensor_data_files, time_segment, provider){
        features <- call_features_of_type(calls_of_type, features_type, call_type, time_segment, requested_features)
        call_features <- merge(call_features, features, all=TRUE)
    }
-    call_features <- call_features %>% mutate_at(vars(contains("countmostfrequentcontact") | contains("distinctcontacts") | contains("count")), list( ~ replace_na(., 0)))
+
+    # Fill seleted columns with a high number
+    time_cols <- select(call_features, contains("timefirstcall") |  contains("timelastcall")) %>% 
+        colnames(.)
+
+    call_features <- call_features %>% 
+        mutate_at(., time_cols, ~replace(., is.na(.), 1500))
+
+    # Fill NA values with 0
+    call_features <- call_features %>% mutate_all(~replace(., is.na(.), 0))
+
    return(call_features)
 }
--- a/src/features/phone_conversation/rapids/main.py
+++ b/src/features/phone_conversation/rapids/main.py
@ -140,7 +140,7 @@ def rapids_features(sensor_data_files, time_segment, provider, filter_data_by_se
            if "voicemaxenergy" in features_to_compute:
                conversation_features["voicemaxenergy"] = conversation_data[conversation_data['inference']==2].groupby(["local_segment"])["double_energy"].max()

-
+            conversation_features.fillna(value={feature_name: 0 for feature_name in conversation_features.columns if feature_name not in ["timefirstconversation", "timelastconversation", "sdconversationduration", "noisesdenergy", "voicesdenergy"]}, inplace=True)
            conversation_features = conversation_features.reset_index()

    return conversation_features
--- a/src/features/phone_data_yield/rapids/main.R
+++ b/src/features/phone_data_yield/rapids/main.R
@ -3,9 +3,11 @@ library(tidyr)
 library(readr)

 compute_data_yield_features <- function(data, feature_name, time_segment, provider){
+  
  data <- data %>% filter_data_by_segment(time_segment)
-  if(nrow(data) == 0)
+  if(nrow(data) == 0){
    return(tibble(local_segment = character(), ratiovalidyieldedminutes = numeric(), ratiovalidyieldedhours = numeric()))
+  }
  features <- data %>%
    separate(timestamps_segment, into = c("start_timestamp", "end_timestamp"), convert = T, sep = ",") %>% 
    mutate(duration_minutes = (end_timestamp - start_timestamp) / 60000,
--- a/src/features/phone_esm/straw/esm.py
+++ b/src/features/phone_esm/straw/esm.py
@ -0,0 +1,274 @@
+from collections.abc import Collection
+
+import numpy as np
+import pandas as pd
+from pytz import timezone
+import datetime, json
+
+# from config.models import ESM, Participant
+# from features import helper
+
+ESM_STATUS_ANSWERED = 2
+
+GROUP_SESSIONS_BY = ["device_id", "esm_session"] # 'participant_id
+
+SESSION_STATUS_UNANSWERED = "ema_unanswered"
+SESSION_STATUS_DAY_FINISHED = "day_finished"
+SESSION_STATUS_COMPLETE = "ema_completed"
+
+ANSWER_DAY_FINISHED = "DayFinished3421"
+ANSWER_DAY_OFF = "DayOff3421"
+ANSWER_SET_EVENING = "DayFinishedSetEvening"
+
+MAX_MORNING_LENGTH = 3
+# When the participants was not yet at work at the time of the first (morning) EMA,
+# only three items were answered.
+# Two sleep related items and one indicating NOT starting work yet.
+# Daytime EMAs are all longer, in fact they always consist of at least 6 items.
+
+
+TZ_LJ = timezone("Europe/Ljubljana")
+COLUMN_TIMESTAMP = "timestamp"
+COLUMN_TIMESTAMP_ESM = "double_esm_user_answer_timestamp"
+
+
+def get_date_from_timestamp(df_aware) -> pd.DataFrame:
+    """
+    Transform a UNIX timestamp into a datetime (with Ljubljana timezone).
+    Additionally, extract only the date part, where anything until 4 AM is considered the same day.
+
+    Parameters
+    ----------
+    df_aware: pd.DataFrame
+        Any AWARE-type data as defined in models.py.
+
+    Returns
+    -------
+    df_aware: pd.DataFrame
+        The same dataframe with datetime_lj and date_lj columns added.
+
+    """
+    if COLUMN_TIMESTAMP_ESM in df_aware:
+        column_timestamp = COLUMN_TIMESTAMP_ESM
+    else:
+        column_timestamp = COLUMN_TIMESTAMP
+
+    df_aware["datetime_lj"] = df_aware[column_timestamp].apply(
+        lambda x: datetime.datetime.fromtimestamp(x / 1000.0, tz=TZ_LJ)
+    )
+    df_aware = df_aware.assign(
+        date_lj=lambda x: (x.datetime_lj - datetime.timedelta(hours=4)).dt.date
+    )
+    # Since daytime EMAs could *theoretically* last beyond midnight, but never after 4 AM,
+    # the datetime is first translated to 4 h earlier.
+
+    return df_aware
+
+
+def preprocess_esm(df_esm: pd.DataFrame) -> pd.DataFrame:
+    """
+    Convert timestamps into human-readable datetimes and dates
+    and expand the JSON column into several Pandas DF columns.
+
+    Parameters
+    ----------
+    df_esm: pd.DataFrame
+        A dataframe of esm data.
+
+    Returns
+    -------
+    df_esm_preprocessed: pd.DataFrame
+        A dataframe with added columns: datetime in Ljubljana timezone and all fields from ESM_JSON column.
+    """
+    df_esm = get_date_from_timestamp(df_esm)
+
+    df_esm_json = df_esm["esm_json"].apply(json.loads)
+    df_esm_json = pd.json_normalize(df_esm_json).drop(
+        columns=["esm_trigger"]
+    )  # The esm_trigger column is already present in the main df.
+    return df_esm.join(df_esm_json)
+
+
+def classify_sessions_by_completion(df_esm_preprocessed: pd.DataFrame) -> pd.DataFrame:
+    """
+    For each distinct EMA session, determine how the participant responded to it.
+    Possible outcomes are: SESSION_STATUS_UNANSWERED, SESSION_STATUS_DAY_FINISHED, and SESSION_STATUS_COMPLETE
+
+    This is done in three steps.
+
+    First, the esm_status is considered.
+    If any of the ESMs in a session has a status *other than* "answered", then this session is taken as unfinished.
+
+    Second, the sessions which do not represent full questionnaires are identified.
+    These are sessions where participants only marked they are finished with the day or have not yet started working.
+
+    Third, the sessions with only one item are marked with their trigger.
+    We never offered questionnaires with single items, so we can be sure these are unfinished.
+
+    Finally, all sessions that remain are marked as completed.
+    By going through different possibilities in expl_esm_adherence.ipynb, this turned out to be a reasonable option.
+
+    Parameters
+    ----------
+    df_esm_preprocessed: pd.DataFrame
+        A preprocessed dataframe of esm data, which must include the session ID (esm_session).
+
+    Returns
+    -------
+    df_session_counts: pd.Dataframe
+        A dataframe of all sessions (grouped by GROUP_SESSIONS_BY) with their statuses and the number of items.
+    """
+    sessions_grouped = df_esm_preprocessed.groupby(GROUP_SESSIONS_BY)
+
+    # 0. First, assign all session statuses as NaN.
+    df_session_counts = pd.DataFrame(sessions_grouped.count()["timestamp"]).rename(
+        columns={"timestamp": "esm_session_count"}
+    )
+    df_session_counts["session_response"] = np.nan
+
+    # 1. Identify all ESMs with status other than answered.
+    esm_not_answered = sessions_grouped.apply(
+        lambda x: (x.esm_status != ESM_STATUS_ANSWERED).any()
+    )
+    df_session_counts.loc[
+        esm_not_answered, "session_response"
+    ] = SESSION_STATUS_UNANSWERED
+
+    # 2. Identify non-sessions, i.e. answers about the end of the day.
+    non_session = sessions_grouped.apply(
+        lambda x: (
+            (x.esm_user_answer == ANSWER_DAY_FINISHED)  # I finished working for today.
+            | (x.esm_user_answer == ANSWER_DAY_OFF)  # I am not going to work today.
+            | (
+                x.esm_user_answer == ANSWER_SET_EVENING
+            )  # When would you like to answer the evening EMA?
+        ).any()
+    )
+    df_session_counts.loc[non_session, "session_response"] = SESSION_STATUS_DAY_FINISHED
+
+    # 3. Identify sessions appearing only once, as those were not true EMAs for sure.
+    singleton_sessions = (df_session_counts.esm_session_count == 1) & (
+        df_session_counts.session_response.isna()
+    )
+    df_session_1 = df_session_counts[singleton_sessions]
+    df_esm_unique_session = df_session_1.join(
+        df_esm_preprocessed.set_index(GROUP_SESSIONS_BY), how="left"
+    )
+    df_esm_unique_session = df_esm_unique_session.assign(
+        session_response=lambda x: x.esm_trigger
+    )["session_response"]
+    df_session_counts.loc[
+        df_esm_unique_session.index, "session_response"
+    ] = df_esm_unique_session
+
+    # 4. Mark the remaining sessions as completed.
+    df_session_counts.loc[
+        df_session_counts.session_response.isna(), "session_response"
+    ] = SESSION_STATUS_COMPLETE
+
+    return df_session_counts
+
+
+def classify_sessions_by_time(df_esm_preprocessed: pd.DataFrame) -> pd.DataFrame:
+    """
+    For each EMA session, determine the time of the first user answer and its time type (morning, workday, or evening.)
+
+    Parameters
+    ----------
+    df_esm_preprocessed: pd.DataFrame
+        A preprocessed dataframe of esm data, which must include the session ID (esm_session).
+
+    Returns
+    -------
+    df_session_time: pd.DataFrame
+        A dataframe of all sessions (grouped by GROUP_SESSIONS_BY) with their time type and timestamp of first answer.
+    """
+    df_session_time = (
+        df_esm_preprocessed.sort_values(["datetime_lj"]) # "participant_id"
+        .groupby(GROUP_SESSIONS_BY)
+        .first()[["time", "datetime_lj"]]
+    )
+    return df_session_time
+
+
+def classify_sessions_by_completion_time(
+    df_esm_preprocessed: pd.DataFrame,
+) -> pd.DataFrame:
+    """
+    The point of this function is to not only classify sessions by using the previously defined functions.
+    It also serves to "correct" the time type of some EMA sessions.
+
+    A morning questionnaire could seamlessly transition into a daytime questionnaire,
+        if the participant was already at work.
+    In this case, the "time" label changed mid-session.
+    Because of the way classify_sessions_by_time works, this questionnaire was classified as "morning".
+    But for all intents and purposes, it can be treated as a "daytime" EMA.
+
+    The way this scenario is differentiated from a true "morning" questionnaire,
+        where the participants NOT yet at work, is by considering their length.
+
+    Parameters
+    ----------
+    df_esm_preprocessed: pd.DataFrame
+        A preprocessed dataframe of esm data, which must include the session ID (esm_session).
+
+    Returns
+    -------
+    df_session_counts_time: pd.DataFrame
+        A dataframe of all sessions (grouped by GROUP_SESSIONS_BY) with statuses, the number of items,
+            their time type (with some morning EMAs reclassified) and timestamp of first answer.
+
+    """
+    df_session_counts = classify_sessions_by_completion(df_esm_preprocessed)
+    df_session_time = classify_sessions_by_time(df_esm_preprocessed)
+
+    df_session_counts_time = df_session_time.join(df_session_counts)
+
+    morning_transition_to_daytime = (df_session_counts_time.time == "morning") & (
+        df_session_counts_time.esm_session_count > MAX_MORNING_LENGTH
+    )
+
+    df_session_counts_time.loc[morning_transition_to_daytime, "time"] = "daytime"
+
+    return df_session_counts_time
+
+
+# def clean_up_esm(df_esm_preprocessed: pd.DataFrame) -> pd.DataFrame:
+#     """
+#     This function eliminates invalid ESM responses.
+#     It removes unanswered ESMs and those that indicate end of work and similar.
+#     It also extracts a numeric answer from strings such as "4 - I strongly agree".
+
+#     Parameters
+#     ----------
+#     df_esm_preprocessed: pd.DataFrame
+#         A preprocessed dataframe of esm data.
+
+#     Returns
+#     -------
+#     df_esm_clean: pd.DataFrame
+#         A subset of the original dataframe.
+
+#     """
+#     df_esm_clean = df_esm_preprocessed[
+#         df_esm_preprocessed["esm_status"] == ESM_STATUS_ANSWERED
+#     ]
+#     df_esm_clean = df_esm_clean[
+#         ~df_esm_clean["esm_user_answer"].isin(
+#             [ANSWER_DAY_FINISHED, ANSWER_DAY_OFF, ANSWER_SET_EVENING]
+#         )
+#     ]
+#     df_esm_clean["esm_user_answer_numeric"] = np.nan
+#     esm_type_numeric = [
+#         ESM.ESM_TYPE.get("radio"),
+#         ESM.ESM_TYPE.get("scale"),
+#         ESM.ESM_TYPE.get("number"),
+#     ]
+#     df_esm_clean.loc[
+#         df_esm_clean["esm_type"].isin(esm_type_numeric)
+#     ] = df_esm_clean.loc[df_esm_clean["esm_type"].isin(esm_type_numeric)].assign(
+#         esm_user_answer_numeric=lambda x: x.esm_user_answer.str.slice(stop=1).astype(
+#             int
+#         )
+#     )
+#     return df_esm_clean
--- a/src/features/phone_esm/straw/esm_JCQ.py
+++ b/src/features/phone_esm/straw/esm_JCQ.py
@ -0,0 +1,108 @@
+import pandas as pd
+
+JCQ_ORIGINAL_MAX = 4
+JCQ_ORIGINAL_MIN = 1
+
+dict_JCQ_demand_control_reverse = {
+    75: (
+        "I was NOT asked",
+        "Men legde mij geen overdreven",
+        "Men legde mij GEEN overdreven",  # Capitalized in some versions
+        "Od mene se NI zahtevalo",
+    ),
+    76: (
+        "I had enough time to do my work",
+        "Ik had voldoende tijd om mijn werk",
+        "Imela sem dovolj časa, da končam",
+        "Imel sem dovolj časa, da končam",
+    ),
+    77: (
+        "I was free of conflicting demands",
+        "Er werden mij op het werk geen tegenstrijdige",
+        "Er werden mij op het werk GEEN tegenstrijdige",  # Capitalized in some versions
+        "Pri svojem delu se NISEM srečeval",
+    ),
+    79: (
+        "My job involved a lot of repetitive work",
+        "Mijn taak omvatte veel repetitief werk",
+        "Moje delo je vključevalo veliko ponavljajočega",
+    ),
+    85: (
+        "On my job, I had very little freedom",
+        "In mijn taak had ik zeer weinig vrijheid",
+        "Pri svojem delu sem imel zelo malo svobode",
+        "Pri svojem delu sem imela zelo malo svobode",
+    ),
+}
+
+
+def reverse_jcq_demand_control_scoring(
+    df_esm_jcq_demand_control: pd.DataFrame,
+) -> pd.DataFrame:
+    """
+    This function recodes answers in Job content questionnaire by first incrementing them by 1,
+    to be in line with original (1-4) scoring.
+    Then, some answers are reversed (i.e. 1 becomes 4 etc.), because the questions are negatively phrased.
+    These answers are listed in dict_JCQ_demand_control_reverse and identified by their question ID.
+    However, the existing data is checked against literal phrasing of these questions
+        to protect against wrong numbering of questions (differing question IDs).
+
+    Parameters
+    ----------
+    df_esm_jcq_demand_control: pd.DataFrame
+        A cleaned up dataframe, which must also include esm_user_answer_numeric.
+
+    Returns
+    -------
+    df_esm_jcq_demand_control: pd.DataFrame
+        The same dataframe with a column esm_user_score containing answers recoded and reversed.
+    """
+    df_esm_jcq_demand_control_unique_answers = (
+        df_esm_jcq_demand_control.groupby("question_id")
+        .esm_instructions.value_counts()
+        .rename()
+        .reset_index()
+    )
+    # Tabulate all possible answers to each question (group by question ID).
+    for q_id in dict_JCQ_demand_control_reverse.keys():
+        # Look through all answers that need to be reversed.
+        possible_answers = df_esm_jcq_demand_control_unique_answers.loc[
+            df_esm_jcq_demand_control_unique_answers["question_id"] == q_id,
+            "esm_instructions",
+        ]
+        # These are all answers to a given question (by q_id).
+        answers_matches = possible_answers.str.startswith(
+            dict_JCQ_demand_control_reverse.get(q_id)
+        )
+        # See if they are expected, i.e. included in the dictionary.
+        if ~answers_matches.all():
+            print("One of the answers that occur in the data should not be reversed.")
+            print("This was the answer found in the data: ")
+            raise KeyError(possible_answers[~answers_matches])
+            # In case there is an unexpected answer, raise an exception.
+
+    try:
+        df_esm_jcq_demand_control = df_esm_jcq_demand_control.assign(
+            esm_user_score=lambda x: x.esm_user_answer_numeric + 1
+        )
+        # Increment the original answer by 1
+        # to keep in line with traditional scoring (JCQ_ORIGINAL_MIN - JCQ_ORIGINAL_MAX).
+        df_esm_jcq_demand_control[
+            df_esm_jcq_demand_control["question_id"].isin(
+                dict_JCQ_demand_control_reverse.keys()
+            )
+        ] = df_esm_jcq_demand_control[
+            df_esm_jcq_demand_control["question_id"].isin(
+                dict_JCQ_demand_control_reverse.keys()
+            )
+        ].assign(
+            esm_user_score=lambda x: JCQ_ORIGINAL_MAX
+            + JCQ_ORIGINAL_MIN
+            - x.esm_user_score
+        )
+        # Reverse the items that require it.
+    except AttributeError as e:
+        print("Please, clean the dataframe first using features.esm.clean_up_esm.")
+        print(e)
+
+    return df_esm_jcq_demand_control
--- a/src/features/phone_esm/straw/esm_preprocess.py
+++ b/src/features/phone_esm/straw/esm_preprocess.py
@ -0,0 +1,135 @@
+import json
+
+import numpy as np
+import pandas as pd
+
+ESM_TYPE = {
+    "text": 1,
+    "radio": 2,
+    "checkbox": 3,
+    "likert": 4,
+    "quick_answers": 5,
+    "scale": 6,
+    "datetime": 7,
+    "pam": 8,
+    "number": 9,
+    "web": 10,
+    "date": 11,
+}
+
+QUESTIONNAIRE_IDS = {
+    "sleep_quality": 1,
+    "PANAS_positive_affect": 8,
+    "PANAS_negative_affect": 9,
+    "JCQ_job_demand": 10,
+    "JCQ_job_control": 11,
+    "JCQ_supervisor_support": 12,
+    "JCQ_coworker_support": 13,
+    "PFITS_supervisor": 14,
+    "PFITS_coworkers": 15,
+    "UWES_vigor": 16,
+    "UWES_dedication": 17,
+    "UWES_absorption": 18,
+    "COPE_active": 19,
+    "COPE_support": 20,
+    "COPE_emotions": 21,
+    "balance_life_work": 22,
+    "balance_work_life": 23,
+    "recovery_experience_detachment": 24,
+    "recovery_experience_relaxation": 25,
+    "symptoms": 26,
+    "appraisal_stressfulness_event": 87,
+    "appraisal_threat": 88,
+    "appraisal_challenge": 89,
+    "appraisal_event_time": 90,
+    "appraisal_event_duration": 91,
+    "appraisal_event_work_related": 92,
+    "appraisal_stressfulness_period": 93,
+    "late_work": 94,
+    "work_hours": 95,
+    "left_work": 96,
+    "activities": 97,
+    "coffee_breaks": 98,
+    "at_work_yet": 99,
+}
+
+ESM_STATUS_ANSWERED = 2
+
+GROUP_SESSIONS_BY = ["participant_id", "device_id", "esm_session"]
+
+SESSION_STATUS_UNANSWERED = "ema_unanswered"
+SESSION_STATUS_DAY_FINISHED = "day_finished"
+SESSION_STATUS_COMPLETE = "ema_completed"
+
+ANSWER_DAY_FINISHED = "DayFinished3421"
+ANSWER_DAY_OFF = "DayOff3421"
+ANSWER_SET_EVENING = "DayFinishedSetEvening"
+
+MAX_MORNING_LENGTH = 3
+# When the participants was not yet at work at the time of the first (morning) EMA,
+# only three items were answered.
+# Two sleep related items and one indicating NOT starting work yet.
+# Daytime EMAs are all longer, in fact they always consist of at least 6 items.
+
+
+def preprocess_esm(df_esm: pd.DataFrame) -> pd.DataFrame:
+    """
+    Convert timestamps into human-readable datetimes and dates
+    and expand the JSON column into several Pandas DF columns.
+
+    Parameters
+    ----------
+    df_esm: pd.DataFrame
+        A dataframe of esm data.
+
+    Returns
+    -------
+    df_esm_preprocessed: pd.DataFrame
+        A dataframe with added columns: datetime in Ljubljana timezone and all fields from ESM_JSON column.
+    """
+    df_esm_json = df_esm["esm_json"].apply(json.loads)
+    df_esm_json = pd.json_normalize(df_esm_json).drop(
+        columns=["esm_trigger"]
+    )  # The esm_trigger column is already present in the main df.
+    return df_esm.join(df_esm_json)
+
+
+def clean_up_esm(df_esm_preprocessed: pd.DataFrame) -> pd.DataFrame:
+    """
+    This function eliminates invalid ESM responses.
+    It removes unanswered ESMs and those that indicate end of work and similar.
+    It also extracts a numeric answer from strings such as "4 - I strongly agree".
+
+    Parameters
+    ----------
+    df_esm_preprocessed: pd.DataFrame
+        A preprocessed dataframe of esm data.
+
+    Returns
+    -------
+    df_esm_clean: pd.DataFrame
+        A subset of the original dataframe.
+
+    """
+    df_esm_clean = df_esm_preprocessed[
+        df_esm_preprocessed["esm_status"] == ESM_STATUS_ANSWERED
+    ]
+    df_esm_clean = df_esm_clean[
+        ~df_esm_clean["esm_user_answer"].isin(
+            [ANSWER_DAY_FINISHED, ANSWER_DAY_OFF, ANSWER_SET_EVENING]
+        )
+    ]
+    df_esm_clean["esm_user_answer_numeric"] = np.nan
+    esm_type_numeric = [
+        ESM_TYPE.get("radio"),
+        ESM_TYPE.get("scale"),
+        ESM_TYPE.get("number"),
+    ]
+    df_esm_clean.loc[
+        df_esm_clean["esm_type"].isin(esm_type_numeric)
+    ] = df_esm_clean.loc[df_esm_clean["esm_type"].isin(esm_type_numeric)].assign(
+        esm_user_answer_numeric=lambda x: x.esm_user_answer.str.slice(stop=1).astype(
+            int
+        )
+    )
+    return df_esm_clean
--- a/src/features/phone_esm/straw/main.py
+++ b/src/features/phone_esm/straw/main.py
@ -0,0 +1,66 @@
+import pandas as pd
+
+QUESTIONNAIRE_IDS = {
+    "sleep_quality": 1,
+    "PANAS_positive_affect": 8,
+    "PANAS_negative_affect": 9,
+    "JCQ_job_demand": 10,
+    "JCQ_job_control": 11,
+    "JCQ_supervisor_support": 12,
+    "JCQ_coworker_support": 13,
+    "PFITS_supervisor": 14,
+    "PFITS_coworkers": 15,
+    "UWES_vigor": 16,
+    "UWES_dedication": 17,
+    "UWES_absorption": 18,
+    "COPE_active": 19,
+    "COPE_support": 20,
+    "COPE_emotions": 21,
+    "balance_life_work": 22,
+    "balance_work_life": 23,
+    "recovery_experience_detachment": 24,
+    "recovery_experience_relaxation": 25,
+    "symptoms": 26,
+    "appraisal_stressfulness_event": 87,
+    "appraisal_threat": 88,
+    "appraisal_challenge": 89,
+    "appraisal_event_time": 90,
+    "appraisal_event_duration": 91,
+    "appraisal_event_work_related": 92,
+    "appraisal_stressfulness_period": 93,
+    "late_work": 94,
+    "work_hours": 95,
+    "left_work": 96,
+    "activities": 97,
+    "coffee_breaks": 98,
+    "at_work_yet": 99,
+}
+
+
+def straw_features(sensor_data_files, time_segment, provider, filter_data_by_segment, *args, **kwargs):
+    esm_data = pd.read_csv(sensor_data_files["sensor_data"])
+    requested_features = provider["FEATURES"]
+    # name of the features this function can compute
+    requested_scales = provider["SCALES"]
+    base_features_names = ["PANAS_positive_affect", "PANAS_negative_affect", "JCQ_job_demand", "JCQ_job_control", "JCQ_supervisor_support", "JCQ_coworker_support", 
+                            "appraisal_stressfulness_period", "appraisal_stressfulness_event", "appraisal_threat", "appraisal_challenge"]
+    #TODO Check valid questionnaire and feature names.
+    # the subset of requested features this function can compute
+    features_to_compute = list(set(requested_features) & set(base_features_names))
+    esm_features = pd.DataFrame(columns=["local_segment"] + features_to_compute)
+    if not esm_data.empty:
+        esm_data = filter_data_by_segment(esm_data, time_segment)
+
+        if not esm_data.empty:
+            esm_features = pd.DataFrame()
+            for scale in requested_scales:
+                questionnaire_id = QUESTIONNAIRE_IDS[scale]
+                mask = esm_data["questionnaire_id"] == questionnaire_id
+                esm_features[scale + "_mean"] = esm_data.loc[mask].groupby(["local_segment"])["esm_user_score"].mean()
+                #TODO Create the column esm_user_score in esm_clean. Currently, this is only done when reversing.
+
+            esm_features = esm_features.reset_index()
+            if 'index' in esm_features: # In calse of empty esm_features df 
+                esm_features.rename(columns={'index': 'local_segment'}, inplace=True)
+
+    return esm_features
--- a/src/features/phone_esm/straw/preprocess.py
+++ b/src/features/phone_esm/straw/preprocess.py
@ -0,0 +1,25 @@
+from esm_preprocess import *
+from esm_JCQ import reverse_jcq_demand_control_scoring
+
+requested_scales = snakemake.params["scales"]
+
+df_esm = pd.read_csv(snakemake.input[0])
+df_esm_preprocessed = preprocess_esm(df_esm)
+
+if not all([scale in QUESTIONNAIRE_IDS for scale in requested_scales]):
+    unknown_scales = set(requested_scales) - set(QUESTIONNAIRE_IDS.keys())
+    print("The requested questionnaire name should be one of the following:")
+    print(QUESTIONNAIRE_IDS.keys())
+    raise ValueError("You requested scales not collected: ", unknown_scales)
+
+df_esm_clean = clean_up_esm(df_esm_preprocessed)
+df_esm_clean["esm_user_score"] = df_esm_clean["esm_user_answer_numeric"]
+
+for scale in requested_scales:
+    questionnaire_id = QUESTIONNAIRE_IDS[scale]
+    mask = df_esm_clean["questionnaire_id"] == questionnaire_id
+    if scale.startswith("JCQ"):
+        df_esm_clean.loc[mask] = reverse_jcq_demand_control_scoring(df_esm_clean.loc[mask])
+    #TODO Reverse other questionnaires if needed and/or adapt esm_user_score to original scoring.
+
+df_esm_clean.to_csv(snakemake.output[0], index=False)
--- a/src/features/phone_esm/straw/process_user_event_related_segments.py
+++ b/src/features/phone_esm/straw/process_user_event_related_segments.py
@ -0,0 +1,260 @@
+import pandas as pd
+import numpy as np
+import datetime
+
+import math, sys, yaml
+
+from esm_preprocess import clean_up_esm
+from esm import classify_sessions_by_completion_time, preprocess_esm
+
+input_data_files = dict(snakemake.input)
+
+def format_timestamp(x):
+    """This method formates inputed timestamp into format "HH MM SS". Including spaces. If there is no hours or minutes present
+    that part is ignored, e.g., "MM SS" or just "SS". 
+
+    Args:
+        x (int): unix timestamp in seconds
+
+    Returns:
+        str: formatted timestamp using "HH MM SS" sintax
+    """
+    tstring=""
+    space = False
+    if x//3600 > 0:
+        tstring += f"{x//3600}H"
+        space = True
+    if x % 3600 // 60 > 0:
+        tstring += f" {x % 3600 // 60}M" if "H" in tstring else f"{x % 3600 // 60}M"
+    if x % 60 > 0:  
+        tstring += f" {x % 60}S" if "M" in tstring or "H" in tstring else f"{x % 60}S"
+    
+    return tstring
+
+
+def extract_ers(esm_df):
+    """This method has two major functionalities: 
+        (1) It prepares STRAW event-related segments file with the use of esm file. The execution protocol is depended on 
+            the segmenting method specified in the config.yaml file.
+        (2) It prepares and writes csv with targets and corresponding time segments labels. This is later used 
+            in the overall cleaning script (straw).
+    
+    Details about each segmenting method are listed below by each corresponding condition. Refer to the RAPIDS documentation for the 
+    ERS file format: https://www.rapids.science/1.9/setup/configuration/#time-segments -> event segments
+
+    Args:
+        esm_df (DataFrame): read esm file that is dependend on the current participant.
+
+    Returns:
+        extracted_ers (DataFrame): dataframe with all necessary information to write event-related segments file 
+        in the correct format.
+    """
+
+    pd.set_option("display.max_rows", 100)
+    pd.set_option("display.max_columns", None)
+
+    with open('config.yaml', 'r') as stream:
+        config = yaml.load(stream, Loader=yaml.FullLoader)
+
+    pd.DataFrame(columns=["label"]).to_csv(snakemake.output[1]) # Create an empty stress_events_targets file 
+
+    esm_preprocessed = clean_up_esm(preprocess_esm(esm_df))
+
+    # Take only ema_completed sessions responses
+    classified = classify_sessions_by_completion_time(esm_preprocessed)
+    esm_filtered_sessions = classified[classified["session_response"] == 'ema_completed'].reset_index()[['device_id', 'esm_session']]
+    esm_df = esm_preprocessed.loc[(esm_preprocessed['device_id'].isin(esm_filtered_sessions['device_id'])) & (esm_preprocessed['esm_session'].isin(esm_filtered_sessions['esm_session']))]
+    
+    segmenting_method = config["TIME_SEGMENTS"]["TAILORED_EVENTS"]["SEGMENTING_METHOD"]
+    
+    if segmenting_method in ["30_before", "90_before"]: # takes 30-minute peroid before the questionnaire + the duration of the questionnaire
+        """ '30-minutes and 90-minutes before' have the same fundamental logic with couple of deviations that will be explained below.
+        Both take x-minute period before the questionnaire that is summed with the questionnaire duration.
+        All questionnaire durations over 15 minutes are excluded from the querying.
+        """
+        # Extract time-relevant information
+        extracted_ers = esm_df.groupby(["device_id", "esm_session"])['timestamp'].apply(lambda x: math.ceil((x.max() - x.min()) / 1000)).reset_index() # questionnaire length
+        extracted_ers["label"] = f"straw_event_{segmenting_method}_" + snakemake.params["pid"] + "_" + extracted_ers.index.astype(str).str.zfill(3) 
+        extracted_ers[['event_timestamp', 'device_id']] = esm_df.groupby(["device_id", "esm_session"])['timestamp'].min().reset_index()[['timestamp', 'device_id']]
+        extracted_ers = extracted_ers[extracted_ers["timestamp"] <= 15 * 60].reset_index(drop=True) # ensure that the longest duration of the questionnaire anwsering is 15 min 
+        extracted_ers["shift_direction"] = -1 
+
+        if segmenting_method == "30_before":
+            """The method 30-minutes before simply takes 30 minutes before the questionnaire and sums it with the questionnaire duration.
+            The timestamps are formatted with the help of format_timestamp() method.
+            """
+            time_before_questionnaire = 30 * 60 # in seconds (30 minutes)
+
+            extracted_ers["length"] = (extracted_ers["timestamp"] + time_before_questionnaire).apply(lambda x: format_timestamp(x))
+            extracted_ers["shift"] = time_before_questionnaire
+            extracted_ers["shift"] = extracted_ers["shift"].apply(lambda x: format_timestamp(x))
+        
+        elif segmenting_method == "90_before":
+            """The method 90-minutes before has an important condition. If the time between the current and the previous questionnaire is
+            longer then 90 minutes it takes 90 minutes, otherwise it takes the original time difference between the questionnaires.
+            """
+            time_before_questionnaire = 90 * 60 # in seconds (90 minutes)
+
+            extracted_ers[['end_event_timestamp', 'device_id']] = esm_df.groupby(["device_id", "esm_session"])['timestamp'].max().reset_index()[['timestamp', 'device_id']]
+
+            extracted_ers['diffs'] = extracted_ers['event_timestamp'].astype('int64') - extracted_ers['end_event_timestamp'].shift(1, fill_value=0).astype('int64')
+            extracted_ers.loc[extracted_ers['diffs'] > time_before_questionnaire * 1000, 'diffs'] = time_before_questionnaire * 1000
+            
+            extracted_ers["diffs"] = (extracted_ers["diffs"] / 1000).apply(lambda x: math.ceil(x))
+
+            extracted_ers["length"] = (extracted_ers["timestamp"] + extracted_ers["diffs"]).apply(lambda x: format_timestamp(x))
+            extracted_ers["shift"] = extracted_ers["diffs"].apply(lambda x: format_timestamp(x))
+
+    elif segmenting_method == "stress_event":
+        """
+        TODO: update documentation for this condition
+        This is a special case of the method as it consists of two important parts:
+            (1) Generating of the ERS file (same as the methods above) and
+            (2) Generating targets file alongside with the correct time segment labels.
+        
+        This extracts event-related segments, depended on the event time and duration specified by the participant in the next
+        questionnaire. Additionally, 5 minutes before the specified start time of this event is taken to take into a account the 
+        possiblity of the participant not remembering the start time percisely => this parameter can be manipulated with the variable
+        "time_before_event" which is defined below. 
+        
+        In case if the participant marked that no stressful event happened, the default of 30 minutes before the event is choosen. 
+        In this case, se_threat and se_challenge are NaN.
+        
+        By default, this method also excludes all events that are longer then 2.5 hours so that the segments are easily comparable. 
+        """
+
+        ioi = config["TIME_SEGMENTS"]["TAILORED_EVENTS"]["INTERVAL_OF_INTEREST"] * 60 # interval of interest in seconds
+        ioi_error_tolerance = config["TIME_SEGMENTS"]["TAILORED_EVENTS"]["IOI_ERROR_TOLERANCE"]  * 60 # interval of interest error tolerance in seconds 
+
+        # Get and join required data
+        extracted_ers = esm_df.groupby(["device_id", "esm_session"])['timestamp'].apply(lambda x: math.ceil((x.max() - x.min()) / 1000)).reset_index().rename(columns={'timestamp': 'session_length'}) # questionnaire length
+        extracted_ers = extracted_ers[extracted_ers["session_length"] <= 15 * 60].reset_index(drop=True) # ensure that the longest duration of the questionnaire answering is 15 min
+        session_start_timestamp = esm_df.groupby(['device_id', 'esm_session'])['timestamp'].min().to_frame().rename(columns={'timestamp': 'session_start_timestamp'}) # questionnaire start timestamp
+        session_end_timestamp = esm_df.groupby(['device_id', 'esm_session'])['timestamp'].max().to_frame().rename(columns={'timestamp': 'session_end_timestamp'}) # questionnaire end timestamp
+        
+        # Users' answers for the stressfulness event (se) start times and durations 
+        se_time = esm_df[esm_df.questionnaire_id == 90.].set_index(['device_id', 'esm_session'])['esm_user_answer'].to_frame().rename(columns={'esm_user_answer': 'se_time'})
+        se_duration = esm_df[esm_df.questionnaire_id == 91.].set_index(['device_id', 'esm_session'])['esm_user_answer'].to_frame().rename(columns={'esm_user_answer': 'se_duration'})
+
+        # Make se_durations to the appropriate lengths
+
+        # Extracted 3 targets that will be transfered in the csv file to the cleaning script. 
+        se_stressfulness_event_tg = esm_df[esm_df.questionnaire_id == 87.].set_index(['device_id', 'esm_session'])['esm_user_answer_numeric'].to_frame().rename(columns={'esm_user_answer_numeric': 'appraisal_stressfulness_event'})
+        se_threat_tg = esm_df[esm_df.questionnaire_id == 88.].groupby(["device_id", "esm_session"]).mean(numeric_only=True)['esm_user_answer_numeric'].to_frame().rename(columns={'esm_user_answer_numeric': 'appraisal_threat'})
+        se_challenge_tg = esm_df[esm_df.questionnaire_id == 89.].groupby(["device_id", "esm_session"]).mean(numeric_only=True)['esm_user_answer_numeric'].to_frame().rename(columns={'esm_user_answer_numeric': 'appraisal_challenge'})
+
+        # All relevant features are joined by inner join to remove standalone columns (e.g., stressfulness event target has larger count)
+        extracted_ers = extracted_ers.join(session_start_timestamp, on=['device_id', 'esm_session'], how='inner') \
+                                     .join(session_end_timestamp, on=['device_id', 'esm_session'], how='inner') \
+                                     .join(se_stressfulness_event_tg, on=['device_id', 'esm_session'], how='inner') \
+                                     .join(se_time, on=['device_id', 'esm_session'], how='left') \
+                                     .join(se_duration, on=['device_id', 'esm_session'], how='left') \
+                                     .join(se_threat_tg, on=['device_id', 'esm_session'], how='left') \
+                                     .join(se_challenge_tg, on=['device_id', 'esm_session'], how='left')
+
+        # Filter-out the sessions that are not useful. Because of the ambiguity this excludes: 
+        # (1) straw event times that are marked as "0 - I don't remember"
+        extracted_ers = extracted_ers[~extracted_ers.se_time.astype(str).str.startswith("0 - ")]
+        extracted_ers.reset_index(drop=True, inplace=True)
+
+        extracted_ers.loc[extracted_ers.se_duration.astype(str).str.startswith("0 - "), 'se_duration'] = 0
+
+        # Add default duration in case if participant answered that no stressful event occured
+        extracted_ers["se_duration"] = extracted_ers["se_duration"].fillna(int((ioi + 2*ioi_error_tolerance) * 1000))
+
+        # Prepare data to fit the data structure in the CSV file ...
+        # Add the event time as the end of the questionnaire if no stress event occured
+        extracted_ers['se_time'] = extracted_ers['se_time'].fillna(extracted_ers['session_start_timestamp'])
+        # Type could be an int (timestamp [ms]) which stays the same, and datetime str which is converted to timestamp in miliseconds 
+        extracted_ers['event_timestamp'] = extracted_ers['se_time'].apply(lambda x: x if isinstance(x, int) else pd.to_datetime(x).timestamp() * 1000).astype('int64')
+        extracted_ers['shift_direction'] = -1
+
+        """>>>>> begin section (could be optimized) <<<<<"""
+
+        # Checks whether the duration is marked with "1 - It's still ongoing" which means that the end of the current questionnaire
+        # is taken as end time of the segment. Else the user input duration is taken. 
+        extracted_ers['se_duration'] = \
+            np.where(
+                extracted_ers['se_duration'].astype(str).str.startswith("1 - "),
+                extracted_ers['session_end_timestamp'] - extracted_ers['event_timestamp'], 
+                extracted_ers['se_duration']
+            )
+
+        # This converts the rows of timestamps in miliseconds and the rows with datetime... to timestamp in seconds.
+        extracted_ers['se_duration'] = \
+            extracted_ers['se_duration'].apply(lambda x: math.ceil(x / 1000) if isinstance(x, int) else (pd.to_datetime(x).hour * 60 + pd.to_datetime(x).minute) * 60)
+
+        # Check explicitley whether min duration is at least 0. This will eliminate rows that would be investigated after the end of the questionnaire.
+        extracted_ers = extracted_ers[extracted_ers['session_end_timestamp'] - extracted_ers['event_timestamp'] >= 0]
+        # Double check whether min se_duration is at least 0. Filter-out the rest. Negative values are considered invalid.
+        extracted_ers = extracted_ers[extracted_ers["se_duration"] >= 0].reset_index(drop=True)
+
+        """>>>>> end section <<<<<"""
+
+        # Simply override all durations to be of an equal amount
+        extracted_ers['se_duration'] = ioi + 2*ioi_error_tolerance 
+
+        # If target is 0 then shift by the total stress event duration, otherwise shift it by ioi_tolerance
+        extracted_ers['shift'] = \
+            np.where(
+                extracted_ers['appraisal_stressfulness_event'] == 0,
+                extracted_ers['se_duration'], 
+                ioi_error_tolerance
+            )
+
+        extracted_ers['shift'] = extracted_ers['shift'].apply(lambda x: format_timestamp(int(x)))
+        extracted_ers['length'] = extracted_ers['se_duration'].apply(lambda x: format_timestamp(int(x)))
+
+        # Drop event_timestamp duplicates in case in the user is referencing the same event over multiple questionnaires
+        extracted_ers.drop_duplicates(subset=["event_timestamp"], keep='first', inplace=True)
+        extracted_ers.reset_index(drop=True, inplace=True)
+
+        extracted_ers["label"] = f"straw_event_{segmenting_method}_" + snakemake.params["pid"] + "_" + extracted_ers.index.astype(str).str.zfill(3)
+
+        # Write the csv of extracted ERS labels with targets related to stressfulness event   
+        extracted_ers[["label", "appraisal_stressfulness_event", "appraisal_threat", "appraisal_challenge"]].to_csv(snakemake.output[1], index=False)
+        
+    else:
+        raise Exception("Please select correct target method for the event-related segments.")
+        extracted_ers = pd.DataFrame(columns=["label", "event_timestamp", "length", "shift", "shift_direction", "device_id"])
+
+    return extracted_ers[["label", "event_timestamp", "length", "shift", "shift_direction", "device_id"]]
+
+
+"""
+Here the code is executed - this .py file is used both for extraction of the STRAW time_segments file for the individual
+participant, and also for merging all participant's files into one combined file which is later used for the time segments 
+to all sensors assignment.
+
+There are two files involved (see rules extract_event_information_from_esm and merge_event_related_segments_files in preprocessing.smk)
+(1) ERS file which contains all the information about the time segment timings and
+(2) targets file which has corresponding target value for the segment label which is later used to merge with other features in the cleaning script.
+For more information, see the comment in the method above.
+"""
+if snakemake.params["stage"] == "extract": 
+    esm_df = pd.read_csv(input_data_files['esm_raw_input'])
+
+    extracted_ers = extract_ers(esm_df)
+
+    extracted_ers.to_csv(snakemake.output[0], index=False)
+
+elif snakemake.params["stage"] == "merge":
+
+    input_data_files = dict(snakemake.input)
+    straw_events = pd.DataFrame(columns=["label", "event_timestamp", "length", "shift", "shift_direction", "device_id"])
+    stress_events_targets = pd.DataFrame(columns=["label", "appraisal_stressfulness_event", "appraisal_threat", "appraisal_challenge"])
+
+    for input_file in input_data_files["ers_files"]:
+        ers_df = pd.read_csv(input_file)
+        straw_events = pd.concat([straw_events, ers_df], axis=0, ignore_index=True)
+
+    straw_events.to_csv(snakemake.output[0], index=False)
+
+    for input_file in input_data_files["se_files"]:
+        se_df = pd.read_csv(input_file)
+        stress_events_targets = pd.concat([stress_events_targets, se_df], axis=0, ignore_index=True)
+
+    stress_events_targets.to_csv(snakemake.output[1], index=False)
+
+
+
--- a/src/features/phone_keyboard/rapids/main.py
+++ b/src/features/phone_keyboard/rapids/main.py
@ -59,6 +59,7 @@ def rapids_features(sensor_data_files, time_segment, provider, filter_data_by_se
            if "totalkeyboardtouches" in features_to_compute:
                keyboard_features["totalkeyboardtouches"] = keyboard_data.groupby(['local_segment','sessionNumber'])['is_password'].count().reset_index().groupby(['local_segment'])['is_password'].mean()
            
+            keyboard_features.fillna(value={"sessioncount": 0, "averagesessionlength": 0, "changeintextlengthlessthanminusone": 0, "changeintextlengthequaltominusone": 0, "changeintextlengthequaltoone": 0, "changeintextlengthmorethanone": 0, "maxtextlength": 0, "totalkeyboardtouches": 0}, inplace=True)
            keyboard_features = keyboard_features.reset_index()

    return keyboard_features
--- a/src/features/phone_light/rapids/main.py
+++ b/src/features/phone_light/rapids/main.py
@ -29,7 +29,7 @@ def rapids_features(sensor_data_files, time_segment, provider, filter_data_by_se
            if "medianlux" in features_to_compute:
                light_features["medianlux"] = light_data.groupby(["local_segment"])["double_light_lux"].median()
            if "stdlux" in features_to_compute:
-                light_features["stdlux"] = light_data.groupby(["local_segment"])["double_light_lux"].std()
+                light_features["stdlux"] = light_data.groupby(["local_segment"])["double_light_lux"].std().fillna(0)
            
            light_features = light_features.reset_index()

--- a/src/features/phone_locations/barnett/daily_features.R
+++ b/src/features/phone_locations/barnett/daily_features.R
@ -25,9 +25,11 @@ barnett_daily_features <- function(snakemake){
  datetime_end_regex = "[0-9]{4}[\\-|\\/][0-9]{2}[\\-|\\/][0-9]{2} 23:59:59"
  location <- location %>% 
    mutate(is_daily = str_detect(assigned_segments, paste0(".*#", datetime_start_regex, ",", datetime_end_regex, ".*")))
-  
-  if(nrow(segment_labels) == 0 || nrow(location) == 0 || all(location$is_daily == FALSE) || (max(location$timestamp) - min(location$timestamp) < 86400000)){
-    warning("Barnett's location features cannot be computed for data or time segments that do not span one or more entire days (00:00:00 to 23:59:59). Values below point to the problem:",
+
+  does_not_span = nrow(segment_labels) == 0 || nrow(location) == 0 || all(location$is_daily == FALSE) || (max(location$timestamp) - min(location$timestamp) < 86400000)
+
+  if(is.na(does_not_span) || does_not_span){
+      warning("Barnett's location features cannot be computed for data or time segments that do not span one or more entire days (00:00:00 to 23:59:59). Values below point to the problem:",
            "\nLocation data rows within a daily time segment: ", nrow(filter(location, is_daily)),
            "\nLocation data time span in days: ", round((max(location$timestamp) - min(location$timestamp)) / 86400000, 2)
            )
--- a/src/features/phone_locations/doryab/add_doryab_extra_columns.py
+++ b/src/features/phone_locations/doryab/add_doryab_extra_columns.py
@ -115,7 +115,7 @@ cluster_on = provider["CLUSTER_ON"]
 strategy = provider["INFER_HOME_LOCATION_STRATEGY"]
 days_threshold = provider["MINIMUM_DAYS_TO_DETECT_HOME_CHANGES"]

-if not location_data.timestamp.is_monotonic:
+if not location_data.timestamp.is_monotonic_increasing:
    location_data.sort_values(by=["timestamp"], inplace=True)

 location_data["duration_in_seconds"] = -1 * location_data.timestamp.diff(-1) / 1000
--- a/src/features/phone_locations/doryab/main.py
+++ b/src/features/phone_locations/doryab/main.py
@ -37,7 +37,8 @@ def variance_and_logvariance_features(location_data, location_features):
    location_data["longitude_for_wvar"] = (location_data["double_longitude"] - location_data["longitude_wavg"]) ** 2 * location_data["duration"] * 60

    location_features["locationvariance"] = ((location_data_grouped["latitude_for_wvar"].sum() + location_data_grouped["longitude_for_wvar"].sum()) / (location_data_grouped["duration"].sum() * 60 - 1)).fillna(0)
-    location_features["loglocationvariance"] = np.log10(location_features["locationvariance"]).replace(-np.inf, np.nan)
+    
+    location_features["loglocationvariance"] = np.log10(location_features["locationvariance"]).replace(-np.inf, -1000000)

    return location_features

@ -180,8 +181,11 @@ def doryab_features(sensor_data_files, time_segment, provider, filter_data_by_se
    location_features = location_features.merge(location_entropy(stationary_data_without_outliers), how="outer", left_index=True, right_index=True)

    # time at home
-    stationary_data["time_at_home"] = stationary_data.apply(lambda row: row["duration"] if row["distance_from_home"] <= radius_from_home else 0, axis=1)
-    location_features["timeathome"] = stationary_data[["local_segment", "time_at_home"]].groupby(["local_segment"])["time_at_home"].sum()
+    if stationary_data.empty:
+        location_features["timeathome"] = 0
+    else:
+        stationary_data["time_at_home"] = stationary_data.apply(lambda row: row["duration"] if row["distance_from_home"] <= radius_from_home else 0, axis=1)
+        location_features["timeathome"] = stationary_data[["local_segment", "time_at_home"]].groupby(["local_segment"])["time_at_home"].sum()

    # home label
    location_features["homelabel"] = stationary_data[["local_segment", "home_label"]].groupby(["local_segment"]).agg(lambda x: pd.Series.mode(x)[0])
--- a/src/features/phone_messages/rapids/main.R
+++ b/src/features/phone_messages/rapids/main.R
@ -65,6 +65,15 @@ rapids_features <- function(sensor_data_files, time_segment, provider){
        features <- message_features_of_type(messages_of_type, message_type, time_segment, requested_features)
        messages_features <- merge(messages_features, features, all=TRUE)
    }
-    messages_features <- messages_features %>% mutate_at(vars(contains("countmostfrequentcontact") | contains("distinctcontacts") | contains("count")), list( ~ replace_na(., 0)))
+    # Fill seleted columns with a high number
+    time_cols <- select(messages_features, contains("timefirstmessages") |  contains("timelastmessages")) %>% 
+    colnames(.)
+
+    messages_features <- messages_features %>% 
+        mutate_at(., time_cols, ~replace(., is.na(.), 1500))
+    
+    # Fill NA values with 0
+    messages_features <- messages_features %>% mutate_all(~replace(., is.na(.), 0))
+    
    return(messages_features)
 }
--- a/src/features/phone_screen/rapids/main.py
+++ b/src/features/phone_screen/rapids/main.py
@ -15,7 +15,7 @@ def getEpisodeDurationFeatures(screen_data, time_segment, episode, features, ref
    if "avgduration" in features:
        duration_helper = pd.concat([duration_helper, screen_data_episode.groupby(["local_segment"])[["duration"]].mean().rename(columns = {"duration":"avgduration" + episode})], axis = 1)
    if "stdduration" in features:
-        duration_helper = pd.concat([duration_helper, screen_data_episode.groupby(["local_segment"])[["duration"]].std().rename(columns = {"duration":"stdduration" + episode})], axis = 1)
+        duration_helper = pd.concat([duration_helper, screen_data_episode.groupby(["local_segment"])[["duration"]].std().fillna(0).rename(columns = {"duration":"stdduration" + episode})], axis = 1)
    if "firstuseafter" + "{0:0=2d}".format(reference_hour_first_use) in features:
        screen_data_episode_after_hour = screen_data_episode.copy()
        screen_data_episode_after_hour["hour"] = pd.to_datetime(screen_data_episode["local_start_date_time"]).dt.hour
@ -62,6 +62,7 @@ def rapids_features(sensor_data_files, time_segment, provider, filter_data_by_se
                screen_features = pd.concat([screen_features, getEpisodeDurationFeatures(screen_data, time_segment, episode, features_episodes_to_compute, reference_hour_first_use)], axis=1)

        if not screen_features.empty:
+            screen_features.fillna(value={feature_name: 0 for feature_name in screen_features.columns if not feature_name.startswith(("stdduration", "firstuseafter"))}, inplace=True)
            screen_features = screen_features.reset_index()

    return screen_features
--- a/src/features/phone_speech/straw/main.py
+++ b/src/features/phone_speech/straw/main.py
@ -0,0 +1,30 @@
+import pandas as pd
+
+
+def straw_features(sensor_data_files, time_segment, provider, filter_data_by_segment, *args, **kwargs):
+    speech_data = pd.read_csv(sensor_data_files["sensor_data"])
+    requested_features = provider["FEATURES"]
+    # name of the features this function can compute+
+    base_features_names = ["meanspeech", "stdspeech", "nlargest", "nsmallest", "medianspeech"]
+    features_to_compute = list(set(requested_features) & set(base_features_names))
+    speech_features = pd.DataFrame(columns=["local_segment"] + features_to_compute)
+    
+    if not speech_data.empty:
+        speech_data = filter_data_by_segment(speech_data, time_segment)
+
+        if not speech_data.empty:
+            speech_features = pd.DataFrame()
+            if "meanspeech" in features_to_compute:
+                speech_features["meanspeech"] = speech_data.groupby(["local_segment"])['speech_proportion'].mean()
+            if "stdspeech" in features_to_compute:
+                speech_features["stdspeech"] = speech_data.groupby(["local_segment"])['speech_proportion'].std()
+            if "nlargest" in features_to_compute:
+                speech_features["nlargest"] = speech_data.groupby(["local_segment"])['speech_proportion'].apply(lambda x: x.nlargest(5).mean())
+            if "nsmallest" in features_to_compute:
+                speech_features["nsmallest"] = speech_data.groupby(["local_segment"])['speech_proportion'].apply(lambda x: x.nsmallest(5).mean())
+            if "medianspeech" in features_to_compute:
+                speech_features["medianspeech"] = speech_data.groupby(["local_segment"])['speech_proportion'].median()
+            
+            speech_features = speech_features.reset_index()
+
+    return speech_features
--- a/src/features/phone_wifi_visible/rapids/main.R
+++ b/src/features/phone_wifi_visible/rapids/main.R
@ -9,21 +9,26 @@ compute_wifi_feature <- function(data, feature, time_segment){
              "countscans" = data %>% summarise(!!feature := n()),
              "uniquedevices" = data %>% summarise(!!feature := n_distinct(bssid)))
    return(data)
+
   } else if(feature == "countscansmostuniquedevice"){
     # Get the most scanned device
-    mostuniquedevice <- data %>% 
+    mostuniquedevice <- data %>%
+      filter(bssid != "") %>% 
      group_by(bssid) %>% 
      mutate(N=n()) %>% 
      ungroup() %>%
      filter(N == max(N)) %>% 
      head(1) %>% # if there are multiple device with the same amount of scans pick the first one only
      pull(bssid)
+
    data <- data %>% filter_data_by_segment(time_segment)
+
    return(data %>% 
             filter(bssid == mostuniquedevice) %>%
             group_by(local_segment) %>% 
-             summarise(!!feature := n()) %>%
-             replace(is.na(.), 0))
+             summarise(!!feature := n())
+    )
+
  }
 }

@ -43,6 +48,6 @@ rapids_features <- function(sensor_data_files, time_segment, provider){
    feature <- compute_wifi_feature(wifi_data, feature_name, time_segment)
    features <- merge(features, feature, by="local_segment", all = TRUE)
  }
-
+  features <- features %>% mutate_all(~replace(., is.na(.), 0))
  return(features)
 }
--- a/src/features/utils/merge_standardized_sensor_features_for_all_participants.R
+++ b/src/features/utils/merge_standardized_sensor_features_for_all_participants.R
@ -0,0 +1,17 @@
+source("renv/activate.R")
+
+library(tidyr)
+library(purrr)
+library("dplyr", warn.conflicts = F)
+library(stringr)
+
+feature_files  <- snakemake@input[["feature_files"]]
+
+
+features_of_all_participants <- tibble(filename = feature_files) %>% # create a data frame
+  mutate(file_contents = map(filename, ~ read.csv(., stringsAsFactors = F, colClasses = c(local_segment = "character", local_segment_label = "character", local_segment_start_datetime="character", local_segment_end_datetime="character"))),
+         pid = str_match(filename, ".*/(.*)/z_all_sensor_features.csv")[,2]) %>%
+  unnest(cols = c(file_contents)) %>%
+  select(-filename)
+
+write.csv(features_of_all_participants, snakemake@output[[1]], row.names = FALSE)
--- a/src/features/utils/utils.R
+++ b/src/features/utils/utils.R
@ -83,4 +83,14 @@ fetch_provider_features <- function(provider, provider_key, sensor_key, sensor_d
                                                "(.*)#(.*),(.*)", 
                                                remove = FALSE)
    return(sensor_features)
-}
+}
+
+run_provider_cleaning_script <- function(provider, provider_key, sensor_key, sensor_data_files){
+  source(provider[["SRC_SCRIPT"]])
+  print(paste(rapids_log_tag, "Processing", sensor_key, provider_key))
+  
+  cleaning_function <- match.fun(paste0(tolower(provider_key), "_cleaning"))
+  sensor_features <- cleaning_function(sensor_data_files, provider)
+
+  return(sensor_features)
+}
--- a/Show More
+++ b/Show More