Add aware_mysql_split phone data stream

2021-06-02 12:10:31 -04:00
731 changed files with 2024 additions and 32648 deletions
--- a/.gitattributes
+++ b/.gitattributes
@ -1,7 +0,0 @@
 # We'll let Git's auto-detection algorithm infer if a file is text. If it is,
 # enforce LF line endings regardless of OS or git configurations.
 * text=auto eol=lf
 # Isolate binary files in case the auto-detection algorithm fails and
 # marks them as text files (which could brick them).
 *.{png,jpg,jpeg,gif,webp,woff,woff2} binary
--- a/.gitignore
+++ b/.gitignore
@ -93,17 +93,10 @@ packrat/*
 # exclude data from source control by default
 data/external/*
 !/data/external/empatica/empatica1/E4 Data.zip
 !/data/external/.gitkeep
 !/data/external/stachl_application_genre_catalogue.csv
 !/data/external/timesegments*.csv
 !/data/external/wiki_tz.csv
 !/data/external/main_study_usernames.csv
 !/data/external/timezone.csv
 !/data/external/play_store_application_genre_catalogue.csv
 !/data/external/play_store_categories_count.csv
 data/raw/*
 !/data/raw/.gitkeep
 data/interim/*
@ -121,12 +114,3 @@ settings.dcf
 tests/fakedata_generation/
 site/
 credentials.yaml
 # Docker container and other files
 .devcontainer
 # Calculating features module
 calculatingfeatures/
 # Temp folder for rapids data/external
 rapids_temp_data/
--- a/README.md
+++ b/README.md
@ -11,191 +11,3 @@
 For more information refer to our [documentation](http://www.rapids.science)
 By [MoSHI](https://www.moshi.pitt.edu/), [University of Pittsburgh](https://www.pitt.edu/)
 ## Installation 
 For RAPIDS installation refer to to the [documentation](https://www.rapids.science/1.8/setup/installation/)
 ### For the installation of the Docker version
 1. Follow the [instructions](https://www.rapids.science/1.8/setup/installation/) to setup RAPIDS via Docker (from scratch).
 2. Delete current contents in /rapids/ folder when in a container session.
    ```
    cd ..
    rm -rf rapids/{*,.*}
    cd rapids
    ```
 3. Clone RAPIDS workspace from Git and checkout a specific branch.
    ```
    git clone "https://repo.ijs.si/junoslukan/rapids.git" .
    git checkout <branch_name>
    ```
 4. Install missing “libpq-dev” dependency with bash.
    ```
    apt-get update -y
    apt-get install -y libpq-dev
    ```
 5. Restore R venv.
 Type R to go to the interactive R session and then:
    ```
    renv::restore()
    ```
 6. Install cr-features module 
 From: https://repo.ijs.si/matjazbostic/calculatingfeatures.git -> branch master. 
 Then follow the "cr-features module" section below.  
 7. Install all required packages from environment.yml, prune also deletes conda packages not present in environment file.
    ```
    conda env update --file environment.yml –prune
    ```
 8. If you wish to update your R or Python venvs.
    ```
    R in interactive session:
    renv::snapshot()
    Python: 
    conda env export --no-builds | sed 's/^.*libgfortran.*$/  - libgfortran/' | sed 's/^.*mkl=.*$/  - mkl/' >  environment.yml
    ```
 ### cr-features module 
 This RAPIDS extension uses cr-features library accessible [here](https://repo.ijs.si/matjazbostic/calculatingfeatures).
 To use cr-features library:
 - Follow the installation instructions in the [README.md](https://repo.ijs.si/matjazbostic/calculatingfeatures/-/blob/master/README.md).
 - Copy built calculatingfeatures folder into the RAPIDS workspace.
 - Install the cr-features package by:
    ```
    pip install path/to/the/calculatingfeatures/folder
    e.g. pip install ./calculatingfeatures if the folder is copied to main parent directory
    cr-features package has to be built and installed everytime to get the newest version. 
    Or an the newest version of the docker image must be used.   
    ```
 ## Updating RAPIDS
 To update RAPIDS, first pull and merge [origin]( https://github.com/carissalow/rapids), such as with:
 ```commandline
 git fetch --progress "origin" refs/heads/master
 git merge --no-ff origin/master
 ```
 Next, update the conda and R virtual environment.
 ```bash
 R -e 'renv::restore(repos = c(CRAN = "https://packagemanager.rstudio.com/all/__linux__/focal/latest"))'
 ```
 ## Custom configuration
 ### Credentials
 As mentioned under [Database in RAPIDS documentation](https://www.rapids.science/1.6/snippets/database/), a `credentials.yaml` file is needed to connect to a database.
 It should contain:
 ```yaml
 PSQL_STRAW:
  database: staw
  host: 212.235.208.113
  password: password
  port: 5432
  user: staw_db
 ```
 where`password` needs to be specified as well.
 ## Possible installation issues
 ### Missing dependencies for RPostgres
 To install `RPostgres` R package (used to connect to the PostgreSQL database), an error might occur:
 ```text
 ------------------------- ANTICONF ERROR ---------------------------
 Configuration failed because libpq was not found. Try installing:
   * deb: libpq-dev (Debian, Ubuntu, etc)
   * rpm: postgresql-devel (Fedora, EPEL)
   * rpm: postgreql8-devel, psstgresql92-devel, postgresql93-devel, or postgresql94-devel (Amazon Linux)
   * csw: postgresql_dev (Solaris)
   * brew: libpq (OSX)
 If libpq is already installed, check that either:
  (i)  'pkg-config' is in your PATH AND PKG_CONFIG_PATH contains a libpq.pc file; or
  (ii) 'pg_config' is in your PATH.
 If neither can detect , you can set INCLUDE_DIR
 and LIB_DIR manually via:
  R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
 --------------------------[ ERROR MESSAGE ]----------------------------
  <stdin>:1:10: fatal error: libpq-fe.h: No such file or directory
 compilation terminated.
 ```
 The library requires `libpq` for compiling from source, so install accordingly.
 ### Timezone environment variable for tidyverse (relevant for WSL2)
 One of the R packages, `tidyverse` might need access to the `TZ` environment variable during the installation.
 On Ubuntu 20.04 on WSL2 this triggers the following error:
 ```text
 > install.packages('tidyverse')
 ERROR: configuration failed for package ‘xml2’
 System has not been booted with systemd as init system (PID 1). Can't operate.
 Failed to create bus connection: Host is down
 Warning in system("timedatectl", intern = TRUE) :
  running command 'timedatectl' had status 1
 Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) :
  namespace ‘xml2’ 1.3.1 is already loaded, but >= 1.3.2 is required
 Calls: <Anonymous> ... namespaceImportFrom -> asNamespace -> loadNamespace
 Execution halted
 ERROR: lazy loading failed for package ‘tidyverse’
 ```
 This happens because WSL2 does not use the `timedatectl` service, which provides this variable.
 ```bash
 ~$ timedatectl
 System has not been booted with systemd as init system (PID 1). Can't operate.
 Failed to create bus connection: Host is down
 ```
 and later 
 ```bash 
 Warning message:
 In system("timedatectl", intern = TRUE) :
  running command 'timedatectl' had status 1
 Execution halted
 ```
 This can be amended by setting the environment variable manually before attempting to install `tidyverse`:
 ```bash
 export TZ='Europe/Ljubljana'
 ```
 Note: if this is needed to avoid runtime issues, you need to either define this environment variable in each new terminal window or (better) define it in your `~/.bashrc` or `~/.bash_profile`.
 ## Possible runtime issues
 ### Unix end of line characters
 Upon running rapids, an error might occur:
 ```bash
 /usr/bin/env: ‘python3\r’: No such file or directory
 ```
 This is due to Windows style end of line characters. 
 To amend this, I added a `.gitattributes` files to force `git` to checkout `rapids` using Unix EOL characters.
 If this still fails, `dos2unix` can be used to change them.
 ### System has not been booted with systemd as init system (PID 1)
 See [the installation issue above](#Timezone-environment-variable-for-tidyverse-(relevant-for-WSL2)).
--- a/64
+++ b/64
@ -5,7 +5,6 @@ include: "rules/common.smk"
 include: "rules/renv.smk"
 include: "rules/preprocessing.smk"
 include: "rules/features.smk"
 include: "rules/models.smk"
 include: "rules/reports.smk"
 import itertools
@ -46,12 +45,7 @@ for provider in config["PHONE_MESSAGES"]["PROVIDERS"].keys():
 for provider in config["PHONE_CALLS"]["PROVIDERS"].keys():
    if config["PHONE_CALLS"]["PROVIDERS"][provider]["COMPUTE"]:
        files_to_compute.extend(expand("data/raw/{pid}/phone_calls_raw.csv", pid=config["PIDS"]))
-        if (provider == "RAPIDS") and (config["PHONE_CALLS"]["PROVIDERS"][provider]["FEATURES_TYPE"] == "EPISODES"):
+        files_to_compute.extend(expand("data/raw/{pid}/phone_calls_with_datetime.csv", pid=config["PIDS"]))
            files_to_compute.extend(expand("data/interim/{pid}/phone_calls_episodes.csv", pid=config["PIDS"]))
            files_to_compute.extend(expand("data/interim/{pid}/phone_calls_episodes_resampled.csv", pid=config["PIDS"]))
            files_to_compute.extend(expand("data/interim/{pid}/phone_calls_episodes_resampled_with_datetime.csv", pid=config["PIDS"]))
        else:
            files_to_compute.extend(expand("data/raw/{pid}/phone_calls_with_datetime.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/interim/{pid}/phone_calls_features/phone_calls_{language}_{provider_key}.csv", pid=config["PIDS"], language=get_script_language(config["PHONE_CALLS"]["PROVIDERS"][provider]["SRC_SCRIPT"]), provider_key=provider.lower()))
        files_to_compute.extend(expand("data/processed/features/{pid}/phone_calls.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"]))
@ -128,10 +122,6 @@ for provider in config["PHONE_APPLICATIONS_FOREGROUND"]["PROVIDERS"].keys():
        files_to_compute.extend(expand("data/raw/{pid}/phone_applications_foreground_raw.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/raw/{pid}/phone_applications_foreground_with_datetime.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/raw/{pid}/phone_applications_foreground_with_datetime_with_categories.csv", pid=config["PIDS"]))
        if config["PHONE_APPLICATIONS_FOREGROUND"]["PROVIDERS"][provider]["INCLUDE_EPISODE_FEATURES"]:
            files_to_compute.extend(expand("data/interim/{pid}/phone_app_episodes.csv", pid=config["PIDS"]))
            files_to_compute.extend(expand("data/interim/{pid}/phone_app_episodes_resampled.csv", pid=config["PIDS"]))
            files_to_compute.extend(expand("data/interim/{pid}/phone_app_episodes_resampled_with_datetime.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/interim/{pid}/phone_applications_foreground_features/phone_applications_foreground_{language}_{provider_key}.csv", pid=config["PIDS"], language=get_script_language(config["PHONE_APPLICATIONS_FOREGROUND"]["PROVIDERS"][provider]["SRC_SCRIPT"]), provider_key=provider.lower()))
        files_to_compute.extend(expand("data/processed/features/{pid}/phone_applications_foreground.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"]))
@ -164,25 +154,6 @@ for provider in config["PHONE_CONVERSATION"]["PROVIDERS"].keys():
        files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"]))
        files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv")
 for provider in config["PHONE_ESM"]["PROVIDERS"].keys():
    if config["PHONE_ESM"]["PROVIDERS"][provider]["COMPUTE"]:
        files_to_compute.extend(expand("data/raw/{pid}/phone_esm_raw.csv",pid=config["PIDS"]))
        files_to_compute.extend(expand("data/raw/{pid}/phone_esm_with_datetime.csv",pid=config["PIDS"]))
        files_to_compute.extend(expand("data/interim/{pid}/phone_esm_clean.csv",pid=config["PIDS"]))
        files_to_compute.extend(expand("data/interim/{pid}/phone_esm_features/phone_esm_{language}_{provider_key}.csv",pid=config["PIDS"],language=get_script_language(config["PHONE_ESM"]["PROVIDERS"][provider]["SRC_SCRIPT"]),provider_key=provider.lower()))
        files_to_compute.extend(expand("data/processed/features/{pid}/phone_esm.csv", pid=config["PIDS"]))
        # files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv",pid=config["PIDS"]))
        # files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv")
 for provider in config["PHONE_SPEECH"]["PROVIDERS"].keys():
    if config["PHONE_SPEECH"]["PROVIDERS"][provider]["COMPUTE"]:
        files_to_compute.extend(expand("data/raw/{pid}/phone_speech_raw.csv",pid=config["PIDS"]))
        files_to_compute.extend(expand("data/raw/{pid}/phone_speech_with_datetime.csv",pid=config["PIDS"]))
        files_to_compute.extend(expand("data/interim/{pid}/phone_speech_features/phone_speech_{language}_{provider_key}.csv",pid=config["PIDS"],language=get_script_language(config["PHONE_SPEECH"]["PROVIDERS"][provider]["SRC_SCRIPT"]),provider_key=provider.lower()))
        files_to_compute.extend(expand("data/processed/features/{pid}/phone_speech.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"]))
        files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv")
 # We can delete these if's as soon as we add feature PROVIDERS to any of these sensors
 if isinstance(config["PHONE_APPLICATIONS_CRASHES"]["PROVIDERS"], dict):
    for provider in config["PHONE_APPLICATIONS_CRASHES"]["PROVIDERS"].keys():
@ -237,8 +208,7 @@ for provider in config["PHONE_LOCATIONS"]["PROVIDERS"].keys():
        if provider == "BARNETT":
            files_to_compute.extend(expand("data/interim/{pid}/phone_locations_barnett_daily.csv", pid=config["PIDS"]))
        if provider == "DORYAB":
-            files_to_compute.extend(expand("data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns_episodes.csv", pid=config["PIDS"]))
+            files_to_compute.extend(expand("data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns.csv", pid=config["PIDS"]))
            files_to_compute.extend(expand("data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns_episodes_resampled_with_datetime.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/raw/{pid}/phone_locations_raw.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/interim/{pid}/phone_locations_processed.csv", pid=config["PIDS"]))
@ -407,41 +377,11 @@ if config["HEATMAP_SENSOR_ROW_COUNT_PER_TIME_SEGMENT"]["PLOT"]:
    files_to_compute.append("reports/data_exploration/heatmap_sensor_row_count_per_time_segment.html")
 if config["HEATMAP_PHONE_DATA_YIELD_PER_PARTICIPANT_PER_TIME_SEGMENT"]["PLOT"]:
    if not config["PHONE_DATA_YIELD"]["PROVIDERS"]["RAPIDS"]["COMPUTE"]:
        raise ValueError("Error: [PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] must be True in config.yaml to get heatmaps of overall data yield.")
    files_to_compute.append("reports/data_exploration/heatmap_phone_data_yield_per_participant_per_time_segment.html")
 if config["HEATMAP_FEATURE_CORRELATION_MATRIX"]["PLOT"]:
    files_to_compute.append("reports/data_exploration/heatmap_feature_correlation_matrix.html")
 # Data Cleaning
 for provider in config["ALL_CLEANING_INDIVIDUAL"]["PROVIDERS"].keys():
    if config["ALL_CLEANING_INDIVIDUAL"]["PROVIDERS"][provider]["COMPUTE"]:
        if provider == "STRAW":
            files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features_cleaned_" + provider.lower() + "_py.csv", pid=config["PIDS"]))
        else:
            files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features_cleaned_" + provider.lower() + "_R.csv", pid=config["PIDS"]))
 for provider in config["ALL_CLEANING_OVERALL"]["PROVIDERS"].keys():
    if config["ALL_CLEANING_OVERALL"]["PROVIDERS"][provider]["COMPUTE"]:
        if provider == "STRAW":
            for target in config["PARAMS_FOR_ANALYSIS"]["TARGET"]["ALL_LABELS"]:
                files_to_compute.extend(expand("data/processed/features/all_participants/all_sensor_features_cleaned_" + provider.lower() +"_py_(" + target + ").csv"))
        else:
            files_to_compute.extend(expand("data/processed/features/all_participants/all_sensor_features_cleaned_" + provider.lower() +"_R.csv"))     
 # Baseline features
 if config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["COMPUTE"]:
    files_to_compute.extend(expand("data/raw/baseline_merged.csv"))
    files_to_compute.extend(expand("data/raw/{pid}/participant_baseline_raw.csv", pid=config["PIDS"]))
    files_to_compute.extend(expand("data/interim/{pid}/baseline_questionnaires.csv", pid=config["PIDS"]))
    files_to_compute.extend(expand("data/processed/features/{pid}/baseline_features.csv", pid=config["PIDS"]))
 # Targets (labels)
 if config["PARAMS_FOR_ANALYSIS"]["TARGET"]["COMPUTE"]:
    files_to_compute.extend(expand("data/processed/models/individual_model/{pid}/input.csv", pid=config["PIDS"]))
    for target in config["PARAMS_FOR_ANALYSIS"]["TARGET"]["ALL_LABELS"]:
        files_to_compute.extend(expand("data/processed/models/population_model/input_" + target + ".csv"))
 rule all:
    input:
--- a/init.py
+++ b/init.py
--- a/automl_test.py
+++ b/automl_test.py
@ -1,57 +0,0 @@
 from pprint import pprint
 import sklearn.metrics
 import autosklearn.regression
 import datetime
 import importlib
 import os
 import sys
 import numpy as np
 import matplotlib.pyplot as plt
 import pandas as pd
 import seaborn as sns
 import yaml
 from sklearn import linear_model, svm, kernel_ridge, gaussian_process
 from sklearn.model_selection import LeaveOneGroupOut, cross_val_score, train_test_split
 from sklearn.metrics import mean_squared_error, r2_score
 from sklearn.impute import SimpleImputer
 model_input = pd.read_csv("data/processed/models/population_model/input_PANAS_negative_affect_mean.csv") # Standardizirani podatki
 model_input.dropna(axis=1, how="all", inplace=True)
 model_input.dropna(axis=0, how="any", subset=["target"], inplace=True)
 categorical_feature_colnames = ["gender", "startlanguage"]
 categorical_feature_colnames += [col for col in model_input.columns if "mostcommonactivity" in col or "homelabel" in col]
 categorical_features = model_input[categorical_feature_colnames].copy()
 mode_categorical_features = categorical_features.mode().iloc[0]
 categorical_features = categorical_features.fillna(mode_categorical_features)
 categorical_features = categorical_features.apply(lambda col: col.astype("category"))
 if not categorical_features.empty:
    categorical_features = pd.get_dummies(categorical_features)
 numerical_features = model_input.drop(categorical_feature_colnames, axis=1)
 model_in = pd.concat([numerical_features, categorical_features], axis=1)
 index_columns = ["local_segment", "local_segment_label", "local_segment_start_datetime", "local_segment_end_datetime"]
 model_in.set_index(index_columns, inplace=True)
 X_train, X_test, y_train, y_test = train_test_split(model_in.drop(["target", "pid"], axis=1), model_in["target"], test_size=0.30)
 automl = autosklearn.regression.AutoSklearnRegressor(
    time_left_for_this_task=7200,
    per_run_time_limit=120
 )
 automl.fit(X_train, y_train, dataset_name='straw')
 print(automl.leaderboard())
 pprint(automl.show_models(), indent=4)
 train_predictions = automl.predict(X_train)
 print("Train R2 score:", sklearn.metrics.r2_score(y_train, train_predictions))
 test_predictions = automl.predict(X_test)
 print("Test R2 score:", sklearn.metrics.r2_score(y_test, test_predictions))
 import sys
 sys.exit()
--- a/config.yaml
+++ b/config.yaml
@ -3,17 +3,16 @@
 ########################################################################################################################
 # See https://www.rapids.science/latest/setup/configuration/#participant-files
-PIDS: ['p031', 'p032', 'p033', 'p034', 'p035', 'p036', 'p037', 'p038', 'p039', 'p040', 'p042', 'p043', 'p044', 'p045', 'p046', 'p049', 'p050', 'p052', 'p053', 'p054', 'p055', 'p057', 'p058', 'p059', 'p060', 'p061', 'p062', 'p064', 'p067', 'p068', 'p069', 'p070', 'p071', 'p072', 'p073', 'p074', 'p075', 'p076', 'p077', 'p078', 'p079', 'p080', 'p081', 'p082', 'p083', 'p084', 'p085', 'p086', 'p088', 'p089', 'p090', 'p091', 'p092', 'p093', 'p106', 'p107']
+PIDS: [test01]
 # See https://www.rapids.science/latest/setup/configuration/#automatic-creation-of-participant-files
 CREATE_PARTICIPANT_FILES:
-  USERNAMES_CSV: "data/external/main_study_usernames.csv"
+  CSV_FILE_PATH: "data/external/example_participants.csv" # see docs for required format
  CSV_FILE_PATH: "data/external/main_study_participants.csv" # see docs for required format
  PHONE_SECTION:
    ADD: True
    IGNORED_DEVICE_IDS: []
  FITBIT_SECTION:
-    ADD: False
+    ADD: True
    IGNORED_DEVICE_IDS: []
  EMPATICA_SECTION:
    ADD: True
@ -21,25 +20,19 @@ CREATE_PARTICIPANT_FILES:
 # See https://www.rapids.science/latest/setup/configuration/#time-segments
 TIME_SEGMENTS: &time_segments
-  TYPE: EVENT # FREQUENCY, PERIODIC, EVENT
+  TYPE: PERIODIC # FREQUENCY, PERIODIC, EVENT
-  FILE: "data/external/straw_events.csv"
+  FILE: "data/external/timesegments_periodic.csv"
-  INCLUDE_PAST_PERIODIC_SEGMENTS: TRUE # Only relevant if TYPE=PERIODIC, see docs
+  INCLUDE_PAST_PERIODIC_SEGMENTS: FALSE # Only relevant if TYPE=PERIODIC, see docs
  TAILORED_EVENTS: # Only relevant if TYPE=EVENT
    COMPUTE: True
    SEGMENTING_METHOD: "30_before" # 30_before, 90_before, stress_event
    INTERVAL_OF_INTEREST: 10 # duration of event of interest [minutes]
    IOI_ERROR_TOLERANCE: 5 # interval of interest erorr tolerance (before and after IOI) [minutes]
 # See https://www.rapids.science/latest/setup/configuration/#timezone-of-your-study
 TIMEZONE: 
-    TYPE: MULTIPLE
+    TYPE: SINGLE
    SINGLE:
-      TZCODE: Europe/Ljubljana
+      TZCODE: America/New_York
    MULTIPLE:
-      TZ_FILE: data/external/timezone.csv
+      TZCODES_FILE: data/external/multiple_timezones_example.csv
-      TZCODES_FILE: data/external/multiple_timezones.csv
+      IF_MISSING_TZCODE: STOP
-      IF_MISSING_TZCODE: USE_DEFAULT
+      DEFAULT_TZCODE: America/New_York
      DEFAULT_TZCODE: Europe/Ljubljana
      FITBIT: 
        ALLOW_MULTIPLE_TZ_PER_DEVICE: False
        INFER_FROM_SMARTPHONE_TZ: False
@ -50,19 +43,19 @@ TIMEZONE:
 # See https://www.rapids.science/latest/setup/configuration/#data-stream-configuration
 PHONE_DATA_STREAMS:
-  USE: aware_postgresql
+  USE: aware_mysql
  # AVAILABLE:
-  aware_mysql: 
+  aware_mysql: # one table per sensor with all participants' data
    DATABASE_GROUP: MY_GROUP
-  aware_postgresql:
+  aware_mysql_split: # one table per sensor per participant
-    DATABASE_GROUP: PSQL_STRAW
+    DATABASE_GROUP: MY_GROUP
-  aware_csv:
+  aware_csv: # one CSV file per sensor with all participants's data
    FOLDER: data/external/aware_csv
-  aware_influxdb: 
+  aware_influxdb: # one table per sensor with all participants' data
    DATABASE_GROUP: MY_GROUP
 # Sensors ------
@ -75,6 +68,7 @@ PHONE_ACCELEROMETER:
      COMPUTE: False
      FEATURES: ["maxmagnitude", "minmagnitude", "avgmagnitude", "medianmagnitude", "stdmagnitude"]
      SRC_SCRIPT: src/features/phone_accelerometer/rapids/main.py
    PANDA:
      COMPUTE: False
      VALID_SENSED_MINUTES: False
@ -86,12 +80,12 @@ PHONE_ACCELEROMETER:
 # See https://www.rapids.science/latest/features/phone-activity-recognition/
 PHONE_ACTIVITY_RECOGNITION:
  CONTAINER: 
-    ANDROID: google_ar
+    ANDROID: plugin_google_activity_recognition
    IOS: plugin_ios_activity_recognition
  EPISODE_THRESHOLD_BETWEEN_ROWS: 5 # minutes. Max time difference for two consecutive rows to be considered within the same AR episode.
  PROVIDERS:
    RAPIDS:
-      COMPUTE: True
+      COMPUTE: False
      FEATURES: ["count", "mostcommonactivity", "countuniqueactivities", "durationstationary", "durationmobile", "durationvehicle"]
      ACTIVITY_CLASSES:
        STATIONARY: ["still", "tilting"]
@ -104,52 +98,35 @@ PHONE_APPLICATIONS_CRASHES:
  CONTAINER: applications_crashes
  APPLICATION_CATEGORIES:
    CATALOGUE_SOURCE: FILE # FILE (genres are read from CATALOGUE_FILE) or GOOGLE (genres are scrapped from the Play Store)
-    CATALOGUE_FILE: "data/external/play_store_application_genre_catalogue.csv"
+    CATALOGUE_FILE: "data/external/stachl_application_genre_catalogue.csv"
-    UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
+    UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether or not to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
-    SCRAPE_MISSING_CATEGORIES: False # whether to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
+    SCRAPE_MISSING_CATEGORIES: False # whether or not to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
  PROVIDERS: # None implemented yet but this sensor can be used in PHONE_DATA_YIELD
 # See https://www.rapids.science/latest/features/phone-applications-foreground/
 PHONE_APPLICATIONS_FOREGROUND:
-  CONTAINER: applications
+  CONTAINER: applications_foreground
  APPLICATION_CATEGORIES:
    CATALOGUE_SOURCE: FILE # FILE (genres are read from CATALOGUE_FILE) or GOOGLE (genres are scrapped from the Play Store)
-    CATALOGUE_FILE: "data/external/play_store_application_genre_catalogue.csv"
+    CATALOGUE_FILE: "data/external/stachl_application_genre_catalogue.csv"
-    # Refer to data/external/play_store_categories_count.csv for a list of categories (genres) and their frequency.
+    UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether or not to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
-    UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
+    SCRAPE_MISSING_CATEGORIES: False # whether or not to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
    SCRAPE_MISSING_CATEGORIES: False # whether to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
  PROVIDERS:
    RAPIDS:
-      COMPUTE: True
+      COMPUTE: False
-      INCLUDE_EPISODE_FEATURES: True
+      SINGLE_CATEGORIES: ["all", "email"]
      SINGLE_CATEGORIES: ["Productivity", "Tools", "Communication", "Education", "Social"]
      MULTIPLE_CATEGORIES:
-        games: ["Puzzle", "Card", "Casual", "Board", "Strategy", "Trivia", "Word", "Adventure", "Role Playing", "Simulation", "Board, Brain Games", "Racing"]
+        social: ["socialnetworks", "socialmediatools"]
-        social: ["Communication", "Social", "Dating"]
+        entertainment: ["entertainment", "gamingknowledge", "gamingcasual", "gamingadventure", "gamingstrategy", "gamingtoolscommunity", "gamingroleplaying", "gamingaction", "gaminglogic", "gamingsports", "gamingsimulation"]
-        productivity: ["Tools", "Productivity", "Finance", "Education", "News & Magazines", "Business", "Books & Reference"]
+      SINGLE_APPS: ["top1global", "com.facebook.moments", "com.google.android.youtube", "com.twitter.android"] # There's no entropy for single apps
-        health: ["Health & Fitness", "Lifestyle", "Food & Drink", "Sports", "Medical", "Parenting"]
+      EXCLUDED_CATEGORIES: []
-        entertainment: ["Shopping", "Music & Audio", "Entertainment", "Travel & Local", "Photography", "Video Players & Editors", "Personalization", "House & Home", "Art & Design", "Auto & Vehicles", "Entertainment,Music & Video",
+      EXCLUDED_APPS: ["com.fitbit.FitbitMobile", "com.aware.plugin.upmc.cancer"]
-                        "Puzzle", "Card", "Casual", "Board", "Strategy", "Trivia", "Word", "Adventure", "Role Playing", "Simulation", "Board, Brain Games", "Racing" # Add all games.
+      FEATURES: ["count", "timeoffirstuse", "timeoflastuse", "frequencyentropy"]
        ]
        maps_weather: ["Maps & Navigation", "Weather"]
      CUSTOM_CATEGORIES:
      SINGLE_APPS: []
      EXCLUDED_CATEGORIES: ["System", "STRAW"]
      # Note: A special option here is "is_system_app".
      # This excludes applications that have is_system_app = TRUE, which is a separate column in the table.
      # However, all of these applications have been assigned System category.
      # I will therefore filter by that category, which is a superset and is more complete. JL
      EXCLUDED_APPS: []
      FEATURES: 
        APP_EVENTS: ["countevent", "timeoffirstuse", "timeoflastuse", "frequencyentropy"]
        APP_EPISODES: ["countepisode", "minduration", "maxduration", "meanduration", "sumduration"]
      IGNORE_EPISODES_SHORTER_THAN: 0 # in minutes, set to 0 to disable
      IGNORE_EPISODES_LONGER_THAN: 300 # in minutes, set to 0 to disable
      SRC_SCRIPT: src/features/phone_applications_foreground/rapids/main.py
 # See https://www.rapids.science/latest/features/phone-applications-notifications/
 PHONE_APPLICATIONS_NOTIFICATIONS:
-  CONTAINER: notifications
+  CONTAINER: applications_notifications
  APPLICATION_CATEGORIES:
    CATALOGUE_SOURCE: FILE # FILE (genres are read from CATALOGUE_FILE) or GOOGLE (genres are scrapped from the Play Store)
    CATALOGUE_FILE: "data/external/stachl_application_genre_catalogue.csv"
@ -163,7 +140,7 @@ PHONE_BATTERY:
  EPISODE_THRESHOLD_BETWEEN_ROWS: 30 # minutes. Max time difference for two consecutive rows to be considered within the same battery episode.
  PROVIDERS:
    RAPIDS:
-      COMPUTE: True
+      COMPUTE: False
      FEATURES: ["countdischarge", "sumdurationdischarge", "countcharge", "sumdurationcharge", "avgconsumptionrate", "maxconsumptionrate"]
      SRC_SCRIPT: src/features/phone_battery/rapids/main.py
@ -177,7 +154,7 @@ PHONE_BLUETOOTH:
      SRC_SCRIPT: src/features/phone_bluetooth/rapids/main.R
    DORYAB:
-      COMPUTE: True
+      COMPUTE: False
      FEATURES: 
        ALL: 
            DEVICES: ["countscans", "uniquedevices", "meanscans", "stdscans"]
@ -195,11 +172,10 @@ PHONE_BLUETOOTH:
 # See https://www.rapids.science/latest/features/phone-calls/
 PHONE_CALLS:
-  CONTAINER: call
+  CONTAINER: calls
  PROVIDERS:
    RAPIDS:
-      COMPUTE: True
+      COMPUTE: False
      FEATURES_TYPE: EPISODES # EVENTS or EPISODES
      CALL_TYPES: [missed, incoming, outgoing]
      FEATURES:
        missed:  [count, distinctcontacts, timefirstcall, timelastcall, countmostfrequentcontact]
@ -208,7 +184,7 @@ PHONE_CALLS:
      SRC_SCRIPT: src/features/phone_calls/rapids/main.R
 # See https://www.rapids.science/latest/features/phone-conversation/
-PHONE_CONVERSATION: # TODO Adapt for speech
+PHONE_CONVERSATION:
  CONTAINER: 
    ANDROID: plugin_studentlife_audio_android
    IOS: plugin_studentlife_audio
@ -227,35 +203,14 @@ PHONE_CONVERSATION: # TODO Adapt for speech
 # See https://www.rapids.science/latest/features/phone-data-yield/
 PHONE_DATA_YIELD:
-  SENSORS: [#PHONE_ACCELEROMETER,
+  SENSORS: []
            PHONE_ACTIVITY_RECOGNITION,
            PHONE_APPLICATIONS_FOREGROUND,
            PHONE_APPLICATIONS_NOTIFICATIONS,
            PHONE_BATTERY,
            PHONE_BLUETOOTH,
            PHONE_CALLS,
            PHONE_LIGHT,
            PHONE_LOCATIONS,
            PHONE_MESSAGES,
            PHONE_SCREEN,
            PHONE_WIFI_VISIBLE]
  PROVIDERS:
    RAPIDS:
-      COMPUTE: True
+      COMPUTE: False
      FEATURES: [ratiovalidyieldedminutes, ratiovalidyieldedhours]
      MINUTE_RATIO_THRESHOLD_FOR_VALID_YIELDED_HOURS: 0.5 # 0 to 1, minimum percentage of valid minutes in an hour to be considered valid.
      SRC_SCRIPT: src/features/phone_data_yield/rapids/main.R
 PHONE_ESM:
  CONTAINER: esm
  PROVIDERS:
    STRAW:
      COMPUTE: True
      SCALES: ["PANAS_positive_affect", "PANAS_negative_affect", "JCQ_job_demand", "JCQ_job_control", "JCQ_supervisor_support", "JCQ_coworker_support", 
              "appraisal_stressfulness_period", "appraisal_stressfulness_event", "appraisal_threat", "appraisal_challenge"]
      FEATURES: [mean]
      SRC_SCRIPT: src/features/phone_esm/straw/main.py
 # See https://www.rapids.science/latest/features/phone-keyboard/
 PHONE_KEYBOARD:
  CONTAINER: keyboard
@ -267,10 +222,10 @@ PHONE_KEYBOARD:
 # See https://www.rapids.science/latest/features/phone-light/
 PHONE_LIGHT:
-  CONTAINER: light_sensor
+  CONTAINER: light
  PROVIDERS:
    RAPIDS:
-      COMPUTE: True
+      COMPUTE: False
      FEATURES: ["count", "maxlux", "minlux", "avglux", "medianlux", "stdlux"]
      SRC_SCRIPT: src/features/phone_light/rapids/main.py
@ -280,12 +235,12 @@ PHONE_LOCATIONS:
  LOCATIONS_TO_USE: ALL_RESAMPLED # ALL, GPS, ALL_RESAMPLED, OR FUSED_RESAMPLED
  FUSED_RESAMPLED_CONSECUTIVE_THRESHOLD: 30 # minutes, only replicate location samples to the next sensed bin if the phone did not stop collecting data for more than this threshold
  FUSED_RESAMPLED_TIME_SINCE_VALID_LOCATION: 720 # minutes, only replicate location samples to consecutive sensed bins if they were logged within this threshold after a valid location row
  ACCURACY_LIMIT: 100 # meters, drops location coordinates with an accuracy equal or higher than this. This number means there's a 68% probability the true location is within this radius
  PROVIDERS:
    DORYAB:
-      COMPUTE: True
+      COMPUTE: False
      FEATURES: ["locationvariance","loglocationvariance","totaldistance","avgspeed","varspeed", "numberofsignificantplaces","numberlocationtransitions","radiusgyration","timeattop1location","timeattop2location","timeattop3location","movingtostaticratio","outlierstimepercent","maxlengthstayatclusters","minlengthstayatclusters","avglengthstayatclusters","stdlengthstayatclusters","locationentropy","normalizedlocationentropy","timeathome", "homelabel"]
      ACCURACY_LIMIT: 100 # meters, drops location coordinates with an accuracy higher than this. This number means there's a 68% probability the true location is within this radius
      DBSCAN_EPS: 100 # meters
      DBSCAN_MINSAMPLES: 5
      THRESHOLD_STATIC : 1 # km/h
@ -299,8 +254,9 @@ PHONE_LOCATIONS:
      SRC_SCRIPT: src/features/phone_locations/doryab/main.py
    BARNETT:
-      COMPUTE: True
+      COMPUTE: False
      FEATURES: ["hometime","disttravelled","rog","maxdiam","maxhomedist","siglocsvisited","avgflightlen","stdflightlen","avgflightdur","stdflightdur","probpause","siglocentropy","circdnrtn","wkenddayrtn"]
      ACCURACY_LIMIT: 100 # meters, drops location coordinates with an accuracy higher than this. This number means there's a 68% probability the true location is within this radius
      IF_MULTIPLE_TIMEZONES: USE_MOST_COMMON
      MINUTES_DATA_USED: False # Use this for quality control purposes, how many minutes of data (location coordinates gruped by minute) were used to compute features
      SRC_SCRIPT: src/features/phone_locations/barnett/main.R
@ -314,10 +270,10 @@ PHONE_LOG:
 # See https://www.rapids.science/latest/features/phone-messages/
 PHONE_MESSAGES:
-  CONTAINER: sms
+  CONTAINER: messages
  PROVIDERS:
    RAPIDS:
-      COMPUTE: True
+      COMPUTE: False
      MESSAGES_TYPES : [received, sent]
      FEATURES: 
        received: [count, distinctcontacts, timefirstmessage, timelastmessage, countmostfrequentcontact]
@ -329,23 +285,14 @@ PHONE_SCREEN:
  CONTAINER: screen
  PROVIDERS:
    RAPIDS:
-      COMPUTE: True
+      COMPUTE: False
      REFERENCE_HOUR_FIRST_USE: 0
      IGNORE_EPISODES_SHORTER_THAN: 0 # in minutes, set to 0 to disable
-      IGNORE_EPISODES_LONGER_THAN: 360 # in minutes, set to 0 to disable
+      IGNORE_EPISODES_LONGER_THAN: 0 # in minutes, set to 0 to disable
      FEATURES: ["countepisode", "sumduration", "maxduration", "minduration", "avgduration", "stdduration", "firstuseafter"] # "episodepersensedminutes" needs to be added later
      EPISODE_TYPES: ["unlock"]
      SRC_SCRIPT: src/features/phone_screen/rapids/main.py
 # Custom added sensor
 PHONE_SPEECH:
  CONTAINER: speech
  PROVIDERS:
    STRAW:
      COMPUTE: True
      FEATURES: ["meanspeech", "stdspeech", "nlargest", "nsmallest", "medianspeech"]
      SRC_SCRIPT: src/features/phone_speech/straw/main.py
 # See https://www.rapids.science/latest/features/phone-wifi-connected/
 PHONE_WIFI_CONNECTED:
  CONTAINER: sensor_wifi
@ -360,7 +307,7 @@ PHONE_WIFI_VISIBLE:
  CONTAINER: wifi
  PROVIDERS:
    RAPIDS:
-      COMPUTE: True
+      COMPUTE: False
      FEATURES: ["countscans", "uniquedevices", "countscansmostuniquedevice"]
      SRC_SCRIPT: src/features/phone_wifi_visible/rapids/main.R
@ -463,6 +410,7 @@ FITBIT_SLEEP_INTRADAY:
        UNIFIED: [awake, asleep]
      SLEEP_TYPES: [main, nap, all]
      SRC_SCRIPT: src/features/fitbit_sleep_intraday/rapids/main.py
    PRICE:
      COMPUTE: False
      FEATURES: [avgduration, avgratioduration, avgstarttimeofepisodemain, avgendtimeofepisodemain, avgmidpointofepisodemain, stdstarttimeofepisodemain, stdendtimeofepisodemain, stdmidpointofepisodemain, socialjetlag, rmssdmeanstarttimeofepisodemain, rmssdmeanendtimeofepisodemain, rmssdmeanmidpointofepisodemain, rmssdmedianstarttimeofepisodemain, rmssdmedianendtimeofepisodemain, rmssdmedianmidpointofepisodemain]
@ -498,15 +446,13 @@ FITBIT_STEPS_INTRADAY:
    RAPIDS:
      COMPUTE: False
      FEATURES:
-        STEPS: ["sum", "max", "min", "avg", "std", "firststeptime", "laststeptime"]
+        STEPS: ["sum", "max", "min", "avg", "std"]
        SEDENTARY_BOUT: ["countepisode", "sumduration", "maxduration", "minduration", "avgduration", "stdduration"]
        ACTIVE_BOUT: ["countepisode", "sumduration", "maxduration", "minduration", "avgduration", "stdduration"]
      REFERENCE_HOUR: 0
      THRESHOLD_ACTIVE_BOUT: 10 # steps
      INCLUDE_ZERO_STEP_ROWS: False
      SRC_SCRIPT: src/features/fitbit_steps_intraday/rapids/main.py
 ########################################################################################################################
 #                                                 EMPATICA                                                             #
 ########################################################################################################################
@ -528,15 +474,6 @@ EMPATICA_ACCELEROMETER:
      COMPUTE: False
      FEATURES: ["maxmagnitude", "minmagnitude", "avgmagnitude", "medianmagnitude", "stdmagnitude"]
      SRC_SCRIPT: src/features/empatica_accelerometer/dbdp/main.py
    CR:
      COMPUTE: True
      FEATURES: ["totalMagnitudeBand", "absoluteMeanBand", "varianceBand"] # Acc features
      WINDOWS:
        COMPUTE: True
        WINDOW_LENGTH: 15 # specify window length in seconds
        SECOND_ORDER_FEATURES: ['mean', 'median', 'sd', 'nlargest', 'nsmallest', 'count_windows']
      SRC_SCRIPT: src/features/empatica_accelerometer/cr/main.py
 # See https://www.rapids.science/latest/features/empatica-heartrate/
 EMPATICA_HEARTRATE:
@ -555,15 +492,6 @@ EMPATICA_TEMPERATURE:
      COMPUTE: False
      FEATURES: ["maxtemp", "mintemp", "avgtemp", "mediantemp", "modetemp", "stdtemp", "diffmaxmodetemp", "diffminmodetemp", "entropytemp"]
      SRC_SCRIPT: src/features/empatica_temperature/dbdp/main.py
    CR:
      COMPUTE: True
      FEATURES: ["maximum", "minimum", "meanAbsChange", "longestStrikeAboveMean", "longestStrikeBelowMean", 
                  "stdDev", "median", "meanChange", "sumSquared", "squareSumOfComponent", "sumOfSquareComponents"]
      WINDOWS:
        COMPUTE: True
        WINDOW_LENGTH: 300 # specify window length in seconds
        SECOND_ORDER_FEATURES: ['mean', 'median', 'sd', 'nlargest', 'nsmallest', 'count_windows']
      SRC_SCRIPT: src/features/empatica_temperature/cr/main.py
 # See https://www.rapids.science/latest/features/empatica-electrodermal-activity/
 EMPATICA_ELECTRODERMAL_ACTIVITY:
@ -573,19 +501,6 @@ EMPATICA_ELECTRODERMAL_ACTIVITY:
      COMPUTE: False
      FEATURES: ["maxeda", "mineda", "avgeda", "medianeda", "modeeda", "stdeda", "diffmaxmodeeda", "diffminmodeeda", "entropyeda"]
      SRC_SCRIPT: src/features/empatica_electrodermal_activity/dbdp/main.py
    CR:
      COMPUTE: True
      FEATURES: ['mean', 'std', 'q25', 'q75', 'qd', 'deriv', 'power', 'numPeaks', 'ratePeaks', 'powerPeaks', 'sumPosDeriv', 'propPosDeriv', 'derivTonic', 
                  'sigTonicDifference', 'freqFeats','maxPeakAmplitudeChangeBefore', 'maxPeakAmplitudeChangeAfter', 'avgPeakAmplitudeChangeBefore', 
                  'avgPeakAmplitudeChangeAfter', 'avgPeakChangeRatio', 'maxPeakIncreaseTime', 'maxPeakDecreaseTime', 'maxPeakDuration', 'maxPeakChangeRatio',
                  'avgPeakIncreaseTime', 'avgPeakDecreaseTime', 'avgPeakDuration', 'signalOverallChange', 'changeDuration', 'changeRate', 'significantIncrease', 
                  'significantDecrease']
      WINDOWS:
        COMPUTE: True
        WINDOW_LENGTH: 60 # specify window length in seconds
        SECOND_ORDER_FEATURES: ['mean', 'median', 'sd', 'nlargest', 'nsmallest', count_windows, eda_num_peaks_non_zero]
        IMPUTE_NANS: True
      SRC_SCRIPT: src/features/empatica_electrodermal_activity/cr/main.py
 # See https://www.rapids.science/latest/features/empatica-blood-volume-pulse/
 EMPATICA_BLOOD_VOLUME_PULSE:
@ -595,15 +510,6 @@ EMPATICA_BLOOD_VOLUME_PULSE:
      COMPUTE: False
      FEATURES: ["maxbvp", "minbvp", "avgbvp", "medianbvp", "modebvp", "stdbvp", "diffmaxmodebvp", "diffminmodebvp", "entropybvp"]
      SRC_SCRIPT: src/features/empatica_blood_volume_pulse/dbdp/main.py
    CR:
      COMPUTE: False
      FEATURES: ['meanHr', 'ibi', 'sdnn', 'sdsd', 'rmssd', 'pnn20', 'pnn50', 'sd', 'sd2', 'sd1/sd2', 'numRR', # Time features
                  'VLF', 'LF', 'LFnorm', 'HF', 'HFnorm', 'LF/HF', 'fullIntegral'] # Freq features
      WINDOWS:
        COMPUTE: True
        WINDOW_LENGTH: 300 # specify window length in seconds
        SECOND_ORDER_FEATURES: ['mean', 'median', 'sd', 'nlargest', 'nsmallest', 'count_windows', 'hrv_num_windows_non_nan']
      SRC_SCRIPT: src/features/empatica_blood_volume_pulse/cr/main.py
 # See https://www.rapids.science/latest/features/empatica-inter-beat-interval/
 EMPATICA_INTER_BEAT_INTERVAL:
@ -613,16 +519,6 @@ EMPATICA_INTER_BEAT_INTERVAL:
      COMPUTE: False
      FEATURES: ["maxibi", "minibi", "avgibi", "medianibi", "modeibi", "stdibi", "diffmaxmodeibi", "diffminmodeibi", "entropyibi"]
      SRC_SCRIPT: src/features/empatica_inter_beat_interval/dbdp/main.py
    CR:
      COMPUTE: True
      FEATURES: ['meanHr', 'ibi', 'sdnn', 'sdsd', 'rmssd', 'pnn20', 'pnn50', 'sd', 'sd2', 'sd1/sd2', 'numRR', # Time features
                  'VLF', 'LF', 'LFnorm', 'HF', 'HFnorm', 'LF/HF', 'fullIntegral'] # Freq features            
      PATCH_WITH_BVP: True
      WINDOWS:
        COMPUTE: True
        WINDOW_LENGTH: 300 # specify window length in seconds
        SECOND_ORDER_FEATURES: ['mean', 'median', 'sd', 'nlargest', 'nsmallest', 'count_windows', 'hrv_num_windows_non_nan']
      SRC_SCRIPT: src/features/empatica_inter_beat_interval/cr/main.py
 # See https://www.rapids.science/latest/features/empatica-tags/
 EMPATICA_TAGS:
@ -663,96 +559,3 @@ HEATMAP_FEATURE_CORRELATION_MATRIX:
  CORR_THRESHOLD: 0.1
  CORR_METHOD: "pearson" # choose from {"pearson", "kendall", "spearman"}
 ########################################################################################################################
 #                                                    Data Cleaning                                                     #
 ########################################################################################################################
 ALL_CLEANING_INDIVIDUAL:
  PROVIDERS:
    RAPIDS:
      COMPUTE: False
      IMPUTE_SELECTED_EVENT_FEATURES:
        COMPUTE: False
        MIN_DATA_YIELDED_MINUTES_TO_IMPUTE: 0.33
      COLS_NAN_THRESHOLD: 1 # set to 1 to disable
      COLS_VAR_THRESHOLD: True
      ROWS_NAN_THRESHOLD: 1 # set to 1 to disable
      DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_HOURS # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
      DATA_YIELD_RATIO_THRESHOLD: 0 # set to 0 to disable
      DROP_HIGHLY_CORRELATED_FEATURES:
        COMPUTE: True
        MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
        CORR_THRESHOLD: 0.95
      SRC_SCRIPT: src/features/all_cleaning_individual/rapids/main.R
    STRAW:
      COMPUTE: True
      PHONE_DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_MINUTES # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
      PHONE_DATA_YIELD_RATIO_THRESHOLD: 0.5 # set to 0 to disable
      EMPATICA_DATA_YIELD_RATIO_THRESHOLD: 0.5 # set to 0 to disable
      ROWS_NAN_THRESHOLD: 0.33 # set to 1 to disable
      COLS_NAN_THRESHOLD: 0.9 # set to 1 to remove only columns that contains all (100% of) NaN
      COLS_VAR_THRESHOLD: True
      DROP_HIGHLY_CORRELATED_FEATURES:
        COMPUTE: True
        MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
        CORR_THRESHOLD: 0.95
      STANDARDIZATION: True
      SRC_SCRIPT: src/features/all_cleaning_individual/straw/main.py
 ALL_CLEANING_OVERALL:
  PROVIDERS:
    RAPIDS:
      COMPUTE: False
      IMPUTE_SELECTED_EVENT_FEATURES:
        COMPUTE: False
        MIN_DATA_YIELDED_MINUTES_TO_IMPUTE: 0.33
      COLS_NAN_THRESHOLD: 1 # set to 1 to disable
      COLS_VAR_THRESHOLD: True
      ROWS_NAN_THRESHOLD: 1 # set to 1 to disable
      DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_HOURS # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
      DATA_YIELD_RATIO_THRESHOLD: 0 # set to 0 to disable
      DROP_HIGHLY_CORRELATED_FEATURES:
        COMPUTE: True
        MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
        CORR_THRESHOLD: 0.95
      SRC_SCRIPT: src/features/all_cleaning_overall/rapids/main.R
    STRAW:
      COMPUTE: True
      PHONE_DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_MINUTES # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
      PHONE_DATA_YIELD_RATIO_THRESHOLD: 0.5 # set to 0 to disable
      EMPATICA_DATA_YIELD_RATIO_THRESHOLD: 0.5 # set to 0 to disable
      ROWS_NAN_THRESHOLD: 0.33 # set to 1 to disable
      COLS_NAN_THRESHOLD: 0.8 # set to 1 to remove only columns that contains all (100% of) NaN
      COLS_VAR_THRESHOLD: True
      DROP_HIGHLY_CORRELATED_FEATURES:
        COMPUTE: True
        MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
        CORR_THRESHOLD: 0.95
      STANDARDIZATION: True
      TARGET_STANDARDIZATION: False
      SRC_SCRIPT: src/features/all_cleaning_overall/straw/main.py
 ########################################################################################################################
 #                                                      Baseline                                                        #
 ########################################################################################################################
 PARAMS_FOR_ANALYSIS:
  BASELINE:
    COMPUTE: True
    FOLDER: data/external/baseline
    CONTAINER: [results-survey637813_final.csv,  # Slovenia
                results-survey358134_final.csv,  # Belgium 1
                results-survey413767_final.csv  # Belgium 2
    ]
    QUESTION_LIST: survey637813+question_text.csv
    FEATURES: [age, gender, startlanguage, limesurvey_demand, limesurvey_control, limesurvey_demand_control_ratio, limesurvey_demand_control_ratio_quartile]
    CATEGORICAL_FEATURES: [gender]
  TARGET:
    COMPUTE: True
    LABEL: appraisal_stressfulness_event_mean
    ALL_LABELS: [PANAS_positive_affect_mean, PANAS_negative_affect_mean, JCQ_job_demand_mean, JCQ_job_control_mean, JCQ_supervisor_support_mean, JCQ_coworker_support_mean, appraisal_stressfulness_period_mean]
                # PANAS_positive_affect_mean, PANAS_negative_affect_mean, JCQ_job_demand_mean, JCQ_job_control_mean, JCQ_supervisor_support_mean, 
                # JCQ_coworker_support_mean, appraisal_stressfulness_period_mean, appraisal_stressfulness_event_mean, appraisal_threat_mean, appraisal_challenge_mean
--- a/data/external/aware_csv/calls.csv
+++ b/data/external/aware_csv/calls.csv
@ -1,9 +0,0 @@
 "_id","timestamp","device_id","call_type","call_duration","trace"
 1,1587663260695,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",2,14,"d5e84f8af01b2728021d4f43f53a163c0c90000c"
 2,1587739118007,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",3,0,"47c125dc7bd163b8612cdea13724a814917b6e93"
 5,1587746544891,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",2,95,"9cc793ffd6e88b1d850ce540b5d7e000ef5650d4"
 6,1587911379859,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",2,63,"51fb9344e988049a3fec774c7ca622358bf80264"
 7,1587992647361,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",3,0,"2a862a7730cfdfaf103a9487afe3e02935fd6e02"
 8,1588020039448,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",1,11,"a2c53f6a086d98622c06107780980cf1bb4e37bd"
 11,1588176189024,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",2,65,"56589df8c830c70e330b644921ed38e08d8fd1f3"
 12,1588197745079,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",3,0,"cab458018a8ed3b626515e794c70b6f415318adc"
--- a/data/external/empatica/empatica1/E4
+++ b/data/external/empatica/empatica1/E4
--- a/data/external/main_study_usernames.csv
+++ b/data/external/main_study_usernames.csv
@ -1,57 +0,0 @@
 label,empatica_id
 uploader_79170,A0245B
 uploader_89788,A02731
 uploader_68294,A02705
 uploader_92856,A024AF
 uploader_23726,A0231C
 uploader_66620,A02305
 uploader_58435,A026B5
 uploader_87801,A022A8
 uploader_96055,A027BA
 uploader_69549,A0226C
 uploader_26363,A0263D
 uploader_72010,A023FA
 uploader_13997,A024AF
 uploader_31156,A02305
 uploader_63187,A027BA
 uploader_94821,A022A8
 uploader_65413,A023F1;A023FA
 uploader_36488,A02713
 uploader_91087,A0231C
 uploader_35174,A025D1
 uploader_73880,A02705
 uploader_78650,A02731
 uploader_70578,A0245B
 uploader_88313,A02736
 uploader_58482,A0261A
 uploader_80601,A027BA
 uploader_93729,A0226C
 uploader_61663,A0245B
 uploader_80848,A025D1
 uploader_57312,A023F9;A02361;A027A0
 uploader_52087,A02666
 uploader_98770,A02953
 uploader_51327,A0245F
 uploader_11737,A02732
 uploader_77440,A0264E
 uploader_57277,A02422
 uploader_13098,A026E5
 uploader_80719,A023C8
 uploader_54698,A02953
 uploader_95571,A02853
 uploader_21880,A024DC
 uploader_92905,A02920
 uploader_12108,A023F4
 uploader_17436,A026E5
 uploader_58440,A0273F
 uploader_22172,A0245F
 uploader_39250,A02422
 uploader_15311,A023F9
 uploader_45766,A02920
 uploader_23096,A02361
 uploader_78243,A02422
 uploader_58777,A0245F
 uploader_82941,A02666
 uploader_89606,A023F4
 uploader_82969,A023C8
 uploader_53573,A024DC;A02361
--- a/data/external/participant_files/p01.yaml
+++ b/data/external/participant_files/p01.yaml
@ -1,11 +0,0 @@
 PHONE:
  DEVICE_IDS: [4b62a655-cbf0-4ac0-a448-06726f45b56a]
  PLATFORMS: [android]
  LABEL: uploader_53573
  START_DATE: 2021-05-21 09:21:24
  END_DATE: 2021-07-12 17:32:07
 EMPATICA:
  DEVICE_IDS: [uploader_53573]
  LABEL: uploader_53573
  START_DATE: 2021-05-21 09:21:24
  END_DATE: 2021-07-12 17:32:07
--- a/data/external/play_store_application_genre_catalogue.csv
+++ b/data/external/play_store_application_genre_catalogue.csv
--- a/data/external/play_store_categories_count.csv
+++ b/data/external/play_store_categories_count.csv
@ -1,45 +0,0 @@
 genre,n
 System,261
 Tools,96
 Productivity,71
 Health & Fitness,60
 Finance,54
 Communication,39
 Music & Audio,39
 Shopping,38
 Lifestyle,33
 Education,28
 News & Magazines,24
 Maps & Navigation,23
 Entertainment,21
 Business,18
 Travel & Local,18
 Books & Reference,16
 Social,16
 Weather,16
 Food & Drink,14
 Sports,14
 Other,13
 Photography,13
 Puzzle,13
 Video Players & Editors,12
 Card,9
 Casual,9
 Personalization,8
 Medical,7
 Board,5
 Strategy,4
 House & Home,3
 Trivia,3
 Word,3
 Adventure,2
 Art & Design,2
 Auto & Vehicles,2
 Dating,2
 Role Playing,2
 STRAW,2
 Simulation,2
 "Board,Brain Games",1
 "Entertainment,Music & Video",1
 Parenting,1
 Racing,1
--- a/data/external/timesegments_daily.csv
+++ b/data/external/timesegments_daily.csv
@ -1,3 +0,0 @@
 label,start_time,length,repeats_on,repeats_value
 daily,04:00:00,23H 59M 59S,every_day,0
 working_day,04:00:00,18H 00M 00S,every_day,0
--- a/data/external/timesegments_frequency.csv
+++ b/data/external/timesegments_frequency.csv
@ -1,2 +1,2 @@
 label,length
-fiveminutes,5
+thirtyminutes,30
--- a/data/external/timesegments_periodic.csv
+++ b/data/external/timesegments_periodic.csv
@ -1,2 +1,9 @@
 label,start_time,length,repeats_on,repeats_value
-daily,00:00:00,23H 59M 59S,every_day,0
+threeday,00:00:00,2D 23H 59M 59S,every_day,0
 daily, 00:00:00,23H 59M 59S, every_day, 0
 morning,06:00:00,5H 59M 59S,every_day,0
 afternoon,12:00:00,5H 59M 59S,every_day,0
 evening,18:00:00,5H 59M 59S,every_day,0
 night,00:00:00,5H 59M 59S,every_day,0
 two_weeks_overlapping,00:00:00,13D 23H 59M 59S,every_day,0
 weekends,00:00:00,2D 23H 59M 59S,wday,5
--- a/data/external/timezone.csv
+++ b/data/external/timezone.csv
--- a/docs/analysis/data-cleaning.md
+++ b/docs/analysis/data-cleaning.md
@ -1,92 +0,0 @@
 Data Cleaning
 =============
 The goal of this module is to perform basic clean tasks on the behavioral features that RAPIDS computes. You might need to do further processing depending on your analysis objectives. This module can clean features at the individual level and at the study level. If you are interested in creating individual models (using each participant's features independently of the others) use [`ALL_CLEANING_INDIVIDUAL`]. If you are interested in creating population models (using everyone's data in the same model) use [`ALL_CLEANING_OVERALL`]
 ## Clean sensor features for individual participants
 !!! info "File Sequence"
    ```bash
    - data/processed/features/{pid}/all_sensor_features.csv
    - data/processed/features/{pid}/all_sensor_features_cleaned_{provider_key}.csv
    ```
 ### RAPIDS provider
 Parameters description for `[ALL_CLEANING_INDIVIDUAL][PROVIDERS][RAPIDS]`:
 |Key&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;            | Description |
 |----------------|-----------------------------------------------------------------------------------------------------------------------------------
 |`[COMPUTE]` | Set to `True` to execute the cleaning tasks described below. You can use the parameters of each task to tweak them or deactivate them|
 |`[IMPUTE_SELECTED_EVENT_FEATURES]`     | Fill NAs with 0 only for event-based features, see table below
 |`[COLS_NAN_THRESHOLD]`                 | Discard columns with missing value ratios higher than `[COLS_NAN_THRESHOLD]`. Set to 1 to disable
 |`[COLS_VAR_THRESHOLD]`                 | Set to `True` to discard columns with zero variance
 |`[ROWS_NAN_THRESHOLD]`                 | Discard rows with missing value ratios higher than `[ROWS_NAN_THRESHOLD]`. Set to 1 to disable
 |`[DATA_YIELD_FEATURE]`                 | `RATIO_VALID_YIELDED_HOURS` or `RATIO_VALID_YIELDED_MINUTES`
 |`[DATA_YIELD_RATIO_THRESHOLD]`         | Discard rows with `ratiovalidyieldedhours` or `ratiovalidyieldedminutes` feature less than `[DATA_YIELD_RATIO_THRESHOLD]`. The feature name is determined by `[DATA_YIELD_FEATURE]` parameter. Set to 0 to disable
 |`DROP_HIGHLY_CORRELATED_FEATURES`      | Discard highly correlated features, see table below
 Parameters description for `[ALL_CLEANING_INDIVIDUAL][PROVIDERS][RAPIDS][IMPUTE_SELECTED_EVENT_FEATURES]`:
 |Parameters                             | Description                                                    |
 |-------------------------------------- |----------------------------------------------------------------|
 |`[COMPUTE]`                            | Set to `True` to fill NAs with 0 for phone event-based features
 |`[MIN_DATA_YIELDED_MINUTES_TO_IMPUTE]` | Any feature value in a time segment instance with phone data yield > `[MIN_DATA_YIELDED_MINUTES_TO_IMPUTE]` will be replaced with a zero. See below for an explanation. |
 Parameters description for `[ALL_CLEANING_INDIVIDUAL][PROVIDERS][RAPIDS][DROP_HIGHLY_CORRELATED_FEATURES]`:
 |Parameters                             | Description                                                    |
 |-------------------------------------- |----------------------------------------------------------------|
 |`[COMPUTE]`                            | Set to `True` to drop highly correlated features
 |`[MIN_OVERLAP_FOR_CORR_THRESHOLD]`     | Minimum ratio of observations required per pair of columns (features) to be considered as a valid correlation. 
 |`[CORR_THRESHOLD]` | The absolute values of pair-wise correlations are calculated. If two variables have a valid correlation higher than `[CORR_THRESHOLD]`, we looks at the mean absolute correlation of each variable and removes the variable with the largest mean absolute correlation.
 Steps to clean sensor features for individual participants. It only considers the **phone sensors** currently.
 ??? info "1. Fill NA with 0 for the selected event features."
    Some event features should be zero instead of NA. In this step, we fill those missing features with 0 when the `phone_data_yield_rapids_ratiovalidyieldedminutes` column is higher than the `[IMPUTE_SELECTED_EVENT_FEATURES][MIN_DATA_YIELDED_MINUTES_TO_IMPUTE]` parameter. Plugins such as Activity Recognition sensor are not considered. You can skip this step by setting `[IMPUTE_SELECTED_EVENT_FEATURES][COMPUTE]` to `False`.
    Take phone calls sensor as an example. If there are no calls records during a time segment for a participant, then (1) the calls sensor was not working during that time segment; or (2) the calls sensor was working and the participant did not have any calls during that time segment. To differentiate these two situations, we assume the selected sensors are working when `phone_data_yield_rapids_ratiovalidyieldedminutes > [MIN_DATA_YIELDED_MINUTES_TO_IMPUTE]`.
    The following phone event-based features are considered currently:
      - Application foreground: countevent, countepisode, minduration, maxduration, meanduration, sumduration.
      - Battery: all features.
      - Calls: count, distinctcontacts, sumduration, minduration, maxduration, meanduration, modeduration.
      - Keyboard: sessioncount, averagesessionlength, changeintextlengthlessthanminusone, changeintextlengthequaltominusone, changeintextlengthequaltoone, changeintextlengthmorethanone, maxtextlength, totalkeyboardtouches.
      - Messages: count, distinctcontacts.
      - Screen: sumduration, maxduration, minduration, avgduration, countepisode.
      - WiFi: all connected and visible features.
 ??? info "2. Discard unreliable rows."
    Extracted features might be not reliable if the sensor only works for a short period during a time segment. In this step, we discard rows when the `phone_data_yield_rapids_ratiovalidyieldedminutes` column or the `phone_data_yield_rapids_ratiovalidyieldedhours` column is less than the `[DATA_YIELD_RATIO_THRESHOLD]` parameter. We recommend using `phone_data_yield_rapids_ratiovalidyieldedminutes` column (set `[DATA_YIELD_FEATURE]` to `RATIO_VALID_YIELDED_MINUTES`) on time segments that are shorter than two or three hours and `phone_data_yield_rapids_ratiovalidyieldedhours` (set `[DATA_YIELD_FEATURE]` to `RATIO_VALID_YIELDED_HOURS`) for longer segments. We do not recommend you to skip this step, but you can do it by setting `[DATA_YIELD_RATIO_THRESHOLD]` to 0.
 ??? info "3. Discard columns (features) with too many missing values."
    In this step, we discard columns with missing value ratios higher than `[COLS_NAN_THRESHOLD]`. We do not recommend you to skip this step, but you can do it by setting `[COLS_NAN_THRESHOLD]` to 1.
 ??? info "4. Discard columns (features) with zero variance."
    In this step, we discard columns with zero variance. We do not recommend you to skip this step, but you can do it by setting `[COLS_VAR_THRESHOLD]` to `False`.
 ??? info "5. Drop highly correlated features."
    As highly correlated features might not bring additional information and will increase the complexity of a model, we drop them in this step. The absolute values of pair-wise correlations are calculated. Each correlation vector between two variables is regarded as valid only if the ratio of valid value pairs (i.e. non NA pairs) is greater than or equal to `[DROP_HIGHLY_CORRELATED_FEATURES][MIN_OVERLAP_FOR_CORR_THRESHOLD]`. If two variables have a correlation coefficient higher than `[DROP_HIGHLY_CORRELATED_FEATURES][CORR_THRESHOLD]`, we look at the mean absolute correlation of each variable and remove the variable with the largest mean absolute correlation. This step can be skipped by setting `[DROP_HIGHLY_CORRELATED_FEATURES][COMPUTE]` to False.
 ??? info "6. Discard rows with too many missing values."
    In this step, we discard rows with missing value ratios higher than `[ROWS_NAN_THRESHOLD]`. We do not recommend you to skip this step, but you can do it by setting `[ROWS_NAN_THRESHOLD]` to 1. In other words, we are discarding time segments (e.g. days) that did not have enough data to be considered reliable. This step is similar to step 2 except the ratio is computed based on NA values instead of a phone data yield threshold.
 ## Clean sensor features for all participants
 !!! info "File Sequence"
    ```bash
    - data/processed/features/all_participants/all_sensor_features.csv
    - data/processed/features/all_participants/all_sensor_features_cleaned_{provider_key}.csv
    ```
 ### RAPIDS provider
 Parameters description and the steps are the same as the above [RAPIDS provider](#rapids-provider) section for individual participants.
--- a/docs/change-log.md
+++ b/docs/change-log.md
@ -1,42 +1,4 @@
 # Change Log
 ## v1.8.0
 - Add data stream for AWARE Micro server
 - Fix the NA bug in PHONE_LOCATIONS BARNETT provider
 - Fix the bug of data type for call_duration field
 - Fix the index bug of heatmap_sensors_per_minute_per_time_segment
 ## v1.7.1
 - Update docs for Git Flow section
 - Update RAPIDS paper information
 ## v1.7.0
 - Add firststeptime and laststeptime features to FITBIT_STEPS_INTRADAY RAPIDS provider
 - Update tests for Fitbit steps intraday features
 - Add tests for phone battery features
 - Add a data cleaning module to replace NAs with 0 in selected event-based features, discard unreliable rows and columns, discard columns with zero variance, and discard highly correlated columns
 ## v1.6.0
 - Refactor PHONE_CALLS RAPIDS provider to compute features based on call episodes or events
 - Refactor PHONE_LOCATIONS DORYAB provider to compute features based on location episodes
 - Temporary revert PHONE_LOCATIONS BARNETT provider to use R script
 - Update the default IGNORE_EPISODES_LONGER_THAN to be 6 hours for screen RAPIDS provider
 - Fix the bug of step intraday features when INCLUDE_ZERO_STEP_ROWS is False
 ## v1.5.0
 - Update Barnett location features with faster Python implementation
 - Fix rounding bug in data yield features
 - Add tests for data yield, Fitbit and accelerometer features
 - Small fixes of documentation
 ## v1.4.1
 - Update home page
 - Add PHONE_MESSAGES tests
 ## v1.4.0
 - Add new Application Foreground episode features and tests
 - Update VSCode setup instructions for our Docker container
 - Add tests for phone calls features
 - Add tests for WiFI features and fix a bug that incorrectly counted the most scanned device within the current time segment instances instead of globally
 - Add tests for phone conversation features
 - Add tests for Bluetooth features and choose the most scanned device alphabetically when ties exist
 - Add tests for Activity Recognition features and fix iOS unknown activity parsing
 - Fix Fitbit bug that parsed date-times with the current time zone in rare cases
 - Update the visualizations to be more precise and robust with different time segments.
 - Fix regression crash of the example analysis workflow
 ## v1.3.0
 - Refactor PHONE_LOCATIONS DORYAB provider. Fix bugs and faster execution up to 30x
 - New PHONE_KEYBOARD features
--- a/docs/citation.md
+++ b/docs/citation.md
@ -5,10 +5,14 @@
 ## RAPIDS
-If you used RAPIDS, please cite [this paper](https://www.frontiersin.org/article/10.3389/fdgth.2021.769823).
+If you used RAPIDS, please cite [this paper](https://preprints.jmir.org/preprint/23246).
 !!! cite "RAPIDS et al. citation"
-    Vega, J., Li, M., Aguillera, K., Goel, N., Joshi, E., Khandekar, K., ... & Low, C. A. (2021). Reproducible Analysis Pipeline for Data Streams (RAPIDS): Open-Source Software to Process Data Collected with Mobile Devices. Frontiers in Digital Health, 168.
+    Vega J, Li M, Aguillera K, Goel N, Joshi E, Durica KC, Kunta AR, Low CA
    RAPIDS: Reproducible Analysis Pipeline for Data Streams Collected with Mobile Devices
    JMIR Preprints. 18/08/2020:23246
    DOI: 10.2196/preprints.23246
    URL: https://preprints.jmir.org/preprint/23246
 ## DBDP (all Empatica sensors)
--- a/docs/datastreams/add-new-data-streams.md
+++ b/docs/datastreams/add-new-data-streams.md
@ -31,6 +31,8 @@ PHONE_DATA_STREAMS:
    DATABASE_GROUP: MY_GROUP # users define this group (user, password, host, etc.) in credentials.yaml
 ```
 Secondly, update `tools/config.schema.yaml` including `[*_DATA_STREAMS][properties][USE][enum]` and `[*_DATA_STREAMS][required]`. This is needed to make sure users do not use invalid values in your data stream's `config.yaml` entry by mistake. Take the other streams' entries as examples or check this [guide](../developers/validation-schema-config.md).
 Then implement one or both of the following functions:
 === "pull_data"
--- a/docs/datastreams/aware-micro-mysql.md
+++ b/docs/datastreams/aware-micro-mysql.md
@ -1,15 +0,0 @@
 # `aware_micro_mysql`
 This [data stream](../../datastreams/data-streams-introduction) handles iOS and Android sensor data collected with the [AWARE Framework's](https://awareframework.com/) [AWARE Micro](https://github.com/denzilferreira/aware-micro) server and stored in a MySQL database.
 ## Container
 A MySQL database with a table per sensor, each containing the data for all participants. Sensor data is stored in a JSON field within each table called `data`
 The script to connect and download data from this container is at:
 ```bash
 src/data/streams/aware_micro_mysql/container.R
 ```
 ## Format
 --8<---- "docs/snippets/aware_format.md"
--- a/docs/datastreams/aware-mysql-split.md
+++ b/docs/datastreams/aware-mysql-split.md
@ -0,0 +1,15 @@
 # `aware_mysql_split`
 This [data stream](../../datastreams/data-streams-introduction) handles iOS and Android sensor data collected with the [AWARE Framework](https://awareframework.com/) and stored in a MySQL database. This stream is similar to `aware_mysql` except for the way data is stored in the database tables as explained below.
 ## Container
 A MySQL database with a table per sensor **per participant**. RAPIDS assumes such tables' names follow the format `deviceid_sensorname` (for example `a748ee1a-1d0b-4ae9-9074-279a2b6ba524_accelerometer`); if this is not the case, you can modify the SQL query in this stream's `container.R`script. RAPIDS also assumes that an empty table exists for those participants that don’t have data for a specific sensor.
 The script to connect and download data from this container is at:
 ```bash
 src/data/streams/aware_mysql_split/container.R
 ```
 ## Format
 --8<---- "docs/snippets/aware_format.md"
--- a/docs/datastreams/data-streams-introduction.md
+++ b/docs/datastreams/data-streams-introduction.md
@ -16,7 +16,7 @@ For reference, these are the data streams we currently support:
 | Data Stream | Device | Format | Container | Docs
 |--|--|--|--|--|
 | `aware_mysql`| Phone | AWARE app | MySQL | [link](../aware-mysql)
-| `aware_micro_mysql`| Phone | AWARE Micro server | MySQL | [link](../aware-micro-mysql)
+| `aware_mysql_split`| Phone | AWARE app | MySQL | [link](../aware-mysql-split)
 | `aware_csv`| Phone | AWARE app | CSV files | [link](../aware-csv)
 | `aware_influxdb` (beta)| Phone | AWARE app | InfluxDB | [link](../aware-influxdb)
 | `fitbitjson_mysql`| Fitbit | JSON (per [Fitbit's API](https://dev.fitbit.com/build/reference/web-api/)) | MySQL | [link](../fitbitjson-mysql)
--- a/docs/developers/git-flow.md
+++ b/docs/developers/git-flow.md
@ -127,9 +127,9 @@ git branch -d release/v[NEW_RELEASE]
 ```
 git checkout master
 git merge --ff-only develop
-git push # Unlock the master branch before merging
+git push
 ```
-1. Release happens automatically after passing the tests
+1. Go to [GitHub](https://github.com/carissalow/rapids/tags) and create a new release based on the newest tag `v[NEW_RELEASE]` (remember to add the change log)
 ## Release a Hotfix
 1. Pull the latest master
@ -156,6 +156,6 @@ git branch -d hotfix/v[NEW_HOTFIX]
 ```
 git checkout master
 git merge --ff-only v[NEW_HOTFIX]
-git push # Unlock the master branch before merging
+git push
 ```
-1. Release happens automatically after passing the tests
+1. Go to [GitHub](https://github.com/carissalow/rapids/tags) and create a new release based on the newest tag `v[NEW_HOTFIX]` (remember to add the change log)
--- a/docs/developers/test-cases.md
+++ b/docs/developers/test-cases.md
@ -7,260 +7,158 @@ The following is a list of the sensors that testing is currently available.
 | Sensor                        | Provider | Periodic | Frequency | Event |
 |-------------------------------|----------|----------|-----------|-------|
-| Phone Accelerometer           | Panda    | Y        | Y         | Y     |
+| Phone Accelerometer           | Panda    | N        | N         | N     |
-| Phone Accelerometer           | RAPIDS   | Y        | Y         | Y     |
+| Phone Accelerometer           | RAPIDS   | N        | N         | N     |
-| Phone Activity Recognition    | RAPIDS   | Y        | Y         | Y     |
+| Phone Activity Recognition    | RAPIDS   | N        | N         | N     |
-| Phone Applications Foreground | RAPIDS   | Y        | Y         | Y     |
+| Phone Applications Foreground | RAPIDS   | N        | N         | N     |
-| Phone Battery                 | RAPIDS   | Y        | Y         | Y     |
+| Phone Battery                 | RAPIDS   | Y        | Y         | N     |
-| Phone Bluetooth               | Doryab   | Y        | Y         | Y     |
+| Phone Bluetooth               | Doryab   | N        | N         | N     |
 | Phone Bluetooth               | RAPIDS   | Y        | Y         | Y     |
-| Phone Calls                   | RAPIDS   | Y        | Y         | Y     |
+| Phone Calls                   | RAPIDS   | Y        | Y         | N     |
-| Phone Conversation            | RAPIDS   | Y        | Y         | Y     |
+| Phone Conversation            | RAPIDS   | Y        | Y         | N     |
-| Phone Data Yield              | RAPIDS   | Y        | Y         | Y     |
+| Phone Data Yield              | RAPIDS   | N        | N         | N     |
-| Phone Light                   | RAPIDS   | Y        | Y         | Y     |
+| Phone Light                   | RAPIDS   | Y        | Y         | N     |
-| Phone Locations               | Doryab   | Y        | Y         | Y     |
+| Phone Locations               | Doryab   | N        | N         | N     |
 | Phone Locations               | Barnett  | N        | N         | N     |
-| Phone Messages                | RAPIDS   | Y        | Y         | Y     |
+| Phone Messages                | RAPIDS   | Y        | Y         | N     |
-| Phone Screen                  | RAPIDS   | Y        | Y         | Y     |
+| Phone Screen                  | RAPIDS   | Y        | N         | N     |
-| Phone WiFi Connected          | RAPIDS   | Y        | Y         | Y     |
+| Phone WiFi Connected          | RAPIDS   | Y        | Y         | N     |
-| Phone WiFi Visible            | RAPIDS   | Y        | Y         | Y     |
+| Phone WiFi Visible            | RAPIDS   | Y        | Y         | N     |
 | Fitbit Calories Intraday      | RAPIDS   | Y        | Y         | Y     |
-| Fitbit Data Yield             | RAPIDS   | Y        | Y         | Y     |
+| Fitbit Data Yield             | RAPIDS   | N        | N         | N     |
-| Fitbit Heart Rate Summary     | RAPIDS   | Y        | Y         | Y     |
+| Fitbit Heart Rate Summary     | RAPIDS   | N        | N         | N     |
-| Fitbit Heart Rate Intraday    | RAPIDS   | Y        | Y         | Y     |
+| Fitbit Heart Rate Intraday    | RAPIDS   | N        | N         | N     |
-| Fitbit Sleep Summary          | RAPIDS   | Y        | Y         | Y     |
+| Fitbit Sleep Summary          | RAPIDS   | N        | N         | N     |
 | Fitbit Sleep Intraday         | RAPIDS   | Y        | Y         | Y     |
 | Fitbit Sleep Intraday         | PRICE    | Y        | Y         | Y     |
-| Fitbit Steps Summary          | RAPIDS   | Y        | Y         | Y     |
+| Fitbit Steps Summary          | RAPIDS   | N        | N         | N     |
-| Fitbit Steps Intraday         | RAPIDS   | Y        | Y         | Y     |
+| Fitbit Steps Intraday         | RAPIDS   | N        | N         | N     |
 ## Accelerometer
 Description
 - The raw accelerometer data file, `phone_accelerometer_raw.csv`, contains data for 4 separate days
 - One episode for each daily segment (night, morning, afternoon and evening)
 - Two episodes locate in the same 30-min segment (`Fri 00:15:00` and `Fri 00:21:21`)
 - Two episodes locate in the same daily segment (`Fri 00:15:00` and `Fri 18:12:00`)
 - One episode before the time switch (`Sun 00:02:00`) and one episode after the time switch (`Sun 04:18:00`)
 - Multiple episodes within one min which cause variance in magnitude (`Fri 00:10:25`, `Fri 00:10:27` and `Fri 00:10:46`)
 Checklist
 |time segment| single tz | multi tz|platform|
 |-|-|-|-|
 |30min|OK|OK|android, ios|
 |morning|OK|OK|android, ios|
 |daily|OK|OK|android, ios|
 |threeday|OK|OK|android, ios|
 |weekend|OK|OK|android, ios|
 |beforeMarchEvent|OK|OK|android, ios|
 |beforeNovemberEvent|OK|OK|android, ios|
 ## Messages (SMS)
-Description
+-   The raw message data file contains data for 2 separate days.
-
+-   The data for the first day contains records 5 records for every
- The raw message data file, `phone_messages_raw.csv`, contains data for 4 separate days
+    `epoch`.
- One episode for each daily segment (night, morning, afternoon and evening)
+-   The second day\'s data contains 6 records for each of only 2
- Two `sent` episodes locate in the same 30-min segment (`Fri 16:08:03.000` and `Fri 16:19:35.000`)
+    `epoch` (currently `morning` and `evening`)
- Two `received` episodes locate in the same 30-min segment (`Sat 06:45:05.000` and `Fri 06:45:05.000`)
+-   The raw message data contains records for both `message_types`
- Two episodes locate in the same daily segment (`Fri 11:57:56.385` and `Sat 10:54:10.000`)
+    (i.e. `recieved` and `sent`) in both days in all epochs. The
- One episode before the time switch (`Sun 00:48:01.000`) and one episode after the time switch (`Sun 06:21:01.000`)
+    number records with each `message_types` per epoch is randomly
-
+    distributed There is at least one records with each
-Checklist
+    `message_types` per epoch.
-
+-   There is one raw message data file each, as described above, for
-|time segment| single tz | multi tz|platform|
+    testing both iOS and Android data.
-|-|-|-|-|
+-   There is also an additional empty data file for both android and
-|30min|OK|OK|android|
+    iOS for testing empty data files
 |morning|OK|OK|android|
 |daily|OK|OK|android|
 |threeday|OK|OK|android|
 |weekend|OK|OK|android|
 |beforeMarchEvent|OK|OK|android|
 |beforeNovemberEvent|OK|OK|android|
 ## Calls
-Due to the difference in the format of the raw data for iOS and Android the following is the expected results 
+Due to the difference in the format of the raw call data for iOS and Android the following is the expected results the `calls_with_datetime_unified.csv`. This would give a better idea of the use cases being tested since the `calls_with_datetime_unified.csv` would make both the iOS and Android data comparable.
 the `phone_calls.csv`. 
-Description
+-   The call data would contain data for 2 days.
-
+-   The data for the first day contains 6 records for every `epoch`.
- One missed episode, one outgoing episode and one incoming episode on Friday night, morning, afternoon and evening
+-   The second day\'s data contains 6 records for each of only 2
- There is at least one episode of each type of phone calls on each day
+    `epoch` (currently `morning` and `evening`)
- One incoming episode crossing two 30-mins segments
+-   The call data contains records for all `call_types` (i.e.
- One outgoing episode crossing two 30-mins segments
+    `incoming`, `outgoing` and `missed`) in both days in all epochs.
- One missed episode before, during and after the `event`
+    The number records with each of the `call_types` per epoch is
- There is one incoming episode before, during or after the `event`
+    randomly distributed. There is at least one records with each
- There is one outcoming episode before, during or after the `event`
+    `call_types` per epoch.
- There is one missed episode before, during or after the `event`
+-   There is one call data file each, as described above, for testing
-
+    both iOS and Android data.
-Data format
+-   There is also an additional empty data file for both android and
-
+    iOS for testing empty data files
 | Device | Missed | Outgoing | Incoming |
 |-|-|-|-|
 |android| 3 | 2 | 1 |
 |ios| 1,4 or 3,4 | 3,2,4 | 1,2,4 |
 Note
 When generating test data, all traces for iOS device need to be unique otherwise the episode with duplicate trace will be dropped 
 Checklist
 |time segment| single tz | multi tz|platform|
 |-|-|-|-|
 |30min|OK|OK|android, iOS|
 |morning|OK|OK|android, iOS|
 |daily|OK|OK|android, iOS|
 |threeday|OK|OK|android, iOS|
 |weekend|OK|OK|android, iOS|
 |beforeMarchEvent|OK|OK|android, iOS|
 |beforeNovemberEvent|OK|OK|android, iOS|
 ## Screen
-Due to the difference in the format of the raw screen data for iOS and Android the following is the expected results the `phone_screen.csv`. 
+Due to the difference in the format of the raw screen data for iOS and Android the following is the expected results the `screen_deltas.csv`. This would give a better idea of the use cases being tested since the `screen_eltas.csv` would make both the iOS and Android data comparable These files are used to calculate the features for the screen sensor
-Description
+-   The screen delta data file contains data for 1 day.
-
+-   The screen delta data contains 1 record to represent an `unlock`
 - The screen data file contains data for 4 days.
 - The screen data contains 1 record to represent an `unlock`
    episode that falls within an `epoch` for every `epoch`.
- The screen data contains 1 record to represent an `unlock`
+-   The screen delta data contains 1 record to represent an `unlock`
    episode that falls across the boundary of 2 epochs. Namely the
    `unlock` episode starts in one epoch and ends in the next, thus
    there is a record for `unlock` episodes that fall across `night`
    to `morning`, `morning` to `afternoon` and finally `afternoon` to
    `night`
- One episode that crossing two `30-min` segments
+-   The testing is done for `unlock` episode\_type.
-
+-   There is one screen data file each for testing both iOS and
-Data format
+    Android data formats.
-
+-   There is also an additional empty data file for both android and
-| Device | unlock |
+    iOS for testing empty data files
 |-|-|
 | Android | 3, 0|
 | iOS | 3, 2|
 Checklist
 |time segment| single tz | multi tz|platform|
 |-|-|-|-|
 |30min|OK|OK|android, iOS|
 |morning|OK|OK|android, iOS|
 |daily|OK|OK|android, iOS|
 |threeday|OK|OK|android, iOS|
 |weekend|OK|OK|android, iOS|
 |beforeMarchEvent|OK|OK|android, iOS|
 |beforeNovemberEvent|OK|OK|android, iOS|
 ## Battery
-Description
+Due to the difference in the format of the raw battery data for iOS and Android as well as versions of iOS the following is the expected results the `battery_deltas.csv`. This would give a better idea of the use cases being tested since the `battery_deltas.csv` would make both the iOS and Android data comparable. These files are used to calculate the features for the battery sensor.
- The 4-day raw data is contained in `phone_battery_raw.csv`
+-   The battery delta data file contains data for 1 day.
- One discharge episode acrossing two 30-min time segements (`Fri 05:57:30.123` to `Fri 06:04:32.456`)
+-   The battery delta data contains 1 record each for a `charging` and
- One charging episode acrossing two 30-min time segments (`Fri 11:55:58.416` to `Fri 12:08:07.876`)
+    `discharging` episode that falls within an `epoch` for every
- One discharge episode and one charging episode locate within the same 30-min time segement (`Fri 21:30:00` to `Fri 22:00:00`)
+    `epoch`. Thus, for the `daily` epoch there would be multiple
- One episode before the time switch (`Sun 00:24:00.000`) and one episode after the time switch (`Sun 21:58:00`)
+    `charging` and `discharging` episodes
- Two episodes locate in the same daily segment
+-   Since either a `charging` episode or a `discharging` episode and
-  
+    not both can occur across epochs, in order to test episodes that
-Checklist
+    occur across epochs alternating episodes of `charging` and
-
+    `discharging` episodes that fall across `night` to `morning`,
-|time segment| single tz | multi tz|platform|
+    `morning` to `afternoon` and finally `afternoon` to `night` are
-|-|-|-|-|
+    present in the battery delta data. This starts with a
-|30min|OK|OK|android|
+    `discharging` episode that begins in `night` and end in `morning`.
-|morning|OK|OK|android|
+-   There is one battery data file each, for testing both iOS and
-|daily|OK|OK|android|
+    Android data formats.
-|threeday|OK|OK|android|
+-   There is also an additional empty data file for both android and
-|weekend|OK|OK|android|
+    iOS for testing empty data files
 |beforeMarchEvent|OK|OK|android|
 |beforeNovemberEvent|OK|OK|android|
 ## Bluetooth
-Description 
+-   The raw Bluetooth data file contains data for 1 day.
-
+-   The raw Bluetooth data contains at least 2 records for each
- The 4-day raw data is contained in `phone_bluetooth_raw.csv`
+    `epoch`. Each `epoch` has a record with a `timestamp` for the
- One episode for each daily segment (`night`, `morning`, `afternoon` and `evening`)
+    beginning boundary for that `epoch` and a record with a
- Two episodes locate in the same 30-min segment (`Fri 23:38:45.789` and `Fri 23:59:59.465`)
+    `timestamp` for the ending boundary for that `epoch`. (e.g. For
- Two episodes locate in the same daily segment (`Fri 00:00:00.798` and `Fri 00:49:04.132`)
+    the `morning` epoch there is a record with a `timestamp` for
- One episode before the time switch (`Sun 00:24:00.000`) and one episode after the time switch (`Sun 17:32:00.000`)
+    `6:00AM` and another record with a `timestamp` for `11:59:59AM`.
-
+    These are to test edge cases)
-Checklist
+-   An option of 5 Bluetooth devices are randomly distributed
-
+    throughout the data records.
-|time segment| single tz | multi tz|platform|
+-   There is one raw Bluetooth data file each, for testing both iOS
-|-|-|-|-|
+    and Android data formats.
-|30min|OK|OK|android|
+-   There is also an additional empty data file for both android and
-|morning|OK|OK|android|
+    iOS for testing empty data files.
 |daily|OK|OK|android|
 |threeday|OK|OK|android|
 |weekend|OK|OK|android|
 |beforeMarchEvent|OK|OK|android|
 |beforeNovemberEvent|OK|OK|android|
 ## WIFI
-There are two wifi features (`phone wifi connected` and `phone wifi visible`). The raw test data are seperatly stored in the `phone_wifi_connected_raw.csv` and `phone_wifi_visible_raw.csv`.
+-   There are 2 data files (`wifi_raw.csv` and `sensor_wifi_raw.csv`)
-
+    for each fake participant for each phone platform. 
-Description 
+-   The raw WIFI data files contain data for 1 day.
-
+-   The `sensor_wifi_raw.csv` data contains at least 2 records for
- One episode for each `epoch` (`night`, `morining`, `afternoon` and `evening`)
+    each `epoch`. Each `epoch` has a record with a `timestamp` for the
- Two two episodes in the same time segment (`daily` and `30-min`)
+    beginning boundary for that `epoch` and a record with a
- Two episodes around the transition of `epochs` (e.g. one at the end of `night` and one at the beginning of `morning`) 
+    `timestamp` for the ending boundary for that `epoch`. (e.g. For
- One episode before and after the time switch on Sunday
+    the `morning` epoch there is a record with a `timestamp` for
-
+    `6:00AM` and another record with a `timestamp` for `11:59:59AM`.
-phone wifi connected
+    These are to test edge cases)
-
+-   The `wifi_raw.csv` data contains 3 records with random timestamps
-Checklist
+    for each `epoch` to represent visible broadcasting WIFI network.
-
+    This file is empty for the iOS phone testing data.
-|time segment| single tz | multi tz|platform|
+-   An option of 10 access point devices is randomly distributed
-|-|-|-|-|
+    throughout the data records. 5 each for `sensor_wifi_raw.csv` and
-|30min|OK|OK|android, iOS|
+    `wifi_raw.csv`.
-|morning|OK|OK|android, iOS|
+-   There data files for testing both iOS and Android data formats.
-|daily|OK|OK|android, iOS|
+-   There are also additional empty data files for both android and
-|threeday|OK|OK|android, iOS|
+    iOS for testing empty data files.
 |weekend|OK|OK|android, iOS|
 |beforeMarchEvent|OK|OK|android, iOS|
 |beforeNovemberEvent|OK|OK|android, iOS|
 phone wifi visible
 Checklist
 |time segment| single tz | multi tz|platform|
 |-|-|-|-|
 |30min|OK|OK|android|
 |morning|OK|OK|android|
 |daily|OK|OK|android|
 |threeday|OK|OK|android|
 |weekend|OK|OK|android|
 |beforeMarchEvent|OK|OK|android|
 |beforeNovemberEvent|OK|OK|android|
 ## Light
-Description
+-   The raw light data file contains data for 1 day.
-
+-   The raw light data contains 3 or 4 rows of data for each `epoch`
- The 4-day raw light data is contained in `phone_light_raw.csv`
+    except `night`. The single row of data for `night` is for testing
- One episode for each daily segment (`night`, `morning`, `afternoon` and `evening`)
+    features for single values inputs. (Example testing the standard
- Two episodes locate in the same 30-min segment (`Fri 00:07:27.000` and `Fri 00:12:00.000`)
+    deviation of one input value)
- Two episodes locate in the same daily segment (`Fri 01:00:00` and `Fri 03:59:59.654`)
+-   Since light is only available for Android there is only one file
- One episode before the time switch (`Sun 00:08:00.000`) and one episode after the time switch (`Sun 05:36:00.000`)
+    that contains data for Android. All other files (i.e. for iPhone)
-
+    are empty data files.
 Checklist
 |time segment| single tz | multi tz|platform|
 |-|-|-|-|
 |30min|OK|OK|android|
 |morning|OK|OK|android|
 |daily|OK|OK|android|
 |threeday|OK|OK|android|
 |weekend|OK|OK|android|
 |beforeMarchEvent|OK|OK|android|
 |beforeNovemberEvent|OK|OK|android|
 ## Locations
@ -273,81 +171,58 @@ Description
 ## Application Foreground
- The 4-day raw application data is contained in `phone_applications_foreground_raw.csv`
+-   The raw application foreground data file contains data for 1 day.
- One episode for each daily segment (night, morning, afternoon and evening)
+-   The raw application foreground data contains 7 - 9 rows of data
- Two episodes locate in the same 30-min segment (`Fri 10:12:56.385` and `Fri 10:18:48.895`)
+    for each `epoch`. The records for each `epoch` contains apps that
- Two episodes locate in the same daily segment (`Fri 11:57:56.385` and `Fri 12:02:56.385`)
+    are randomly selected from a list of apps that are from the
- One episode before the time switch (`Sun 00:07:48.001`) and one episode after the time switch (`Sun 05:10:30.001`)
+    `MULTIPLE_CATEGORIES` and `SINGLE_CATEGORIES` (See
- Two custom category (`Dating`) episode, one at `Fri 06:05:10.385`, another one at ` Fri 11:53:00.385`
+    [testing\_config.yaml]()). There are also records in each epoch
-
+    that have apps randomly selected from a list of apps that are from
-Checklist:
+    the `EXCLUDED_CATEGORIES` and `EXCLUDED_APPS`. This is to test
-
+    that these apps are actually being excluded from the calculations
-|time segment| single tz | multi tz|platform|
+    of features. There are also records to test `SINGLE_APPS`
-|-|-|-|-|
+    calculations.
-|30min|OK|OK|android|
+-   Since application foreground is only available for Android there
-|morning|OK|OK|android|
+    is only one file that contains data for Android. All other files
-|daily|OK|OK|android|
+    (i.e. for iPhone) are empty data files.
 |threeday|OK|OK|android|
 |weekend|OK|OK|android|
 |beforeMarchEvent|OK|OK|android|
 |beforeNovemberEvent|OK|OK|android|
 ## Activity Recognition
-Description
+-   The raw Activity Recognition data file contains data for 1 day.
-
+-   The raw Activity Recognition data each `epoch` period contains
- The 4-day raw activity data is contained in `plugin_google_activity_recognition_raw.csv` and `plugin_ios_activity_recognition_raw.csv`.
+    rows that records 2 - 5 different `activity_types`. The is such
- Two episodes locate in the same 30-min segment (`Fri 04:01:54` and `Fri 04:13:52`)
+    that durations of activities can be tested. Additionally, there
- One episode for each daily segment (`night`, `morning`, `afternoon` and `evening`)
+    are records that mimic the duration of an activity over the time
- Two episodes locate in the same daily segment (`Fri  05:03:09` and `Fri 05:50:36`)
+    boundary of neighboring epochs. (For example, there a set of
- Two episodes with the time difference less than `5 mins` threshold (`Fri 07:14:21` and `Fri 07:18:50`)
+    records that mimic the participant `in_vehicle` from `afternoon`
- One episode before the time switch (`Sun 00:46:00`) and one episode after the time switch (`Sun 03:42:00`)
+    into `evening`)
-
+-   There is one file each with raw Activity Recognition data for
-Checklist
+    testing both iOS and Android data formats.
-
+    (plugin\_google\_activity\_recognition\_raw.csv for android and
-|time segment| single tz | multi tz|platform|
+    plugin\_ios\_activity\_recognition\_raw.csv for iOS)
-|-|-|-|-|
+-   There is also an additional empty data file for both android and
-|30min|OK|OK|android, iOS|
+    iOS for testing empty data files.
 |morning|OK|OK|android, iOS|
 |daily|OK|OK|android, iOS|
 |threeday|OK|OK|android, iOS|
 |weekend|OK|OK|android, iOS|
 |beforeMarchEvent|OK|OK|android, iOS|
 |beforeNovemberEvent|OK|OK|android, iOS|
 ## Conversation
-The 4-day raw conversation data is contained in `phone_conversation_raw.csv`. The different `inference` records are 
+-   The raw conversation data file contains data for 2 day.
-randomly distributed throughout the `epoch`. 
+-   The raw conversation data contains records with a sample of both
-
+    `datatypes` (i.e. `voice/noise` = `0`, and `conversation` = `2` )
-Description
+    as well as rows with for samples of each of the `inference` values
-
+    (i.e. `silence` = `0`, `noise` = `1`, `voice` = `2`, and `unknown`
- One episode for each daily segment (`night`, `morning`, `afternoon` and `evening`) on each day
+    = `3`) for each `epoch`. The different `datatype` and `inference`
- Two episodes near the transition of the daily segment, one starts at the end of the afternoon, `Fri 17:10:00` and another one starts at the beginning of the evening, `Fri 18:01:00`
+    records are randomly distributed throughout the `epoch`.
- One episode across two segments, `daily` and `30-mins`, (from `Fri 05:55:00` to `Fri 06:00:41`)
+-   Additionally there are 2 - 5 records for conversations (`datatype`
- Two episodes locate in the same daily segment (`Sat 12:45:36` and `Sat 16:48:22`)
+    = 2, and `inference` = -1) in each `epoch` and for each `epoch`
- One episode before the time switch, `Sun 00:15:06`, and one episode after the time switch, `Sun 06:01:00`
+    except night, there is a conversation record that has a
-
+    `double_convo_start` `timestamp` that is from the previous
-Data format
+    `epoch`. This is to test the calculations of features across
-
+    `epochs`.
-| inference | type |
+-   There is a raw conversation data file for both android and iOS
-| - | - |
+    platforms (`plugin_studentlife_audio_android_raw.csv` and
-| 0 | silence |
+    `plugin_studentlife_audio_raw.csv` respectively).
-| 1 | noise | 
+-   Finally, there are also additional empty data files for both
-| 2 | voice |
+    android and iOS for testing empty data files
 | 3 | unknown | 
 Checklist
 |time segment| single tz | multi tz|platform|
 |-|-|-|-|
 |30min|OK|OK|android|
 |morning|OK|OK|android|
 |daily|OK|OK|android|
 |threeday|OK|OK|android|
 |weekend|OK|OK|android|
 |beforeMarchEvent|OK|OK|android|
 |beforeNovemberEvent|OK|OK|android|
 ## Keyboard
@ -368,40 +243,6 @@ Checklist
 - One three-minute episode with a 1-minute row on Sun 08:59:54.65 and 09:00:00,another on Sun 12:01:02 that are considering a single episode in multi-timezone event segments to showcase how
 inferring time zone data for Keyboard from phone data can produce inaccurate results around the tz change. This happens because the device was on LA time until 11:59 and switched to NY time at 12pm, in terms of actual time 09 am LA and 12 pm NY represent the same moment in time so 09:00 LA and 12:01 NY are consecutive minutes.
 ## Application Episodes
 -   The feature requires raw application foreground data file and raw phone screen data file
 -   The raw data files contains data for 4 day.
 -   The raw conversation data contains records with difference in `timestamp` ranging from milliseconds to minutes.
 -   An app episode starts when an app is launched and ends when another app is launched, marking the episode end of the first one,
 or when the screen locks. Thus, we are taking into account the screen unlock episodes.
 -   There are multiple apps usage within each screen unlock episode to verify creation of different app episodes in each 
 screen unlock session. In the screen unlock episode starting from Fri 05:56:51, Fri 10:00:24, Sat 17:48:01, Sun 22:02:00, and Mon 21:05:00 we have multiple apps, both system and non-system apps, to check this.
 -   The 22 minute chunk starting from Fri 10:03:56 checks app episodes for system apps only.
 -   The screen unlock episode starting from Mon 21:05:00 and Sat 17:48:01 checks if the screen lock marks the end of episode for that particular app which was launched a few milliseconds to 8 mins before the screen lock.
 -   Finally, since application foreground is only for Android devices, this feature is also for Android devices only. All other files are empty data files
 ## Data Yield
 Description
 - Two sensors were picked for testing, `phone_screen` and `phone_light`. `phone_screen` is event based and `phone_light` is sampling at regular frequency
 - A 31-min episode (from `Fri 01:00:00` to `Fri 01:30:00`) in phone_light data, which is considered as a `validyieldedhours`
 Checklist
 |time segment| single tz | multi tz|platform|
 |-|-|-|-|
 |30min|OK|OK|android, ios|
 |morning|OK|OK|android, ios|
 |daily|OK|OK|android, ios|
 |threeday|OK|OK|android, ios|
 |weekend|OK|OK|android, ios|
 |beforeMarchEvent|OK|OK|android, ios|
 |beforeNovemberEvent|OK|OK|android, ios|
 ## Fitbit Calories Intraday
@ -422,31 +263,6 @@ Description
 - A four-minute sedentary episode on Sun 10:01 that will be ignored for Novembers's multi-timezone event segments since the test segment ends at 10am on that weekend.
 - A three-minute very active episode on Sat 16:03. This episode and the one at 16:00 are counted as one for lowmet episodes
 Checklist
 |time segment| single tz | multi tz|platform|
 |-|-|-|-|
 |30min|OK|OK|fitbit|
 |morning|OK|OK|fitbit|
 |daily|OK|OK|fitbit|
 |threeday|OK|OK|fitbit|
 |weekend|OK|OK|fitbit|
 |beforeMarchEvent|OK|OK|fitbit|
 |beforeNovemberEvent|OK|OK|fitbit|
 ## Fitbit Heartrate intraday 
 Description:
 - The 4-day raw heartrate data is contained in `fitbit_heartrate_intraday_raw.csv`
 - One episode for each daily segment (`night`, `morning`, `afternoon` and `evening`)
 - Two episodes locate in the same 30-min segment (`Fri 00:49:00` and `Fri 00:52:00`)
 - Two different types of heartrate zone episodes locate in the same 30-min segment (`Fri 05:49:00 outofrange` and `Fri 05:57:00 fatburn`)
 - Two episodes locate in the same daily segment (`Fri 12:02:00` and `Fri 19:38:00`)
 - One episode before the time switch, `Sun 00:08:00`, and one episode after the time switch, `Sun 07:28:00`
 Checklist
 |time segment| single tz | multi tz|platform|
@ -506,82 +322,3 @@ Checklist
 |weekend|OK|OK|fitbit|
 |beforeMarchEvent|OK|OK|fitbit|
 |beforeNovemberEvent|OK|OK|fitbit|
 ## Fitbit Heartrate Summary
 Description
 - The 4-day raw heartrate summary data is contained in `fitbit_heartrate_summary_raw.csv`.
 - As heartrate summary is periodic, it only generates results in periodic feature, there will be no result in frequency and event. 
 Checklist
 |time segment| single tz | multi tz|platform|
 |-|-|-|-|
 |30min|OK|OK|fitbit|
 |morning|OK|OK|fitbit|
 |daily|OK|OK|fitbit|
 |threeday|OK|OK|fitbit|
 |weekend|OK|OK|fitbit|
 |beforeMarchEvent|OK|OK|fitbit|
 |beforeNovemberEvent|OK|OK|fitbit|
 ## Fitbit Step Intraday
 Description
 - The 4-day raw heartrate summary data is contained in `fitbit_steps_intraday_raw.csv`
 - One episode for each daily segment (`night`, `morning`, `afternoon` and `evening`) on each day
 - Two episodes within the same 30-min segment (`Fri 05:58:00` and `Fri 05:59:00`)
 - A one-min episode at `2020-03-07 09:00:00` that will be converted to New York time `2020-03-07 12:00:00`
 - One episode before the time switch, `Sun 00:19:00`, and one episode after the time switch, `Sun 09:01:00`
 - Episodes cross two 30-min segments (`Fri 11:59:00` and `Fri 12:00:00`)
 Checklist
 |time segment| single tz | multi tz|platform|
 |-|-|-|-|
 |30min|OK|OK|fitbit|
 |morning|OK|OK|fitbit|
 |daily|OK|OK|fitbit|
 |threeday|OK|OK|fitbit|
 |weekend|OK|OK|fitbit|
 |beforeMarchEvent|OK|OK|fitbit|
 |beforeNovemberEvent|OK|OK|fitbit|
 ## Fitbit Step Summary
 Description
 - The 4-day raw heartrate summary data is contained in `fitbit_steps_summary_raw.csv`.
 - As heartrate summary is periodic, it only generates results in periodic feature, there will be no result in frequency and event. 
 Checklist
 |time segment| single tz | multi tz|platform|
 |-|-|-|-|
 |30min|OK|OK|fitbit|
 |morning|OK|OK|fitbit|
 |daily|OK|OK|fitbit|
 |threeday|OK|OK|fitbit|
 |weekend|OK|OK|fitbit|
 |beforeMarchEvent|OK|OK|fitbit|
 |beforeNovemberEvent|OK|OK|fitbit|
 ## Fitbit Data Yield
 Checklist
 |time segment| single tz | multi tz|platform|
 |-|-|-|-|
 |30min|OK|OK|fitbit|
 |morning|OK|OK|fitbit|
 |daily|OK|OK|fitbit|
 |threeday|OK|OK|fitbit|
 |weekend|OK|OK|fitbit|
 |beforeMarchEvent|OK|OK|fitbit|
 |beforeNovemberEvent|OK|OK|fitbit|
--- a/docs/features/fitbit-steps-intraday.md
+++ b/docs/features/fitbit-steps-intraday.md
@ -29,7 +29,6 @@ Parameters description for `[FITBIT_STEPS_INTRADAY][PROVIDERS][RAPIDS]`:
 |----------------|-----------------------------------------------------------------------------------------------------------------------------------
 |`[COMPUTE]`                | Set to `True` to extract `FITBIT_STEPS_INTRADAY` features from the `RAPIDS` provider|
 |`[FEATURES]`               |         Features to be computed from steps intraday data, see table below           |
 |`[REFERENCE_HOUR]`         | The reference point from which `firststeptime` or `laststeptime` is to be computed, default is midnight |
 |`[THRESHOLD_ACTIVE_BOUT]`  | Every minute with Fitbit steps data wil be labelled as `sedentary` if its step count is below this threshold, otherwise, `active`.    |
 |`[INCLUDE_ZERO_STEP_ROWS]` | Whether or not to include time segments with a 0 step count during the whole day.                          |
@ -43,8 +42,6 @@ Features description for `[FITBIT_STEPS_INTRADAY][PROVIDERS][RAPIDS]`:
 |minsteps                   |steps          |The minimum step count during a time segment.
 |avgsteps                   |steps          |The average step count during a time segment.
 |stdsteps                   |steps          |The standard deviation of step count during a time segment.
 |firststeptime              |minutes        |Minutes until the first non-zero step count.
 |laststeptime               |minutes        |Minutes until the last non-zero step count.
 |countepisodesedentarybout  |bouts          |Number of sedentary bouts during a time segment.
 |sumdurationsedentarybout   |minutes        |Total duration of all sedentary bouts during a time segment.
 |maxdurationsedentarybout   |minutes        |The maximum duration of any sedentary bout during a time segment.
--- a/docs/features/phone-activity-recognition.md
+++ b/docs/features/phone-activity-recognition.md
@ -44,7 +44,7 @@ Features description for `[PHONE_ACTIVITY_RECOGNITION][PROVIDERS][RAPIDS]`:
 |count                   |rows             | Number of episodes.
 |mostcommonactivity      |activity type   | The most common activity type (e.g. `still`, `on_foot`, etc.). If there is a tie, the first one is chosen.
 |countuniqueactivities   |activity type   | Number of unique activities.
-|durationstationary      |minutes          | The total duration of `[ACTIVITY_CLASSES][STATIONARY]` episodes of still and tilting activities
+|durationstationary      |minutes          | The total duration of `[ACTIVITY_CLASSES][STATIONARY]` episodes
 |durationmobile          |minutes          | The total duration of `[ACTIVITY_CLASSES][MOBILE]` episodes of on foot, running, and on bicycle activities
 |durationvehicle         |minutes          | The total duration of `[ACTIVITY_CLASSES][VEHICLE]` episodes of on vehicle activity
--- a/docs/features/phone-applications-foreground.md
+++ b/docs/features/phone-applications-foreground.md
@ -33,36 +33,25 @@ Parameters description for `[PHONE_APPLICATIONS_FOREGROUND][PROVIDERS][RAPIDS]`:
 |Key&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;            | Description |
 |----------------|-----------------------------------------------------------------------------------------------------------------------------------
 |`[COMPUTE]`| Set to `True` to extract `PHONE_APPLICATIONS_FOREGROUND` features from the `RAPIDS` provider|
 |`[INCLUDE_EPISODE_FEATURES]`| Set to `True` to extract features from application usage episodes using Screen data |
 |`[FEATURES]` |         Features to be computed, see table below
-|`[SINGLE_CATEGORIES]`     | An array of app categories to be *included* in the feature extraction computation. The special keyword `all` represents a category with all the apps from each participant. By default, we use the category catalog pointed by `[APPLICATION_CATEGORIES][CATALOGUE_FILE]` (see the Sensor parameters description table above)
+|`[SINGLE_CATEGORIES]`     | An array of app categories to be *included* in the feature extraction computation. The special keyword `all` represents a category with all the apps from each participant. By default we use the category catalogue pointed by `[APPLICATION_CATEGORIES][CATALOGUE_FILE]` (see the Sensor parameters description table above)
-|`[CUSTOM_CATEGORIES]`   | An array of collections representing your own app categories. The key of each element is the name of the custom category, and the value is an array of the package names (apps) included in that category.
+|`[MULTIPLE_CATEGORIES]`   | An array of collections representing meta-categories (a group of categories). They key of each element is the name of the `meta-category` and the value is an array of member app categories. By default we use the category catalogue pointed by `[APPLICATION_CATEGORIES][CATALOGUE_FILE]` (see the Sensor parameters description table above)
 |`[MULTIPLE_CATEGORIES]`   | An array of collections representing meta-categories (a group of categories). The key of each element is the name of the `meta-category` and the value is an array of member app categories. By default, we use the category catalog pointed by `[APPLICATION_CATEGORIES][CATALOGUE_FILE]` (see the Sensor parameters description table above)
 |`[SINGLE_APPS]`           | An array of apps to be *included* in the feature extraction computation. Use their package name (e.g. `com.google.android.youtube`) or the reserved keyword `top1global` (the most used app by a participant over the whole monitoring study)
-|`[EXCLUDED_CATEGORIES]`   | An array of app categories to be *excluded* from the feature extraction computation. By default, we use the category catalog pointed by `[APPLICATION_CATEGORIES][CATALOGUE_FILE]` (see the Sensor parameters description table above)
+|`[EXCLUDED_CATEGORIES]`   | An array of app categories to be *excluded* from the feature extraction computation. By default we use the category catalogue pointed by `[APPLICATION_CATEGORIES][CATALOGUE_FILE]` (see the Sensor parameters description table above)
 |`[EXCLUDED_APPS]`         | An array of apps to be excluded from the feature extraction computation. Use their package name, for example: `com.google.android.youtube`
 Features description for `[PHONE_APPLICATIONS_FOREGROUND][PROVIDERS][RAPIDS]`:
 |Feature                    |Units      |Description|
 |-------------------------- |---------- |---------------------------|
-|countevent              |apps      | Number of times a single app or apps within a category were used (i.e. they were brought to the foreground either by tapping their icon or switching to it from another app)
+|count              |apps      | Number of times a single app or apps within a category were used (i.e. they were brought to the foreground either by tapping their icon or switching to it from another app)
 |timeoffirstuse     |minutes   | The time in minutes between 12:00am (midnight) and the first use of a single app or apps within a category during a `time_segment`
 |timeoflastuse      |minutes   | The time in minutes between 12:00am (midnight) and the last use of a single app or apps within a category during a `time_segment`
 |frequencyentropy   |nats      | The entropy of the used apps within a category during a `time_segment` (each app is seen as a unique event, the more apps were used, the higher the entropy). This is especially relevant when computed over all apps. Entropy cannot be obtained for a single app
 |countepisode              |apps      | Number of times a usage episode of a single app or apps within a category were logged. In contrast to `countevent`, if an app was used across more than one time segment (for example, across more than one 30-minute segment), the `countepisode` will be one on each time segment instance. 
 |minduration        |minutes   | For a `time_segment`, the minimum duration an application was used in minutes
 |maxduration        |minutes   | For a `time_segment`, the maximum duration an application was used in minutes
 |meanduration       |minutes   | For a `time_segment`, the mean duration of all the applications used in minutes
 |sumduration        |minutes   | For a `time_segment`, the sum duration of all the applications used in minutes
 !!! note "Assumptions/Observations"
-    1. Features can be computed by app, by apps grouped under a single category (genre), by your own categories, or by multiple categories grouped together (meta-categories). For example, we can get features for `Facebook` (single app), for `Social Network` apps (a category including Facebook and other social media apps), for `Traditional Social Media` (a custom category that includes Twitter and Facebook), or for `Social` (a meta-category formed by `Social Network` and `Social Media Tools` categories).
+    Features can be computed by app, by apps grouped under a single category (genre) and by multiple categories grouped together (meta-categories). For example, we can get features for `Facebook` (single app), for `Social Network` apps (a category including Facebook and other social media apps) or for `Social` (a meta-category formed by `Social Network` and `Social Media Tools` categories).
-    2. Apps installed by default like YouTube are considered systems apps on some phones. We do an exact match to exclude apps where "genre" == `EXCLUDED_CATEGORIES` or "package_name" == `EXCLUDED_APPS`.
+    Apps installed by default like YouTube are considered systems apps on some phones. We do an exact match to exclude apps where "genre" == `EXCLUDED_CATEGORIES` or "package_name" == `EXCLUDED_APPS`.
-    3. We provide four ways of classifying an app within a category (genre): a) by automatically scraping its official category from the Google Play Store, b) by using the catalog created by Stachl et al., which we provide in RAPIDS (`data/external/stachl_application_genre_catalogue.csv`), c) by manually creating a personalized catalog, or d) by defining a custom category in `config.yaml`. You can choose a, b, or c by modifying `[APPLICATION_GENRES]` keys and values (see the first table of this page).
+    We provide three ways of classifying and app within a category (genre): a) by automatically scraping its official category from the Google Play Store, b) by using the catalogue created by Stachl et al. which we provide in RAPIDS (`data/external/stachl_application_genre_catalogue.csv`), or c) by manually creating a personalized catalogue. You can choose a, b or c by modifying `[APPLICATION_GENRES]` keys and values (see the Sensor parameters description table above).
    4. We count `episodes` and `events` separately. Events are single app logs (when an app was opened), but episodes span from the time an app was opened until a new app is in the foreground or the screen is locked. Episodes will be chunked across any overlapping time segments. The `top1global` of `episodes` might not be the same as the `top1global` of `events`.
    5. The application episodes are calculated using the application foreground and screen unlock episode data. An application episode starts when the application is launched and ends when new application is launched, or the screen is locked.
--- a/docs/features/phone-bluetooth.md
+++ b/docs/features/phone-bluetooth.md
@ -86,7 +86,6 @@ Features description for `[PHONE_BLUETOOTH][PROVIDERS][DORYAB]`:
 !!! note "Assumptions/Observations"
    - Devices are classified as belonging to the participant (`own`) or to other people (`others`) using k-means based on the number of times and the number of days each device was detected across each participant's dataset. See [Doryab et al](../../citation#doryab-bluetooth) for more details.
    - If ownership cannot be computed because all devices were detected on only one day, they are all considered as `other`. Thus `all` and `other` features will be equal. The likelihood of this scenario decreases the more days of data you have.
    - When searching for the most frequent device across 30-minute segments, the search range is equivalent to the sum of all segments of the same time period. For instance, the `countscansmostfrequentdeviceacrosssegments` for the time segment (`Fri 00:00:00, Fri 00:29:59`) will get the count in that segment of the most frequent device found within all (`00:00:00, 00:29:59`) time segments. To find `countscansmostfrequentdeviceacrosssegments` for `other` devices, the search range needs to filter out all `own` devices. But no need to do so for `countscansmostfrequentdeviceacrosssedataset`. The most frequent device across the dataset stays the same for `countscansmostfrequentdeviceacrossdatasetall`, `countscansmostfrequentdeviceacrossdatasetown` and `countscansmostfrequentdeviceacrossdatasetother`. Same rule applies to the least frequent device across the dataset. 
    - The most and least frequent devices will be the same across time segment instances and across the entire dataset when every time segment instance covers every hour of a dataset. For example, daily segments (00:00 to 23:59) fall in this category but morning segments (06:00am to 11:59am) or periodic 30-minute segments don't.
    ??? info "Example"
--- a/docs/features/phone-calls.md
+++ b/docs/features/phone-calls.md
@ -26,7 +26,6 @@ Parameters description for `[PHONE_CALLS][PROVIDERS][RAPIDS]`:
 | Key&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;        | Description |
 |-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 |`[COMPUTE]`| Set to `True` to extract `PHONE_CALLS` features from the `RAPIDS` provider|
 |`[FEATURES_TYPE]`| Set to `EPISODES` to extract features based on call episodes or `EVENTS` to extract features based on events.|
 | `[CALL_TYPES]`   | The particular call_type that will be analyzed. The options for this parameter are incoming, outgoing or missed.                                                                                                                                                 |
 | `[FEATURES]`    | Features to be computed for `outgoing`, `incoming`, and `missed` calls. Note that the same features are available for both incoming and outgoing calls, while missed calls has its own set of features. See the tables below. |
@ -61,4 +60,4 @@ Features description for `[PHONE_CALLS][PROVIDERS][RAPIDS]` missed calls:
 !!! note "Assumptions/Observations"
    1. Traces for iOS calls are unique even for the same contact calling a participant more than once which renders `countmostfrequentcontact` meaningless and `distinctcontacts` equal to the total number of traces. 
    2. `[CALL_TYPES]` and `[FEATURES]` keys in `config.yaml` need to match. For example, `[CALL_TYPES]` `outgoing` matches the `[FEATURES]` key `outgoing`
-    3. iOS calls data is transformed to match Android calls data format.
+    3. iOS calls data is transformed to match Android calls data format. See our [algorithm](algorithms/phone-algorithms.md#phone-calls)
--- a/docs/features/phone-keyboard.md
+++ b/docs/features/phone-keyboard.md
@ -6,12 +6,6 @@ Sensor parameters description for `[PHONE_KEYBOARD]`:
 |----------------|-----------------------------------------------------------------------------------------------------------------------------------
 |`[CONTAINER]`| Data stream [container](../../datastreams/data-streams-introduction/) (database table, CSV file, etc.) where the keyboard data is stored
 ## RAPIDS provider
 !!! info "Available time segments and platforms"
    - Available for all time segments
    - Available for Android only
 !!! info "File Sequence"
    ```bash
    - data/raw/{pid}/phone_keyboard_raw.csv
--- a/docs/features/phone-locations.md
+++ b/docs/features/phone-locations.md
@ -6,9 +6,8 @@ Sensor parameters description for `[PHONE_LOCATIONS]`:
 |----------------|-----------------------------------------------------------------------------------------------------------------------------------
 |`[CONTAINER]`| Data stream [container](../../datastreams/data-streams-introduction/) (database table, CSV file, etc.) where the location data is stored
 |`[LOCATIONS_TO_USE]`| Type of location data to use, one of `ALL`, `GPS`, `ALL_RESAMPLED` or `FUSED_RESAMPLED`. This filter is based on the `provider` column of the locations table, `ALL` includes every row, `GPS` only includes rows where the provider is gps, `ALL_RESAMPLED` includes all rows after being resampled, and `FUSED_RESAMPLED` only includes rows where the provider is fused after being resampled.
-|`[FUSED_RESAMPLED_CONSECUTIVE_THRESHOLD]`| If `ALL_RESAMPLED` or `FUSED_RESAMPLED` is used, the original fused data has to be resampled. A location row is resampled to the next valid timestamp (see the Assumptions/Observations below) only if the time difference between them is less or equal than this threshold (in minutes).
+|`[FUSED_RESAMPLED_CONSECUTIVE_THRESHOLD]`| if `ALL_RESAMPLED` or `FUSED_RESAMPLED` is used, the original fused data has to be resampled, a location row is resampled to the next valid timestamp (see the Assumptions/Observations below) only if the time difference between them is less or equal than this threshold (in minutes).
-|`[FUSED_RESAMPLED_TIME_SINCE_VALID_LOCATION]`| If `ALL_RESAMPLED` or `FUSED_RESAMPLED` is used, the original fused data has to be resampled. A location row is resampled at most for this long (in minutes).
+|`[FUSED_RESAMPLED_TIME_SINCE_VALID_LOCATION]`| if `ALL_RESAMPLED` or `FUSED_RESAMPLED` is used, the original fused data has to be resampled, a location row is resampled at most for this long (in minutes)
 |`[ACCURACY_LIMIT]` | An integer in meters, any location rows with an accuracy higher or equal than this is dropped. This number means there's a 68% probability the actual location is within this radius.
 !!! note "Assumptions/Observations"
    **Types of location data to use**
@ -17,9 +16,9 @@ Sensor parameters description for `[PHONE_LOCATIONS]`:
    - If you want to use only the GPS provider, set `[LOCATIONS_TO_USE]` to `GPS`
    - If you want to use all providers, set `[LOCATIONS_TO_USE]` to `ALL`
    - If you collected location data from different providers, including the fused API, use `ALL_RESAMPLED`
-    - If your mobile client was configured to use fused location only or want to focus only on this provider, set `[LOCATIONS_TO_USE]` to `FUSED_RESAMPLED`.
+    - If your mobile client was configured to use fused location only or want to focus only on this provider, set `[LOCATIONS_TO_USE]` to `RESAMPLE_FUSED`.
-    `ALL_RESAMPLED` and `FUSED_RESAMPLED` take the original location coordinates and replicate each pair forward in time as long as the phone was sensing data as indicated by the joined timestamps of [`[PHONE_DATA_YIELD][SENSORS]`](../phone-data-yield/). This is done because Google's API only logs a new location coordinate pair when it is sufficiently different in time or space from the previous one and because GPS and network providers can log data at variable rates.
+    `ALL_RESAMPLED` and `RESAMPLE_FUSED` take the original location coordinates and replicate each pair forward in time as long as the phone was sensing data as indicated by the joined timestamps of [`[PHONE_DATA_YIELD][SENSORS]`](../phone-data-yield/). This is done because Google's API only logs a new location coordinate pair when it is sufficiently different in time or space from the previous one and because GPS and network providers can log data at variable rates.
    There are two parameters associated with resampling fused location.
@ -42,7 +41,6 @@ These features are based on the original open-source implementation by [Barnett
    - data/raw/{pid}/phone_locations_raw.csv
    - data/interim/{pid}/phone_locations_processed.csv
    - data/interim/{pid}/phone_locations_processed_with_datetime.csv
    - data/interim/{pid}/phone_locations_barnett_daily.csv
    - data/interim/{pid}/phone_locations_features/phone_locations_{language}_{provider_key}.csv
    - data/processed/features/{pid}/phone_locations.csv
    ```
@ -54,6 +52,7 @@ Parameters description for `[PHONE_LOCATIONS][PROVIDERS][BARNETT]`:
 |----------------|-----------------------------------------------------------------------------------------------------------------------------------
 |`[COMPUTE]`| Set to `True` to extract `PHONE_LOCATIONS` features from the `BARNETT` provider|
 |`[FEATURES]` |         Features to be computed, see table below
 |`[ACCURACY_LIMIT]` |   An integer in meters, any location rows with an accuracy higher than this is dropped. This number means there's a 68% probability the actual location is within this radius
 |`[IF_MULTIPLE_TIMEZONES]` |    Currently, `USE_MOST_COMMON` is the only value supported. If the location data for a participant belongs to multiple time zones, we select the most common because Barnett's algorithm can only handle one time zone 
 |`[MINUTES_DATA_USED]` |    Set to `True` to include an extra column in the final location feature file containing the number of minutes used to compute the features on each time segment. Use this for quality control purposes; the more data minutes exist for a period, the more reliable its features should be. For fused location, a single minute can contain more than one coordinate pair if the participant is moving fast enough.
@ -112,9 +111,7 @@ These features are based on the original implementation by [Doryab et al.](../..
    - data/raw/{pid}/phone_locations_raw.csv
    - data/interim/{pid}/phone_locations_processed.csv
    - data/interim/{pid}/phone_locations_processed_with_datetime.csv
-    - data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns_episodes.csv
+    - data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns.csv
    - data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns_episodes_resampled.csv
    - data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns_episodes_resampled_with_datetime.csv
    - data/interim/{pid}/phone_locations_features/phone_locations_{language}_{provider_key}.csv
    - data/processed/features/{pid}/phone_locations.csv
    ```
@ -124,8 +121,9 @@ Parameters description for `[PHONE_LOCATIONS][PROVIDERS][DORYAB]`:
 |Key&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;            | Description |
 |----------------|-----------------------------------------------------------------------------------------------------------------------------------
-|`[COMPUTE]`| Set to `True` to extract `PHONE_LOCATIONS` features from the `DORYAB` provider|
+|`[COMPUTE]`| Set to `True` to extract `PHONE_LOCATIONS` features from the `BARNETT` provider|
 |`[FEATURES]` |         Features to be computed, see table below
 |`[ACCURACY_LIMIT]` |   An integer in meters, any location rows with an accuracy higher than this will be dropped. This number means there's a 68% probability the true location is within this radius
 | `[DBSCAN_EPS]`             | The maximum distance in meters between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.
 | `[DBSCAN_MINSAMPLES]`      | The number of samples (or total weight) in a neighborhood for a point to be considered as a core point of a cluster. This includes the point itself.
 | `[THRESHOLD_STATIC]`       | It is the threshold value in km/hr which labels a row as Static or Moving.
@ -145,8 +143,8 @@ Features description for `[PHONE_LOCATIONS][PROVIDERS][DORYAB]`:
 |locationvariance                                            |$meters^2$    |The sum of the variances of the latitude and longitude columns. 
 |loglocationvariance                                           | -          | Log of the sum of the variances of the latitude and longitude columns.
 |totaldistance                                                |meters        |Total distance traveled in a time segment using the haversine formula.
-|avgspeed                                                 |km/hr         |Average speed in a time segment considering only the instances labeled as Moving. This feature is 0 when the participant is stationary during a time segment.
+|avgspeed                                                 |km/hr         |Average speed in a time segment considering only the instances labeled as Moving.
-|varspeed                                                      |km/hr         |Speed variance in a time segment considering only the instances labeled as Moving. This feature is 0 when the participant is stationary during a time segment.
+|varspeed                                                      |km/hr         |Speed variance in a time segment considering only the instances labeled as Moving. 
 |{--circadianmovement--}                                      |-             | Deprecated, see Observations below. \ "It encodes the extent to which a person's location patterns follow a 24-hour circadian cycle.\" [Doryab et al.](../../citation#doryab-locations).
 |numberofsignificantplaces                                    |places        |Number of significant locations visited. It is calculated using the DBSCAN/OPTICS clustering algorithm which takes in EPS and MIN_SAMPLES as parameters to identify clusters. Each cluster is a significant place.
 |numberlocationtransitions                                    |transitions   |Number of movements between any two clusters in a time segment.
@ -167,7 +165,7 @@ Features description for `[PHONE_LOCATIONS][PROVIDERS][DORYAB]`:
 !!! note "Assumptions/Observations"
    **Significant Locations Identified**
-    Significant locations are determined using `DBSCAN` or `OPTICS` clustering on locations that a participant visited over the course of the period of data collection. The most significant location is the place where the participant stayed for the longest time.
+    Significant locations are determined using DBSCAN clustering on locations that a patient visit over the course of the period of data collection.
    **Circadian Movement Calculation**
    Note Feb 3 2021. It seems the implementation of this feature is not correct; we suggest not to use this feature until a fix is in place. For a detailed description of how this should be calculated, see [Saeb et al](https://pubmed.ncbi.nlm.nih.gov/28344895/).
@ -197,5 +195,4 @@ Features description for `[PHONE_LOCATIONS][PROVIDERS][DORYAB]`:
            the candidate will be regarded as the home cluster; otherwise, the home cluster will be the last valid day's cluster.
            If there are no valid clusters before that day, the first home location in the days after is used.
-    **Clustering algorithms**
+
    [`DBSCAN`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html) and [`OPTICS`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.OPTICS.html#r2c55e37003fe-1) algorithms are available currently. Duplicated locations are discarded while clustering. The `DBSCAN` algorithm takes the time spent at each location into consideration. However, the `OPTICS` algorithm ignores it as it is not supported in the current [scikit-learn](https://github.com/scikit-learn/scikit-learn/issues/12394) implementation.
--- a/docs/features/phone-screen.md
+++ b/docs/features/phone-screen.md
@ -32,7 +32,7 @@ Parameters description for `[PHONE_SCREEN][PROVIDERS][RAPIDS]`:
 |`[FEATURES]` |         Features to be computed, see table below
 |`[REFERENCE_HOUR_FIRST_USE]` |  The reference point from which `firstuseafter` is to be computed, default is midnight
 |`[IGNORE_EPISODES_SHORTER_THAN]` |  Ignore episodes that are shorter than this threshold (minutes). Set to 0 to disable this filter.
-|`[IGNORE_EPISODES_LONGER_THAN]` |  Ignore episodes that are longer than this threshold (minutes), default is 6 hours. Set to 0 to disable this filter.
+|`[IGNORE_EPISODES_LONGER_THAN]` |  Ignore episodes that are longer than this threshold (minutes). Set to 0 to disable this filter.
 |`[EPISODE_TYPES]` |  Currently we only support `unlock` episodes (from when the phone is unlocked until the screen is off)
--- a/docs/img/hm-data-yield-participants-absolute-time.html
+++ b/docs/img/hm-data-yield-participants-absolute-time.html
--- a/docs/img/hm-data-yield-participants-absolute-time.png
+++ b/docs/img/hm-data-yield-participants-absolute-time.png
--- a/docs/img/hm-data-yield-participants-relative-time.html
+++ b/docs/img/hm-data-yield-participants-relative-time.html
--- a/docs/img/hm-data-yield-participants-relative-time.png
+++ b/docs/img/hm-data-yield-participants-relative-time.png
--- a/docs/img/hm-feature-correlations.html
+++ b/docs/img/hm-feature-correlations.html
--- a/docs/img/hm-feature-correlations.png
+++ b/docs/img/hm-feature-correlations.png
--- a/docs/img/hm-sensor-rows.html
+++ b/docs/img/hm-sensor-rows.html
--- a/docs/img/hm-sensor-rows.png
+++ b/docs/img/hm-sensor-rows.png
--- a/docs/index.md
+++ b/docs/index.md
@ -1,12 +1,12 @@
 # Welcome to RAPIDS documentation
-Reproducible Analysis Pipeline for Data Streams (RAPIDS) allows you to process smartphone and wearable data to [extract](features/feature-introduction.md) and [create](features/add-new-features.md) **behavioral features** (a.k.a. digital biomarkers), [visualize](visualizations/data-quality-visualizations.md) mobile sensor data, and [structure](analysis/complete-workflow-example.md) your analysis into reproducible workflows. Check out our [paper](https://www.frontiersin.org/article/10.3389/fdgth.2021.769823)!
+Reproducible Analysis Pipeline for Data Streams (RAPIDS) allows you to process smartphone and wearable data to [extract](features/feature-introduction.md) and [create](features/add-new-features.md) **behavioral features** (a.k.a. digital biomarkers), [visualize](visualizations/data-quality-visualizations.md) mobile sensor data, and [structure](workflow-examples/analysis.md) your analysis into reproducible workflows.
-RAPIDS is open source, documented, multi-platform, modular, tested, and reproducible. At the moment, we support [data streams](datastreams/data-streams-introduction) logged by smartphones, Fitbit wearables, and Empatica wearables (the latter in collaboration with the [DBDP](https://dbdp.org/)). 
+RAPIDS is open source, documented, multi-platform, modular, tested, and reproducible. At the moment, we support [data streams](datastreams/data-streams-introduction) logged by smartphones, Fitbit wearables, and Empatica wearables in collaboration with the [DBDP](https://dbdp.org/). 
 !!! tip "Where do I start?"
-    :material-power-standby: New to RAPIDS? Check our [Overview + FAQ](setup/overview/) and [minimal example](analysis/minimal)
+    :material-power-standby: New to RAPIDS? Check our [Overview + FAQ](setup/overview/) and [minimal example](workflow-examples/minimal)
    :material-play-speed: [Install](setup/installation), [configure](setup/configuration), and [execute](setup/execution) RAPIDS to [extract](features/feature-introduction.md) and [plot](visualizations/data-quality-visualizations.md) behavioral features
@ -25,7 +25,7 @@ RAPIDS is open source, documented, multi-platform, modular, tested, and reproduc
 1. **Consistent analysis**. Every participant sensor dataset is analyzed in the same way and isolated from each other.
 2. **Efficient analysis**. Every analysis step is executed only once. Whenever your data or configuration changes, only the affected files are updated.
-5. **Parallel execution**. Thanks to [Snakemake](https://snakemake.github.io/), your analysis can be executed over multiple cores without changing your code.
+5. **Parallel execution**. Thanks to Snakemake, your analysis can be executed over multiple cores without changing your code.
 6. **Code-free features**. Extract any of the behavioral features offered by RAPIDS without writing any code.
 7. **Extensible code**. You can easily add your own data streams or behavioral features in R or Python, share them with the community, and keep authorship and citations.
 8. **Time zone aware**. Your data is adjusted to one or more time zones per participant.
@ -37,7 +37,7 @@ RAPIDS is open source, documented, multi-platform, modular, tested, and reproduc
 ## Users and Contributors
 ??? quote "Community Contributors"
-    Many thanks to the [whole team](./team) and our community contributions:
+    Many thanks to our community contributions and the [whole team](../team):
    - Agam Kumar (CMU)
    - Yasaman S. Sefidgar (University of Washington)
@ -46,7 +46,7 @@ RAPIDS is open source, documented, multi-platform, modular, tested, and reproduc
    - Stephen Price (CMU)
    - Neil Singh (University of Virginia)
-    Many thanks to the researchers that made [their work](./citation) open source:
+    Many thanks to the researchers that made [their work](../citation) open source:
    - Panda et al. [paper](https://pubmed.ncbi.nlm.nih.gov/31657854/)
    - Stachl et al. [paper](https://www.pnas.org/content/117/30/17680)
@ -57,10 +57,9 @@ RAPIDS is open source, documented, multi-platform, modular, tested, and reproduc
 ??? quote "Publications using RAPIDS"
    - Predicting Symptoms of Depression and Anxiety Using Smartphone and Wearable Data [link](https://www.frontiersin.org/articles/10.3389/fpsyt.2021.625247/full)
-    - Predicting Depression from Smartphone Behavioral Markers Using Machine Learning Methods, Hyperparameter Optimization, and Feature Importance Analysis: Exploratory Study [link](https://mhealth.jmir.org/2021/7/e26540)
+    - Predicting Depression from Smartphone Behavioral Markers Using Machine Learning Methods, Hyper-parameter Optimization, and Feature Importance Analysis: An Exploratory Study [link](https://preprints.jmir.org/preprint/26540)
    -  Digital Biomarkers of Symptom Burden Self-Reported by Perioperative Patients Undergoing Pancreatic Surgery: Prospective Longitudinal Study [link](https://cancer.jmir.org/2021/2/e27975/)
-    - An Automated Machine Learning Pipeline for Monitoring and Forecasting Mobile Health Data [link](https://ieeexplore.ieee.org/abstract/document/9483755/)
+    - An Automated Machine Learning Pipeline for Monitoring and Forecasting Mobile Health Data [link](https://edas.info/showManuscript.php?m=1570708269&random=750318666&type=final&ext=pdf&title=PDF+file)
    - Mobile Footprinting: Linking Individual Distinctiveness in Mobility Patterns to Mood, Sleep, and Brain Functional Connectivity [link](https://www.biorxiv.org/content/10.1101/2021.05.17.444568v1.abstract)
 <div class="users">
 <div><img alt="carnegie mellon university" loading="lazy" src="./img/logos/cmu.png" /></div>
--- a/docs/setup/configuration.md
+++ b/docs/setup/configuration.md
@ -476,12 +476,28 @@ Modify the following keys in your `config.yaml` depending on the [data stream](.
        --8<---- "docs/snippets/database.md"
-    === "aware_csv"
+    === "aware_mysql_split"
        | Key                  | Description                                                                                                                |
        |---------------------|----------------------------------------------------------------------------------------------------------------------------|
        | `[DATABASE_GROUP]`   | A database credentials group. Read the instructions below to set it up    |
        --8<---- "docs/snippets/database.md"
    === "aware_csv"
        | Key &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                  | Description|
        |---------------------|----------------------------------------------------------------------------------------------------------------------------|
        | `[FOLDER]`   | Folder where you have to place a CSV file **per** phone sensor. Each file has to contain all the data from every participant you want to process.     |
    === "aware_influxdb"
        | Key                  | Description                                                                                                                |
        |---------------------|----------------------------------------------------------------------------------------------------------------------------|
        | `[DATABASE_GROUP]`   | A database credentials group. Read the instructions below to set it up    |
        --8<---- "docs/snippets/database.md"
--- a/docs/setup/installation.md
+++ b/docs/setup/installation.md
@ -35,21 +35,14 @@ You can install RAPIDS using Docker (the fastest), or native instructions for Ma
        ```
    7.  *Optional*. You can edit RAPIDS files with `vim` but we recommend using `Visual Studio Code` and its `Remote Containers` extension
-        ??? info "How to configure the Remote Containers extension"
+        ??? info "How to configure Remote Containers extension"
-            - Make sure RAPIDS Docker container is running
+            - Make sure RAPIDS container is running
-
+            - Install the [Remote - Containers extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers)
-            - Install VS Code and its [Remote - Containers extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers)
+            - Go to the `Remote Explorer` panel on the left hand sidebar
-
+            - On the top right dropdown menu choose `Containers`
-            - Click the `Remote Explorer` icon on the left-hand sidebar (the icon is a computer monitor)
+            - Double click on the `moshiresearch/rapids` container in the`CONTAINERS` tree
-
+            - A new VS Code session should open on RAPIDS main folder inside the container.
            - On the top right dropdown menu, choose `Containers`
            - Right-click on the `moshiresearch/rapids` container in the `CONTAINERS` tree and select `Attach to Container`. A new VS Code window should open
            - In the new window, open the `/rapids/` folder via the `File/Open...` menu
            - Run RAPIDS inside a terminal in VS Code. Open one with the `Terminal/New Terminal` menu
    !!! warning
        If you installed RAPIDS using Docker for Windows on Windows 10, the container will have [limits](https://stackoverflow.com/questions/43460770/docker-windows-container-memory-limit) on the amount of RAM it can use. If you find that RAPIDS crashes due to running out of memory, [increase](https://stackoverflow.com/a/56583203/6030343) this limit.
--- a/docs/setup/overview.md
+++ b/docs/setup/overview.md
@ -23,10 +23,10 @@ Let's review some key concepts we use throughout these docs:
    - [Add your own behavioral features](../../features/add-new-features/) (we can include them in RAPIDS if you want to share them with the community)
    - [Add support for new data streams](../../datastreams/add-new-data-streams/) if yours cannot be processed by RAPIDS yet
    - Create visualizations for [data quality control](../../visualizations/data-quality-visualizations/)  and [feature inspection](../../visualizations/feature-visualizations/)
-    - [Extending RAPIDS to organize your analysis](../../analysis/complete-workflow-example/) and publish a code repository along with your code
+    - [Extending RAPIDS to organize your analysis](../../workflow-examples/analysis/) and publish a code repository along with your code
 !!! hint
-    - We recommend you follow the [Minimal Example](../../analysis/minimal/) tutorial to get familiar with RAPIDS
+    - We recommend you follow the [Minimal Example](../../workflow-examples/minimal/) tutorial to get familiar with RAPIDS
    - In order to follow any of the previous tutorials, you will have to [Install](../installation/), [Configure](../configuration/), and learn how to [Execute](../execution/) RAPIDS.
--- a/docs/snippets/parsedfitbit_format.md
+++ b/docs/snippets/parsedfitbit_format.md
@ -43,9 +43,9 @@ All columns are mandatory; however, all except `device_id` and `local_date_time`
            |device_id                              |local_date_time   |heartrate_daily_restinghr |heartrate_daily_caloriesoutofrange  |heartrate_daily_caloriesfatburn  |heartrate_daily_caloriescardio  |heartrate_daily_caloriespeak   |
            |-------------------------------------- |----------------- |------- |-------------- |------------- |------------ |-------|
-            |a748ee1a-1d0b-4ae9-9074-279a2b6ba524   |2020-10-07 00:00:00  |72      |1200.6102      |760.3020      |15.2048      |0      |
+            |a748ee1a-1d0b-4ae9-9074-279a2b6ba524   |2020-10-07        |72      |1200.6102      |760.3020      |15.2048      |0      |
-            |a748ee1a-1d0b-4ae9-9074-279a2b6ba524   |2020-10-08 00:00:00  |70      |1100.1120      |660.0012      |23.7088      |0      |
+            |a748ee1a-1d0b-4ae9-9074-279a2b6ba524   |2020-10-08        |70      |1100.1120      |660.0012      |23.7088      |0      |
-            |a748ee1a-1d0b-4ae9-9074-279a2b6ba524   |2020-10-09 00:00:00  |69      |750.3615       |734.1516      |131.8579     |0      |
+            |a748ee1a-1d0b-4ae9-9074-279a2b6ba524   |2020-10-09        |69      |750.3615       |734.1516      |131.8579     |0      |
 ??? info "FITBIT_HEARTRATE_INTRADAY"
--- a/docs/team.md
+++ b/docs/team.md
@ -9,12 +9,16 @@ If you are interested in contributing feel free to submit a pull request or cont
 ??? abstract "About"
    Julio Vega is a postdoctoral associate at the Mobile Sensing + Health Institute. He is interested in personalized methodologies to monitor chronic conditions that affect daily human behavior using mobile and wearable data.
    - *vegaju* at *upmc* . *edu*
    - [Personal Website](https://juliovega.info/)
 ### Meng Li
 ??? abstract "About"
    Meng Li received her Master of Science degree in Information Science from the University of Pittsburgh. She is interested in applying machine learning algorithms to the medical field.
    - *lim11* at *upmc* . *edu*
    - [Linkedin Profile](https://www.linkedin.com/in/meng-li-57238414a)
    - [Github Profile](https://github.com/Meng6)
 ###  Abhineeth Reddy Kunta
@ -45,22 +49,10 @@ If you are interested in contributing feel free to submit a pull request or cont
 ### Nikunj Goel
 ??? abstract "About"
-    Nik is a graduate student at the University of Pittsburgh pursuing Master of Science in Information Science. He earned his Bachelor of Technology degree in Information Technology from India. He is a Data Enthusiast and passionate about finding the meaning out of raw data. In a long term, his goal is to create a breakthrough in Data Science and Deep Learning.
+    Nik is a graduate student at the University of Pittsburgh pursuing Master of Science in Information Science. He earned his Bachelor of Technology degree in Information Technology from India. He is a Data Enthusiasts and passionate about finding the meaning out of raw data. In a long term, his goal is to create a breakthrough in Data Science and Deep Learning.
    - [Linkedin Profile](https://www.linkedin.com/in/nikunjgoel95/)
 ### Kirtiraj Khandekar
 ??? abstract "About"
    Raj is a graduate student at the University of Pittsburgh pursuing Master of Science in Information Science.
 ### Weiyu Huang
 ??? abstract "About"
    Weiyu is a graduate student at the University of Pittsburgh pursuing Master of Science in Information Science.
    - [Github Profile](https://github.com/ChinW97)
 ## Community Contributors
 ### Agam Kumar
@ -96,19 +88,6 @@ If you are interested in contributing feel free to submit a pull request or cont
 ??? abstract "About"
    University of Virginia
 ###  Ian Barnett
 ??? abstract "About"
    University of Pennsylvania
    - [Profile](https://www.dbei.med.upenn.edu/bio/ian-j-barnett-phd)
 ###  Shirley Anugrah Hayati
 ??? abstract "About"
    University of Pennsylvania
    - [Personal Website](https://www.shirley.id/)
 ## Advisors
 ### Afsaneh Doryab
--- a/docs/visualizations/data-quality-visualizations.md
+++ b/docs/visualizations/data-quality-visualizations.md
@ -20,7 +20,7 @@ These plots can be used as a rough indication of the smartphone monitoring cover
 ## 2. Heatmaps of overall data yield
 These heatmaps are a break down per time segment and per participant of [Visualization 1](#1-histograms-of-phone-data-yield). Heatmap's rows represent participants, columns represent time segment instances and the cells’ color represent the valid yielded minute or hour ratio for a participant during a time segment instance.
-As different participants might join a study on different dates and time segments can be of any length and start on any day, the x-axis can be labelled with the absolute time of each time segment instance or the time delta between each time segment instance and the start of the first instance for each participant. These plots provide a quick study overview of the monitoring coverage per person and per time segment.
+As different participants might join a study on different dates and time segments can be of any length and start on any day, the x-axis can be labelled with the absolute time of the start of each time segment instance or the time delta between the start of each time segment instance minus the start of the first instance. These plots provide a quick study overview of the monitoring coverage per person and per time segment. 
 The figure below shows the heatmap of the valid yielded minute ratio for participants example01 and example02 on daily segments and, as we inferred from the previous histogram, the lighter (yellow) color on most time segment instances (cells) indicate both phones sensed data without interruptions for most days (except for the first and last ones).
@ -63,7 +63,7 @@ The figure below shows this heatmap for phone sensors collected by participant e
 ## 4. Heatmap of sensor row count
 These heatmaps are a per-sensor breakdown of [Visualization 1](#1-histograms-of-phone-data-yield) and [Visualization 2](#2-heatmaps-of-overall-data-yield). Note that the second row (ratio of valid yielded minutes) of this heatmap matches the respective participant (bottom) row the screenshot in Visualization 2.
-In these heatmaps rows represent phone or Fitbit sensors, columns represent time segment instances and cell’s color shows the normalized (0 to 1) row count of each sensor within a time segment instance. A grey cell represents missing data in that time segment instance. RAPIDS creates one heatmap per participant and they can be used to judge missing data on a per participant and per sensor basis.
+In these heatmaps rows represent phone or Fitbit sensors, columns represent time segment instances and cell’s color shows the normalized (0 to 1) row count of each sensor within a time segment instance. RAPIDS creates one heatmap per participant and they can be used to judge missing data on a per participant and per sensor basis.
 The figure below shows data for 14 phone sensors (including data yield) of example01’s daily segments. From the top two rows, we can see that the phone was sensing data for most of the monitoring period (as suggested by Figure 3 and Figure 4). We can also infer how phone usage influenced the different sensor streams; there are peaks of screen events during the first day (Apr 23rd), peaks of location coordinates on Apr 26th and Apr 30th, and no sent or received SMS except for Apr 23rd, Apr 29th and Apr 30th (unlabeled row between screen and locations).
--- a/docs/analysis/complete-workflow-example.md
+++ b/docs/analysis/complete-workflow-example.md
@ -1,8 +1,8 @@
 # Analysis Workflow Example
 !!! info "TL;DR"
-    - In addition to using RAPIDS to extract behavioral features, create plots, and clean sensor features, you can structure your data analysis within RAPIDS (i.e. creating ML/statistical models and evaluating your models)
+    - In addition to using RAPIDS to extract behavioral features and create plots, you can structure your data analysis within RAPIDS (i.e. cleaning your features and creating ML/statistical models)
-    - We include an analysis example in RAPIDS that covers raw data processing, feature extraction, cleaning, machine learning modeling, and evaluation
+    - We include an analysis example in RAPIDS that covers raw data processing, cleaning, feature extraction, machine learning modeling, and evaluation
    - Use this example as a guide to structure your own analysis within RAPIDS
    - RAPIDS analysis workflows are compatible with your favorite data science tools and libraries
    - RAPIDS analysis workflows are reproducible and we encourage you to publish them along with your research papers
@ -52,7 +52,7 @@ Note you will see a lot of warning messages, you can ignore them since they happ
 ## Modules of our analysis workflow example
 ??? info "1. Feature extraction"
-    We extract daily behavioral features for data yield, received and sent messages, missed, incoming and outgoing calls, resample fused location data using Doryab provider, activity recognition, battery, Bluetooth, screen, light, applications foreground, conversations, Wi-Fi connected, Wi-Fi visible, Fitbit heart rate summary and intraday data, Fitbit sleep summary data, and Fitbit step summary and intraday data without excluding sleep periods with an active bout threshold of 10 steps. In total, we obtained 245 daily sensor features over 12 days per participant. 
+    We extract daily behavioral features for data yield, received and sent messages, missed, incoming and outgoing calls, resample fused location data using Doryab provider, activity recognition, battery, Bluetooth, screen, light, applications foreground, conversations, Wi-Fi connected, Wi-Fi visible, Fitbit heart rate summary and intraday data, Fitbit sleep summary data, and Fitbit step summary and intraday data without excluding sleep periods with an active bout threshold of 10 steps. In total, we obtained 237 daily sensor features over 12 days per participant. 
 ??? info "2. Extract demographic data."
    It is common to have demographic data in addition to mobile and target (ground truth) data. In this example we include participants’ age, gender and the number of days they spent in hospital after their surgery as features in our model. We extract these three columns from the `data/external/example_workflow/participant_info.csv` file. As these three features remain the same within participants, they are used only on the population model. Refer to the `demographic_features` rule in `rules/models.smk`.
@ -69,12 +69,12 @@ Note you will see a lot of warning messages, you can ignore them since they happ
 ??? info "6. Feature cleaning."
    In this stage we perform four steps to clean our sensor feature file. First, we discard days with a data yield hour ratio less than or equal to 0.75, i.e. we include days with at least 18 hours of data. Second, we drop columns (features) with more than 30% of missing rows. Third, we drop columns with zero variance. Fourth, we drop rows (days) with more than 30% of missing columns (features). In this cleaning stage several parameters are created and exposed in `example_profile/example_config.yaml`. 
-    After this step, we kept 173 features over 11 days for the individual model of p01, 101 features over 12 days for the individual model of p02 and 117 features over 22 days for the population model. Note that the difference in the number of features between p01 and p02 is mostly due to iOS restrictions that stops researchers from collecting the same number of sensors than in Android phones. 
+    After this step, we kept 158 features over 11 days for the individual model of p01, 101 features over 12 days for the individual model of p02 and 106 features over 20 days for the population model. Note that the difference in the number of features between p01 and p02 is mostly due to iOS restrictions that stops researchers from collecting the same number of sensors than in Android phones. 
    Feature cleaning for the individual models is done in the `clean_sensor_features_for_individual_participants` rule and for the population model in the `clean_sensor_features_for_all_participants` rule in `rules/models.smk`.
 ??? info "7. Merge features and targets."
-    In this step we merge the cleaned features and target labels for our individual models in the `merge_features_and_targets_for_individual_model` rule in `rules/features.smk`. Additionally, we merge the cleaned features, target labels, and demographic features of our two participants for the population model in the `merge_features_and_targets_for_population_model` rule in `rules/features.smk`. These two merged files are the input for our individual and population models. 
+    In this step we merge the cleaned features and target labels for our individual models in the `merge_features_and_targets_for_individual_model` rule in `rules/models.smk`. Additionally, we merge the cleaned features, target labels, and demographic features of our two participants for the population model in the `merge_features_and_targets_for_population_model` rule in `rules/models.smk`. These two merged files are the input for our individual and population models. 
 ??? info "8. Modelling."
    This stage has three phases: model building, training and evaluation. 
--- a/docs/workflow-examples/minimal.md
+++ b/docs/workflow-examples/minimal.md
--- a/donotmakechanges.py
+++ b/donotmakechanges.py
@ -1,39 +0,0 @@
 """
 Please do not make any changes, as RAPIDS is running on tmux server ...
 """
 # !
 # !
 """
 Please do not make any changes, as RAPIDS is running on tmux server ...
 """
 # !
 # !
 """
 Please do not make any changes, as RAPIDS is running on tmux server ...
 """
 # !
 # !
 """
 Please do not make any changes, as RAPIDS is running on tmux server ...
 """
 # !
 # !
 """
 Please do not make any changes, as RAPIDS is running on tmux server ...
 """
 # !
 # !
 """
 Please do not make any changes, as RAPIDS is running on tmux server ...
 """
 # !
 # !
 """
 Please do not make any changes, as RAPIDS is running on tmux server ...
 """
 # !
 # !
 """
 Please do not make any changes, as RAPIDS is running on tmux server ...
 """
 # !
--- a/environment.yml
+++ b/environment.yml
@ -1,30 +1,109 @@
-name: rapids
+name: rapids202012
 channels:
  - conda-forge
  - defaults
 dependencies:
-    - auto-sklearn
+  - _py-xgboost-mutex=2.0
-    - hmmlearn
+  - appdirs=1.4.4
-    - imbalanced-learn
+  - arrow=0.16.0
-    - jsonschema
+  - asn1crypto=1.4.0
-    - lightgbm
+  - astropy=4.2
-    - matplotlib
+  - attrs=20.3.0
-    - numpy
+  - binaryornot=0.4.4
-    - pandas
+  - blas=1.0
-    - peakutils
+  - brotlipy=0.7.0
-    - pip
+  - bzip2=1.0.8
-    - plotly
+  - ca-certificates=2020.12.8
-    - python-dateutil
+  - certifi=2020.12.5
-    - pytz
+  - cffi=1.14.4
-    - pywavelets
+  - chardet=3.0.4
-    - pyyaml
+  - click=7.1.2
-    - scikit-learn
+  - cookiecutter=1.6.0
-    - scipy
+  - cryptography=3.3.1
-    - seaborn
+  - datrie=0.8.2
-    - setuptools
+  - docutils=0.16
-    - bioconda::snakemake 
+  - future=0.18.2
-    - bioconda::snakemake-minimal
+  - gitdb=4.0.5
-    - tqdm
+  - gitdb2=4.0.2
-    - xgboost
+  - gitpython=3.1.11
-    - pip:
+  - idna=2.10
-        - biosppy
+  - imbalanced-learn=0.6.2
-        - cr_features>=0.2
+  - importlib-metadata=2.0.0
  - importlib_metadata=2.0.0
  - intel-openmp=2019.4
  - jinja2=2.11.2
  - jinja2-time=0.2.0
  - joblib=1.0.0
  - jsonschema=3.2.0
  - libblas=3.8.0
  - libcblas=3.8.0
  - libcxx=10.0.0
  - libedit=3.1.20191231
  - libffi=3.3
  - libgfortran
  - liblapack=3.8.0
  - libxgboost=0.90
  - lightgbm=3.1.1
  - llvm-openmp=10.0.0
  - markupsafe=1.1.1
  - mkl
  - mkl-service=2.3.0
  - mkl_fft=1.2.0
  - mkl_random=1.1.1
  - more-itertools=8.6.0
  - ncurses=6.2
  - numpy=1.19.2
  - numpy-base=1.19.2
  - openssl=1.1.1i
  - pandas=1.1.5
  - pbr=5.5.1
  - pip=20.3.3
  - plotly=4.14.1
  - poyo=0.5.0
  - psutil=5.7.2
  - py-xgboost=0.90
  - pycparser=2.20
  - pyerfa=1.7.1.1
  - pyopenssl=20.0.1
  - pysocks=1.7.1
  - python=3.7.9
  - python-dateutil=2.8.1
  - python_abi=3.7
  - pytz=2020.4
  - pyyaml=5.3.1
  - readline=8.0
  - requests=2.25.0
  - retrying=1.3.3
  - scikit-learn=0.23.2
  - scipy=1.5.2
  - setuptools=51.0.0
  - six=1.15.0
  - smmap=3.0.4
  - smmap2=3.0.1
  - sqlite=3.33.0
  - threadpoolctl=2.1.0
  - tk=8.6.10
  - urllib3=1.25.11
  - wheel=0.36.2
  - whichcraft=0.6.1
  - wrapt=1.12.1
  - xgboost=0.90
  - xz=5.2.5
  - yaml=0.2.5
  - zipp=3.4.0
  - zlib=1.2.11
  - pip:
    - amply==0.1.4
    - configargparse==0.15.1
    - decorator==4.4.2
    - ipython-genutils==0.2.0
    - jupyter-core==4.6.3
    - nbformat==5.0.7
    - pulp==2.4
    - pyparsing==2.4.7
    - pyrsistent==0.15.5
    - ratelimiter==1.2.0.post0
    - snakemake==5.30.2
    - toposort==1.5
    - traitlets==4.3.3
 prefix: /usr/local/Caskroom/miniconda/base/envs/rapids202012
--- a/example_profile/Snakefile
+++ b/example_profile/Snakefile
@ -3,7 +3,7 @@ include: "../rules/common.smk"
 include: "../rules/renv.smk"
 include: "../rules/preprocessing.smk"
 include: "../rules/features.smk"
-include: "../rules/models_example.smk"
+include: "../rules/models.smk"
 include: "../rules/reports.smk"
 import itertools
@ -204,29 +204,15 @@ for provider in config["PHONE_LOCATIONS"]["PROVIDERS"].keys():
            else:
                raise ValueError("Error: Add PHONE_LOCATIONS (and as many PHONE_SENSORS as you have) to [PHONE_DATA_YIELD][SENSORS] in config.yaml. This is necessary to compute phone_yielded_timestamps (time when the smartphone was sensing data) which is used to resample fused location data (ALL_RESAMPLED and RESAMPLED_FUSED)")
        if provider == "BARNETT":
            files_to_compute.extend(expand("data/interim/{pid}/phone_locations_barnett_daily.csv", pid=config["PIDS"]))
        if provider == "DORYAB":
            files_to_compute.extend(expand("data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns_episodes.csv", pid=config["PIDS"]))
            files_to_compute.extend(expand("data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns_episodes_resampled_with_datetime.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/raw/{pid}/phone_locations_raw.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/interim/{pid}/phone_locations_processed.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/interim/{pid}/phone_locations_processed_with_datetime.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/interim/{pid}/phone_locations_processed_with_datetime_with_home.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/interim/{pid}/phone_locations_features/phone_locations_{language}_{provider_key}.csv", pid=config["PIDS"], language=get_script_language(config["PHONE_LOCATIONS"]["PROVIDERS"][provider]["SRC_SCRIPT"]), provider_key=provider.lower()))
        files_to_compute.extend(expand("data/processed/features/{pid}/phone_locations.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"]))
        files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv")
 for provider in config["FITBIT_CALORIES_INTRADAY"]["PROVIDERS"].keys():
    if config["FITBIT_CALORIES_INTRADAY"]["PROVIDERS"][provider]["COMPUTE"]:
        files_to_compute.extend(expand("data/raw/{pid}/fitbit_calories_intraday_raw.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/raw/{pid}/fitbit_calories_intraday_with_datetime.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/interim/{pid}/fitbit_calories_intraday_features/fitbit_calories_intraday_{language}_{provider_key}.csv", pid=config["PIDS"], language=get_script_language(config["FITBIT_CALORIES_INTRADAY"]["PROVIDERS"][provider]["SRC_SCRIPT"]), provider_key=provider.lower()))
        files_to_compute.extend(expand("data/processed/features/{pid}/fitbit_calories_intraday.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"]))
        files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv")
 for provider in config["FITBIT_DATA_YIELD"]["PROVIDERS"].keys():
    if config["FITBIT_DATA_YIELD"]["PROVIDERS"][provider]["COMPUTE"]:
        files_to_compute.extend(expand("data/raw/{pid}/fitbit_heartrate_intraday_raw.csv", pid=config["PIDS"]))
@ -285,12 +271,6 @@ for provider in config["FITBIT_STEPS_SUMMARY"]["PROVIDERS"].keys():
 for provider in config["FITBIT_STEPS_INTRADAY"]["PROVIDERS"].keys():
    if config["FITBIT_STEPS_INTRADAY"]["PROVIDERS"][provider]["COMPUTE"]:
        if config["FITBIT_STEPS_INTRADAY"]["EXCLUDE_SLEEP"]["TIME_BASED"]["EXCLUDE"] or config["FITBIT_STEPS_INTRADAY"]["EXCLUDE_SLEEP"]["FITBIT_BASED"]["EXCLUDE"]:
            if config["FITBIT_STEPS_INTRADAY"]["EXCLUDE_SLEEP"]["FITBIT_BASED"]["EXCLUDE"]:
                files_to_compute.extend(expand("data/raw/{pid}/fitbit_sleep_summary_raw.csv", pid=config["PIDS"]))
            files_to_compute.extend(expand("data/interim/{pid}/fitbit_steps_intraday_with_datetime_exclude_sleep.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/raw/{pid}/fitbit_steps_intraday_raw.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/raw/{pid}/fitbit_steps_intraday_with_datetime.csv", pid=config["PIDS"]))
        files_to_compute.extend(expand("data/interim/{pid}/fitbit_steps_intraday_features/fitbit_steps_intraday_{language}_{provider_key}.csv", pid=config["PIDS"], language=get_script_language(config["FITBIT_STEPS_INTRADAY"]["PROVIDERS"][provider]["SRC_SCRIPT"]), provider_key=provider.lower()))
@ -377,21 +357,11 @@ if config["HEATMAP_SENSOR_ROW_COUNT_PER_TIME_SEGMENT"]["PLOT"]:
    files_to_compute.append("reports/data_exploration/heatmap_sensor_row_count_per_time_segment.html")
 if config["HEATMAP_PHONE_DATA_YIELD_PER_PARTICIPANT_PER_TIME_SEGMENT"]["PLOT"]:
    if not config["PHONE_DATA_YIELD"]["PROVIDERS"]["RAPIDS"]["COMPUTE"]:
        raise ValueError("Error: [PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] must be True in config.yaml to get heatmaps of overall data yield.")
    files_to_compute.append("reports/data_exploration/heatmap_phone_data_yield_per_participant_per_time_segment.html")
 if config["HEATMAP_FEATURE_CORRELATION_MATRIX"]["PLOT"]:
    files_to_compute.append("reports/data_exploration/heatmap_feature_correlation_matrix.html")
 # Data Cleaning
 for provider in config["ALL_CLEANING_INDIVIDUAL"]["PROVIDERS"].keys():
    if config["ALL_CLEANING_INDIVIDUAL"]["PROVIDERS"][provider]["COMPUTE"]:
        files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features_cleaned_" + provider.lower() +".csv", pid=config["PIDS"]))
 for provider in config["ALL_CLEANING_OVERALL"]["PROVIDERS"].keys():
    if config["ALL_CLEANING_OVERALL"]["PROVIDERS"][provider]["COMPUTE"]:
        files_to_compute.extend(expand("data/processed/features/all_participants/all_sensor_features_cleaned_" + provider.lower() +".csv"))
 # Analysis Workflow Example
 models, scalers = [], []
 for model_name in config["PARAMS_FOR_ANALYSIS"]["MODEL_NAMES"]:
@ -409,6 +379,7 @@ files_to_compute.extend(expand("data/raw/{pid}/participant_target_with_datetime.
 files_to_compute.extend(expand("data/processed/targets/{pid}/parsed_targets.csv", pid=config["PIDS"]))
 # Individual model
 files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features_cleaned.csv", pid=config["PIDS"]))
 files_to_compute.extend(expand("data/processed/models/individual_model/{pid}/input.csv", pid=config["PIDS"]))
 files_to_compute.extend(expand("data/processed/models/individual_model/{pid}/output_{cv_method}/baselines.csv", pid=config["PIDS"], cv_method=config["PARAMS_FOR_ANALYSIS"]["CV_METHODS"]))
 files_to_compute.extend(expand(
@ -421,6 +392,7 @@ files_to_compute.extend(expand(
                            scaler=scalers))
 # Population model
 files_to_compute.append("data/processed/features/all_participants/all_sensor_features_cleaned.csv")
 files_to_compute.append("data/processed/models/population_model/input.csv")
 files_to_compute.extend(expand("data/processed/models/population_model/output_{cv_method}/baselines.csv", cv_method=config["PARAMS_FOR_ANALYSIS"]["CV_METHODS"]))
 files_to_compute.extend(expand(
--- a/example_profile/example_config.yaml
+++ b/example_profile/example_config.yaml
@ -84,7 +84,6 @@ PHONE_APPLICATIONS_CRASHES:
    CATALOGUE_SOURCE: FILE # FILE (genres are read from CATALOGUE_FILE) or GOOGLE (genres are scrapped from the Play Store)
    CATALOGUE_FILE: "data/external/stachl_application_genre_catalogue.csv"
    UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether or not to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
    PACKAGE_NAMES_HASHED: False
    SCRAPE_MISSING_CATEGORIES: False # whether or not to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
  PROVIDERS: # None implemented yet but this sensor can be used in PHONE_DATA_YIELD
@ -94,13 +93,11 @@ PHONE_APPLICATIONS_FOREGROUND:
  APPLICATION_CATEGORIES:
    CATALOGUE_SOURCE: FILE # FILE (genres are read from CATALOGUE_FILE) or GOOGLE (genres are scrapped from the Play Store)
    CATALOGUE_FILE: "data/external/stachl_application_genre_catalogue.csv"
    PACKAGE_NAMES_HASHED: False
    UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether or not to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
    SCRAPE_MISSING_CATEGORIES: False # whether or not to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
  PROVIDERS:
    RAPIDS:
      COMPUTE: True
      INCLUDE_EPISODE_FEATURES: False
      SINGLE_CATEGORIES: ["all", "email"]
      MULTIPLE_CATEGORIES:
        social: ["socialnetworks", "socialmediatools"]
@ -108,11 +105,7 @@ PHONE_APPLICATIONS_FOREGROUND:
      SINGLE_APPS: ["top1global", "com.facebook.moments", "com.google.android.youtube", "com.twitter.android"] # There's no entropy for single apps
      EXCLUDED_CATEGORIES: ["system_apps"]
      EXCLUDED_APPS: ["com.fitbit.FitbitMobile", "com.aware.plugin.upmc.cancer"]
-      FEATURES:
+      FEATURES: ["count", "timeoffirstuse", "timeoflastuse", "frequencyentropy"]
        APP_EVENTS: ["countevent", "timeoffirstuse", "timeoflastuse", "frequencyentropy"]
        APP_EPISODES: ["countepisode", "minduration", "maxduration", "meanduration", "sumduration"]
      IGNORE_EPISODES_SHORTER_THAN: 0 # in minutes, set to 0 to disable
      IGNORE_EPISODES_LONGER_THAN: 300 # in minutes, set to 0 to disable
      SRC_SCRIPT: src/features/phone_applications_foreground/rapids/main.py
 # See https://www.rapids.science/latest/features/phone-applications-notifications/
@ -122,7 +115,6 @@ PHONE_APPLICATIONS_NOTIFICATIONS:
    CATALOGUE_SOURCE: FILE # FILE (genres are read from CATALOGUE_FILE) or GOOGLE (genres are scrapped from the Play Store)
    CATALOGUE_FILE: "data/external/stachl_application_genre_catalogue.csv"
    UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether or not to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
    PACKAGE_NAMES_HASHED: False
    SCRAPE_MISSING_CATEGORIES: False # whether or not to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
  PROVIDERS: # None implemented yet but this sensor can be used in PHONE_DATA_YIELD
@ -206,11 +198,7 @@ PHONE_DATA_YIELD:
 # See https://www.rapids.science/latest/features/phone-keyboard/
 PHONE_KEYBOARD:
  CONTAINER: keyboard
-  PROVIDERS:
+  PROVIDERS: # None implemented yet but this sensor can be used in PHONE_DATA_YIELD
    RAPIDS:
      COMPUTE: False
      FEATURES: ["sessioncount","averageinterkeydelay","averagesessionlength","changeintextlengthlessthanminusone","changeintextlengthequaltominusone","changeintextlengthequaltoone","changeintextlengthmorethanone","maxtextlength","lastmessagelength","totalkeyboardtouches"]
      SRC_SCRIPT: src/features/phone_keyboard/rapids/main.py
 # See https://www.rapids.science/latest/features/phone-light/
 PHONE_LIGHT:
@ -227,12 +215,12 @@ PHONE_LOCATIONS:
  LOCATIONS_TO_USE: FUSED_RESAMPLED # ALL, GPS, ALL_RESAMPLED, OR FUSED_RESAMPLED
  FUSED_RESAMPLED_CONSECUTIVE_THRESHOLD: 30 # minutes, only replicate location samples to the next sensed bin if the phone did not stop collecting data for more than this threshold
  FUSED_RESAMPLED_TIME_SINCE_VALID_LOCATION: 720 # minutes, only replicate location samples to consecutive sensed bins if they were logged within this threshold after a valid location row
  ACCURACY_LIMIT: 51 # meters
  PROVIDERS:
    DORYAB:
      COMPUTE: True
      FEATURES: ["locationvariance","loglocationvariance","totaldistance","avgspeed","varspeed", "numberofsignificantplaces","numberlocationtransitions","radiusgyration","timeattop1location","timeattop2location","timeattop3location","movingtostaticratio","outlierstimepercent","maxlengthstayatclusters","minlengthstayatclusters","avglengthstayatclusters","stdlengthstayatclusters","locationentropy","normalizedlocationentropy","timeathome"]
      ACCURACY_LIMIT: 51 # meters, drops location coordinates with an accuracy higher than this. This number means there's a 68% probability the true location is within this radius
      DBSCAN_EPS: 10 # meters
      DBSCAN_MINSAMPLES: 5
      THRESHOLD_STATIC : 1 # km/h
@ -248,6 +236,7 @@ PHONE_LOCATIONS:
    BARNETT:
      COMPUTE: False
      FEATURES: ["hometime","disttravelled","rog","maxdiam","maxhomedist","siglocsvisited","avgflightlen","stdflightlen","avgflightdur","stdflightdur","probpause","siglocentropy","circdnrtn","wkenddayrtn"]
      ACCURACY_LIMIT: 51 # meters, drops location coordinates with an accuracy higher than this. This number means there's a 68% probability the true location is within this radius
      IF_MULTIPLE_TIMEZONES: USE_MOST_COMMON
      MINUTES_DATA_USED: False # Use this for quality control purposes, how many minutes of data (location coordinates gruped by minute) were used to compute features
      SRC_SCRIPT: src/features/phone_locations/barnett/main.R
@ -524,7 +513,7 @@ HEATMAP_SENSORS_PER_MINUTE_PER_TIME_SEGMENT:
 # See https://www.rapids.science/latest/visualizations/data-quality-visualizations/#4-heatmap-of-sensor-row-count
 HEATMAP_SENSOR_ROW_COUNT_PER_TIME_SEGMENT:
-  PLOT: False
+  PLOT: True
  SENSORS:  [PHONE_ACTIVITY_RECOGNITION, PHONE_APPLICATIONS_FOREGROUND, PHONE_BATTERY, PHONE_BLUETOOTH, PHONE_CALLS, PHONE_CONVERSATION, PHONE_LIGHT, PHONE_LOCATIONS, PHONE_MESSAGES, PHONE_SCREEN, PHONE_WIFI_CONNECTED, PHONE_WIFI_VISIBLE]
 # Features ------
@ -537,46 +526,6 @@ HEATMAP_FEATURE_CORRELATION_MATRIX:
  CORR_METHOD: "pearson" # choose from {"pearson", "kendall", "spearman"}
 ########################################################################################################################
 #                                                    Data Cleaning                                                     #
 ########################################################################################################################
 ALL_CLEANING_INDIVIDUAL:
  PROVIDERS:
    RAPIDS:
      COMPUTE: True
      IMPUTE_SELECTED_EVENT_FEATURES:
        COMPUTE: False
        MIN_DATA_YIELDED_MINUTES_TO_IMPUTE: 0.33
      COLS_NAN_THRESHOLD: 0.3 # set to 1 to disable
      COLS_VAR_THRESHOLD: True
      ROWS_NAN_THRESHOLD: 0.3 # set to 1 to disable
      DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_HOURS # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
      DATA_YIELD_RATIO_THRESHOLD: 0.75 # set to 0 to disable
      DROP_HIGHLY_CORRELATED_FEATURES:
        COMPUTE: False
        MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
        CORR_THRESHOLD: 0.95
      SRC_SCRIPT: src/features/all_cleaning_individual/rapids/main.R
 ALL_CLEANING_OVERALL:
  PROVIDERS:
    RAPIDS:
      COMPUTE: True
      IMPUTE_SELECTED_EVENT_FEATURES:
        COMPUTE: False
        MIN_DATA_YIELDED_MINUTES_TO_IMPUTE: 0.33
      COLS_NAN_THRESHOLD: 0.3 # set to 1 to disable
      COLS_VAR_THRESHOLD: True
      ROWS_NAN_THRESHOLD: 0.3 # set to 1 to disable
      DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_HOURS # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
      DATA_YIELD_RATIO_THRESHOLD: 0.75 # set to 0 to disable
      DROP_HIGHLY_CORRELATED_FEATURES:
        COMPUTE: False
        MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
        CORR_THRESHOLD: 0.95
      SRC_SCRIPT: src/features/all_cleaning_overall/rapids/main.R
 ########################################################################################################################
 #                                              Analysis Workflow Example                                               #
@ -595,6 +544,12 @@ PARAMS_FOR_ANALYSIS:
    FOLDER: data/external/example_workflow
    CONTAINER: participant_target.csv
  # Cleaning Parameters
  COLS_NAN_THRESHOLD: 0.3
  COLS_VAR_THRESHOLD: True
  ROWS_NAN_THRESHOLD: 0.3
  DATA_YIELDED_HOURS_RATIO_THRESHOLD: 0.75
  MODEL_NAMES: [LogReg, kNN , SVM, DT, RF, GB, XGBoost, LightGBM]
  CV_METHODS: [LeaveOneOut]
  RESULT_COMPONENTS: [fold_predictions, fold_metrics, overall_results, fold_feature_importances]
--- a/mkdocs.yml
+++ b/mkdocs.yml
@ -74,7 +74,7 @@ extra_css:
 nav:
  - Home: 'index.md'
  - Overview: setup/overview.md
-  - Minimal Example: analysis/minimal.md
+  - Minimal Example: workflow-examples/minimal.md
  - Citation: citation.md
  - Contributing: contributing.md
  - Setup:
@ -85,7 +85,6 @@ nav:
      - Introduction: datastreams/data-streams-introduction.md
      - Phone:
        - aware_mysql: datastreams/aware-mysql.md
        - aware_micro_mysql: datastreams/aware-micro-mysql.md
        - aware_csv: datastreams/aware-csv.md
        - aware_influxdb (beta): datastreams/aware-influxdb.md
        - Mandatory Phone Format: datastreams/mandatory-phone-format.md
@ -141,9 +140,8 @@ nav:
  - Visualizations:
    - Data Quality: visualizations/data-quality-visualizations.md
    - Features: visualizations/feature-visualizations.md
-  - Analysis:
+  - Analysis Workflows:
-    - Data Cleaning: analysis/data-cleaning.md
+    - Complete Example: workflow-examples/analysis.md
    - Complete Workflow Example: analysis/complete-workflow-example.md
  - Developers:
    - Git Flow: developers/git-flow.md
    - Remote Support: developers/remote-support.md
--- a/problems/app_categories.txt
+++ b/problems/app_categories.txt
@ -1,33 +0,0 @@
 Warning: 1241 parsing failures.
 row           col   expected actual                                                                            file
  1 is_system_app an integer  TRUE  'data/raw/p011/phone_applications_foreground_with_datetime_with_categories.csv'
  2 is_system_app an integer  FALSE 'data/raw/p011/phone_applications_foreground_with_datetime_with_categories.csv'
  3 is_system_app an integer  TRUE  'data/raw/p011/phone_applications_foreground_with_datetime_with_categories.csv'
  4 is_system_app an integer  TRUE  'data/raw/p011/phone_applications_foreground_with_datetime_with_categories.csv'
  5 is_system_app an integer  TRUE  'data/raw/p011/phone_applications_foreground_with_datetime_with_categories.csv'
 ... ............. .......... ...... ...............................................................................
 See problems(...) for more details.
 Warning message:
 The following named parsers don't match the column names: application_name
 Error: Problem with `filter()` input `..1`.
 ✖ object 'application_name' not found
 ℹ Input `..1` is `!is.na(application_name)`.
 Backtrace:
     █
  1. ├─`%>%`(...)
  2. ├─dplyr::mutate(...)
  3. ├─utils::head(., -1)
  4. ├─dplyr::select(., -c("timestamp"))
  5. ├─dplyr::filter(., !is.na(application_name))
  6. ├─dplyr:::filter.data.frame(., !is.na(application_name))
  7. │ └─dplyr:::filter_rows(.data, ...)
  8. │   ├─base::withCallingHandlers(...)
  9. │   └─mask$eval_all_filter(dots, env_filter)
 10. └─base::.handleSimpleError(...)
 11.   └─dplyr:::h(simpleError(msg, call))
 Execution halted
 [Mon Dec 13 17:19:06 2021]
 Error in rule app_episodes:
    jobid: 54
    output: data/interim/p011/phone_app_episodes.csv
--- a/problems/locations_barnett.txt
+++ b/problems/locations_barnett.txt
@ -1,5 +0,0 @@
 Warning message:
 In barnett_daily_features(snakemake) :
  Barnett's location features cannot be computed for data or time segments that do not span one or more entire days (00:00:00 to 23:59:59). Values below point to the problem:
 Location data rows within a daily time segment: 0
 Location data time span in days: 398.6
--- a/renv.lock
+++ b/renv.lock
--- a/renv/activate.R
+++ b/renv/activate.R
@ -15,6 +15,9 @@ local({
  Sys.setenv("RENV_R_INITIALIZING" = "true")
  on.exit(Sys.unsetenv("RENV_R_INITIALIZING"), add = TRUE)
  if(grepl("Darwin", Sys.info()["sysname"], fixed = TRUE) & grepl("ARM64", Sys.info()["version"], fixed = TRUE)) # M1 Macs
    Sys.setenv("TZDIR" = file.path(R.home(), "share", "zoneinfo"))
  # signal that we've consented to use renv
  options(renv.consent = TRUE)
--- a/rules/common.smk
+++ b/rules/common.smk
@ -23,16 +23,10 @@ def get_barnett_daily(wildcards):
 def get_locations_python_input(wildcards):
    if wildcards.provider_key.upper() == "DORYAB":
-        return "data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns_episodes_resampled_with_datetime.csv"
+        return "data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns.csv"
    else:
        return "data/interim/{pid}/phone_locations_processed_with_datetime.csv"
 def get_calls_input(wildcards):
    if (wildcards.provider_key.upper() == "RAPIDS") and (config["PHONE_CALLS"]["PROVIDERS"]["RAPIDS"]["FEATURES_TYPE"] == "EPISODES"):
        return "data/interim/{pid}/phone_calls_episodes_resampled_with_datetime.csv"
    else:
        return "data/raw/{pid}/phone_calls_with_datetime.csv"
 def find_features_files(wildcards):
    feature_files = []
    for provider_key, provider in config[(wildcards.sensor_key).upper()]["PROVIDERS"].items():
@ -40,17 +34,6 @@ def find_features_files(wildcards):
            feature_files.extend(expand("data/interim/{{pid}}/{sensor_key}_features/{sensor_key}_{language}_{provider_key}.csv", sensor_key=wildcards.sensor_key.lower(), language=get_script_language(provider["SRC_SCRIPT"]), provider_key=provider_key.lower()))
    return(feature_files)
 def find_joint_non_empatica_sensor_files(wildcards):
    joined_files = []
    for config_key in config.keys():
        if config_key.startswith(("PHONE", "FITBIT")) and "PROVIDERS" in config[config_key] and isinstance(config[config_key]["PROVIDERS"], dict):
            for provider_key, provider in config[config_key]["PROVIDERS"].items():
                if "COMPUTE" in provider.keys() and provider["COMPUTE"]:
                    joined_files.append("data/processed/features/{pid}/" + config_key.lower() + ".csv")
                    break
    return joined_files
 def optional_steps_sleep_input(wildcards):
    if config["FITBIT_STEPS_INTRADAY"]["EXCLUDE_SLEEP"]["FITBIT_BASED"]["EXCLUDE"]:
        return "data/raw/{pid}/fitbit_sleep_summary_raw.csv"
@ -125,16 +108,7 @@ def input_tzcodes_file(wilcards):
        if not config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"].lower().endswith(".csv"):
            raise ValueError("[TIMEZONE][MULTIPLE][TZCODES_FILE] should point to a CSV file, instead you typed: " + config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"])
        if not Path(config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"]).exists():
-            try:
+            raise ValueError("[TIMEZONE][MULTIPLE][TZCODES_FILE] should point to a CSV file, the file in the path you typed does not exist: " + config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"])
                config["TIMEZONE"]["MULTIPLE"]["TZ_FILE"]
            except KeyError:
                raise ValueError("To create TZCODES_FILE, a list of timezones should be created " +
                                 "with the rule preprocessing.smk/prepare_tzcodes_file " +
                                 "which will create a file specified as config['TIMEZONE']['MULTIPLE']['TZ_FILE']." +
                                 "\n An alternative is to provide the file manually:" +
                                 "[TIMEZONE][MULTIPLE][TZCODES_FILE] should point to a CSV file," +
                                 "but the file in the path you typed does not exist: " +
                                 config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"])
        return [config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"]]
    return []
--- a/rules/features.smk
+++ b/rules/features.smk
@ -125,7 +125,6 @@ rule phone_applications_crashes_r_features:
 rule phone_applications_foreground_python_features:
    input:
        sensor_data = "data/raw/{pid}/phone_applications_foreground_with_datetime_with_categories.csv",
        episode_data = "data/interim/{pid}/phone_app_episodes_resampled_with_datetime.csv",
        time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
    params:
        provider = lambda wildcards: config["PHONE_APPLICATIONS_FOREGROUND"]["PROVIDERS"][wildcards.provider_key.upper()],
@ -139,7 +138,6 @@ rule phone_applications_foreground_python_features:
 rule phone_applications_foreground_r_features:
    input:
        sensor_data = "data/raw/{pid}/phone_applications_foreground_with_datetime_with_categories.csv",
        episode_data = "data/interim/{pid}/phone_app_episodes_resampled_with_datetime.csv",
        time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
    params:
        provider = lambda wildcards: config["PHONE_APPLICATIONS_FOREGROUND"]["PROVIDERS"][wildcards.provider_key.upper()],
@ -264,17 +262,9 @@ rule phone_bluetooth_r_features:
    script:
        "../src/features/entry.R"
-rule calls_episodes:
+rule calls_python_features:
    input:
-        calls = "data/raw/{pid}/phone_calls_raw.csv"
+        sensor_data = "data/raw/{pid}/phone_calls_with_datetime.csv",
    output:
        "data/interim/{pid}/phone_calls_episodes.csv"
    script:
        "../src/features/phone_calls/episodes/calls_episodes.py"
 rule phone_calls_python_features:
    input:
        sensor_data = get_calls_input,
        time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
    params:
        provider = lambda wildcards: config["PHONE_CALLS"]["PROVIDERS"][wildcards.provider_key.upper()],
@ -285,9 +275,9 @@ rule phone_calls_python_features:
    script:
        "../src/features/entry.py"
-rule phone_calls_r_features:
+rule calls_r_features:
    input:
-        sensor_data = get_calls_input,
+        sensor_data = "data/raw/{pid}/phone_calls_with_datetime.csv",
        time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
    params:
        provider = lambda wildcards: config["PHONE_CALLS"]["PROVIDERS"][wildcards.provider_key.upper()],
@ -324,40 +314,6 @@ rule conversation_r_features:
    script:
        "../src/features/entry.R"
 rule preprocess_esm:
    input: "data/raw/{pid}/phone_esm_with_datetime.csv"
    params:
        scales=lambda wildcards: config["PHONE_ESM"]["PROVIDERS"]["STRAW"]["SCALES"]
    output: "data/interim/{pid}/phone_esm_clean.csv"
    script:
        "../src/features/phone_esm/straw/preprocess.py"
 rule esm_features:
    input:
        sensor_data = "data/interim/{pid}/phone_esm_clean.csv",
        time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
    params:
        provider = lambda wildcards: config["PHONE_ESM"]["PROVIDERS"][wildcards.provider_key.upper()],
        provider_key = "{provider_key}",
        sensor_key = "phone_esm",
        scales=lambda wildcards: config["PHONE_ESM"]["PROVIDERS"][wildcards.provider_key.upper()]["SCALES"]
    output: "data/interim/{pid}/phone_esm_features/phone_esm_python_{provider_key}.csv"
    script:
        "../src/features/entry.py"
 rule phone_speech_python_features:
    input:
        sensor_data = "data/raw/{pid}/phone_speech_with_datetime.csv",
        time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
    params:
        provider = lambda wildcards: config["PHONE_SPEECH"]["PROVIDERS"][wildcards.provider_key.upper()],
        provider_key = "{provider_key}",
        sensor_key = "phone_speech"
    output: 
        "data/interim/{pid}/phone_speech_features/phone_speech_python_{provider_key}.csv"
    script:
        "../src/features/entry.py"
 rule phone_keyboard_python_features:
    input:
        sensor_data = "data/raw/{pid}/phone_keyboard_with_datetime.csv",
@ -416,7 +372,7 @@ rule phone_locations_add_doryab_extra_columns:
    params:
        provider = config["PHONE_LOCATIONS"]["PROVIDERS"]["DORYAB"]
    output: 
-        "data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns_episodes.csv"
+        "data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns.csv"
    script:
        "../src/features/phone_locations/doryab/add_doryab_extra_columns.py"
@ -492,15 +448,6 @@ rule screen_episodes:
    script:
        "../src/features/phone_screen/episodes/screen_episodes.R"
 rule app_episodes:
    input:
        screen = "data/interim/{pid}/phone_screen_episodes.csv",
        app = "data/raw/{pid}/phone_applications_foreground_with_datetime_with_categories.csv"
    output:
        "data/interim/{pid}/phone_app_episodes.csv"
    script:
        "../src/features/phone_applications_foreground/episodes/app_episodes.R"
 rule phone_screen_python_features:
    input:
        sensor_episodes = "data/interim/{pid}/phone_screen_episodes_resampled_with_datetime.csv",
@ -795,6 +742,22 @@ rule fitbit_sleep_intraday_r_features:
    script:
        "../src/features/entry.R"
 rule merge_sensor_features_for_individual_participants:
    input:
        feature_files = input_merge_sensor_features_for_individual_participants
    output:
        "data/processed/features/{pid}/all_sensor_features.csv"
    script:
        "../src/features/utils/merge_sensor_features_for_individual_participants.R"
 rule merge_sensor_features_for_all_participants:
    input:
        feature_files = expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"])
    output:
        "data/processed/features/all_participants/all_sensor_features.csv"
    script:
        "../src/features/utils/merge_sensor_features_for_all_participants.R"
 rule empatica_accelerometer_python_features:
    input:
        sensor_data = "data/raw/{pid}/empatica_accelerometer_with_datetime.csv",
@ -804,8 +767,7 @@ rule empatica_accelerometer_python_features:
        provider_key = "{provider_key}",
        sensor_key = "empatica_accelerometer"
    output:
-        "data/interim/{pid}/empatica_accelerometer_features/empatica_accelerometer_python_{provider_key}.csv",
+        "data/interim/{pid}/empatica_accelerometer_features/empatica_accelerometer_python_{provider_key}.csv"
        "data/interim/{pid}/empatica_accelerometer_features/empatica_accelerometer_python_{provider_key}_windows.csv"
    script:
        "../src/features/entry.py"
@ -831,8 +793,7 @@ rule empatica_heartrate_python_features:
        provider_key = "{provider_key}",
        sensor_key = "empatica_heartrate"
    output:
-        "data/interim/{pid}/empatica_heartrate_features/empatica_heartrate_python_{provider_key}.csv",
+        "data/interim/{pid}/empatica_heartrate_features/empatica_heartrate_python_{provider_key}.csv"
        "data/interim/{pid}/empatica_heartrate_features/empatica_heartrate_python_{provider_key}_windows.csv"
    script:
        "../src/features/entry.py"
@ -858,8 +819,7 @@ rule empatica_temperature_python_features:
        provider_key = "{provider_key}",
        sensor_key = "empatica_temperature"
    output:
-        "data/interim/{pid}/empatica_temperature_features/empatica_temperature_python_{provider_key}.csv",
+        "data/interim/{pid}/empatica_temperature_features/empatica_temperature_python_{provider_key}.csv"
        "data/interim/{pid}/empatica_temperature_features/empatica_temperature_python_{provider_key}_windows.csv"
    script:
        "../src/features/entry.py"
@ -885,8 +845,7 @@ rule empatica_electrodermal_activity_python_features:
        provider_key = "{provider_key}",
        sensor_key = "empatica_electrodermal_activity"
    output:
-        "data/interim/{pid}/empatica_electrodermal_activity_features/empatica_electrodermal_activity_python_{provider_key}.csv",
+        "data/interim/{pid}/empatica_electrodermal_activity_features/empatica_electrodermal_activity_python_{provider_key}.csv"
        "data/interim/{pid}/empatica_electrodermal_activity_features/empatica_electrodermal_activity_python_{provider_key}_windows.csv"
    script:
        "../src/features/entry.py"
@ -912,8 +871,7 @@ rule empatica_blood_volume_pulse_python_features:
        provider_key = "{provider_key}",
        sensor_key = "empatica_blood_volume_pulse"
    output:
-        "data/interim/{pid}/empatica_blood_volume_pulse_features/empatica_blood_volume_pulse_python_{provider_key}.csv",
+        "data/interim/{pid}/empatica_blood_volume_pulse_features/empatica_blood_volume_pulse_python_{provider_key}.csv"
        "data/interim/{pid}/empatica_blood_volume_pulse_features/empatica_blood_volume_pulse_python_{provider_key}_windows.csv"
    script:
        "../src/features/entry.py"
@ -939,8 +897,7 @@ rule empatica_inter_beat_interval_python_features:
        provider_key = "{provider_key}",
        sensor_key = "empatica_inter_beat_interval"
    output:
-        "data/interim/{pid}/empatica_inter_beat_interval_features/empatica_inter_beat_interval_python_{provider_key}.csv",
+        "data/interim/{pid}/empatica_inter_beat_interval_features/empatica_inter_beat_interval_python_{provider_key}.csv"
        "data/interim/{pid}/empatica_inter_beat_interval_features/empatica_inter_beat_interval_python_{provider_key}_windows.csv"
    script:
        "../src/features/entry.py"
@ -982,48 +939,3 @@ rule empatica_tags_r_features:
        "data/interim/{pid}/empatica_tags_features/empatica_tags_r_{provider_key}.csv"
    script:
        "../src/features/entry.R"
 rule merge_sensor_features_for_individual_participants:
    input:
        feature_files = input_merge_sensor_features_for_individual_participants
    output:
        "data/processed/features/{pid}/all_sensor_features.csv"
    script:
        "../src/features/utils/merge_sensor_features_for_individual_participants.R"
 rule merge_sensor_features_for_all_participants:
    input:
        feature_files = expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"])
    output:
        "data/processed/features/all_participants/all_sensor_features.csv"
    script:
        "../src/features/utils/merge_sensor_features_for_all_participants.R"
 rule clean_sensor_features_for_individual_participants:
    input:
        sensor_data = rules.merge_sensor_features_for_individual_participants.output
    wildcard_constraints:
        pid = "("+"|".join(config["PIDS"])+")"
    params:
        provider = lambda wildcards: config["ALL_CLEANING_INDIVIDUAL"]["PROVIDERS"][wildcards.provider_key.upper()],
        provider_key = "{provider_key}",
        script_extension = "{script_extension}",
        sensor_key = "all_cleaning_individual" 
    output:
        "data/processed/features/{pid}/all_sensor_features_cleaned_{provider_key}_{script_extension}.csv" 
    script:
        "../src/features/entry.{params.script_extension}"
 rule clean_sensor_features_for_all_participants:
    input:
        sensor_data = rules.merge_sensor_features_for_all_participants.output
    params:
        provider = lambda wildcards: config["ALL_CLEANING_OVERALL"]["PROVIDERS"][wildcards.provider_key.upper()],
        provider_key = "{provider_key}",
        script_extension = "{script_extension}",
        sensor_key = "all_cleaning_overall",
        target = "{target}"
    output:
        "data/processed/features/all_participants/all_sensor_features_cleaned_{provider_key}_{script_extension}_({target}).csv"
    script:
        "../src/features/entry.{params.script_extension}"
--- a/rules/models.smk
+++ b/rules/models.smk
@ -1,52 +1,165 @@
-rule merge_baseline_data:
+rule download_demographic_data:
    input:
        data = expand(config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["FOLDER"] + "/{container}", container=config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["CONTAINER"])
    output:
        "data/raw/baseline_merged.csv"
    script:
        "../src/data/merge_baseline_data.py"
 rule download_baseline_data:
    input:
        participant_file = "data/external/participant_files/{pid}.yaml",
-        data = "data/raw/baseline_merged.csv"
+        data = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FOLDER"] + "/" + config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["CONTAINER"]
    output:
-        "data/raw/{pid}/participant_baseline_raw.csv"
+        "data/raw/{pid}/participant_info_raw.csv"
    script:
-        "../src/data/download_baseline_data.py"
+        "../src/data/workflow_example/download_demographic_data.R"
-rule baseline_features:
+rule demographic_features:
    input:
-        "data/raw/{pid}/participant_baseline_raw.csv"
+        participant_info = "data/raw/{pid}/participant_info_raw.csv"
    params:
-        pid="{pid}",
+        pid = "{pid}",
-        features=config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["FEATURES"],
+        features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FEATURES"]
        question_filename=config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["FOLDER"] + "/" + config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["QUESTION_LIST"]
    output:
-        interim="data/interim/{pid}/baseline_questionnaires.csv",
+        "data/processed/features/{pid}/demographic_features.csv"
        features="data/processed/features/{pid}/baseline_features.csv"
    script:
-        "../src/data/baseline_features.py"
+        "../src/features/workflow_example/demographic_features.py"
-rule select_target:
+rule download_target_data:
    input:
-        cleaned_sensor_features = "data/processed/features/{pid}/all_sensor_features_cleaned_straw_py.csv"
+        participant_file = "data/external/participant_files/{pid}.yaml",
        data = config["PARAMS_FOR_ANALYSIS"]["TARGET"]["FOLDER"] + "/" + config["PARAMS_FOR_ANALYSIS"]["TARGET"]["CONTAINER"]
    output:
        "data/raw/{pid}/participant_target_raw.csv"
    script:
        "../src/data/workflow_example/download_target_data.R"
 rule target_readable_datetime:
    input:
        sensor_input = "data/raw/{pid}/participant_target_raw.csv",
        time_segments = "data/interim/time_segments/{pid}_time_segments.csv",
        pid_file = "data/external/participant_files/{pid}.yaml",
        tzcodes_file = input_tzcodes_file,
    params:
-        target_variable = config["PARAMS_FOR_ANALYSIS"]["TARGET"]["LABEL"]
+        device_type = "fitbit",
        timezone_parameters = config["TIMEZONE"],
        pid = "{pid}",
        time_segments_type = config["TIME_SEGMENTS"]["TYPE"],
        include_past_periodic_segments = config["TIME_SEGMENTS"]["INCLUDE_PAST_PERIODIC_SEGMENTS"]
    output:
        "data/raw/{pid}/participant_target_with_datetime.csv"
    script:
        "../src/data/datetime/readable_datetime.R"
 rule parse_targets:
    input:
        targets = "data/raw/{pid}/participant_target_with_datetime.csv",
        time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
    output:
        "data/processed/targets/{pid}/parsed_targets.csv"
    script:
        "../src/models/workflow_example/parse_targets.py"
 rule clean_sensor_features_for_individual_participants:
    input:
        rules.merge_sensor_features_for_individual_participants.output
    params:
        cols_nan_threshold = config["PARAMS_FOR_ANALYSIS"]["COLS_NAN_THRESHOLD"],
        cols_var_threshold = config["PARAMS_FOR_ANALYSIS"]["COLS_VAR_THRESHOLD"],
        rows_nan_threshold = config["PARAMS_FOR_ANALYSIS"]["ROWS_NAN_THRESHOLD"],
        data_yielded_hours_ratio_threshold = config["PARAMS_FOR_ANALYSIS"]["DATA_YIELDED_HOURS_RATIO_THRESHOLD"],
    output:
        "data/processed/features/{pid}/all_sensor_features_cleaned.csv"
    script:
        "../src/models/workflow_example/clean_sensor_features.R"
 rule clean_sensor_features_for_all_participants:
    input:
        rules.merge_sensor_features_for_all_participants.output
    params:
        cols_nan_threshold = config["PARAMS_FOR_ANALYSIS"]["COLS_NAN_THRESHOLD"],
        cols_var_threshold = config["PARAMS_FOR_ANALYSIS"]["COLS_VAR_THRESHOLD"],
        rows_nan_threshold = config["PARAMS_FOR_ANALYSIS"]["ROWS_NAN_THRESHOLD"],
        data_yielded_hours_ratio_threshold = config["PARAMS_FOR_ANALYSIS"]["DATA_YIELDED_HOURS_RATIO_THRESHOLD"],
    output:
        "data/processed/features/all_participants/all_sensor_features_cleaned.csv"
    script:
        "../src/models/workflow_example/clean_sensor_features.R"
 rule merge_features_and_targets_for_individual_model:
    input:
        cleaned_sensor_features = "data/processed/features/{pid}/all_sensor_features_cleaned.csv",
        targets = "data/processed/targets/{pid}/parsed_targets.csv",
    output:
        "data/processed/models/individual_model/{pid}/input.csv"
    script:
-        "../src/models/select_targets.py"
+        "../src/models/workflow_example/merge_features_and_targets_for_individual_model.py"
 rule merge_features_and_targets_for_population_model:
    input:
-        cleaned_sensor_features = "data/processed/features/all_participants/all_sensor_features_cleaned_straw_py_({target}).csv",
+        cleaned_sensor_features = "data/processed/features/all_participants/all_sensor_features_cleaned.csv",
-        demographic_features = expand("data/processed/features/{pid}/baseline_features.csv", pid=config["PIDS"]),
+        demographic_features = expand("data/processed/features/{pid}/demographic_features.csv", pid=config["PIDS"]),
-    params:
+        targets = expand("data/processed/targets/{pid}/parsed_targets.csv", pid=config["PIDS"]),
        target_variable="{target}"
    output:
-        "data/processed/models/population_model/input_{target}.csv"
+        "data/processed/models/population_model/input.csv"
    script:
-        "../src/models/merge_features_and_targets_for_population_model.py"
+        "../src/models/workflow_example/merge_features_and_targets_for_population_model.py"
 rule baselines_for_individual_model:
    input:
        "data/processed/models/individual_model/{pid}/input.csv"
    params:
        cv_method = "{cv_method}",
        colnames_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FEATURES"],
    output:
        "data/processed/models/individual_model/{pid}/output_{cv_method}/baselines.csv"
    log:
        "data/processed/models/individual_model/{pid}/output_{cv_method}/baselines_notes.log"
    script:
        "../src/models/workflow_example/baselines.py"
 rule baselines_for_population_model:
    input:
        "data/processed/models/population_model/input.csv"
    params:
        cv_method = "{cv_method}",
        colnames_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FEATURES"],
    output:
        "data/processed/models/population_model/output_{cv_method}/baselines.csv"
    log:
        "data/processed/models/population_model/output_{cv_method}/baselines_notes.log"
    script:
        "../src/models/workflow_example/baselines.py"
 rule modelling_for_individual_participants:
    input:
        data = "data/processed/models/individual_model/{pid}/input.csv"
    params:
        model = "{model}",
        cv_method = "{cv_method}",
        scaler = "{scaler}",
        categorical_operators = config["PARAMS_FOR_ANALYSIS"]["CATEGORICAL_OPERATORS"],
        categorical_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["CATEGORICAL_FEATURES"],
        model_hyperparams = config["PARAMS_FOR_ANALYSIS"]["MODEL_HYPERPARAMS"],
    output:
        fold_predictions = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/fold_predictions.csv",
        fold_metrics = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/fold_metrics.csv",
        overall_results = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/overall_results.csv",
        fold_feature_importances = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/fold_feature_importances.csv"
    log:
        "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/notes.log"
    script:
        "../src/models/workflow_example/modelling.py"
 rule modelling_for_all_participants:
    input:
        data = "data/processed/models/population_model/input.csv"
    params:
        model = "{model}",
        cv_method = "{cv_method}",
        scaler = "{scaler}",
        categorical_operators = config["PARAMS_FOR_ANALYSIS"]["CATEGORICAL_OPERATORS"],
        categorical_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["CATEGORICAL_FEATURES"],
        model_hyperparams = config["PARAMS_FOR_ANALYSIS"]["MODEL_HYPERPARAMS"],
    output:
        fold_predictions = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/fold_predictions.csv",
        fold_metrics = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/fold_metrics.csv",
        overall_results = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/overall_results.csv",
        fold_feature_importances = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/fold_feature_importances.csv"
    log:
        "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/notes.log"
    script:
        "../src/models/workflow_example/modelling.py"
--- a/rules/models_example.smk
+++ b/rules/models_example.smk
@ -1,139 +0,0 @@
 rule download_demographic_data:
    input:
        participant_file = "data/external/participant_files/{pid}.yaml",
        data = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FOLDER"] + "/" + config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["CONTAINER"]
    output:
        "data/raw/{pid}/participant_info_raw.csv"
    script:
        "../src/data/workflow_example/download_demographic_data.R"
 rule demographic_features:
    input:
        participant_info = "data/raw/{pid}/participant_info_raw.csv"
    params:
        pid = "{pid}",
        features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FEATURES"]
    output:
        "data/processed/features/{pid}/demographic_features.csv"
    script:
        "../src/features/workflow_example/demographic_features.py"
 rule download_target_data:
    input:
        participant_file = "data/external/participant_files/{pid}.yaml",
        data = config["PARAMS_FOR_ANALYSIS"]["TARGET"]["FOLDER"] + "/" + config["PARAMS_FOR_ANALYSIS"]["TARGET"]["CONTAINER"]
    output:
        "data/raw/{pid}/participant_target_raw.csv"
    script:
        "../src/data/workflow_example/download_target_data.R"
 rule target_readable_datetime:
    input:
        sensor_input = "data/raw/{pid}/participant_target_raw.csv",
        time_segments = "data/interim/time_segments/{pid}_time_segments.csv",
        pid_file = "data/external/participant_files/{pid}.yaml",
        tzcodes_file = input_tzcodes_file,
    params:
        device_type = "fitbit",
        timezone_parameters = config["TIMEZONE"],
        pid = "{pid}",
        time_segments_type = config["TIME_SEGMENTS"]["TYPE"],
        include_past_periodic_segments = config["TIME_SEGMENTS"]["INCLUDE_PAST_PERIODIC_SEGMENTS"]
    output:
        "data/raw/{pid}/participant_target_with_datetime.csv"
    script:
        "../src/data/datetime/readable_datetime.R"
 rule parse_targets:
    input:
        targets = "data/raw/{pid}/participant_target_with_datetime.csv",
        time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
    output:
        "data/processed/targets/{pid}/parsed_targets.csv"
    script:
        "../src/models/workflow_example/parse_targets.py"
 rule merge_features_and_targets_for_individual_model:
    input:
        cleaned_sensor_features = "data/processed/features/{pid}/all_sensor_features_cleaned_rapids.csv",
        targets = "data/processed/targets/{pid}/parsed_targets.csv",
    output:
        "data/processed/models/individual_model/{pid}/input.csv"
    script:
        "../src/models/workflow_example/merge_features_and_targets_for_individual_model.py"
 rule merge_features_and_targets_for_population_model:
    input:
        cleaned_sensor_features = "data/processed/features/all_participants/all_sensor_features_cleaned_rapids.csv",
        demographic_features = expand("data/processed/features/{pid}/demographic_features.csv", pid=config["PIDS"]),
        targets = expand("data/processed/targets/{pid}/parsed_targets.csv", pid=config["PIDS"]),
    output:
        "data/processed/models/population_model/input.csv"
    script:
        "../src/models/workflow_example/merge_features_and_targets_for_population_model.py"
 rule baselines_for_individual_model:
    input:
        "data/processed/models/individual_model/{pid}/input.csv"
    params:
        cv_method = "{cv_method}",
        colnames_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FEATURES"],
    output:
        "data/processed/models/individual_model/{pid}/output_{cv_method}/baselines.csv"
    log:
        "data/processed/models/individual_model/{pid}/output_{cv_method}/baselines_notes.log"
    script:
        "../src/models/workflow_example/baselines.py"
 rule baselines_for_population_model:
    input:
        "data/processed/models/population_model/input.csv"
    params:
        cv_method = "{cv_method}",
        colnames_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FEATURES"],
    output:
        "data/processed/models/population_model/output_{cv_method}/baselines.csv"
    log:
        "data/processed/models/population_model/output_{cv_method}/baselines_notes.log"
    script:
        "../src/models/workflow_example/baselines.py"
 rule modelling_for_individual_participants:
    input:
        data = "data/processed/models/individual_model/{pid}/input.csv"
    params:
        model = "{model}",
        cv_method = "{cv_method}",
        scaler = "{scaler}",
        categorical_operators = config["PARAMS_FOR_ANALYSIS"]["CATEGORICAL_OPERATORS"],
        categorical_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["CATEGORICAL_FEATURES"],
        model_hyperparams = config["PARAMS_FOR_ANALYSIS"]["MODEL_HYPERPARAMS"],
    output:
        fold_predictions = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/fold_predictions.csv",
        fold_metrics = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/fold_metrics.csv",
        overall_results = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/overall_results.csv",
        fold_feature_importances = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/fold_feature_importances.csv"
    log:
        "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/notes.log"
    script:
        "../src/models/workflow_example/modelling.py"
 rule modelling_for_all_participants:
    input:
        data = "data/processed/models/population_model/input.csv"
    params:
        model = "{model}",
        cv_method = "{cv_method}",
        scaler = "{scaler}",
        categorical_operators = config["PARAMS_FOR_ANALYSIS"]["CATEGORICAL_OPERATORS"],
        categorical_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["CATEGORICAL_FEATURES"],
        model_hyperparams = config["PARAMS_FOR_ANALYSIS"]["MODEL_HYPERPARAMS"],
    output:
        fold_predictions = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/fold_predictions.csv",
        fold_metrics = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/fold_metrics.csv",
        overall_results = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/overall_results.csv",
        fold_feature_importances = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/fold_feature_importances.csv"
    log:
        "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/notes.log"
    script:
        "../src/models/workflow_example/modelling.py"
--- a/rules/preprocessing.smk
+++ b/rules/preprocessing.smk
@ -4,36 +4,6 @@ rule create_example_participant_files:
    shell:
        "echo 'PHONE:\n  DEVICE_IDS: [a748ee1a-1d0b-4ae9-9074-279a2b6ba524]\n  PLATFORMS: [android]\n  LABEL: test-01\n  START_DATE: 2020-04-23 00:00:00\n  END_DATE: 2020-05-04 23:59:59\nFITBIT:\n  DEVICE_IDS: [a748ee1a-1d0b-4ae9-9074-279a2b6ba524]\n  LABEL: test-01\n  START_DATE: 2020-04-23 00:00:00\n  END_DATE: 2020-05-04 23:59:59\n' >> ./data/external/participant_files/example01.yaml && echo 'PHONE:\n  DEVICE_IDS: [13dbc8a3-dae3-4834-823a-4bc96a7d459d]\n  PLATFORMS: [ios]\n  LABEL: test-02\n  START_DATE: 2020-04-23 00:00:00\n  END_DATE: 2020-05-04 23:59:59\nFITBIT:\n  DEVICE_IDS: [13dbc8a3-dae3-4834-823a-4bc96a7d459d]\n  LABEL: test-02\n  START_DATE: 2020-04-23 00:00:00\n  END_DATE: 2020-05-04 23:59:59\n' >> ./data/external/participant_files/example02.yaml"
 # rule query_usernames_device_empatica_ids:
 #     params:
 #         baseline_folder = "/mnt/e/STRAWbaseline/"
 #     output:
 #         usernames_file = config["CREATE_PARTICIPANT_FILES"]["USERNAMES_CSV"],
 #         timezone_file = config["TIMEZONE"]["MULTIPLE"]["TZ_FILE"]
 #     script:
 #         "../../participants/prepare_usernames_file.py"
 rule prepare_tzcodes_file:
    input:
        timezone_file = config["TIMEZONE"]["MULTIPLE"]["TZ_FILE"]
    output:
        tzcodes_file = config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"]
    script:
        "../tools/create_multi_timezones_file.py"
 rule prepare_participants_csv:
    input:
        username_list = config["CREATE_PARTICIPANT_FILES"]["USERNAMES_CSV"]
    params:
        data_configuration = config["PHONE_DATA_STREAMS"][config["PHONE_DATA_STREAMS"]["USE"]],
        participants_table = "participants",
        device_id_table = "esm",
        start_end_date_table = "esm"
    output:
        participants_file = config["CREATE_PARTICIPANT_FILES"]["CSV_FILE_PATH"]
    script:
        "../src/data/translate_usernames_into_participants_data.R"
 rule create_participants_files:
    input:
        participants_file = config["CREATE_PARTICIPANT_FILES"]["CSV_FILE_PATH"] 
@ -128,8 +98,7 @@ rule process_phone_locations_types:
    params:
        consecutive_threshold = config["PHONE_LOCATIONS"]["FUSED_RESAMPLED_CONSECUTIVE_THRESHOLD"],
        time_since_valid_location = config["PHONE_LOCATIONS"]["FUSED_RESAMPLED_TIME_SINCE_VALID_LOCATION"],
-        locations_to_use = config["PHONE_LOCATIONS"]["LOCATIONS_TO_USE"],
+        locations_to_use = config["PHONE_LOCATIONS"]["LOCATIONS_TO_USE"]
        accuracy_limit = config["PHONE_LOCATIONS"]["ACCURACY_LIMIT"]
    output:
        "data/interim/{pid}/phone_locations_processed.csv"
    script:
@ -177,6 +146,7 @@ rule resample_episodes_with_datetime:
    script:
        "../src/data/datetime/readable_datetime.R"
 rule phone_application_categories:
    input:
        "data/raw/{pid}/phone_applications_{type}_with_datetime.csv"
@ -247,33 +217,5 @@ rule empatica_readable_datetime:
        include_past_periodic_segments = config["TIME_SEGMENTS"]["INCLUDE_PAST_PERIODIC_SEGMENTS"]
    output:
        "data/raw/{pid}/empatica_{sensor}_with_datetime.csv"
    resources:
        mem_mb=50000
    script:
        "../src/data/datetime/readable_datetime.R"
 rule extract_event_information_from_esm:
    input:
        esm_raw_input = "data/raw/{pid}/phone_esm_raw.csv",
        pid_file = "data/external/participant_files/{pid}.yaml"
    params:
        stage = "extract",
        pid = "{pid}"
    output:
        "data/raw/ers/{pid}_ers.csv",
        "data/raw/ers/{pid}_stress_event_targets.csv"
    script:
        "../src/features/phone_esm/straw/process_user_event_related_segments.py"
 rule merge_event_related_segments_files:
    input:
        ers_files = expand("data/raw/ers/{pid}_ers.csv", pid=config["PIDS"]),
        se_files = expand("data/raw/ers/{pid}_stress_event_targets.csv", pid=config["PIDS"])
    params:
        stage = "merge"
    output:
        "data/external/straw_events.csv",
        "data/external/stress_event_targets.csv"
    script:
        "../src/features/phone_esm/straw/process_user_event_related_segments.py"
--- a/rules/reports.smk
+++ b/rules/reports.smk
@ -1,8 +1,6 @@
 rule histogram_phone_data_yield:
    input:
        "data/processed/features/all_participants/all_sensor_features.csv"
    params:
        time_segments_type = config["TIME_SEGMENTS"]["TYPE"]
    output:
        "reports/data_exploration/histogram_phone_data_yield.html"
    script:
@ -14,8 +12,7 @@ rule heatmap_sensors_per_minute_per_time_segment:
        participant_file = "data/external/participant_files/{pid}.yaml",
        time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
    params:
-        pid = "{pid}",
+        pid = "{pid}"
        time_segments_type = config["TIME_SEGMENTS"]["TYPE"]
    output:
        "reports/interim/{pid}/heatmap_sensors_per_minute_per_time_segment.html"
    script:
@ -36,9 +33,7 @@ rule heatmap_sensor_row_count_per_time_segment:
        participant_file = "data/external/participant_files/{pid}.yaml",
        time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
    params:
-        pid = "{pid}",
+        pid = "{pid}"
        sensor_names = config["HEATMAP_SENSOR_ROW_COUNT_PER_TIME_SEGMENT"]["SENSORS"],
        time_segments_type = config["TIME_SEGMENTS"]["TYPE"]
    output:
        "reports/interim/{pid}/heatmap_sensor_row_count_per_time_segment.html"
    script:
@ -54,13 +49,11 @@ rule merge_heatmap_sensor_row_count_per_time_segment:
 rule heatmap_phone_data_yield_per_participant_per_time_segment:
    input:
-        participant_files = expand("data/external/participant_files/{pid}.yaml", pid=config["PIDS"]),
+        phone_data_yield = expand("data/processed/features/{pid}/phone_data_yield.csv", pid=config["PIDS"]),
-        time_segments_file = config["TIME_SEGMENTS"]["FILE"],
+        participant_file = expand("data/external/participant_files/{pid}.yaml", pid=config["PIDS"]),
-        phone_data_yield = "data/processed/features/all_participants/all_sensor_features.csv",
+        time_segments_labels = expand("data/interim/time_segments/{pid}_time_segments_labels.csv", pid=config["PIDS"])
    params:
-        pids = config["PIDS"],
+        time = config["HEATMAP_PHONE_DATA_YIELD_PER_PARTICIPANT_PER_TIME_SEGMENT"]["TIME"]
        time = config["HEATMAP_PHONE_DATA_YIELD_PER_PARTICIPANT_PER_TIME_SEGMENT"]["TIME"],
        time_segments_type = config["TIME_SEGMENTS"]["TYPE"]
    output:
        "reports/data_exploration/heatmap_phone_data_yield_per_participant_per_time_segment.html"
    script:
@ -70,7 +63,6 @@ rule heatmap_feature_correlation_matrix:
    input:
        all_sensor_features = "data/processed/features/all_participants/all_sensor_features.csv" # before data cleaning
    params:
        time_segments_type = config["TIME_SEGMENTS"]["TYPE"],
        min_rows_ratio = config["HEATMAP_FEATURE_CORRELATION_MATRIX"]["MIN_ROWS_RATIO"],
        corr_threshold = config["HEATMAP_FEATURE_CORRELATION_MATRIX"]["CORR_THRESHOLD"],
        corr_method = config["HEATMAP_FEATURE_CORRELATION_MATRIX"]["CORR_METHOD"]
--- a/src/data/baseline_features.py
+++ b/src/data/baseline_features.py
@ -1,182 +0,0 @@
 import numpy as np
 import pandas as pd
 pid = snakemake.params["pid"]
 requested_features = snakemake.params["features"]
 baseline_interim = pd.DataFrame(columns=["qid", "question", "score_original", "score"])
 baseline_features = pd.DataFrame(columns=requested_features)
 question_filename = snakemake.params["question_filename"]
 JCQ_DEMAND = "JobEisen"
 JCQ_CONTROL = "JobControle"
 dict_JCQ_demand_control_reverse = {
    JCQ_DEMAND: {
        3: " [Od mene se ne zahteva,",
        4: " [Imam dovolj časa, da končam",
        5: " [Pri svojem delu se ne srečujem s konfliktnimi",
    },
    JCQ_CONTROL: {
        2: " |Moje delo vključuje veliko ponavljajočega",
        6: " [Pri svojem delu imam zelo malo svobode",
    },
 }
 LIMESURVEY_JCQ_MIN = 1
 LIMESURVEY_JCQ_MAX = 4
 DEMAND_CONTROL_RATIO_MIN = 5 / (9 * 4)
 DEMAND_CONTROL_RATIO_MAX = (4 * 5) / 9
 JCQ_NORMS = {
    "F": {
        0: DEMAND_CONTROL_RATIO_MIN,
        1: 0.45,
        2: 0.52,
        3: 0.62,
        4: DEMAND_CONTROL_RATIO_MAX,
    },
    "M": {
        0: DEMAND_CONTROL_RATIO_MIN,
        1: 0.41,
        2: 0.48,
        3: 0.56,
        4: DEMAND_CONTROL_RATIO_MAX,
    },
 }
 participant_info = pd.read_csv(snakemake.input[0], parse_dates=["date_of_birth"])
 if not participant_info.empty:
    if "age" in requested_features:
        now = pd.Timestamp("now")
        baseline_features.loc[0, "age"] = (
            now - participant_info.loc[0, "date_of_birth"]
        ).days / 365.25245
    if "gender" in requested_features:
        baseline_features.loc[0, "gender"] = participant_info.loc[0, "gender"]
    if "startlanguage" in requested_features:
        baseline_features.loc[0, "startlanguage"] = participant_info.loc[
            0, "startlanguage"
        ]
    if (
        ("limesurvey_demand" in requested_features)
        or ("limesurvey_control" in requested_features)
        or ("limesurvey_demand_control_ratio" in requested_features)
    ):
        participant_info_t = participant_info.T
        rows_baseline = participant_info_t.index
        if ("limesurvey_demand" in requested_features) or (
            "limesurvey_demand_control_ratio" in requested_features
        ):
            # Find questions about demand, but disregard time (duration of filling in questionnaire)
            rows_demand = rows_baseline.str.startswith(
                JCQ_DEMAND
            ) & ~rows_baseline.str.endswith("Time")
            limesurvey_demand = (
                participant_info_t[rows_demand]
                .reset_index()
                .rename(columns={"index": "question", 0: "score_original"})
            )
            # Extract question IDs from names such as JobEisen[3]
            limesurvey_demand["qid"] = (
                limesurvey_demand["question"].str.extract(r"\[(\d+)\]").astype(int)
            )
            limesurvey_demand["score"] = limesurvey_demand["score_original"]
            # Identify rows that include questions to be reversed.
            rows_demand_reverse = limesurvey_demand["qid"].isin(
                dict_JCQ_demand_control_reverse[JCQ_DEMAND].keys()
            )
            # Reverse the score, so that the maximum value becomes the minimum etc.
            limesurvey_demand.loc[rows_demand_reverse, "score"] = (
                LIMESURVEY_JCQ_MAX
                + LIMESURVEY_JCQ_MIN
                - limesurvey_demand.loc[rows_demand_reverse, "score_original"]
            )
            baseline_interim = pd.concat([baseline_interim, limesurvey_demand], axis=0, ignore_index=True)
            if "limesurvey_demand" in requested_features:
                baseline_features.loc[0, "limesurvey_demand"] = limesurvey_demand[
                    "score"
                ].sum()
        if ("limesurvey_control" in requested_features) or (
            "limesurvey_demand_control_ratio" in requested_features
        ):
            # Find questions about control, but disregard time (duration of filling in questionnaire)
            rows_control = rows_baseline.str.startswith(
                JCQ_CONTROL
            ) & ~rows_baseline.str.endswith("Time")
            limesurvey_control = (
                participant_info_t[rows_control]
                .reset_index()
                .rename(columns={"index": "question", 0: "score_original"})
            )
            # Extract question IDs from names such as JobControle[3]
            limesurvey_control["qid"] = (
                limesurvey_control["question"].str.extract(r"\[(\d+)\]").astype(int)
            )
            limesurvey_control["score"] = limesurvey_control["score_original"]
            # Identify rows that include questions to be reversed.
            rows_control_reverse = limesurvey_control["qid"].isin(
                dict_JCQ_demand_control_reverse[JCQ_CONTROL].keys()
            )
            # Reverse the score, so that the maximum value becomes the minimum etc.
            limesurvey_control.loc[rows_control_reverse, "score"] = (
                LIMESURVEY_JCQ_MAX
                + LIMESURVEY_JCQ_MIN
                - limesurvey_control.loc[rows_control_reverse, "score_original"]
            )
            baseline_interim = pd.concat([baseline_interim, limesurvey_control], axis=0, ignore_index=True)
            if "limesurvey_control" in requested_features:
                baseline_features.loc[0, "limesurvey_control"] = limesurvey_control[
                    "score"
                ].sum()
        if "limesurvey_demand_control_ratio" in requested_features:
            if limesurvey_control["score"].sum():
                limesurvey_demand_control_ratio = (
                        limesurvey_demand["score"].sum() / limesurvey_control["score"].sum()
                )
            else:
                limesurvey_demand_control_ratio = 0
            if (
                JCQ_NORMS[participant_info.loc[0, "gender"]][0]
                <= limesurvey_demand_control_ratio
                < JCQ_NORMS[participant_info.loc[0, "gender"]][1]
            ):
                limesurvey_quartile = 1
            elif (
                JCQ_NORMS[participant_info.loc[0, "gender"]][1]
                <= limesurvey_demand_control_ratio
                < JCQ_NORMS[participant_info.loc[0, "gender"]][2]
            ):
                limesurvey_quartile = 2
            elif (
                JCQ_NORMS[participant_info.loc[0, "gender"]][2]
                <= limesurvey_demand_control_ratio
                < JCQ_NORMS[participant_info.loc[0, "gender"]][3]
            ):
                limesurvey_quartile = 3
            elif (
                JCQ_NORMS[participant_info.loc[0, "gender"]][3]
                <= limesurvey_demand_control_ratio
                < JCQ_NORMS[participant_info.loc[0, "gender"]][4]
            ):
                limesurvey_quartile = 4
            else:
                limesurvey_quartile = np.nan
            baseline_features.loc[
                0, "limesurvey_demand_control_ratio"
            ] = limesurvey_demand_control_ratio
            baseline_features.loc[
                0, "limesurvey_demand_control_ratio_quartile"
            ] = limesurvey_quartile
 if not baseline_interim.empty:
    baseline_interim.to_csv(snakemake.output["interim"], index=False, encoding="utf-8")
 baseline_features.to_csv(snakemake.output["features"], index=False, encoding="utf-8")
--- a/src/data/create_participants_files.R
+++ b/src/data/create_participants_files.R
@ -1,6 +1,6 @@
 source("renv/activate.R")
-#library(RMariaDB)
+library(RMariaDB)
 library(stringr)
 library(purrr)
 library(readr)
@ -58,7 +58,7 @@ participants %>%
      lines <- append(lines, empty_fitbit)
    if(add_empatica_section == TRUE && !is.na(row[empatica_device_id_column])){
-      lines <- append(lines, c("EMPATICA:", paste0("  DEVICE_IDS: [",row$label,"]"),
+      lines <- append(lines, c("EMPATICA:", paste0("  DEVICE_IDS: [",row[empatica_device_id_column],"]"),
                               paste("  LABEL:",row$label), paste("  START_DATE:", start_date), paste("  END_DATE:", end_date)))
    } else
      lines <- append(lines, empty_empatica)
@ -73,7 +73,7 @@ participants %>%
 file_lines <-readLines("./config.yaml")
 for (i in 1:length(file_lines)){
  if(startsWith(file_lines[i], "PIDS:")){
-    file_lines[i] <- paste0("PIDS: ['", paste(participants$pid, collapse = "', '"), "']")
+    file_lines[i] <- paste0("PIDS: [", paste(participants$pid, collapse = ", "), "]")
  }
 }
 writeLines(file_lines, con = "./config.yaml") 
--- a/src/data/datetime/assign_to_time_segment.R
+++ b/src/data/datetime/assign_to_time_segment.R
@ -5,16 +5,13 @@ options(scipen=999)
 assign_rows_to_segments <- function(data, segments){
  # This function is used by all segment types, we use data.tables because they are fast
  data <- data.table::as.data.table(data)
  data[, assigned_segments := ""]
  for(i in seq_len(nrow(segments))) {
    segment <- segments[i,]
    data[segment$segment_start_ts<= timestamp & segment$segment_end_ts >= timestamp,
         assigned_segments := stringi::stri_c(assigned_segments, segment$segment_id, sep = "|")]
  }
  data[,assigned_segments:=substring(assigned_segments, 2)]
  data
 }
--- a/src/data/download_baseline_data.py
+++ b/src/data/download_baseline_data.py
@ -1,14 +0,0 @@
 import pandas as pd
 import yaml
 filename = snakemake.input["data"]
 baseline = pd.read_csv(filename)
 with open(snakemake.input["participant_file"], "r") as file:
    participant = yaml.safe_load(file)
 username = participant["PHONE"]["LABEL"]
 baseline[baseline["username"] == username].to_csv(snakemake.output[0],
                                                  index=False,
                                                  encoding="utf-8",)
--- a/src/data/merge_baseline_data.py
+++ b/src/data/merge_baseline_data.py
@ -1,30 +0,0 @@
 import pandas as pd
 VARIABLES_TO_TRANSLATE = {
    "Gebruikersnaam": "username",
    "Geslacht": "gender",
    "Geboortedatum": "date_of_birth",
 }
 filenames = snakemake.input["data"]
 baseline_dfs = []
 for fn in filenames:
    baseline_dfs.append(pd.read_csv(fn,
                                    parse_dates=["Geboortedatum"],
                                    infer_datetime_format=True,
                                    cache_dates=True,
                                    ))
 baseline = (
    pd.concat(baseline_dfs, join="inner")
    .reset_index()
    .drop(columns="index")
 )
 baseline.rename(columns=VARIABLES_TO_TRANSLATE, copy=False, inplace=True)
 baseline.to_csv(snakemake.output[0],
                index=False,
                encoding="utf-8",)
--- a/src/data/process_location_types.R
+++ b/src/data/process_location_types.R
@ -6,10 +6,9 @@ library(tidyr)
 consecutive_threshold <- snakemake@params[["consecutive_threshold"]]
 time_since_valid_location <- snakemake@params[["time_since_valid_location"]]
 locations_to_use <- snakemake@params[["locations_to_use"]]
 accuracy_limit <- snakemake@params[["accuracy_limit"]]
 locations <- read.csv(snakemake@input[["locations"]]) %>% 
-            filter(double_latitude != 0 & double_longitude != 0 & accuracy < accuracy_limit) %>% 
+            filter(double_latitude != 0 & double_longitude != 0) %>% 
            drop_na(double_longitude, double_latitude) %>% 
            group_by(timestamp) %>% # keep only the row with the best accuracy if two or more have the same timestamp
            filter(accuracy == min(accuracy, na.rm=TRUE)) %>%  
@ -64,7 +63,7 @@ if(locations_to_use == "ALL"){
            # you can think of consecutive_threshold as the period a location row is valid for
            mutate(limit = pmin(lead(timestamp, default = 9999999999999) - 1, limit + (1000 * 60 * consecutive_threshold)),
                    n_resample = (limit - timestamp)%/%60001,
-                    n_resample = n_resample + 1) %>% 
+                    n_resample = if_else(n_resample == 0, 1, n_resample)) %>% 
            drop_na(double_longitude, double_latitude) %>%
            uncount(weights = n_resample, .id = "id") %>% 
            mutate(provider = if_else(id > 1, "resampled", provider),
--- a/src/data/streams/aware_mysql_split/container.R
+++ b/src/data/streams/aware_mysql_split/container.R
@ -63,14 +63,7 @@ infer_device_os <- function(stream_parameters, device){
 pull_data <- function(stream_parameters, device, sensor, sensor_container, columns){
  dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
-
+  query <- paste0("SELECT ", paste(columns, collapse = ",")," FROM `", device, "_", sensor_container, "` WHERE ", columns$DEVICE_ID ," = '", device,"'")
  select_items <- c()
  for (column in columns) {
    select_items <- append(select_items, paste0("data->>'$.", column, "' ", column))
  }
  query <- paste0("SELECT ", paste(select_items, collapse = ",")," FROM ", sensor_container, " WHERE ", columns$DEVICE_ID ," = '", device,"'")
  # Letting the user know what we are doing
  message(paste0("Executing the following query to download data: ", query)) 
  sensor_data <- dbGetQuery(dbEngine, query)
--- a/src/data/streams/aware_mysql_split/format.yaml
+++ b/src/data/streams/aware_mysql_split/format.yaml
--- a/src/data/streams/aware_postgresql/container.R
+++ b/src/data/streams/aware_postgresql/container.R
@ -1,212 +0,0 @@
 # if you need a new package, you should add it with renv::install(package) so your renv venv is updated
 library(RPostgres)
 # Needs libpq-dev for compiling from source.
 # Error installing package 'RPostgres':
 #   =====================================
 #   
 #   * installing *source* package 'RPostgres' ...
 # ** package 'RPostgres' successfully unpacked and MD5 sums checked
 # ** using staged installation
 # Using PKG_CFLAGS=
 #   Using PKG_LIBS=-lpq
 # Using PKG_PLOGR=
 #   ------------------------- ANTICONF ERROR ---------------------------
 #   Configuration failed because libpq was not found. Try installing:
 #   * deb: libpq-dev (Debian, Ubuntu, etc)
 # * rpm: postgresql-devel (Fedora, EPEL)
 # * rpm: postgreql8-devel, psstgresql92-devel, postgresql93-devel, or postgresql94-devel (Amazon Linux)
 # * csw: postgresql_dev (Solaris)
 # * brew: libpq (OSX)
 # If libpq is already installed, check that either:
 #   (i)  'pkg-config' is in your PATH AND PKG_CONFIG_PATH contains
 # a libpq.pc file; or
 # (ii) 'pg_config' is in your PATH.
 # If neither can detect , you can set INCLUDE_DIR
 # and LIB_DIR manually via:
 #   R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
 # --------------------------[ ERROR MESSAGE ]----------------------------
 #   <stdin>:1:10: fatal error: libpq-fe.h: No such file or directory
 # compilation terminated.
 library(dbplyr)
 library(yaml)
 #' @description
 #' Auxiliary function to parse the connection credentials from a specifc group in ./credentials.yaml
 #' You can reause most of this function if you are connection to a DB or Web API.
 #' It's OK to delete this function if you don't need credentials, e.g., you are pulling data from a CSV for example.
 #' @param group the yaml key containing the credentials to connect to a database
 #' @preturn dbEngine a database engine (connection) ready to perform queries
 get_db_engine <- function(group){
  # The working dir is aways RAPIDS root folder, so your credentials file is always /credentials.yaml
  credentials <- read_yaml("./credentials.yaml")
  if(!group %in% names(credentials))
    stop(paste("The credentials group",group, "does not exist in ./credentials.yaml. The only groups that exist in that file are:", paste(names(credentials), collapse = ","), ". Did you forget to set the group in [PHONE_DATA_STREAMS][aware_mysql][DATABASE_GROUP] in config.yaml?"))
  dbEngine <- dbConnect(Postgres(), db = credentials[[group]][["database"]],
                        user = credentials[[group]][["user"]],
                        password = credentials[[group]][["password"]],
                        host = credentials[[group]][["host"]],
                        port = credentials[[group]][["port"]])
  return(dbEngine)
 }
 # This file gets executed for each PHONE_SENSOR of each participant
 # If you are connecting to a database the env file containing its credentials is available at "./.env"
 # If you are reading a CSV file instead of a DB table, the @param sensor_container wil contain the file path as set in config.yaml
 # You are not bound to databases or files, you can query a web API or whatever data source you need.
 #' @description
 #' RAPIDS allows users to use the keyword "infer" (previously "multiple") to automatically infer the mobile Operative System a device was running.
 #' If you have a way to infer the OS of a device ID, implement this function. For example, for AWARE data we use the "aware_device" table.
 #'  
 #' If you don't have a way to infer the OS, call stop("Error Message") so other users know they can't use "infer" or the inference failed, 
 #' and they have to assign the OS manually in the participant file
 #' 
 #' @param stream_parameters The PHONE_STREAM_PARAMETERS key in config.yaml. If you need specific parameters add them there.
 #' @param device A device ID string
 #' @return The OS the device ran, "android" or "ios"
 infer_device_os <- function(stream_parameters, device){
  #dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
  #query <- paste0("SELECT device_id,brand FROM aware_device WHERE device_id = '", device, "'")
  #message(paste0("Executing the following query to infer phone OS: ", query)) 
  #os <- dbGetQuery(dbEngine, query)
  #dbDisconnect(dbEngine)
  #if(nrow(os) > 0)
  #  return(os %>% mutate(os = ifelse(brand == "iPhone", "ios", "android")) %>% pull(os))
  #else
  stop(paste("We cannot infer the OS of the following device id because the aware_device table does not exist."))
  #return(os)
 }
 #' @description
 #' Gets the sensor data for a specific device id from a database table, file or whatever source you want to query
 #' 
 #' @param stream_parameters The PHONE_STREAM_PARAMETERS key in config.yaml. If you need specific parameters add them there.
 #' @param device A device ID string
 #' @param sensor_container database table or file containing the sensor data for all participants. This is the PHONE_SENSOR[CONTAINER] key in config.yaml
 #' @param columns the columns needed from this sensor (we recommend to only return these columns instead of every column in sensor_container)
 #' @return A dataframe with the sensor data for device
 pull_data <- function(stream_parameters, device, sensor, sensor_container, columns){
  dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
  query <- paste0("SELECT ", paste(columns, collapse = ",")," FROM ", sensor_container, " WHERE ", columns$DEVICE_ID ," = '", device,"'")
  # Letting the user know what we are doing
  message(paste0("Executing the following query to download data: ", query)) 
  sensor_data <- dbGetQuery(dbEngine, query)
  dbDisconnect(dbEngine)
  if(nrow(sensor_data) == 0)
    warning(paste("The device '", device,"' did not have data in ", sensor_container))
  return(sensor_data)
 }
 #' @description
 #' Gets participants' IDs for specified usernames.
 #'
 #' @param stream_parameters The PHONE_DATA_STREAMS key in config.yaml. If you need specific parameters add them there.
 #' @param usernames A vector of usernames
 #' @param participants_container The name of the database table containing participants data, such as their username.
 #' @return A dataframe with participant IDs matching usernames
 pull_participants_ids <- function(stream_parameters, usernames, participants_container) {
  dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
  query_participant_id <- tbl(dbEngine, participants_container) %>% 
    filter(username %in% usernames) %>% 
    select(username, id)
  message(paste0("Executing the following query to get participants' IDs: \n", sql_render(query_participant_id)))
  participant_data <- query_participant_id %>% collect()
  dbDisconnect(dbEngine)
  if(nrow(participant_data) == 0)
    warning(paste("We could not find requested usernames (", usernames,  ") in ", participants_container))
  return(participant_data)
 }
 #' @description
 #' Gets participants' IDs for specified participant IDs
 #'
 #' @param stream_parameters The PHONE_DATA_STREAMS key in config.yaml. If you need specific parameters add them there.
 #' @param participants_ids A vector of numeric participant IDs
 #' @param device_id_container The name of the database table which will be used to determine distinct device ID. Ideally, a table that reliably contains data, but not too much.
 #' @return A dataframe with a row matching each distinct device ID with a participant ID
 pull_participants_device_ids <- function(stream_parameters, participants_ids, device_id_container) {
  dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
  query_device_id <- tbl(dbEngine, device_id_container) %>%
    filter(participant_id %in% !!participants_ids) %>% 
    group_by(participant_id) %>% 
    distinct(device_id, .keep_all = FALSE)
  message(
    paste0(
      "Executing the following query to get the distinct device IDs: \n",
      sql_render(query_device_id),
      "\n NOTE: This might take a long time."
    )
  )
  device_ids <- query_device_id %>% collect()
  dbDisconnect(dbEngine)
  if(nrow(device_ids) == 0)
    warning(paste("We could not find device IDs for requested participant IDs (", participants_ids,  ") in ", device_id_container))
  return(device_ids)
 }
 #' @description
 #' Gets start and end datetimes for specified participant IDs.
 #'
 #' @param stream_parameters The PHONE_DATA_STREAMS key in config.yaml. If you need specific parameters add them there.
 #' @param participants_ids A vector of numeric participant IDs
 #' @param start_end_date_container The name of the database table which will be used to determine when a participant started and ended their participation. Briefing and debriefing EMAs can be meaningfully used here.
 #' @return A dataframe relating participant IDs with their start and end datetimes.
 pull_participants_start_end_dates <- function(stream_parameters, participants_ids, start_end_date_container) {
  dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
  query_timestamps <- tbl(dbEngine, start_end_date_container) %>% 
    filter(
      participant_id %in% !!participants_ids,
      double_esm_user_answer_timestamp > 0
    ) %>% 
    group_by(participant_id) %>% 
    summarise(
      timestamp_min = min(double_esm_user_answer_timestamp, na.rm = TRUE),
      timestamp_max = max(double_esm_user_answer_timestamp, na.rm = TRUE)
    ) %>% 
    select(participant_id, timestamp_min, timestamp_max)
  message(paste0("Executing the following query to get the starting and ending datetimes: \n", sql_render(query_timestamps)))
  start_end_timestamps <- query_timestamps %>% collect()
  if(nrow(start_end_timestamps) == 0)
    warning(paste("We could not find datetimes for requested participant IDs (", participants_ids,  ") in ", start_end_date_container))
  start_end_times <- start_end_timestamps %>% 
    mutate(    
      datetime_start = as_datetime(timestamp_min/1000, tz = "UTC"),
      datetime_end = as_datetime(timestamp_max/1000, tz = "UTC")
    ) %>% 
    select(-c(timestamp_min, timestamp_max))
  dbDisconnect(dbEngine)
  return(start_end_times)
 }
--- a/src/data/streams/aware_postgresql/format.yaml
+++ b/src/data/streams/aware_postgresql/format.yaml
@ -1,372 +0,0 @@
 PHONE_ACCELEROMETER:
  ANDROID:
    RAPIDS_COLUMN_MAPPINGS:
      TIMESTAMP: timestamp
      DEVICE_ID: device_id
      DOUBLE_VALUES_0: double_values_0
      DOUBLE_VALUES_1: double_values_1
      DOUBLE_VALUES_2: double_values_2
    MUTATION:
      COLUMN_MAPPINGS:
      SCRIPTS: # List any python or r scripts that mutate your raw data
  IOS:
    RAPIDS_COLUMN_MAPPINGS:
      TIMESTAMP: timestamp
      DEVICE_ID: device_id
      DOUBLE_VALUES_0: double_values_0
      DOUBLE_VALUES_1: double_values_1
      DOUBLE_VALUES_2: double_values_2
    MUTATION:
      COLUMN_MAPPINGS:
      SCRIPTS: # List any python or r scripts that mutate your raw data
 PHONE_ACTIVITY_RECOGNITION:
  ANDROID:
    RAPIDS_COLUMN_MAPPINGS:
      TIMESTAMP: timestamp
      DEVICE_ID: device_id
      ACTIVITY_NAME: activity_name
      ACTIVITY_TYPE: activity_type
      CONFIDENCE: confidence
    MUTATION:
      COLUMN_MAPPINGS:
      SCRIPTS: # List any python or r scripts that mutate your raw data
  IOS:
    RAPIDS_COLUMN_MAPPINGS:
      TIMESTAMP: timestamp
      DEVICE_ID: device_id
      ACTIVITY_NAME: FLAG_TO_MUTATE
      ACTIVITY_TYPE: FLAG_TO_MUTATE
      CONFIDENCE: FLAG_TO_MUTATE
    MUTATION:
      COLUMN_MAPPINGS:
        ACTIVITIES: activities
        CONFIDENCE: confidence
      SCRIPTS: # List any python or r scripts that mutate your raw data
        - "src/data/streams/mutations/phone/aware/activity_recogniton_ios_unification.R"
 PHONE_APPLICATIONS_CRASHES:
  ANDROID:
    RAPIDS_COLUMN_MAPPINGS:
      TIMESTAMP: timestamp
      DEVICE_ID: device_id
      PACKAGE_NAME: package_name
      APPLICATION_NAME: application_name
      APPLICATION_VERSION: application_version
      ERROR_SHORT: error_short
      ERROR_LONG: error_long
      ERROR_CONDITION: error_condition
      IS_SYSTEM_APP: is_system_app
    MUTATION:
      COLUMN_MAPPINGS:
      SCRIPTS: # List any python or r scripts that mutate your raw data
 PHONE_APPLICATIONS_FOREGROUND:
  ANDROID:
    RAPIDS_COLUMN_MAPPINGS:
      TIMESTAMP: timestamp
      DEVICE_ID: device_id
      PACKAGE_NAME: package_hash
      APPLICATION_NAME: FLAG_TO_MUTATE
      IS_SYSTEM_APP: is_system_app
    MUTATION:
      COLUMN_MAPPINGS:
      SCRIPTS:
        - src/data/streams/mutations/phone/straw/app_add_name.R
 PHONE_APPLICATIONS_NOTIFICATIONS:
  ANDROID:
    RAPIDS_COLUMN_MAPPINGS:
      TIMESTAMP: timestamp
      DEVICE_ID: device_id
      PACKAGE_NAME: package_hash
      APPLICATION_NAME: FLAG_TO_MUTATE
      SOUND: sound
      VIBRATE: vibrate
    MUTATION:
      COLUMN_MAPPINGS:
      SCRIPTS:
        - src/data/streams/mutations/phone/straw/app_add_name.R
 PHONE_BATTERY:
  ANDROID:
    RAPIDS_COLUMN_MAPPINGS:
      TIMESTAMP: timestamp
      DEVICE_ID: device_id
      BATTERY_STATUS: battery_status
      BATTERY_LEVEL: battery_level
      BATTERY_SCALE: battery_scale
    MUTATION:
      COLUMN_MAPPINGS:
      SCRIPTS: # List any python or r scripts that mutate your raw data
  IOS:
    RAPIDS_COLUMN_MAPPINGS:
      TIMESTAMP: timestamp
      DEVICE_ID: device_id
      BATTERY_STATUS: FLAG_TO_MUTATE
      BATTERY_LEVEL: battery_level
      BATTERY_SCALE: battery_scale
    MUTATION:
      COLUMN_MAPPINGS:
        BATTERY_STATUS: battery_status
      SCRIPTS:
        - "src/data/streams/mutations/phone/aware/battery_ios_unification.R"
 PHONE_BLUETOOTH:
  ANDROID:
    RAPIDS_COLUMN_MAPPINGS:
      TIMESTAMP: timestamp
      DEVICE_ID: device_id
      BT_ADDRESS: bt_address
      BT_NAME: bt_name
      BT_RSSI: bt_rssi
    MUTATION:
      COLUMN_MAPPINGS:
      SCRIPTS: # List any python or r scripts that mutate your raw data
  IOS:
    RAPIDS_COLUMN_MAPPINGS:
      TIMESTAMP: timestamp
      DEVICE_ID: device_id
      BT_ADDRESS: bt_address
      BT_NAME: bt_name
      BT_RSSI: bt_rssi
    MUTATION:
      COLUMN_MAPPINGS:
      SCRIPTS: # List any python or r scripts that mutate your raw data
 PHONE_CALLS:
  ANDROID:
    RAPIDS_COLUMN_MAPPINGS:
      TIMESTAMP: timestamp
      DEVICE_ID: device_id
      CALL_TYPE: call_type
      CALL_DURATION: call_duration
      TRACE: trace
    MUTATION:
      COLUMN_MAPPINGS:
      SCRIPTS: # List any python or r scripts that mutate your raw data
  IOS:
    RAPIDS_COLUMN_MAPPINGS:
      TIMESTAMP: timestamp
      DEVICE_ID: device_id
      CALL_TYPE: FLAG_TO_MUTATE
      CALL_DURATION: call_duration
      TRACE: trace
    MUTATION:
      COLUMN_MAPPINGS:
        CALL_TYPE: call_type
      SCRIPTS:
        - "src/data/streams/mutations/phone/aware/calls_ios_unification.R"
 PHONE_CONVERSATION:
  ANDROID:
    RAPIDS_COLUMN_MAPPINGS:
      TIMESTAMP: timestamp
      DEVICE_ID: device_id
      DOUBLE_ENERGY: double_energy
      INFERENCE: inference
      DOUBLE_CONVO_START: double_convo_start
      DOUBLE_CONVO_END: double_convo_end
    MUTATION:
      COLUMN_MAPPINGS:
      SCRIPTS: # List any python or r scripts that mutate your raw data
  IOS:
    RAPIDS_COLUMN_MAPPINGS:
      TIMESTAMP: timestamp
      DEVICE_ID: device_id
      DOUBLE_ENERGY: double_energy
      INFERENCE: inference
      DOUBLE_CONVO_START: double_convo_start
      DOUBLE_CONVO_END: double_convo_end
    MUTATION:
      COLUMN_MAPPINGS:
      SCRIPTS: # List any python or r scripts that mutate your raw data
        - "src/data/streams/mutations/phone/aware/conversation_ios_timestamp.R"
 PHONE_ESM:
  ANDROID:
    RAPIDS_COLUMN_MAPPINGS:
      TIMESTAMP: double_esm_user_answer_timestamp
      DEVICE_ID: device_id
      ESM_STATUS: esm_status
      ESM_USER_ANSWER: esm_user_answer
      ESM_JSON: esm_json
      ESM_TRIGGER: esm_trigger
      ESM_SESSION: esm_session
      ESM_NOTIFICATION_ID: esm_notification_id
    MUTATION:
      COLUMN_MAPPINGS:
      SCRIPTS:
 PHONE_KEYBOARD:
  ANDROID:
    RAPIDS_COLUMN_MAPPINGS:
      TIMESTAMP: timestamp
      DEVICE_ID: device_id
      PACKAGE_NAME: package_name
      BEFORE_TEXT: before_text
      CURRENT_TEXT: current_text
      IS_PASSWORD: is_password
    MUTATION:
      COLUMN_MAPPINGS:
      SCRIPTS: # List any python or r scripts that mutate your raw data
 PHONE_LIGHT:
  ANDROID:
    RAPIDS_COLUMN_MAPPINGS:
      TIMESTAMP: timestamp
      DEVICE_ID: device_id
      DOUBLE_LIGHT_LUX: double_light_lux
      ACCURACY: accuracy
    MUTATION:
      COLUMN_MAPPINGS:
      SCRIPTS: # List any python or r scripts that mutate your raw data
 PHONE_LOCATIONS:
  ANDROID:
    RAPIDS_COLUMN_MAPPINGS:
      TIMESTAMP: timestamp
      DEVICE_ID: device_id
      DOUBLE_LATITUDE: double_latitude
      DOUBLE_LONGITUDE: double_longitude
      DOUBLE_BEARING: double_bearing
      DOUBLE_SPEED: double_speed
      DOUBLE_ALTITUDE: double_altitude
      PROVIDER: provider
      ACCURACY: accuracy
    MUTATION:
      COLUMN_MAPPINGS:
      SCRIPTS: # List any python or r scripts that mutate your raw data
  IOS:
    RAPIDS_COLUMN_MAPPINGS:
      TIMESTAMP: timestamp
      DEVICE_ID: device_id
      DOUBLE_LATITUDE: double_latitude
      DOUBLE_LONGITUDE: double_longitude
      DOUBLE_BEARING: double_bearing
      DOUBLE_SPEED: double_speed
      DOUBLE_ALTITUDE: double_altitude
      PROVIDER: provider
      ACCURACY: accuracy
    MUTATION:
      COLUMN_MAPPINGS:
      SCRIPTS: # List any python or r scripts that mutate your raw data
 PHONE_LOG:
  ANDROID:
    RAPIDS_COLUMN_MAPPINGS:
      TIMESTAMP: timestamp
      DEVICE_ID: device_id
      LOG_MESSAGE: log_message
    MUTATION:
      COLUMN_MAPPINGS:
      SCRIPTS: # List any python or r scripts that mutate your raw data
  IOS:
    RAPIDS_COLUMN_MAPPINGS:
      TIMESTAMP: timestamp
      DEVICE_ID: device_id
      LOG_MESSAGE: log_message
    MUTATION:
      COLUMN_MAPPINGS:
      SCRIPTS: # List any python or r scripts that mutate your raw data
 PHONE_MESSAGES:
  ANDROID:
    RAPIDS_COLUMN_MAPPINGS:
      TIMESTAMP: timestamp
      DEVICE_ID: device_id
      MESSAGE_TYPE: message_type
      TRACE: trace
    MUTATION:
      COLUMN_MAPPINGS:
      SCRIPTS: # List any python or r scripts that mutate your raw data
 PHONE_SCREEN:
  ANDROID:
    RAPIDS_COLUMN_MAPPINGS:
      TIMESTAMP: timestamp
      DEVICE_ID: device_id
      SCREEN_STATUS: screen_status
    MUTATION:
      COLUMN_MAPPINGS:
      SCRIPTS: # List any python or r scripts that mutate your raw data
  IOS:
    RAPIDS_COLUMN_MAPPINGS:
      TIMESTAMP: timestamp
      DEVICE_ID: device_id
      SCREEN_STATUS: FLAG_TO_MUTATE
    MUTATION:
      COLUMN_MAPPINGS:
        SCREEN_STATUS: screen_status
      SCRIPTS: # List any python or r scripts that mutate your raw data
        - "src/data/streams/mutations/phone/aware/screen_ios_unification.R"
 PHONE_WIFI_CONNECTED:
  ANDROID:
    RAPIDS_COLUMN_MAPPINGS:
      TIMESTAMP: timestamp
      DEVICE_ID: device_id
      MAC_ADDRESS: mac_address
      SSID: ssid
      BSSID: bssid
    MUTATION:
      COLUMN_MAPPINGS:
      SCRIPTS: # List any python or r scripts that mutate your raw data
  IOS:
    RAPIDS_COLUMN_MAPPINGS:
      TIMESTAMP: timestamp
      DEVICE_ID: device_id
      MAC_ADDRESS: mac_address
      SSID: ssid
      BSSID: bssid
    MUTATION:
      COLUMN_MAPPINGS:
      SCRIPTS: # List any python or r scripts that mutate your raw data
 PHONE_WIFI_VISIBLE:
  ANDROID:
    RAPIDS_COLUMN_MAPPINGS:
      TIMESTAMP: timestamp
      DEVICE_ID: device_id
      SSID: ssid
      BSSID: bssid
      SECURITY: security
      FREQUENCY: frequency
      RSSI: rssi
    MUTATION:
      COLUMN_MAPPINGS:
      SCRIPTS: # List any python or r scripts that mutate your raw data
  IOS:
    RAPIDS_COLUMN_MAPPINGS:
      TIMESTAMP: timestamp
      DEVICE_ID: device_id
      SSID: ssid
      BSSID: bssid
      SECURITY: security
      FREQUENCY: frequency
      RSSI: rssi
    MUTATION:
      COLUMN_MAPPINGS:
      SCRIPTS: # List any python or r scripts that mutate your raw data
 PHONE_SPEECH:
  ANDROID:
    RAPIDS_COLUMN_MAPPINGS:
      TIMESTAMP: timestamp
      DEVICE_ID: device_id
      SPEECH_PROPORTION: speech_proportion
    MUTATION:
      COLUMN_MAPPINGS:
      SCRIPTS: # List any python or r scripts that mutate your raw data
  IOS:
    RAPIDS_COLUMN_MAPPINGS:
      TIMESTAMP: timestamp
      DEVICE_ID: device_id
      SPEECH_PROPORTION: speech_proportion
    MUTATION:
      COLUMN_MAPPINGS:
      SCRIPTS: # List any python or r scripts that mutate your raw data
--- a/src/data/streams/empatica_zip/container.py
+++ b/src/data/streams/empatica_zip/container.py
@ -2,16 +2,11 @@ from zipfile import ZipFile
 import warnings
 from pathlib import Path
 import pandas as pd
 import numpy as np
 from pandas.core import indexing
 import yaml
 import csv
 from collections import OrderedDict
 from io import BytesIO, StringIO
 import sys, os
 from cr_features.hrv import get_HRV_features, get_patched_ibi_with_bvp
 from cr_features.helper_functions import empatica1d_to_array, empatica2d_to_array
 def processAcceleration(x, y, z):
    x = float(x)
@ -57,8 +52,6 @@ def extract_empatica_data(data,  sensor):
        df = pd.DataFrame.from_dict(ddict, orient='index', columns=[column])
        df[column] = df[column].astype(float)
        df.index.name = 'timestamp'
        if df.empty:
            return df
    elif sensor == 'EMPATICA_ACCELEROMETER':
        ddict = readFile(sensor_data_file, sensor)
@ -67,16 +60,9 @@ def extract_empatica_data(data,  sensor):
        df['y'] = df['y'].astype(float)
        df['z'] = df['z'].astype(float)
        df.index.name = 'timestamp'
        if df.empty:
            return df
    elif sensor == 'EMPATICA_INTER_BEAT_INTERVAL':
-
+        df = pd.read_csv(sensor_data_file, names=['timestamp', column], header=None)
        df = pd.read_csv(sensor_data_file, names=['timings', column], header=None)
        df['timestamp'] = df['timings']
        if df.empty:
            df = df.set_index('timestamp')
            return df
        timestampstart = float(df['timestamp'][0])
        df['timestamp'] = (df['timestamp'][1:len(df)]).astype(float) + timestampstart
        df = df.drop([0])
@ -98,10 +84,6 @@ def pull_data(data_configuration, device, sensor, container, columns_to_download
    participant_data = pd.DataFrame(columns=columns_to_download.values())
    participant_data.set_index('timestamp', inplace=True)
    with open('config.yaml', 'r') as stream:
        config = yaml.load(stream, Loader=yaml.FullLoader)
    cr_ibi_provider = config['EMPATICA_INTER_BEAT_INTERVAL']['PROVIDERS']['CR']
    available_zipfiles = list((Path(data_configuration["FOLDER"]) / Path(device)).rglob("*.zip"))
    if len(available_zipfiles) == 0:
        warnings.warn("There were no zip files in: {}. If you were expecting data for this participant the [EMPATICA][DEVICE_IDS] key in their participant file is missing the pid".format((Path(data_configuration["FOLDER"]) / Path(device))))
@ -112,13 +94,7 @@ def pull_data(data_configuration, device, sensor, container, columns_to_download
            listOfFileNames = zipFile.namelist()
            for fileName in listOfFileNames:
                if fileName == sensor_csv:
-                    if sensor == "EMPATICA_INTER_BEAT_INTERVAL" and cr_ibi_provider.get('PATCH_WITH_BVP', False):
+                    participant_data = pd.concat([participant_data, extract_empatica_data(zipFile.read(fileName),  sensor)], axis=0)
                        participant_data = \
                            pd.concat([participant_data, patch_ibi_with_bvp(zipFile.read('IBI.csv'), zipFile.read('BVP.csv'))], axis=0)
                        #print("patch with ibi")
                    else:
                        participant_data = pd.concat([participant_data, extract_empatica_data(zipFile.read(fileName), sensor)], axis=0)
                        #print("no patching")
                    warning = False
            if warning:
                warnings.warn("We could not find a zipped file for {} in {} (we tried to find {})".format(sensor, zipFile, sensor_csv))
@ -129,54 +105,4 @@ def pull_data(data_configuration, device, sensor, container, columns_to_download
    participant_data["device_id"] = device
    return(participant_data)
 def patch_ibi_with_bvp(ibi_data, bvp_data):
    ibi_data_file = BytesIO(ibi_data).getvalue().decode('utf-8')
    ibi_data_file = StringIO(ibi_data_file)
    # Begin with the cr-features part
    try:
        ibi_data, ibi_start_timestamp = empatica2d_to_array(ibi_data_file)
    except (IndexError, KeyError) as e:
        # Checks whether IBI.csv is empty
        # It may raise a KeyError if df is empty here: startTimeStamp = df.time[0]
        df_test = pd.read_csv(ibi_data_file, names=['timings', 'inter_beat_interval'], header=None)
        if df_test.empty:
            df_test['timestamp'] = df_test['timings']
            df_test = df_test.set_index('timestamp')
            return df_test
        else:
            raise IndexError("Something went wrong with indices. Error that was previously caught:\n", repr(e))
    bvp_data_file = BytesIO(bvp_data).getvalue().decode('utf-8')
    bvp_data_file = StringIO(bvp_data_file)
    bvp_data, bvp_start_timestamp, sample_rate = empatica1d_to_array(bvp_data_file)
    hrv_time_and_freq_features, sample, bvp_rr, bvp_timings, peak_indx = \
        get_HRV_features(bvp_data, ma=False, 
                        detrend=False, m_deternd=False, low_pass=False, winsorize=True, 
                        winsorize_value=25, hampel_fiter=False, median_filter=False, 
                        mod_z_score_filter=True, sampling=64, feature_names=['meanHr'])
    ibi_timings, ibi_rr = get_patched_ibi_with_bvp(ibi_data[0], ibi_data[1], bvp_timings, bvp_rr)
    df = \
        pd.DataFrame(np.array([ibi_timings, ibi_rr]).transpose(), columns=['timestamp', 'inter_beat_interval'])
    df.loc[-1] = [ibi_start_timestamp, 'IBI']  # adding a row
    df.index = df.index + 1  # shifting index
    df = df.sort_index()  # sorting by index
    # Repeated as in extract_empatica_data for IBI
    df['timings'] = df['timestamp']
    timestampstart = float(df['timestamp'][0])
    df['timestamp'] = (df['timestamp'][1:len(df)]).astype(float) + timestampstart        
    df = df.drop([0])
    df['inter_beat_interval'] = df['inter_beat_interval'].astype(float)
    df = df.set_index('timestamp')
    # format timestamps
    df.index *= 1000
    df.index = df.index.astype(int)
    return(df)
 # print(pull_data({'FOLDER': 'data/external/empatica'}, "e01", "EMPATICA_accelerometer", {'TIMESTAMP': 'timestamp', 'DEVICE_ID': 'device_id', 'DOUBLE_VALUES_0': 'x', 'DOUBLE_VALUES_1': 'y', 'DOUBLE_VALUES_2': 'z'}))
--- a/src/data/streams/empatica_zip/format.yaml
+++ b/src/data/streams/empatica_zip/format.yaml
@ -50,7 +50,6 @@ EMPATICA_INTER_BEAT_INTERVAL:
    TIMESTAMP: timestamp
    DEVICE_ID: device_id
    INTER_BEAT_INTERVAL: inter_beat_interval
    TIMINGS: timings
  MUTATION:
    COLUMN_MAPPINGS:
    SCRIPTS: # List any python or r scripts that mutate your raw data
--- a/src/data/streams/mutations/fitbit/parse_calories_intraday_json.py
+++ b/src/data/streams/mutations/fitbit/parse_calories_intraday_json.py
@ -18,7 +18,7 @@ def parseCaloriesData(calories_data):
            dataset = record["activities-calories-intraday"]["dataset"]
            for data in dataset:
                d_time = datetime.strptime(data["time"], '%H:%M:%S').time()
-                d_datetime = datetime.combine(curr_date, d_time).strftime("%Y-%m-%d %H:%M:%S")
+                d_datetime = datetime.combine(curr_date, d_time)
                row_intraday = (device_id, data["level"], data["mets"], data["value"], d_datetime, 0)
                records_intraday.append(row_intraday)
--- a/src/data/streams/mutations/fitbit/parse_heartrate_intraday_json.py
+++ b/src/data/streams/mutations/fitbit/parse_heartrate_intraday_json.py
@ -32,13 +32,13 @@ def parseHeartrateZones(heartrate_data):
 def parseHeartrateIntradayData(records_intraday, dataset, device_id, curr_date, heartrate_zones_range):
    for data in dataset:
        d_time = datetime.strptime(data["time"], '%H:%M:%S').time()
-        d_datetime = datetime.combine(curr_date, d_time).strftime("%Y-%m-%d %H:%M:%S")
+        d_datetime = datetime.combine(curr_date, d_time)
        d_hr =  data["value"]
        # Get heartrate zone by range: min <= heartrate < max
        d_hrzone = None
        for hrzone, hrrange in heartrate_zones_range.items():
-            if d_hr >= hrrange[0] and d_hr <= hrrange[1]:
+            if d_hr >= hrrange[0] and d_hr < hrrange[1]:
                d_hrzone = hrzone
                break
--- a/src/data/streams/mutations/fitbit/parse_heartrate_summary_json.py
+++ b/src/data/streams/mutations/fitbit/parse_heartrate_summary_json.py
@ -1,5 +1,6 @@
 import json
 import pandas as pd
 from datetime import datetime
 HR_SUMMARY_COLUMNS = ("device_id",
@ -54,7 +55,7 @@ def parseHeartrateData(heartrate_data):
    for record in heartrate_data.json_fitbit_column:
        record = json.loads(record)  # Parse text into JSON
        if "activities-heart" in record:
-            curr_date = record["activities-heart"][0]["dateTime"] + " 00:00:00"
+            curr_date = datetime.strptime(record["activities-heart"][0]["dateTime"], "%Y-%m-%d")
            record_summary = record["activities-heart"][0]
            row_summary = parseHeartrateSummaryData(record_summary, device_id, curr_date)
--- a/src/data/streams/mutations/fitbit/parse_sleep_intraday_json.py
+++ b/src/data/streams/mutations/fitbit/parse_sleep_intraday_json.py
@ -64,7 +64,7 @@ def parseOneRecordForV1(record, device_id, d_is_main_sleep, records_intraday, ty
        d_time = datetime.strptime(data["dateTime"], '%H:%M:%S').time()
        if is_before_midnight and d_time.hour == 0:
            curr_date = end_date
-        d_datetime = datetime.combine(curr_date, d_time).strftime("%Y-%m-%d %H:%M:%S")
+        d_datetime = datetime.combine(curr_date, d_time)
        # API 1.2 stores original_level as strings, so we convert original_levels of API 1 to strings too
        # (1: "asleep", 2: "restless", 3: "awake")
@ -86,7 +86,7 @@ def parseOneRecordForV12(record, device_id, d_is_main_sleep, records_intraday, t
    if sleep_record_type == "classic":
        for data in record["levels"]["data"]:
-            d_datetime = data["dateTime"][:19].replace("T", " ")
+            d_datetime = dateutil.parser.parse(data["dateTime"])
            row_intraday = (device_id, type_episode_id, data["seconds"],
                data["level"], d_is_main_sleep, sleep_record_type,
@ -95,10 +95,9 @@ def parseOneRecordForV12(record, device_id, d_is_main_sleep, records_intraday, t
    else:
        # For sleep type "stages"
        for data in mergeLongAndShortData(record["levels"]):
            d_datetime = data[0].strftime("%Y-%m-%d %H:%M:%S")
            row_intraday = (device_id, type_episode_id, 30,
                data[1], d_is_main_sleep, sleep_record_type,
-                d_datetime, 0)
+                data[0], 0)
            records_intraday.append(row_intraday)
--- a/src/data/streams/mutations/fitbit/parse_sleep_summary_json.py
+++ b/src/data/streams/mutations/fitbit/parse_sleep_summary_json.py
@ -1,5 +1,8 @@
-import json
+import json, yaml
 import pandas as pd
 import numpy as np
 from datetime import datetime, timedelta
 import dateutil.parser
 SLEEP_SUMMARY_COLUMNS = ("device_id", "efficiency",
                                "minutes_after_wakeup", "minutes_asleep", "minutes_awake", "minutes_to_fall_asleep", "minutes_in_bed",
@ -13,8 +16,8 @@ def parseOneSleepRecord(record, device_id, d_is_main_sleep, records_summary, epi
    sleep_record_type = episode_type
-    d_start_datetime = record["startTime"][:19].replace("T", " ")
+    d_start_datetime = datetime.strptime(record["startTime"][:18], "%Y-%m-%dT%H:%M:%S")
-    d_end_datetime = record["endTime"][:19].replace("T", " ")
+    d_end_datetime = datetime.strptime(record["endTime"][:18], "%Y-%m-%dT%H:%M:%S")
    # Summary data
    row_summary = (device_id, record["efficiency"],
                    record["minutesAfterWakeup"], record["minutesAsleep"], record["minutesAwake"], record["minutesToFallAsleep"], record["timeInBed"],
--- a/src/data/streams/mutations/fitbit/parse_steps_intraday_json.py
+++ b/src/data/streams/mutations/fitbit/parse_steps_intraday_json.py
@ -23,7 +23,7 @@ def parseStepsData(steps_data):
                dataset = record["activities-steps-intraday"]["dataset"]
                for data in dataset:
                    d_time = datetime.strptime(data["time"], '%H:%M:%S').time()
-                    d_datetime = datetime.combine(curr_date, d_time).strftime("%Y-%m-%d %H:%M:%S")
+                    d_datetime = datetime.combine(curr_date, d_time)
                    row_intraday = (device_id,
                        data["value"],
--- a/src/data/streams/mutations/fitbit/parse_steps_summary_json.py
+++ b/src/data/streams/mutations/fitbit/parse_steps_summary_json.py
@ -1,5 +1,6 @@
 import json
 import pandas as pd
 from datetime import datetime
 STEPS_COLUMNS = ("device_id", "steps", "local_date_time", "timestamp")
@ -15,7 +16,7 @@ def parseStepsData(steps_data):
    for record in steps_data.json_fitbit_column:
        record = json.loads(record)  # Parse text into JSON
        if "activities-steps" in record.keys():
-            curr_date = record["activities-steps"][0]["dateTime"] + " 00:00:00"
+            curr_date = datetime.strptime(record["activities-steps"][0]["dateTime"], "%Y-%m-%d")
            row_summary = (device_id,
                record["activities-steps"][0]["value"],
--- a/src/data/streams/mutations/phone/aware/activity_recogniton_ios_unification.R
+++ b/src/data/streams/mutations/phone/aware/activity_recogniton_ios_unification.R
@ -36,8 +36,7 @@ unify_ios_activity_recognition <- function(ios_gar){
                                         activities == "cycling" ~ "on_bicycle",
                                         activities == "walking" ~ "walking",
                                         activities == "running" ~ "running",
-                                         activities == "stationary" ~ "still",
+                                         activities == "stationary" ~ "still"),
                                         activities == "unknown" ~ "unknown"),
               activity_type = case_when(activities == "automotive" ~ 0,
                                         activities == "cycling" ~ 1,
                                         activities == "walking" ~ 7,
--- a/src/data/streams/mutations/phone/aware/calls_ios_unification.R
+++ b/src/data/streams/mutations/phone/aware/calls_ios_unification.R
@ -39,7 +39,7 @@ unify_ios_calls <- function(ios_calls){
                        assigned_segments = first(assigned_segments))
        }
        else {
-            ios_calls <- ios_calls %>% summarise(call_type_sequence = paste(call_type, collapse = ","), call_duration = sum(as.numeric(call_duration)),  timestamp = first(timestamp), device_id = first(device_id))
+            ios_calls <- ios_calls %>% summarise(call_type_sequence = paste(call_type, collapse = ","), call_duration = sum(call_duration),  timestamp = first(timestamp), device_id = first(device_id))
        }
        ios_calls <- ios_calls %>% mutate(call_type = case_when(
            call_type_sequence == "1,2,4" | call_type_sequence == "2,1,4" ~ 1, # incoming
--- a/src/data/streams/mutations/phone/straw/app_add_name.R
+++ b/src/data/streams/mutations/phone/straw/app_add_name.R
@ -1,8 +0,0 @@
 source("renv/activate.R") # needed to use RAPIDS renv environment
 library(dplyr)
 main <- function(data, stream_parameters){
    data <- data %>%
      mutate(application_name = "hashed")
    return(data)
 }
--- a/src/data/streams/mutations/phone/straw/app_add_name.py
+++ b/src/data/streams/mutations/phone/straw/app_add_name.py
@ -1,5 +0,0 @@
 import pandas as pd
 def main(data, stream_parameters):
    data["application_name"] = "hashed"
    return(data)
--- a/src/data/streams/rapids_columns.yaml
+++ b/src/data/streams/rapids_columns.yaml
@ -35,8 +35,11 @@ PHONE_APPLICATIONS_NOTIFICATIONS:
  - DEVICE_ID
  - PACKAGE_NAME
  - APPLICATION_NAME
  - TEXT
  - SOUND
  - VIBRATE
  - DEFAULTS
  - FLAGS
 PHONE_BATTERY:
  - TIMESTAMP
@ -67,16 +70,6 @@ PHONE_CONVERSATION:
  - DOUBLE_CONVO_START
  - DOUBLE_CONVO_END
 PHONE_ESM:
  - TIMESTAMP
  - DEVICE_ID
  - ESM_STATUS
  - ESM_USER_ANSWER
  - ESM_JSON
  - ESM_TRIGGER
  - ESM_SESSION
  - ESM_NOTIFICATION_ID
 PHONE_KEYBOARD:
  - TIMESTAMP
  - DEVICE_ID
@ -118,11 +111,6 @@ PHONE_SCREEN:
  - DEVICE_ID
  - SCREEN_STATUS
 PHONE_SPEECH:
  - TIMESTAMP
  - DEVICE_ID
  - SPEECH_PROPORTION
 PHONE_WIFI_CONNECTED:
  - TIMESTAMP
  - DEVICE_ID
@ -232,7 +220,6 @@ EMPATICA_INTER_BEAT_INTERVAL:
  - TIMESTAMP
  - DEVICE_ID
  - INTER_BEAT_INTERVAL
  - TIMINGS
 EMPATICA_TAGS:
  - TIMESTAMP
--- a/src/data/translate_usernames_into_participants_data.R
+++ b/src/data/translate_usernames_into_participants_data.R
@ -1,62 +0,0 @@
 source("renv/activate.R")
 source("src/data/streams/aware_postgresql/container.R")
 library(RPostgres)
 library(magrittr)
 library(tidyverse)
 library(lubridate)
 prepare_participants_file <- function() {
  username_list_csv_location <- snakemake@input[["username_list"]]
  data_configuration <- snakemake@params[["data_configuration"]]
  participants_container <- snakemake@params[["participants_table"]]
  device_id_container <- snakemake@params[["device_id_table"]]
  start_end_date_container <- snakemake@params[["start_end_date_table"]]
  output_data_file <- snakemake@output[["participants_file"]]
  platform <- "android"
  pid_format <- "p%03d"
  datetime_format <- "%Y-%m-%d %H:%M:%S"
  participant_data <- read_csv(username_list_csv_location, col_types = "cc", progress = FALSE)
  usernames <- participant_data$label
  participant_ids <- pull_participants_ids(data_configuration, usernames, participants_container)
  participant_data %<>%
    left_join(participant_ids, by = c("label" = "username")) %>%
    rename(participant_id = id)
  device_ids <- pull_participants_device_ids(data_configuration, participant_data$participant_id, device_id_container)
  device_ids %<>%
    filter(device_id != "") %>%
    group_by(participant_id) %>%
    summarise(device_ids = list(unique(device_id)))
  participant_data %<>%
    left_join(device_ids, by = "participant_id")
  start_end_datetimes <- pull_participants_start_end_dates(data_configuration, participant_data$participant_id, start_end_date_container)
  participant_data %<>%
    left_join(start_end_datetimes, by = "participant_id")
  participant_data %<>%
  mutate(
    pid = sprintf(pid_format, participant_id),
    start_date = strftime(datetime_start, format=datetime_format, tz = "UTC", usetz = FALSE), #TODO Check what timezone is expected
    end_date = strftime(datetime_end, format=datetime_format, tz = "UTC", usetz = FALSE),
    device_id = map_chr(device_ids, str_c, collapse = ";"),
    number_of_devices = map_int(device_ids, length),
    fitbit_id = ""
    ) %>%
  rowwise() %>%
  mutate(platform = str_c(replicate(number_of_devices, platform), collapse = ";")) %>%
  ungroup() %>%
  arrange(pid) %>%
  select(pid, label, start_date, end_date, empatica_id, device_id, platform, fitbit_id)
  write_csv(participant_data, output_data_file)
 }
 prepare_participants_file()
--- a/src/features/init.py
+++ b/src/features/init.py
--- a/src/features/all_cleaning_individual/rapids/main.R
+++ b/src/features/all_cleaning_individual/rapids/main.R
@ -1,89 +0,0 @@
 source("renv/activate.R")
 library(tidyr)
 library("dplyr", warn.conflicts = F)
 library(tidyverse)
 library(caret)
 library(corrr)
 rapids_cleaning <- function(sensor_data_files, provider){
    clean_features <- read.csv(sensor_data_files[["sensor_data"]], stringsAsFactors = FALSE)
    impute_selected_event_features <- provider[["IMPUTE_SELECTED_EVENT_FEATURES"]]
    cols_nan_threshold <- as.numeric(provider[["COLS_NAN_THRESHOLD"]])
    drop_zero_variance_columns <- as.logical(provider[["COLS_VAR_THRESHOLD"]])
    rows_nan_threshold <- as.numeric(provider[["ROWS_NAN_THRESHOLD"]])
    data_yield_unit <- tolower(str_split_fixed(provider[["DATA_YIELD_FEATURE"]], "_", 4)[[4]])
    data_yield_column <- paste0("phone_data_yield_rapids_ratiovalidyielded", data_yield_unit)
    data_yield_ratio_threshold <- as.numeric(provider[["DATA_YIELD_RATIO_THRESHOLD"]])
    drop_highly_correlated_features <- provider[["DROP_HIGHLY_CORRELATED_FEATURES"]]
    # Impute selected event features
    if(as.logical(impute_selected_event_features$COMPUTE)){
        if(!"phone_data_yield_rapids_ratiovalidyieldedminutes" %in% colnames(clean_features)){
            stop("Error: RAPIDS provider needs to impute the selected event features based on phone_data_yield_rapids_ratiovalidyieldedminutes column, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyieldedminutes' in [FEATURES].")
        }
        column_names <- colnames(clean_features)
        selected_apps_features <- column_names[grepl("^phone_applications_foreground_rapids_(countevent|countepisode|minduration|maxduration|meanduration|sumduration)", column_names)]
        selected_battery_features <- column_names[grepl("^phone_battery_rapids_", column_names)]
        selected_calls_features <- column_names[grepl("^phone_calls_rapids_.*_(count|distinctcontacts|sumduration|minduration|maxduration|meanduration|modeduration)", column_names)]
        selected_keyboard_features <- column_names[grepl("^phone_keyboard_rapids_(sessioncount|averagesessionlength|changeintextlengthlessthanminusone|changeintextlengthequaltominusone|changeintextlengthequaltoone|changeintextlengthmorethanone|maxtextlength|totalkeyboardtouches)", column_names)]
        selected_messages_features <- column_names[grepl("^phone_messages_rapids_.*_(count|distinctcontacts)", column_names)]
        selected_screen_features <- column_names[grepl("^phone_screen_rapids_(sumduration|maxduration|minduration|avgduration|countepisode)", column_names)]
        selected_wifi_features <- column_names[grepl("^phone_wifi_(connected|visible)_rapids_", column_names)]
        selected_columns <- c(selected_apps_features, selected_battery_features, selected_calls_features, selected_keyboard_features, selected_messages_features, selected_screen_features, selected_wifi_features)
        clean_features[selected_columns][is.na(clean_features[selected_columns]) & (clean_features$phone_data_yield_rapids_ratiovalidyieldedminutes > impute_selected_event_features$MIN_DATA_YIELDED_MINUTES_TO_IMPUTE)] <- 0
    }
    # Drop rows with the value of data_yield_column less than data_yield_ratio_threshold
    if(!data_yield_column %in% colnames(clean_features)){
        stop(paste0("Error: RAPIDS provider needs to clean data based on ", data_yield_column, " column, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyielded", data_yield_unit, "' in [FEATURES]."))
    }
    if (data_yield_ratio_threshold > 0) {
        clean_features <- clean_features %>% 
        filter(.[[data_yield_column]] >= data_yield_ratio_threshold)
    }
    # Drop columns with a percentage of NA values above cols_nan_threshold
    if(nrow(clean_features))
        clean_features <- clean_features %>% select(where(~ sum(is.na(.)) / length(.) <= cols_nan_threshold ), starts_with("phone_esm"))
    # Drop columns with zero variance
    if(drop_zero_variance_columns)
    clean_features <- clean_features %>% select_if(grepl("pid|local_segment|local_segment_label|local_segment_start_datetime|local_segment_end_datetime|phone_esm",names(.)) | sapply(., n_distinct, na.rm = T) > 1)
    # Drop highly correlated features
    if(as.logical(drop_highly_correlated_features$COMPUTE)){
        min_overlap_for_corr_threshold <- as.numeric(drop_highly_correlated_features$MIN_OVERLAP_FOR_CORR_THRESHOLD)
        corr_threshold <- as.numeric(drop_highly_correlated_features$CORR_THRESHOLD)
        features_for_corr <- clean_features %>% 
            select_if(is.numeric) %>% 
            select_if(sapply(., n_distinct, na.rm = T) > 1)
        valid_pairs <- crossprod(!is.na(features_for_corr)) >= min_overlap_for_corr_threshold * nrow(features_for_corr)
        if((nrow(features_for_corr) != 0) & (ncol(features_for_corr) != 0)){
            highly_correlated_features <- features_for_corr %>% 
                correlate(use = "pairwise.complete.obs", method = "spearman") %>% 
                column_to_rownames(., var = "term") %>% 
                as.matrix() %>% 
                replace(!valid_pairs | is.na(.), 0) %>% 
                findCorrelation(., cutoff = corr_threshold, verbose = F, names = T)
            clean_features <- clean_features[, !names(clean_features) %in% highly_correlated_features]
        }
    }
    # Drop rows with a percentage of NA values above rows_nan_threshold
    clean_features <- clean_features %>% 
        mutate(percentage_na =  rowSums(is.na(.)) / ncol(.)) %>% 
        filter(percentage_na <= rows_nan_threshold) %>% 
        select(-percentage_na)
    return(clean_features)
 }
--- a/src/features/all_cleaning_individual/straw/init.py
+++ b/src/features/all_cleaning_individual/straw/init.py
--- a/src/features/all_cleaning_individual/straw/main.py
+++ b/src/features/all_cleaning_individual/straw/main.py
@ -1,180 +0,0 @@
 import pandas as pd
 import numpy as np
 import math, sys, random
 import yaml
 from sklearn.impute import KNNImputer
 from sklearn.preprocessing import StandardScaler
 import matplotlib.pyplot as plt
 import seaborn as sns
 sys.path.append('/rapids/')
 from src.features import empatica_data_yield as edy
 pd.set_option('display.max_columns', 20)
 def straw_cleaning(sensor_data_files, provider):
    features = pd.read_csv(sensor_data_files["sensor_data"][0])
    esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] # Get target (esm) columns
    with open('config.yaml', 'r') as stream:
        config = yaml.load(stream, Loader=yaml.FullLoader)
    excluded_columns = ['local_segment', 'local_segment_label', 'local_segment_start_datetime', 'local_segment_end_datetime']
    # (1) FILTER_OUT THE ROWS THAT DO NOT HAVE THE TARGET COLUMN AVAILABLE
    if config['PARAMS_FOR_ANALYSIS']['TARGET']['COMPUTE']:
        target = config['PARAMS_FOR_ANALYSIS']['TARGET']['LABEL'] # get target label from config
        if 'phone_esm_straw_' + target in features:
            features = features[features['phone_esm_straw_' + target].notna()].reset_index(drop=True)
        else:
            return features
    # (2.1) QUALITY CHECK (DATA YIELD COLUMN) deletes the rows where E4 or phone data is low quality
    phone_data_yield_unit = provider["PHONE_DATA_YIELD_FEATURE"].split("_")[3].lower()
    phone_data_yield_column = "phone_data_yield_rapids_ratiovalidyielded" + phone_data_yield_unit
    if features.empty:
        return features
    features = edy.calculate_empatica_data_yield(features)
    if not phone_data_yield_column in features.columns and not "empatica_data_yield" in features.columns:
        raise KeyError(f"RAPIDS provider needs to clean the selected event features based on {phone_data_yield_column} and empatica_data_yield columns. For phone data yield, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyielded{data_yield_unit}' in [FEATURES].")
    # Drop rows where phone data yield is less then given threshold
    if provider["PHONE_DATA_YIELD_RATIO_THRESHOLD"]:
        features = features[features[phone_data_yield_column] >= provider["PHONE_DATA_YIELD_RATIO_THRESHOLD"]].reset_index(drop=True)
    # Drop rows where empatica data yield is less then given threshold
    if provider["EMPATICA_DATA_YIELD_RATIO_THRESHOLD"]:
        features = features[features["empatica_data_yield"] >= provider["EMPATICA_DATA_YIELD_RATIO_THRESHOLD"]].reset_index(drop=True)
    if features.empty:
        return features
    # (2.2) DO THE ROWS CONSIST OF ENOUGH NON-NAN VALUES?
    min_count =  math.ceil((1 - provider["ROWS_NAN_THRESHOLD"]) * features.shape[1]) # minimal not nan values in row
    features.dropna(axis=0, thresh=min_count, inplace=True) # Thresh => at least this many not-nans
    # (3) REMOVE COLS IF THEIR NAN THRESHOLD IS PASSED (should be <= if even all NaN columns must be preserved - this solution now drops columns with all NaN rows)
    esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] # Get target (esm) columns
    features = features.loc[:, features.isna().sum() < provider["COLS_NAN_THRESHOLD"] * features.shape[0]]
    # Preserve esm cols if deleted (has to come after drop cols operations)
    for esm in esm_cols:
        if esm not in features:
            features[esm] = esm_cols[esm]
    # (4) CONTEXTUAL IMPUTATION
    # Impute selected phone features with a high number
    impute_w_hn = [col for col in features.columns if \
        "timeoffirstuse" in col or
        "timeoflastuse" in col or
        "timefirstcall" in col or
        "timelastcall" in col or
        "firstuseafter" in col or
        "timefirstmessages" in col or
        "timelastmessages" in col]
    features[impute_w_hn] = features[impute_w_hn].fillna(1500)
    # Impute special case (mostcommonactivity) and (homelabel)
    impute_w_sn = [col for col in features.columns if "mostcommonactivity" in col]
    features[impute_w_sn] = features[impute_w_sn].fillna(4) # Special case of imputation - nominal/ordinal value
    impute_w_sn2 = [col for col in features.columns if "homelabel" in col]
    features[impute_w_sn2] = features[impute_w_sn2].fillna(1) # Special case of imputation - nominal/ordinal value
    impute_w_sn3 = [col for col in features.columns if "loglocationvariance" in col]
    features[impute_w_sn2] = features[impute_w_sn2].fillna(-1000000) # Special case of imputation - nominal/ordinal value
    # Impute selected phone features with 0
    impute_zero = [col for col in features if \
        col.startswith('phone_applications_foreground_rapids_') or
        col.startswith('phone_battery_rapids_') or
        col.startswith('phone_bluetooth_rapids_') or
        col.startswith('phone_light_rapids_') or
        col.startswith('phone_calls_rapids_') or
        col.startswith('phone_messages_rapids_') or
        col.startswith('phone_screen_rapids_') or
        col.startswith('phone_wifi_visible')]
    features[impute_zero+list(esm_cols.columns)] = features[impute_zero+list(esm_cols.columns)].fillna(0)
    ## (5) STANDARDIZATION 
    if provider["STANDARDIZATION"]:
        features.loc[:, ~features.columns.isin(excluded_columns)] = StandardScaler().fit_transform(features.loc[:, ~features.columns.isin(excluded_columns)])
    # (6) IMPUTATION: IMPUTE DATA WITH KNN METHOD
    impute_cols = [col for col in features.columns if col not in excluded_columns]
    features.reset_index(drop=True, inplace=True)
    features[impute_cols] = impute(features[impute_cols], method="knn")
    # (7) REMOVE COLS WHERE VARIANCE IS 0
    esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')]
    if provider["COLS_VAR_THRESHOLD"]:
        features.drop(features.std(numeric_only=True)[features.std(numeric_only=True) == 0].index.values, axis=1, inplace=True)
    fe5 = features.copy()
    # (8) DROP HIGHLY CORRELATED FEATURES
    drop_corr_features = provider["DROP_HIGHLY_CORRELATED_FEATURES"]
    if drop_corr_features["COMPUTE"] and features.shape[0]: # If small amount of segments (rows) is present, do not execute correlation check
        numerical_cols = features.select_dtypes(include=np.number).columns.tolist()
        # Remove columns where NaN count threshold is passed
        valid_features = features[numerical_cols].loc[:, features[numerical_cols].isna().sum() < drop_corr_features['MIN_OVERLAP_FOR_CORR_THRESHOLD'] * features[numerical_cols].shape[0]]
        corr_matrix = valid_features.corr().abs()
        upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
        to_drop = [column for column in upper.columns if any(upper[column] > drop_corr_features["CORR_THRESHOLD"])]
        features.drop(to_drop, axis=1, inplace=True)
    # Preserve esm cols if deleted (has to come after drop cols operations)
    for esm in esm_cols:
        if esm not in features:
            features[esm] = esm_cols[esm]
    # (9) VERIFY IF THERE ARE ANY NANS LEFT IN THE DATAFRAME
    if features.isna().any().any():
        raise ValueError("There are still some NaNs present in the dataframe. Please check for implementation errors.")
    return features
 def k_nearest(df):
    pd.set_option('display.max_columns', None)
    imputer = KNNImputer(n_neighbors=3)
    return pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
 def impute(df, method='zero'):
    return {
        'zero': df.fillna(0),
        'high_number': df.fillna(1500),
        'mean': df.fillna(df.mean()),
        'median': df.fillna(df.median()),
        'knn': k_nearest(df) 
    }[method]
 def graph_bf_af(features, phase_name, plt_flag=False):
    if plt_flag:
        sns.set(rc={"figure.figsize":(16, 8)})
        sns.heatmap(features.isna(), cbar=False) #features.select_dtypes(include=np.number)
        plt.savefig(f'features_overall_nans_{phase_name}.png', bbox_inches='tight')
    print(f"\n-------------{phase_name}-------------")
    print("Rows number:", features.shape[0])
    print("Columns number:", len(features.columns))
    print("---------------------------------------------\n")
--- a/src/features/all_cleaning_overall/rapids/main.R
+++ b/src/features/all_cleaning_overall/rapids/main.R
@ -1,89 +0,0 @@
 source("renv/activate.R")
 library(tidyr)
 library("dplyr", warn.conflicts = F)
 library(tidyverse)
 library(caret)
 library(corrr)
 rapids_cleaning <- function(sensor_data_files, provider){
    clean_features <- read.csv(sensor_data_files[["sensor_data"]], stringsAsFactors = FALSE)
    impute_selected_event_features <- provider[["IMPUTE_SELECTED_EVENT_FEATURES"]]
    cols_nan_threshold <- as.numeric(provider[["COLS_NAN_THRESHOLD"]])
    drop_zero_variance_columns <- as.logical(provider[["COLS_VAR_THRESHOLD"]])
    rows_nan_threshold <- as.numeric(provider[["ROWS_NAN_THRESHOLD"]])
    data_yield_unit <- tolower(str_split_fixed(provider[["DATA_YIELD_FEATURE"]], "_", 4)[[4]])
    data_yield_column <- paste0("phone_data_yield_rapids_ratiovalidyielded", data_yield_unit)
    data_yield_ratio_threshold <- as.numeric(provider[["DATA_YIELD_RATIO_THRESHOLD"]])
    drop_highly_correlated_features <- provider[["DROP_HIGHLY_CORRELATED_FEATURES"]]
    # Impute selected event features
    if(as.logical(impute_selected_event_features$COMPUTE)){
        if(!"phone_data_yield_rapids_ratiovalidyieldedminutes" %in% colnames(clean_features)){
            stop("Error: RAPIDS provider needs to impute the selected event features based on phone_data_yield_rapids_ratiovalidyieldedminutes column, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyieldedminutes' in [FEATURES].")
        }
        column_names <- colnames(clean_features)
        selected_apps_features <- column_names[grepl("^phone_applications_foreground_rapids_(countevent|countepisode|minduration|maxduration|meanduration|sumduration)", column_names)]
        selected_battery_features <- column_names[grepl("^phone_battery_rapids_", column_names)]
        selected_calls_features <- column_names[grepl("^phone_calls_rapids_.*_(count|distinctcontacts|sumduration|minduration|maxduration|meanduration|modeduration)", column_names)]
        selected_keyboard_features <- column_names[grepl("^phone_keyboard_rapids_(sessioncount|averagesessionlength|changeintextlengthlessthanminusone|changeintextlengthequaltominusone|changeintextlengthequaltoone|changeintextlengthmorethanone|maxtextlength|totalkeyboardtouches)", column_names)]
        selected_messages_features <- column_names[grepl("^phone_messages_rapids_.*_(count|distinctcontacts)", column_names)]
        selected_screen_features <- column_names[grepl("^phone_screen_rapids_(sumduration|maxduration|minduration|avgduration|countepisode)", column_names)]
        selected_wifi_features <- column_names[grepl("^phone_wifi_(connected|visible)_rapids_", column_names)]
        selected_columns <- c(selected_apps_features, selected_battery_features, selected_calls_features, selected_keyboard_features, selected_messages_features, selected_screen_features, selected_wifi_features)
        clean_features[selected_columns][is.na(clean_features[selected_columns]) & (clean_features$phone_data_yield_rapids_ratiovalidyieldedminutes > impute_selected_event_features$MIN_DATA_YIELDED_MINUTES_TO_IMPUTE)] <- 0
    }
    # Drop rows with the value of data_yield_column less than data_yield_ratio_threshold
    if(!data_yield_column %in% colnames(clean_features)){
        stop(paste0("Error: RAPIDS provider needs to clean data based on ", data_yield_column, " column, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyielded", data_yield_unit, "' in [FEATURES]."))
    }
    if (data_yield_ratio_threshold > 0) {
        clean_features <- clean_features %>% 
        filter(.[[data_yield_column]] >= data_yield_ratio_threshold)
    }
    # Drop columns with a percentage of NA values above cols_nan_threshold
    if(nrow(clean_features))
        clean_features <- clean_features %>% select(where(~ sum(is.na(.)) / length(.) <= cols_nan_threshold ), starts_with("phone_esm"))
    # Drop columns with zero variance
    if(drop_zero_variance_columns)
        clean_features <- clean_features %>% select_if(grepl("pid|local_segment|local_segment_label|local_segment_start_datetime|local_segment_end_datetime|phone_esm",names(.)) | sapply(., n_distinct, na.rm = T) > 1)
    # Drop highly correlated features
    if(as.logical(drop_highly_correlated_features$COMPUTE)){
        min_overlap_for_corr_threshold <- as.numeric(drop_highly_correlated_features$MIN_OVERLAP_FOR_CORR_THRESHOLD)
        corr_threshold <- as.numeric(drop_highly_correlated_features$CORR_THRESHOLD)
        features_for_corr <- clean_features %>% 
            select_if(is.numeric) %>% 
            select_if(sapply(., n_distinct, na.rm = T) > 1)
        valid_pairs <- crossprod(!is.na(features_for_corr)) >= min_overlap_for_corr_threshold * nrow(features_for_corr)
        if((nrow(features_for_corr) != 0) & (ncol(features_for_corr) != 0)){
            highly_correlated_features <- features_for_corr %>% 
                correlate(use = "pairwise.complete.obs", method = "spearman") %>% 
                column_to_rownames(., var = "term") %>% 
                as.matrix() %>% 
                replace(!valid_pairs | is.na(.), 0) %>% 
                findCorrelation(., cutoff = corr_threshold, verbose = F, names = T)
            clean_features <- clean_features[, !names(clean_features) %in% highly_correlated_features]
        }
    }
    # Drop rows with a percentage of NA values above rows_nan_threshold
    clean_features <- clean_features %>% 
        mutate(percentage_na =  rowSums(is.na(.)) / ncol(.)) %>% 
        filter(percentage_na <= rows_nan_threshold) %>% 
        select(-percentage_na)
    return(clean_features)
 }
--- a/src/features/all_cleaning_overall/straw/init.py
+++ b/src/features/all_cleaning_overall/straw/init.py
--- a/src/features/all_cleaning_overall/straw/main.py
+++ b/src/features/all_cleaning_overall/straw/main.py
@ -1,275 +0,0 @@
 import pandas as pd
 import numpy as np
 import math, sys, random, warnings, yaml
 from sklearn.impute import KNNImputer
 from sklearn.preprocessing import StandardScaler, minmax_scale 
 import matplotlib.pyplot as plt
 import seaborn as sns
 sys.path.append('/rapids/')
 from src.features import empatica_data_yield as edy
 def straw_cleaning(sensor_data_files, provider, target):
    features = pd.read_csv(sensor_data_files["sensor_data"][0])
    with open('config.yaml', 'r') as stream:
        config = yaml.load(stream, Loader=yaml.FullLoader)
    esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] # Get target (esm) columns
    excluded_columns = ['local_segment', 'local_segment_label', 'local_segment_start_datetime', 'local_segment_end_datetime']
    graph_bf_af(features, "1target_rows_before")
    # (1.0) OVERRIDE STRESSFULNESS EVENT TARGETS IF ERS SEGMENTING_METHOD IS "STRESS_EVENT"
    if config["TIME_SEGMENTS"]["TAILORED_EVENTS"]["SEGMENTING_METHOD"] == "stress_event": 
        stress_events_targets = pd.read_csv("data/external/stress_event_targets.csv")   
        if "appraisal_stressfulness_event_mean" in config['PARAMS_FOR_ANALYSIS']['TARGET']['ALL_LABELS']:
            features.drop(columns=['phone_esm_straw_appraisal_stressfulness_event_mean'], inplace=True)
            features = features.merge(stress_events_targets[["label", "appraisal_stressfulness_event"]] \
                        .rename(columns={'label': 'local_segment_label'}), on=['local_segment_label'], how='inner') \
                        .rename(columns={'appraisal_stressfulness_event': 'phone_esm_straw_appraisal_stressfulness_event_mean'})
        if "appraisal_threat_mean" in config['PARAMS_FOR_ANALYSIS']['TARGET']['ALL_LABELS']:
            features.drop(columns=['phone_esm_straw_appraisal_threat_mean'], inplace=True)
            features = features.merge(stress_events_targets[["label", "appraisal_threat"]] \
                        .rename(columns={'label': 'local_segment_label'}), on=['local_segment_label'], how='inner') \
                        .rename(columns={'appraisal_threat': 'phone_esm_straw_appraisal_threat_mean'})
        if "appraisal_challenge_mean" in config['PARAMS_FOR_ANALYSIS']['TARGET']['ALL_LABELS']:
            features.drop(columns=['phone_esm_straw_appraisal_challenge_mean'], inplace=True)
            features = features.merge(stress_events_targets[["label", "appraisal_challenge"]] \
                        .rename(columns={'label': 'local_segment_label'}), on=['local_segment_label'], how='inner') \
                        .rename(columns={'appraisal_challenge': 'phone_esm_straw_appraisal_challenge_mean'})
    esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] # Get target (esm) columns
    # (1.1) FILTER_OUT THE ROWS THAT DO NOT HAVE THE TARGET COLUMN AVAILABLE
    if config['PARAMS_FOR_ANALYSIS']['TARGET']['COMPUTE']:
        features = features[features['phone_esm_straw_' + target].notna()].reset_index(drop=True)
    if features.empty:
        return pd.DataFrame(columns=excluded_columns)
    graph_bf_af(features, "2target_rows_after")
    # (2) QUALITY CHECK (DATA YIELD COLUMN) drops the rows where E4 or phone data is low quality
    phone_data_yield_unit = provider["PHONE_DATA_YIELD_FEATURE"].split("_")[3].lower()
    phone_data_yield_column = "phone_data_yield_rapids_ratiovalidyielded" + phone_data_yield_unit
    features = edy.calculate_empatica_data_yield(features)
    if not phone_data_yield_column in features.columns and not "empatica_data_yield" in features.columns:
        raise KeyError(f"RAPIDS provider needs to clean the selected event features based on {phone_data_yield_column} and empatica_data_yield columns. For phone data yield, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyielded{data_yield_unit}' in [FEATURES].")
    hist = features[["empatica_data_yield", phone_data_yield_column]].hist()
    plt.savefig(f'phone_E4_histogram.png', bbox_inches='tight')
    # Drop rows where phone data yield is less then given threshold
    if provider["PHONE_DATA_YIELD_RATIO_THRESHOLD"]:
        hist = features[phone_data_yield_column].hist(bins=5)
        plt.close()
        features = features[features[phone_data_yield_column] >= provider["PHONE_DATA_YIELD_RATIO_THRESHOLD"]].reset_index(drop=True)
    # Drop rows where empatica data yield is less then given threshold
    if provider["EMPATICA_DATA_YIELD_RATIO_THRESHOLD"]:
        features = features[features["empatica_data_yield"] >= provider["EMPATICA_DATA_YIELD_RATIO_THRESHOLD"]].reset_index(drop=True)
    if features.empty:
        return pd.DataFrame(columns=excluded_columns)
    graph_bf_af(features, "3data_yield_drop_rows")
    if features.empty:
        return pd.DataFrame(columns=excluded_columns)
    # (3) CONTEXTUAL IMPUTATION
    # Impute selected phone features with a high number
    impute_w_hn = [col for col in features.columns if \
        "timeoffirstuse" in col or
        "timeoflastuse" in col or
        "timefirstcall" in col or
        "timelastcall" in col or
        "firstuseafter" in col or
        "timefirstmessages" in col or
        "timelastmessages" in col]
    features[impute_w_hn] = features[impute_w_hn].fillna(1500)
    # Impute special case (mostcommonactivity) and (homelabel)
    impute_w_sn = [col for col in features.columns if "mostcommonactivity" in col]
    features[impute_w_sn] = features[impute_w_sn].fillna(4) # Special case of imputation - nominal/ordinal value
    impute_w_sn2 = [col for col in features.columns if "homelabel" in col]
    features[impute_w_sn2] = features[impute_w_sn2].fillna(1) # Special case of imputation - nominal/ordinal value
    impute_w_sn3 = [col for col in features.columns if "loglocationvariance" in col]
    features[impute_w_sn3] = features[impute_w_sn3].fillna(-1000000) # Special case of imputation - loglocation
    # Impute location features
    impute_locations = [col for col in features \
        if col.startswith('phone_locations_doryab_') and
        'radiusgyration' not in col    
    ]
    # Impute selected phone, location, and esm features with 0
    impute_zero = [col for col in features if \
        col.startswith('phone_applications_foreground_rapids_') or
        col.startswith('phone_activity_recognition_') or
        col.startswith('phone_battery_rapids_') or
        col.startswith('phone_bluetooth_rapids_') or
        col.startswith('phone_light_rapids_') or
        col.startswith('phone_calls_rapids_') or
        col.startswith('phone_messages_rapids_') or
        col.startswith('phone_screen_rapids_') or
        col.startswith('phone_bluetooth_doryab_') or
        col.startswith('phone_wifi_visible')
        ]
    features[impute_zero+impute_locations+list(esm_cols.columns)] = features[impute_zero+impute_locations+list(esm_cols.columns)].fillna(0)
    pd.set_option('display.max_rows', None)
    graph_bf_af(features, "4context_imp")
    # (4) REMOVE COLS IF THEIR NAN THRESHOLD IS PASSED (should be <= if even all NaN columns must be preserved - this solution now drops columns with all NaN rows)
    esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] # Get target (esm) columns
    features = features.loc[:, features.isna().sum() < provider["COLS_NAN_THRESHOLD"] * features.shape[0]]
    graph_bf_af(features, "5too_much_nans_cols")
    # (5) REMOVE COLS WHERE VARIANCE IS 0
    if provider["COLS_VAR_THRESHOLD"]:
        features.drop(features.std(numeric_only=True)[features.std(numeric_only=True) == 0].index.values, axis=1, inplace=True)
    graph_bf_af(features, "6variance_drop")
    # Preserve esm cols if deleted (has to come after drop cols operations)
    for esm in esm_cols:
        if esm not in features:
            features[esm] = esm_cols[esm]
    # (6) DO THE ROWS CONSIST OF ENOUGH NON-NAN VALUES?
    min_count =  math.ceil((1 - provider["ROWS_NAN_THRESHOLD"]) * features.shape[1]) # minimal not nan values in row
    features.dropna(axis=0, thresh=min_count, inplace=True) # Thresh => at least this many not-nans
    graph_bf_af(features, "7too_much_nans_rows")
    if features.empty:
        return pd.DataFrame(columns=excluded_columns)
    # (7) STANDARDIZATION
    if provider["STANDARDIZATION"]:
        nominal_cols = [col for col in features.columns if "mostcommonactivity" in col or "homelabel" in col] # Excluded nominal features
        # Expected warning within this code block
        with warnings.catch_warnings():
            warnings.simplefilter("ignore", category=RuntimeWarning)
            if provider["TARGET_STANDARDIZATION"]:
                features.loc[:, ~features.columns.isin(excluded_columns + ["pid"] + nominal_cols)] = \
                    features.loc[:, ~features.columns.isin(excluded_columns + nominal_cols)].groupby('pid').transform(lambda x: StandardScaler().fit_transform(x.values[:,np.newaxis]).ravel())
            else:
                features.loc[:, ~features.columns.isin(excluded_columns + ["pid"] + nominal_cols + ['phone_esm_straw_' + target])] = \
                    features.loc[:, ~features.columns.isin(excluded_columns + nominal_cols + ['phone_esm_straw_' + target])].groupby('pid').transform(lambda x: StandardScaler().fit_transform(x.values[:,np.newaxis]).ravel())
    graph_bf_af(features, "8standardization")
    # (8) IMPUTATION: IMPUTE DATA WITH KNN METHOD
    features.reset_index(drop=True, inplace=True)
    impute_cols = [col for col in features.columns if col not in excluded_columns and col != "pid"]
    features[impute_cols] = impute(features[impute_cols], method="knn")
    graph_bf_af(features, "9knn_after")
    # (9) DROP HIGHLY CORRELATED FEATURES
    esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')]
    drop_corr_features = provider["DROP_HIGHLY_CORRELATED_FEATURES"]
    if drop_corr_features["COMPUTE"] and features.shape[0] > 5: # If small amount of segments (rows) is present, do not execute correlation check
        numerical_cols = features.select_dtypes(include=np.number).columns.tolist()
        # Remove columns where NaN count threshold is passed
        valid_features = features[numerical_cols].loc[:, features[numerical_cols].isna().sum() < drop_corr_features['MIN_OVERLAP_FOR_CORR_THRESHOLD'] * features[numerical_cols].shape[0]]
        corr_matrix = valid_features.corr().abs()
        upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
        to_drop = [column for column in upper.columns if any(upper[column] > drop_corr_features["CORR_THRESHOLD"])]
        # sns.heatmap(corr_matrix, cmap="YlGnBu")
        # plt.savefig(f'correlation_matrix.png', bbox_inches='tight')
        # plt.close()
        # s = corr_matrix.unstack()
        # so = s.sort_values(ascending=False)
        # pd.set_option('display.max_rows', None)
        # sorted_upper = upper.unstack().sort_values(ascending=False)
        # print(sorted_upper[sorted_upper > drop_corr_features["CORR_THRESHOLD"]])
        features.drop(to_drop, axis=1, inplace=True)
    # Preserve esm cols if deleted (has to come after drop cols operations)
    for esm in esm_cols:
        if esm not in features:
            features[esm] = esm_cols[esm]
    graph_bf_af(features, "10correlation_drop")
    # Transform categorical columns to category dtype
    cat1 = [col for col in features.columns if "mostcommonactivity" in col]
    if cat1: # Transform columns to category dtype (mostcommonactivity)
        features[cat1] = features[cat1].astype(int).astype('category')
    cat2 = [col for col in features.columns if "homelabel" in col]
    if cat2: # Transform columns to category dtype (homelabel)
        features[cat2] = features[cat2].astype(int).astype('category')
    # (10) DROP ALL WINDOW RELATED COLUMNS
    win_count_cols = [col for col in features if "SO_windowsCount" in col]
    if win_count_cols:
        features.drop(columns=win_count_cols, inplace=True)
    # (11) VERIFY IF THERE ARE ANY NANS LEFT IN THE DATAFRAME
    if features.isna().any().any():
        raise ValueError("There are still some NaNs present in the dataframe. Please check for implementation errors.")
    return features
 def k_nearest(df):
    imputer = KNNImputer(n_neighbors=3)
    return pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
 def impute(df, method='zero'):
    return {
        'zero': df.fillna(0),
        'high_number': df.fillna(1500),
        'mean': df.fillna(df.mean()),
        'median': df.fillna(df.median()),
        'knn': k_nearest(df) 
    }[method]
 def graph_bf_af(features, phase_name, plt_flag=False):
    if plt_flag:
        sns.set(rc={"figure.figsize":(16, 8)})
        sns.heatmap(features.isna(), cbar=False) #features.select_dtypes(include=np.number)
        plt.savefig(f'features_overall_nans_{phase_name}.png', bbox_inches='tight')
    print(f"\n-------------{phase_name}-------------")
    print("Rows number:", features.shape[0])
    print("Columns number:", len(features.columns))
    print("NaN values:", features.isna().sum().sum())
    print("---------------------------------------------\n")
--- a/src/features/cr_features_helper_methods.py
+++ b/src/features/cr_features_helper_methods.py
@ -1,59 +0,0 @@
 import pandas as pd
 import numpy as np
 import math as m
 import sys
 def extract_second_order_features(intraday_features, so_features_names, prefix=""):
    if prefix:
        groupby_cols = ['local_segment', 'local_segment_label', 'local_segment_start_datetime', 'local_segment_end_datetime']
    else:
        groupby_cols = ['local_segment']
    if not intraday_features.empty:
        so_features = pd.DataFrame()
        #print(intraday_features.drop("level_1", axis=1).groupby(["local_segment"]).nsmallest())
        if "mean" in so_features_names:
            so_features = pd.concat([so_features, intraday_features.drop(prefix+"level_1", axis=1).groupby(groupby_cols).mean(numeric_only=True).add_suffix("_SO_mean")], axis=1)
        if "median" in so_features_names:
            so_features = pd.concat([so_features, intraday_features.drop(prefix+"level_1", axis=1).groupby(groupby_cols).median(numeric_only=True).add_suffix("_SO_median")], axis=1)
        if "sd" in so_features_names:
            so_features = pd.concat([so_features, intraday_features.drop(prefix+"level_1", axis=1).groupby(groupby_cols).std(numeric_only=True).fillna(0).add_suffix("_SO_sd")], axis=1)
        if "nlargest" in so_features_names: # largest 5 -- maybe there is a faster groupby solution?
            for column in intraday_features.loc[:, ~intraday_features.columns.isin(groupby_cols+[prefix+"level_1"])]:
                so_features[column+"_SO_nlargest"] = intraday_features.drop(prefix+"level_1", axis=1).groupby(groupby_cols)[column].apply(lambda x: x.nlargest(5).mean())
        if "nsmallest" in so_features_names: # smallest 5 -- maybe there is a faster groupby solution?
            for column in intraday_features.loc[:, ~intraday_features.columns.isin(groupby_cols+[prefix+"level_1"])]:
                so_features[column+"_SO_nsmallest"] = intraday_features.drop(prefix+"level_1", axis=1).groupby(groupby_cols)[column].apply(lambda x: x.nsmallest(5).mean())
        if "count_windows" in so_features_names:
            so_features["SO_windowsCount"] = intraday_features.groupby(groupby_cols).count()[prefix+"level_1"]
        # numPeaksNonZero specialized for EDA sensor
        if "eda_num_peaks_non_zero" in so_features_names and prefix+"numPeaks" in intraday_features.columns:
            so_features[prefix+"SO_numPeaksNonZero"] = intraday_features.groupby(groupby_cols)[prefix+"numPeaks"].apply(lambda x: (x!=0).sum())
        # numWindowsNonZero specialized for BVP and IBI sensors
        if "hrv_num_windows_non_nan" in so_features_names and prefix+"meanHr" in intraday_features.columns:
            so_features[prefix+"SO_numWindowsNonNaN"] = intraday_features.groupby(groupby_cols)[prefix+"meanHr"].apply(lambda x: (~np.isnan(x)).sum())
        so_features.reset_index(inplace=True)
    else:
        so_features = pd.DataFrame(columns=groupby_cols)
    return so_features
 def get_sample_rate(data): # To-Do get the sample rate information from the file's metadata
    try:
        timestamps_diff = data['timestamp'].diff().dropna().mean()
        print("Timestamp diff:", timestamps_diff)
    except:
        raise Exception("Error occured while trying to get the mean sample rate from the data.")
    return m.ceil(1000/timestamps_diff)
--- a/src/features/empatica_accelerometer/cr/main.py
+++ b/src/features/empatica_accelerometer/cr/main.py
@ -1,75 +0,0 @@
 import pandas as pd
 from scipy.stats import entropy
 from cr_features.helper_functions import convert_to2d, accelerometer_features, frequency_features
 from cr_features.calculate_features_old import calculateFeatures
 from cr_features.calculate_features import calculate_features
 from cr_features_helper_methods import extract_second_order_features
 import sys
 def extract_acc_features_from_intraday_data(acc_intraday_data, features, window_length, time_segment, filter_data_by_segment):
    acc_intraday_features = pd.DataFrame(columns=["local_segment"] + features)
    if not acc_intraday_data.empty:   
        sample_rate = 32
        acc_intraday_data = filter_data_by_segment(acc_intraday_data, time_segment)
        if not acc_intraday_data.empty:
            acc_intraday_features = pd.DataFrame()
            # apply methods from calculate features module
            if window_length is None:
                acc_intraday_features = \
                    acc_intraday_data.groupby('local_segment').apply(lambda x: calculate_features( \
                    convert_to2d(x['double_values_0'], x.shape[0]), \
                    convert_to2d(x['double_values_1'], x.shape[0]), \
                    convert_to2d(x['double_values_2'], x.shape[0]), \
                    fs=sample_rate, feature_names=features, show_progress=False)) 
            else:
                acc_intraday_features = \
                    acc_intraday_data.groupby('local_segment').apply(lambda x: calculate_features( \
                    convert_to2d(x['double_values_0'], window_length*sample_rate), \
                    convert_to2d(x['double_values_1'], window_length*sample_rate), \
                    convert_to2d(x['double_values_2'], window_length*sample_rate), \
                    fs=sample_rate, feature_names=features, show_progress=False)) 
            acc_intraday_features.reset_index(inplace=True)
    return acc_intraday_features
 def cr_features(sensor_data_files, time_segment, provider, filter_data_by_segment, *args, **kwargs):
    data_types = {'local_timezone': 'str', 'device_id': 'str', 'timestamp': 'int64', 'double_values_0': 'float64',
                    'double_values_1': 'float64', 'double_values_2': 'float64', 'local_date_time': 'str', 'local_date': "str",
                    'local_time': "str", 'local_hour': "str", 'local_minute': "str", 'assigned_segments': "str"}
    acc_intraday_data = pd.read_csv(sensor_data_files["sensor_data"], dtype=data_types)    
    requested_intraday_features = provider["FEATURES"]
    calc_windows = kwargs.get('calc_windows', False)
    if provider["WINDOWS"]["COMPUTE"] and calc_windows:
        requested_window_length = provider["WINDOWS"]["WINDOW_LENGTH"]
    else:
        requested_window_length = None
    # name of the features this function can compute
    base_intraday_features_names = accelerometer_features + frequency_features
    # the subset of requested features this function can compute
    intraday_features_to_compute = list(set(requested_intraday_features) & set(base_intraday_features_names))
    # extract features from intraday data
    acc_intraday_features = extract_acc_features_from_intraday_data(acc_intraday_data, intraday_features_to_compute, 
                                                                requested_window_length, time_segment, filter_data_by_segment)
    if calc_windows:
        so_features_names = provider["WINDOWS"]["SECOND_ORDER_FEATURES"]
        acc_second_order_features = extract_second_order_features(acc_intraday_features, so_features_names)
        return acc_intraday_features, acc_second_order_features
    return acc_intraday_features
--- a/Show More
+++ b/Show More
`@ -1,2 +1,2 @@`
	`label,length`	`label,length`
	`fiveminutes,5`	`thirtyminutes,30`