Compare commits
16 Commits
master
...
data_clean
Author | SHA1 | Date |
---|---|---|
Meng Li | 0e112d4f68 | |
Meng Li | 10a57be839 | |
Meng Li | 3e2473c4c0 | |
Weiyu | be9ea60b61 | |
Meng Li | 8250416a7f | |
Meng Li | 512355ca01 | |
Meng Li | 3e7b9260d2 | |
Meng Li | 9f0db3bedd | |
Meng Li | 9607d8d673 | |
Meng Li | 5f85dd8ca4 | |
Meng Li | 9cbc67c8ec | |
Meng Li | 50ed8a9536 | |
Meng Li | 871cdbbcd3 | |
Meng Li | 4a7989c058 | |
Meng Li | 8e3d5eb98c | |
Meng Li | fc121863ff |
|
@ -1,7 +0,0 @@
|
|||
# We'll let Git's auto-detection algorithm infer if a file is text. If it is,
|
||||
# enforce LF line endings regardless of OS or git configurations.
|
||||
* text=auto eol=lf
|
||||
|
||||
# Isolate binary files in case the auto-detection algorithm fails and
|
||||
# marks them as text files (which could brick them).
|
||||
*.{png,jpg,jpeg,gif,webp,woff,woff2} binary
|
|
@ -93,17 +93,10 @@ packrat/*
|
|||
|
||||
# exclude data from source control by default
|
||||
data/external/*
|
||||
!/data/external/empatica/empatica1/E4 Data.zip
|
||||
!/data/external/.gitkeep
|
||||
!/data/external/stachl_application_genre_catalogue.csv
|
||||
!/data/external/timesegments*.csv
|
||||
!/data/external/wiki_tz.csv
|
||||
!/data/external/main_study_usernames.csv
|
||||
!/data/external/timezone.csv
|
||||
!/data/external/play_store_application_genre_catalogue.csv
|
||||
!/data/external/play_store_categories_count.csv
|
||||
|
||||
|
||||
data/raw/*
|
||||
!/data/raw/.gitkeep
|
||||
data/interim/*
|
||||
|
@ -121,12 +114,3 @@ settings.dcf
|
|||
tests/fakedata_generation/
|
||||
site/
|
||||
credentials.yaml
|
||||
|
||||
# Docker container and other files
|
||||
.devcontainer
|
||||
|
||||
# Calculating features module
|
||||
calculatingfeatures/
|
||||
|
||||
# Temp folder for rapids data/external
|
||||
rapids_temp_data/
|
||||
|
|
188
README.md
188
README.md
|
@ -11,191 +11,3 @@
|
|||
For more information refer to our [documentation](http://www.rapids.science)
|
||||
|
||||
By [MoSHI](https://www.moshi.pitt.edu/), [University of Pittsburgh](https://www.pitt.edu/)
|
||||
|
||||
## Installation
|
||||
|
||||
For RAPIDS installation refer to to the [documentation](https://www.rapids.science/1.8/setup/installation/)
|
||||
|
||||
### For the installation of the Docker version
|
||||
|
||||
1. Follow the [instructions](https://www.rapids.science/1.8/setup/installation/) to setup RAPIDS via Docker (from scratch).
|
||||
|
||||
2. Delete current contents in /rapids/ folder when in a container session.
|
||||
```
|
||||
cd ..
|
||||
rm -rf rapids/{*,.*}
|
||||
cd rapids
|
||||
```
|
||||
|
||||
3. Clone RAPIDS workspace from Git and checkout a specific branch.
|
||||
```
|
||||
git clone "https://repo.ijs.si/junoslukan/rapids.git" .
|
||||
git checkout <branch_name>
|
||||
```
|
||||
|
||||
4. Install missing “libpq-dev” dependency with bash.
|
||||
```
|
||||
apt-get update -y
|
||||
apt-get install -y libpq-dev
|
||||
```
|
||||
|
||||
5. Restore R venv.
|
||||
Type R to go to the interactive R session and then:
|
||||
```
|
||||
renv::restore()
|
||||
```
|
||||
|
||||
6. Install cr-features module
|
||||
From: https://repo.ijs.si/matjazbostic/calculatingfeatures.git -> branch master.
|
||||
Then follow the "cr-features module" section below.
|
||||
|
||||
7. Install all required packages from environment.yml, prune also deletes conda packages not present in environment file.
|
||||
```
|
||||
conda env update --file environment.yml –prune
|
||||
```
|
||||
|
||||
8. If you wish to update your R or Python venvs.
|
||||
```
|
||||
R in interactive session:
|
||||
renv::snapshot()
|
||||
Python:
|
||||
conda env export --no-builds | sed 's/^.*libgfortran.*$/ - libgfortran/' | sed 's/^.*mkl=.*$/ - mkl/' > environment.yml
|
||||
```
|
||||
|
||||
### cr-features module
|
||||
|
||||
This RAPIDS extension uses cr-features library accessible [here](https://repo.ijs.si/matjazbostic/calculatingfeatures).
|
||||
|
||||
To use cr-features library:
|
||||
|
||||
- Follow the installation instructions in the [README.md](https://repo.ijs.si/matjazbostic/calculatingfeatures/-/blob/master/README.md).
|
||||
|
||||
- Copy built calculatingfeatures folder into the RAPIDS workspace.
|
||||
|
||||
- Install the cr-features package by:
|
||||
```
|
||||
pip install path/to/the/calculatingfeatures/folder
|
||||
e.g. pip install ./calculatingfeatures if the folder is copied to main parent directory
|
||||
cr-features package has to be built and installed everytime to get the newest version.
|
||||
Or an the newest version of the docker image must be used.
|
||||
```
|
||||
|
||||
## Updating RAPIDS
|
||||
|
||||
To update RAPIDS, first pull and merge [origin]( https://github.com/carissalow/rapids), such as with:
|
||||
|
||||
```commandline
|
||||
git fetch --progress "origin" refs/heads/master
|
||||
git merge --no-ff origin/master
|
||||
```
|
||||
|
||||
Next, update the conda and R virtual environment.
|
||||
|
||||
```bash
|
||||
R -e 'renv::restore(repos = c(CRAN = "https://packagemanager.rstudio.com/all/__linux__/focal/latest"))'
|
||||
```
|
||||
|
||||
## Custom configuration
|
||||
### Credentials
|
||||
|
||||
As mentioned under [Database in RAPIDS documentation](https://www.rapids.science/1.6/snippets/database/), a `credentials.yaml` file is needed to connect to a database.
|
||||
It should contain:
|
||||
|
||||
```yaml
|
||||
PSQL_STRAW:
|
||||
database: staw
|
||||
host: 212.235.208.113
|
||||
password: password
|
||||
port: 5432
|
||||
user: staw_db
|
||||
```
|
||||
|
||||
where`password` needs to be specified as well.
|
||||
|
||||
## Possible installation issues
|
||||
### Missing dependencies for RPostgres
|
||||
|
||||
To install `RPostgres` R package (used to connect to the PostgreSQL database), an error might occur:
|
||||
|
||||
```text
|
||||
------------------------- ANTICONF ERROR ---------------------------
|
||||
Configuration failed because libpq was not found. Try installing:
|
||||
* deb: libpq-dev (Debian, Ubuntu, etc)
|
||||
* rpm: postgresql-devel (Fedora, EPEL)
|
||||
* rpm: postgreql8-devel, psstgresql92-devel, postgresql93-devel, or postgresql94-devel (Amazon Linux)
|
||||
* csw: postgresql_dev (Solaris)
|
||||
* brew: libpq (OSX)
|
||||
If libpq is already installed, check that either:
|
||||
(i) 'pkg-config' is in your PATH AND PKG_CONFIG_PATH contains a libpq.pc file; or
|
||||
(ii) 'pg_config' is in your PATH.
|
||||
If neither can detect , you can set INCLUDE_DIR
|
||||
and LIB_DIR manually via:
|
||||
R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
|
||||
--------------------------[ ERROR MESSAGE ]----------------------------
|
||||
<stdin>:1:10: fatal error: libpq-fe.h: No such file or directory
|
||||
compilation terminated.
|
||||
```
|
||||
|
||||
The library requires `libpq` for compiling from source, so install accordingly.
|
||||
|
||||
### Timezone environment variable for tidyverse (relevant for WSL2)
|
||||
|
||||
One of the R packages, `tidyverse` might need access to the `TZ` environment variable during the installation.
|
||||
On Ubuntu 20.04 on WSL2 this triggers the following error:
|
||||
|
||||
```text
|
||||
> install.packages('tidyverse')
|
||||
|
||||
ERROR: configuration failed for package ‘xml2’
|
||||
System has not been booted with systemd as init system (PID 1). Can't operate.
|
||||
Failed to create bus connection: Host is down
|
||||
Warning in system("timedatectl", intern = TRUE) :
|
||||
running command 'timedatectl' had status 1
|
||||
Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) :
|
||||
namespace ‘xml2’ 1.3.1 is already loaded, but >= 1.3.2 is required
|
||||
Calls: <Anonymous> ... namespaceImportFrom -> asNamespace -> loadNamespace
|
||||
Execution halted
|
||||
ERROR: lazy loading failed for package ‘tidyverse’
|
||||
```
|
||||
|
||||
This happens because WSL2 does not use the `timedatectl` service, which provides this variable.
|
||||
|
||||
```bash
|
||||
~$ timedatectl
|
||||
System has not been booted with systemd as init system (PID 1). Can't operate.
|
||||
Failed to create bus connection: Host is down
|
||||
```
|
||||
|
||||
and later
|
||||
|
||||
```bash
|
||||
Warning message:
|
||||
In system("timedatectl", intern = TRUE) :
|
||||
running command 'timedatectl' had status 1
|
||||
Execution halted
|
||||
```
|
||||
|
||||
This can be amended by setting the environment variable manually before attempting to install `tidyverse`:
|
||||
|
||||
```bash
|
||||
export TZ='Europe/Ljubljana'
|
||||
```
|
||||
|
||||
Note: if this is needed to avoid runtime issues, you need to either define this environment variable in each new terminal window or (better) define it in your `~/.bashrc` or `~/.bash_profile`.
|
||||
|
||||
## Possible runtime issues
|
||||
### Unix end of line characters
|
||||
|
||||
Upon running rapids, an error might occur:
|
||||
|
||||
```bash
|
||||
/usr/bin/env: ‘python3\r’: No such file or directory
|
||||
```
|
||||
|
||||
This is due to Windows style end of line characters.
|
||||
To amend this, I added a `.gitattributes` files to force `git` to checkout `rapids` using Unix EOL characters.
|
||||
If this still fails, `dos2unix` can be used to change them.
|
||||
|
||||
### System has not been booted with systemd as init system (PID 1)
|
||||
|
||||
See [the installation issue above](#Timezone-environment-variable-for-tidyverse-(relevant-for-WSL2)).
|
||||
|
|
45
Snakefile
45
Snakefile
|
@ -5,7 +5,6 @@ include: "rules/common.smk"
|
|||
include: "rules/renv.smk"
|
||||
include: "rules/preprocessing.smk"
|
||||
include: "rules/features.smk"
|
||||
include: "rules/models.smk"
|
||||
include: "rules/reports.smk"
|
||||
|
||||
import itertools
|
||||
|
@ -164,25 +163,6 @@ for provider in config["PHONE_CONVERSATION"]["PROVIDERS"].keys():
|
|||
files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"]))
|
||||
files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv")
|
||||
|
||||
for provider in config["PHONE_ESM"]["PROVIDERS"].keys():
|
||||
if config["PHONE_ESM"]["PROVIDERS"][provider]["COMPUTE"]:
|
||||
files_to_compute.extend(expand("data/raw/{pid}/phone_esm_raw.csv",pid=config["PIDS"]))
|
||||
files_to_compute.extend(expand("data/raw/{pid}/phone_esm_with_datetime.csv",pid=config["PIDS"]))
|
||||
files_to_compute.extend(expand("data/interim/{pid}/phone_esm_clean.csv",pid=config["PIDS"]))
|
||||
files_to_compute.extend(expand("data/interim/{pid}/phone_esm_features/phone_esm_{language}_{provider_key}.csv",pid=config["PIDS"],language=get_script_language(config["PHONE_ESM"]["PROVIDERS"][provider]["SRC_SCRIPT"]),provider_key=provider.lower()))
|
||||
files_to_compute.extend(expand("data/processed/features/{pid}/phone_esm.csv", pid=config["PIDS"]))
|
||||
# files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv",pid=config["PIDS"]))
|
||||
# files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv")
|
||||
|
||||
for provider in config["PHONE_SPEECH"]["PROVIDERS"].keys():
|
||||
if config["PHONE_SPEECH"]["PROVIDERS"][provider]["COMPUTE"]:
|
||||
files_to_compute.extend(expand("data/raw/{pid}/phone_speech_raw.csv",pid=config["PIDS"]))
|
||||
files_to_compute.extend(expand("data/raw/{pid}/phone_speech_with_datetime.csv",pid=config["PIDS"]))
|
||||
files_to_compute.extend(expand("data/interim/{pid}/phone_speech_features/phone_speech_{language}_{provider_key}.csv",pid=config["PIDS"],language=get_script_language(config["PHONE_SPEECH"]["PROVIDERS"][provider]["SRC_SCRIPT"]),provider_key=provider.lower()))
|
||||
files_to_compute.extend(expand("data/processed/features/{pid}/phone_speech.csv", pid=config["PIDS"]))
|
||||
files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"]))
|
||||
files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv")
|
||||
|
||||
# We can delete these if's as soon as we add feature PROVIDERS to any of these sensors
|
||||
if isinstance(config["PHONE_APPLICATIONS_CRASHES"]["PROVIDERS"], dict):
|
||||
for provider in config["PHONE_APPLICATIONS_CRASHES"]["PROVIDERS"].keys():
|
||||
|
@ -417,31 +397,10 @@ if config["HEATMAP_FEATURE_CORRELATION_MATRIX"]["PLOT"]:
|
|||
# Data Cleaning
|
||||
for provider in config["ALL_CLEANING_INDIVIDUAL"]["PROVIDERS"].keys():
|
||||
if config["ALL_CLEANING_INDIVIDUAL"]["PROVIDERS"][provider]["COMPUTE"]:
|
||||
if provider == "STRAW":
|
||||
files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features_cleaned_" + provider.lower() + "_py.csv", pid=config["PIDS"]))
|
||||
else:
|
||||
files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features_cleaned_" + provider.lower() + "_R.csv", pid=config["PIDS"]))
|
||||
|
||||
files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features_cleaned_" + provider.lower() +".csv", pid=config["PIDS"]))
|
||||
for provider in config["ALL_CLEANING_OVERALL"]["PROVIDERS"].keys():
|
||||
if config["ALL_CLEANING_OVERALL"]["PROVIDERS"][provider]["COMPUTE"]:
|
||||
if provider == "STRAW":
|
||||
for target in config["PARAMS_FOR_ANALYSIS"]["TARGET"]["ALL_LABELS"]:
|
||||
files_to_compute.extend(expand("data/processed/features/all_participants/all_sensor_features_cleaned_" + provider.lower() +"_py_(" + target + ").csv"))
|
||||
else:
|
||||
files_to_compute.extend(expand("data/processed/features/all_participants/all_sensor_features_cleaned_" + provider.lower() +"_R.csv"))
|
||||
|
||||
# Baseline features
|
||||
if config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["COMPUTE"]:
|
||||
files_to_compute.extend(expand("data/raw/baseline_merged.csv"))
|
||||
files_to_compute.extend(expand("data/raw/{pid}/participant_baseline_raw.csv", pid=config["PIDS"]))
|
||||
files_to_compute.extend(expand("data/interim/{pid}/baseline_questionnaires.csv", pid=config["PIDS"]))
|
||||
files_to_compute.extend(expand("data/processed/features/{pid}/baseline_features.csv", pid=config["PIDS"]))
|
||||
|
||||
# Targets (labels)
|
||||
if config["PARAMS_FOR_ANALYSIS"]["TARGET"]["COMPUTE"]:
|
||||
files_to_compute.extend(expand("data/processed/models/individual_model/{pid}/input.csv", pid=config["PIDS"]))
|
||||
for target in config["PARAMS_FOR_ANALYSIS"]["TARGET"]["ALL_LABELS"]:
|
||||
files_to_compute.extend(expand("data/processed/models/population_model/input_" + target + ".csv"))
|
||||
files_to_compute.extend(expand("data/processed/features/all_participants/all_sensor_features_cleaned_" + provider.lower() +".csv"))
|
||||
|
||||
rule all:
|
||||
input:
|
||||
|
|
|
@ -1,57 +0,0 @@
|
|||
from pprint import pprint
|
||||
import sklearn.metrics
|
||||
import autosklearn.regression
|
||||
|
||||
import datetime
|
||||
import importlib
|
||||
import os
|
||||
import sys
|
||||
|
||||
import numpy as np
|
||||
import matplotlib.pyplot as plt
|
||||
import pandas as pd
|
||||
import seaborn as sns
|
||||
import yaml
|
||||
|
||||
from sklearn import linear_model, svm, kernel_ridge, gaussian_process
|
||||
from sklearn.model_selection import LeaveOneGroupOut, cross_val_score, train_test_split
|
||||
from sklearn.metrics import mean_squared_error, r2_score
|
||||
from sklearn.impute import SimpleImputer
|
||||
|
||||
model_input = pd.read_csv("data/processed/models/population_model/input_PANAS_negative_affect_mean.csv") # Standardizirani podatki
|
||||
|
||||
model_input.dropna(axis=1, how="all", inplace=True)
|
||||
model_input.dropna(axis=0, how="any", subset=["target"], inplace=True)
|
||||
|
||||
categorical_feature_colnames = ["gender", "startlanguage"]
|
||||
categorical_feature_colnames += [col for col in model_input.columns if "mostcommonactivity" in col or "homelabel" in col]
|
||||
categorical_features = model_input[categorical_feature_colnames].copy()
|
||||
mode_categorical_features = categorical_features.mode().iloc[0]
|
||||
categorical_features = categorical_features.fillna(mode_categorical_features)
|
||||
categorical_features = categorical_features.apply(lambda col: col.astype("category"))
|
||||
if not categorical_features.empty:
|
||||
categorical_features = pd.get_dummies(categorical_features)
|
||||
numerical_features = model_input.drop(categorical_feature_colnames, axis=1)
|
||||
model_in = pd.concat([numerical_features, categorical_features], axis=1)
|
||||
|
||||
index_columns = ["local_segment", "local_segment_label", "local_segment_start_datetime", "local_segment_end_datetime"]
|
||||
model_in.set_index(index_columns, inplace=True)
|
||||
|
||||
X_train, X_test, y_train, y_test = train_test_split(model_in.drop(["target", "pid"], axis=1), model_in["target"], test_size=0.30)
|
||||
|
||||
automl = autosklearn.regression.AutoSklearnRegressor(
|
||||
time_left_for_this_task=7200,
|
||||
per_run_time_limit=120
|
||||
)
|
||||
automl.fit(X_train, y_train, dataset_name='straw')
|
||||
|
||||
print(automl.leaderboard())
|
||||
pprint(automl.show_models(), indent=4)
|
||||
|
||||
train_predictions = automl.predict(X_train)
|
||||
print("Train R2 score:", sklearn.metrics.r2_score(y_train, train_predictions))
|
||||
test_predictions = automl.predict(X_test)
|
||||
print("Test R2 score:", sklearn.metrics.r2_score(y_test, test_predictions))
|
||||
|
||||
import sys
|
||||
sys.exit()
|
272
config.yaml
272
config.yaml
|
@ -3,17 +3,16 @@
|
|||
########################################################################################################################
|
||||
|
||||
# See https://www.rapids.science/latest/setup/configuration/#participant-files
|
||||
PIDS: ['p031', 'p032', 'p033', 'p034', 'p035', 'p036', 'p037', 'p038', 'p039', 'p040', 'p042', 'p043', 'p044', 'p045', 'p046', 'p049', 'p050', 'p052', 'p053', 'p054', 'p055', 'p057', 'p058', 'p059', 'p060', 'p061', 'p062', 'p064', 'p067', 'p068', 'p069', 'p070', 'p071', 'p072', 'p073', 'p074', 'p075', 'p076', 'p077', 'p078', 'p079', 'p080', 'p081', 'p082', 'p083', 'p084', 'p085', 'p086', 'p088', 'p089', 'p090', 'p091', 'p092', 'p093', 'p106', 'p107']
|
||||
PIDS: [test01]
|
||||
|
||||
# See https://www.rapids.science/latest/setup/configuration/#automatic-creation-of-participant-files
|
||||
CREATE_PARTICIPANT_FILES:
|
||||
USERNAMES_CSV: "data/external/main_study_usernames.csv"
|
||||
CSV_FILE_PATH: "data/external/main_study_participants.csv" # see docs for required format
|
||||
CSV_FILE_PATH: "data/external/example_participants.csv" # see docs for required format
|
||||
PHONE_SECTION:
|
||||
ADD: True
|
||||
IGNORED_DEVICE_IDS: []
|
||||
FITBIT_SECTION:
|
||||
ADD: False
|
||||
ADD: True
|
||||
IGNORED_DEVICE_IDS: []
|
||||
EMPATICA_SECTION:
|
||||
ADD: True
|
||||
|
@ -21,25 +20,19 @@ CREATE_PARTICIPANT_FILES:
|
|||
|
||||
# See https://www.rapids.science/latest/setup/configuration/#time-segments
|
||||
TIME_SEGMENTS: &time_segments
|
||||
TYPE: EVENT # FREQUENCY, PERIODIC, EVENT
|
||||
FILE: "data/external/straw_events.csv"
|
||||
INCLUDE_PAST_PERIODIC_SEGMENTS: TRUE # Only relevant if TYPE=PERIODIC, see docs
|
||||
TAILORED_EVENTS: # Only relevant if TYPE=EVENT
|
||||
COMPUTE: True
|
||||
SEGMENTING_METHOD: "30_before" # 30_before, 90_before, stress_event
|
||||
INTERVAL_OF_INTEREST: 10 # duration of event of interest [minutes]
|
||||
IOI_ERROR_TOLERANCE: 5 # interval of interest erorr tolerance (before and after IOI) [minutes]
|
||||
TYPE: PERIODIC # FREQUENCY, PERIODIC, EVENT
|
||||
FILE: "data/external/timesegments_periodic.csv"
|
||||
INCLUDE_PAST_PERIODIC_SEGMENTS: FALSE # Only relevant if TYPE=PERIODIC, see docs
|
||||
|
||||
# See https://www.rapids.science/latest/setup/configuration/#timezone-of-your-study
|
||||
TIMEZONE:
|
||||
TYPE: MULTIPLE
|
||||
TYPE: SINGLE
|
||||
SINGLE:
|
||||
TZCODE: Europe/Ljubljana
|
||||
TZCODE: America/New_York
|
||||
MULTIPLE:
|
||||
TZ_FILE: data/external/timezone.csv
|
||||
TZCODES_FILE: data/external/multiple_timezones.csv
|
||||
IF_MISSING_TZCODE: USE_DEFAULT
|
||||
DEFAULT_TZCODE: Europe/Ljubljana
|
||||
TZCODES_FILE: data/external/multiple_timezones_example.csv
|
||||
IF_MISSING_TZCODE: STOP
|
||||
DEFAULT_TZCODE: America/New_York
|
||||
FITBIT:
|
||||
ALLOW_MULTIPLE_TZ_PER_DEVICE: False
|
||||
INFER_FROM_SMARTPHONE_TZ: False
|
||||
|
@ -50,15 +43,12 @@ TIMEZONE:
|
|||
|
||||
# See https://www.rapids.science/latest/setup/configuration/#data-stream-configuration
|
||||
PHONE_DATA_STREAMS:
|
||||
USE: aware_postgresql
|
||||
USE: aware_mysql
|
||||
|
||||
# AVAILABLE:
|
||||
aware_mysql:
|
||||
DATABASE_GROUP: MY_GROUP
|
||||
|
||||
aware_postgresql:
|
||||
DATABASE_GROUP: PSQL_STRAW
|
||||
|
||||
aware_csv:
|
||||
FOLDER: data/external/aware_csv
|
||||
|
||||
|
@ -75,6 +65,7 @@ PHONE_ACCELEROMETER:
|
|||
COMPUTE: False
|
||||
FEATURES: ["maxmagnitude", "minmagnitude", "avgmagnitude", "medianmagnitude", "stdmagnitude"]
|
||||
SRC_SCRIPT: src/features/phone_accelerometer/rapids/main.py
|
||||
|
||||
PANDA:
|
||||
COMPUTE: False
|
||||
VALID_SENSED_MINUTES: False
|
||||
|
@ -86,12 +77,12 @@ PHONE_ACCELEROMETER:
|
|||
# See https://www.rapids.science/latest/features/phone-activity-recognition/
|
||||
PHONE_ACTIVITY_RECOGNITION:
|
||||
CONTAINER:
|
||||
ANDROID: google_ar
|
||||
ANDROID: plugin_google_activity_recognition
|
||||
IOS: plugin_ios_activity_recognition
|
||||
EPISODE_THRESHOLD_BETWEEN_ROWS: 5 # minutes. Max time difference for two consecutive rows to be considered within the same AR episode.
|
||||
PROVIDERS:
|
||||
RAPIDS:
|
||||
COMPUTE: True
|
||||
COMPUTE: False
|
||||
FEATURES: ["count", "mostcommonactivity", "countuniqueactivities", "durationstationary", "durationmobile", "durationvehicle"]
|
||||
ACTIVITY_CLASSES:
|
||||
STATIONARY: ["still", "tilting"]
|
||||
|
@ -104,42 +95,33 @@ PHONE_APPLICATIONS_CRASHES:
|
|||
CONTAINER: applications_crashes
|
||||
APPLICATION_CATEGORIES:
|
||||
CATALOGUE_SOURCE: FILE # FILE (genres are read from CATALOGUE_FILE) or GOOGLE (genres are scrapped from the Play Store)
|
||||
CATALOGUE_FILE: "data/external/play_store_application_genre_catalogue.csv"
|
||||
UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
|
||||
SCRAPE_MISSING_CATEGORIES: False # whether to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
|
||||
CATALOGUE_FILE: "data/external/stachl_application_genre_catalogue.csv"
|
||||
UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether or not to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
|
||||
SCRAPE_MISSING_CATEGORIES: False # whether or not to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
|
||||
PROVIDERS: # None implemented yet but this sensor can be used in PHONE_DATA_YIELD
|
||||
|
||||
# See https://www.rapids.science/latest/features/phone-applications-foreground/
|
||||
PHONE_APPLICATIONS_FOREGROUND:
|
||||
CONTAINER: applications
|
||||
CONTAINER: applications_foreground
|
||||
APPLICATION_CATEGORIES:
|
||||
CATALOGUE_SOURCE: FILE # FILE (genres are read from CATALOGUE_FILE) or GOOGLE (genres are scrapped from the Play Store)
|
||||
CATALOGUE_FILE: "data/external/play_store_application_genre_catalogue.csv"
|
||||
# Refer to data/external/play_store_categories_count.csv for a list of categories (genres) and their frequency.
|
||||
UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
|
||||
SCRAPE_MISSING_CATEGORIES: False # whether to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
|
||||
CATALOGUE_FILE: "data/external/stachl_application_genre_catalogue.csv"
|
||||
UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether or not to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
|
||||
SCRAPE_MISSING_CATEGORIES: False # whether or not to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
|
||||
PROVIDERS:
|
||||
RAPIDS:
|
||||
COMPUTE: True
|
||||
INCLUDE_EPISODE_FEATURES: True
|
||||
SINGLE_CATEGORIES: ["Productivity", "Tools", "Communication", "Education", "Social"]
|
||||
COMPUTE: False
|
||||
INCLUDE_EPISODE_FEATURES: False
|
||||
SINGLE_CATEGORIES: ["all", "email"]
|
||||
MULTIPLE_CATEGORIES:
|
||||
games: ["Puzzle", "Card", "Casual", "Board", "Strategy", "Trivia", "Word", "Adventure", "Role Playing", "Simulation", "Board, Brain Games", "Racing"]
|
||||
social: ["Communication", "Social", "Dating"]
|
||||
productivity: ["Tools", "Productivity", "Finance", "Education", "News & Magazines", "Business", "Books & Reference"]
|
||||
health: ["Health & Fitness", "Lifestyle", "Food & Drink", "Sports", "Medical", "Parenting"]
|
||||
entertainment: ["Shopping", "Music & Audio", "Entertainment", "Travel & Local", "Photography", "Video Players & Editors", "Personalization", "House & Home", "Art & Design", "Auto & Vehicles", "Entertainment,Music & Video",
|
||||
"Puzzle", "Card", "Casual", "Board", "Strategy", "Trivia", "Word", "Adventure", "Role Playing", "Simulation", "Board, Brain Games", "Racing" # Add all games.
|
||||
]
|
||||
maps_weather: ["Maps & Navigation", "Weather"]
|
||||
social: ["socialnetworks", "socialmediatools"]
|
||||
entertainment: ["entertainment", "gamingknowledge", "gamingcasual", "gamingadventure", "gamingstrategy", "gamingtoolscommunity", "gamingroleplaying", "gamingaction", "gaminglogic", "gamingsports", "gamingsimulation"]
|
||||
CUSTOM_CATEGORIES:
|
||||
SINGLE_APPS: []
|
||||
EXCLUDED_CATEGORIES: ["System", "STRAW"]
|
||||
# Note: A special option here is "is_system_app".
|
||||
# This excludes applications that have is_system_app = TRUE, which is a separate column in the table.
|
||||
# However, all of these applications have been assigned System category.
|
||||
# I will therefore filter by that category, which is a superset and is more complete. JL
|
||||
EXCLUDED_APPS: []
|
||||
social_media: ["com.google.android.youtube", "com.snapchat.android", "com.instagram.android", "com.zhiliaoapp.musically", "com.facebook.katana"]
|
||||
dating: ["com.tinder", "com.relance.happycouple", "com.kiwi.joyride"]
|
||||
SINGLE_APPS: ["top1global", "com.facebook.moments", "com.google.android.youtube", "com.twitter.android"] # There's no entropy for single apps
|
||||
EXCLUDED_CATEGORIES: []
|
||||
EXCLUDED_APPS: ["com.fitbit.FitbitMobile", "com.aware.plugin.upmc.cancer"]
|
||||
FEATURES:
|
||||
APP_EVENTS: ["countevent", "timeoffirstuse", "timeoflastuse", "frequencyentropy"]
|
||||
APP_EPISODES: ["countepisode", "minduration", "maxduration", "meanduration", "sumduration"]
|
||||
|
@ -149,7 +131,7 @@ PHONE_APPLICATIONS_FOREGROUND:
|
|||
|
||||
# See https://www.rapids.science/latest/features/phone-applications-notifications/
|
||||
PHONE_APPLICATIONS_NOTIFICATIONS:
|
||||
CONTAINER: notifications
|
||||
CONTAINER: applications_notifications
|
||||
APPLICATION_CATEGORIES:
|
||||
CATALOGUE_SOURCE: FILE # FILE (genres are read from CATALOGUE_FILE) or GOOGLE (genres are scrapped from the Play Store)
|
||||
CATALOGUE_FILE: "data/external/stachl_application_genre_catalogue.csv"
|
||||
|
@ -163,7 +145,7 @@ PHONE_BATTERY:
|
|||
EPISODE_THRESHOLD_BETWEEN_ROWS: 30 # minutes. Max time difference for two consecutive rows to be considered within the same battery episode.
|
||||
PROVIDERS:
|
||||
RAPIDS:
|
||||
COMPUTE: True
|
||||
COMPUTE: False
|
||||
FEATURES: ["countdischarge", "sumdurationdischarge", "countcharge", "sumdurationcharge", "avgconsumptionrate", "maxconsumptionrate"]
|
||||
SRC_SCRIPT: src/features/phone_battery/rapids/main.py
|
||||
|
||||
|
@ -177,7 +159,7 @@ PHONE_BLUETOOTH:
|
|||
SRC_SCRIPT: src/features/phone_bluetooth/rapids/main.R
|
||||
|
||||
DORYAB:
|
||||
COMPUTE: True
|
||||
COMPUTE: False
|
||||
FEATURES:
|
||||
ALL:
|
||||
DEVICES: ["countscans", "uniquedevices", "meanscans", "stdscans"]
|
||||
|
@ -195,10 +177,10 @@ PHONE_BLUETOOTH:
|
|||
|
||||
# See https://www.rapids.science/latest/features/phone-calls/
|
||||
PHONE_CALLS:
|
||||
CONTAINER: call
|
||||
CONTAINER: calls
|
||||
PROVIDERS:
|
||||
RAPIDS:
|
||||
COMPUTE: True
|
||||
COMPUTE: False
|
||||
FEATURES_TYPE: EPISODES # EVENTS or EPISODES
|
||||
CALL_TYPES: [missed, incoming, outgoing]
|
||||
FEATURES:
|
||||
|
@ -208,7 +190,7 @@ PHONE_CALLS:
|
|||
SRC_SCRIPT: src/features/phone_calls/rapids/main.R
|
||||
|
||||
# See https://www.rapids.science/latest/features/phone-conversation/
|
||||
PHONE_CONVERSATION: # TODO Adapt for speech
|
||||
PHONE_CONVERSATION:
|
||||
CONTAINER:
|
||||
ANDROID: plugin_studentlife_audio_android
|
||||
IOS: plugin_studentlife_audio
|
||||
|
@ -227,35 +209,14 @@ PHONE_CONVERSATION: # TODO Adapt for speech
|
|||
|
||||
# See https://www.rapids.science/latest/features/phone-data-yield/
|
||||
PHONE_DATA_YIELD:
|
||||
SENSORS: [#PHONE_ACCELEROMETER,
|
||||
PHONE_ACTIVITY_RECOGNITION,
|
||||
PHONE_APPLICATIONS_FOREGROUND,
|
||||
PHONE_APPLICATIONS_NOTIFICATIONS,
|
||||
PHONE_BATTERY,
|
||||
PHONE_BLUETOOTH,
|
||||
PHONE_CALLS,
|
||||
PHONE_LIGHT,
|
||||
PHONE_LOCATIONS,
|
||||
PHONE_MESSAGES,
|
||||
PHONE_SCREEN,
|
||||
PHONE_WIFI_VISIBLE]
|
||||
SENSORS: []
|
||||
PROVIDERS:
|
||||
RAPIDS:
|
||||
COMPUTE: True
|
||||
COMPUTE: False
|
||||
FEATURES: [ratiovalidyieldedminutes, ratiovalidyieldedhours]
|
||||
MINUTE_RATIO_THRESHOLD_FOR_VALID_YIELDED_HOURS: 0.5 # 0 to 1, minimum percentage of valid minutes in an hour to be considered valid.
|
||||
SRC_SCRIPT: src/features/phone_data_yield/rapids/main.R
|
||||
|
||||
PHONE_ESM:
|
||||
CONTAINER: esm
|
||||
PROVIDERS:
|
||||
STRAW:
|
||||
COMPUTE: True
|
||||
SCALES: ["PANAS_positive_affect", "PANAS_negative_affect", "JCQ_job_demand", "JCQ_job_control", "JCQ_supervisor_support", "JCQ_coworker_support",
|
||||
"appraisal_stressfulness_period", "appraisal_stressfulness_event", "appraisal_threat", "appraisal_challenge"]
|
||||
FEATURES: [mean]
|
||||
SRC_SCRIPT: src/features/phone_esm/straw/main.py
|
||||
|
||||
# See https://www.rapids.science/latest/features/phone-keyboard/
|
||||
PHONE_KEYBOARD:
|
||||
CONTAINER: keyboard
|
||||
|
@ -267,10 +228,10 @@ PHONE_KEYBOARD:
|
|||
|
||||
# See https://www.rapids.science/latest/features/phone-light/
|
||||
PHONE_LIGHT:
|
||||
CONTAINER: light_sensor
|
||||
CONTAINER: light
|
||||
PROVIDERS:
|
||||
RAPIDS:
|
||||
COMPUTE: True
|
||||
COMPUTE: False
|
||||
FEATURES: ["count", "maxlux", "minlux", "avglux", "medianlux", "stdlux"]
|
||||
SRC_SCRIPT: src/features/phone_light/rapids/main.py
|
||||
|
||||
|
@ -284,7 +245,7 @@ PHONE_LOCATIONS:
|
|||
|
||||
PROVIDERS:
|
||||
DORYAB:
|
||||
COMPUTE: True
|
||||
COMPUTE: False
|
||||
FEATURES: ["locationvariance","loglocationvariance","totaldistance","avgspeed","varspeed", "numberofsignificantplaces","numberlocationtransitions","radiusgyration","timeattop1location","timeattop2location","timeattop3location","movingtostaticratio","outlierstimepercent","maxlengthstayatclusters","minlengthstayatclusters","avglengthstayatclusters","stdlengthstayatclusters","locationentropy","normalizedlocationentropy","timeathome", "homelabel"]
|
||||
DBSCAN_EPS: 100 # meters
|
||||
DBSCAN_MINSAMPLES: 5
|
||||
|
@ -299,7 +260,7 @@ PHONE_LOCATIONS:
|
|||
SRC_SCRIPT: src/features/phone_locations/doryab/main.py
|
||||
|
||||
BARNETT:
|
||||
COMPUTE: True
|
||||
COMPUTE: False
|
||||
FEATURES: ["hometime","disttravelled","rog","maxdiam","maxhomedist","siglocsvisited","avgflightlen","stdflightlen","avgflightdur","stdflightdur","probpause","siglocentropy","circdnrtn","wkenddayrtn"]
|
||||
IF_MULTIPLE_TIMEZONES: USE_MOST_COMMON
|
||||
MINUTES_DATA_USED: False # Use this for quality control purposes, how many minutes of data (location coordinates gruped by minute) were used to compute features
|
||||
|
@ -314,10 +275,10 @@ PHONE_LOG:
|
|||
|
||||
# See https://www.rapids.science/latest/features/phone-messages/
|
||||
PHONE_MESSAGES:
|
||||
CONTAINER: sms
|
||||
CONTAINER: messages
|
||||
PROVIDERS:
|
||||
RAPIDS:
|
||||
COMPUTE: True
|
||||
COMPUTE: False
|
||||
MESSAGES_TYPES : [received, sent]
|
||||
FEATURES:
|
||||
received: [count, distinctcontacts, timefirstmessage, timelastmessage, countmostfrequentcontact]
|
||||
|
@ -329,7 +290,7 @@ PHONE_SCREEN:
|
|||
CONTAINER: screen
|
||||
PROVIDERS:
|
||||
RAPIDS:
|
||||
COMPUTE: True
|
||||
COMPUTE: False
|
||||
REFERENCE_HOUR_FIRST_USE: 0
|
||||
IGNORE_EPISODES_SHORTER_THAN: 0 # in minutes, set to 0 to disable
|
||||
IGNORE_EPISODES_LONGER_THAN: 360 # in minutes, set to 0 to disable
|
||||
|
@ -337,15 +298,6 @@ PHONE_SCREEN:
|
|||
EPISODE_TYPES: ["unlock"]
|
||||
SRC_SCRIPT: src/features/phone_screen/rapids/main.py
|
||||
|
||||
# Custom added sensor
|
||||
PHONE_SPEECH:
|
||||
CONTAINER: speech
|
||||
PROVIDERS:
|
||||
STRAW:
|
||||
COMPUTE: True
|
||||
FEATURES: ["meanspeech", "stdspeech", "nlargest", "nsmallest", "medianspeech"]
|
||||
SRC_SCRIPT: src/features/phone_speech/straw/main.py
|
||||
|
||||
# See https://www.rapids.science/latest/features/phone-wifi-connected/
|
||||
PHONE_WIFI_CONNECTED:
|
||||
CONTAINER: sensor_wifi
|
||||
|
@ -360,7 +312,7 @@ PHONE_WIFI_VISIBLE:
|
|||
CONTAINER: wifi
|
||||
PROVIDERS:
|
||||
RAPIDS:
|
||||
COMPUTE: True
|
||||
COMPUTE: False
|
||||
FEATURES: ["countscans", "uniquedevices", "countscansmostuniquedevice"]
|
||||
SRC_SCRIPT: src/features/phone_wifi_visible/rapids/main.R
|
||||
|
||||
|
@ -463,6 +415,7 @@ FITBIT_SLEEP_INTRADAY:
|
|||
UNIFIED: [awake, asleep]
|
||||
SLEEP_TYPES: [main, nap, all]
|
||||
SRC_SCRIPT: src/features/fitbit_sleep_intraday/rapids/main.py
|
||||
|
||||
PRICE:
|
||||
COMPUTE: False
|
||||
FEATURES: [avgduration, avgratioduration, avgstarttimeofepisodemain, avgendtimeofepisodemain, avgmidpointofepisodemain, stdstarttimeofepisodemain, stdendtimeofepisodemain, stdmidpointofepisodemain, socialjetlag, rmssdmeanstarttimeofepisodemain, rmssdmeanendtimeofepisodemain, rmssdmeanmidpointofepisodemain, rmssdmedianstarttimeofepisodemain, rmssdmedianendtimeofepisodemain, rmssdmedianmidpointofepisodemain]
|
||||
|
@ -498,15 +451,13 @@ FITBIT_STEPS_INTRADAY:
|
|||
RAPIDS:
|
||||
COMPUTE: False
|
||||
FEATURES:
|
||||
STEPS: ["sum", "max", "min", "avg", "std", "firststeptime", "laststeptime"]
|
||||
STEPS: ["sum", "max", "min", "avg", "std"]
|
||||
SEDENTARY_BOUT: ["countepisode", "sumduration", "maxduration", "minduration", "avgduration", "stdduration"]
|
||||
ACTIVE_BOUT: ["countepisode", "sumduration", "maxduration", "minduration", "avgduration", "stdduration"]
|
||||
REFERENCE_HOUR: 0
|
||||
THRESHOLD_ACTIVE_BOUT: 10 # steps
|
||||
INCLUDE_ZERO_STEP_ROWS: False
|
||||
SRC_SCRIPT: src/features/fitbit_steps_intraday/rapids/main.py
|
||||
|
||||
|
||||
########################################################################################################################
|
||||
# EMPATICA #
|
||||
########################################################################################################################
|
||||
|
@ -528,15 +479,6 @@ EMPATICA_ACCELEROMETER:
|
|||
COMPUTE: False
|
||||
FEATURES: ["maxmagnitude", "minmagnitude", "avgmagnitude", "medianmagnitude", "stdmagnitude"]
|
||||
SRC_SCRIPT: src/features/empatica_accelerometer/dbdp/main.py
|
||||
CR:
|
||||
COMPUTE: True
|
||||
FEATURES: ["totalMagnitudeBand", "absoluteMeanBand", "varianceBand"] # Acc features
|
||||
WINDOWS:
|
||||
COMPUTE: True
|
||||
WINDOW_LENGTH: 15 # specify window length in seconds
|
||||
SECOND_ORDER_FEATURES: ['mean', 'median', 'sd', 'nlargest', 'nsmallest', 'count_windows']
|
||||
SRC_SCRIPT: src/features/empatica_accelerometer/cr/main.py
|
||||
|
||||
|
||||
# See https://www.rapids.science/latest/features/empatica-heartrate/
|
||||
EMPATICA_HEARTRATE:
|
||||
|
@ -555,15 +497,6 @@ EMPATICA_TEMPERATURE:
|
|||
COMPUTE: False
|
||||
FEATURES: ["maxtemp", "mintemp", "avgtemp", "mediantemp", "modetemp", "stdtemp", "diffmaxmodetemp", "diffminmodetemp", "entropytemp"]
|
||||
SRC_SCRIPT: src/features/empatica_temperature/dbdp/main.py
|
||||
CR:
|
||||
COMPUTE: True
|
||||
FEATURES: ["maximum", "minimum", "meanAbsChange", "longestStrikeAboveMean", "longestStrikeBelowMean",
|
||||
"stdDev", "median", "meanChange", "sumSquared", "squareSumOfComponent", "sumOfSquareComponents"]
|
||||
WINDOWS:
|
||||
COMPUTE: True
|
||||
WINDOW_LENGTH: 300 # specify window length in seconds
|
||||
SECOND_ORDER_FEATURES: ['mean', 'median', 'sd', 'nlargest', 'nsmallest', 'count_windows']
|
||||
SRC_SCRIPT: src/features/empatica_temperature/cr/main.py
|
||||
|
||||
# See https://www.rapids.science/latest/features/empatica-electrodermal-activity/
|
||||
EMPATICA_ELECTRODERMAL_ACTIVITY:
|
||||
|
@ -573,19 +506,6 @@ EMPATICA_ELECTRODERMAL_ACTIVITY:
|
|||
COMPUTE: False
|
||||
FEATURES: ["maxeda", "mineda", "avgeda", "medianeda", "modeeda", "stdeda", "diffmaxmodeeda", "diffminmodeeda", "entropyeda"]
|
||||
SRC_SCRIPT: src/features/empatica_electrodermal_activity/dbdp/main.py
|
||||
CR:
|
||||
COMPUTE: True
|
||||
FEATURES: ['mean', 'std', 'q25', 'q75', 'qd', 'deriv', 'power', 'numPeaks', 'ratePeaks', 'powerPeaks', 'sumPosDeriv', 'propPosDeriv', 'derivTonic',
|
||||
'sigTonicDifference', 'freqFeats','maxPeakAmplitudeChangeBefore', 'maxPeakAmplitudeChangeAfter', 'avgPeakAmplitudeChangeBefore',
|
||||
'avgPeakAmplitudeChangeAfter', 'avgPeakChangeRatio', 'maxPeakIncreaseTime', 'maxPeakDecreaseTime', 'maxPeakDuration', 'maxPeakChangeRatio',
|
||||
'avgPeakIncreaseTime', 'avgPeakDecreaseTime', 'avgPeakDuration', 'signalOverallChange', 'changeDuration', 'changeRate', 'significantIncrease',
|
||||
'significantDecrease']
|
||||
WINDOWS:
|
||||
COMPUTE: True
|
||||
WINDOW_LENGTH: 60 # specify window length in seconds
|
||||
SECOND_ORDER_FEATURES: ['mean', 'median', 'sd', 'nlargest', 'nsmallest', count_windows, eda_num_peaks_non_zero]
|
||||
IMPUTE_NANS: True
|
||||
SRC_SCRIPT: src/features/empatica_electrodermal_activity/cr/main.py
|
||||
|
||||
# See https://www.rapids.science/latest/features/empatica-blood-volume-pulse/
|
||||
EMPATICA_BLOOD_VOLUME_PULSE:
|
||||
|
@ -595,15 +515,6 @@ EMPATICA_BLOOD_VOLUME_PULSE:
|
|||
COMPUTE: False
|
||||
FEATURES: ["maxbvp", "minbvp", "avgbvp", "medianbvp", "modebvp", "stdbvp", "diffmaxmodebvp", "diffminmodebvp", "entropybvp"]
|
||||
SRC_SCRIPT: src/features/empatica_blood_volume_pulse/dbdp/main.py
|
||||
CR:
|
||||
COMPUTE: False
|
||||
FEATURES: ['meanHr', 'ibi', 'sdnn', 'sdsd', 'rmssd', 'pnn20', 'pnn50', 'sd', 'sd2', 'sd1/sd2', 'numRR', # Time features
|
||||
'VLF', 'LF', 'LFnorm', 'HF', 'HFnorm', 'LF/HF', 'fullIntegral'] # Freq features
|
||||
WINDOWS:
|
||||
COMPUTE: True
|
||||
WINDOW_LENGTH: 300 # specify window length in seconds
|
||||
SECOND_ORDER_FEATURES: ['mean', 'median', 'sd', 'nlargest', 'nsmallest', 'count_windows', 'hrv_num_windows_non_nan']
|
||||
SRC_SCRIPT: src/features/empatica_blood_volume_pulse/cr/main.py
|
||||
|
||||
# See https://www.rapids.science/latest/features/empatica-inter-beat-interval/
|
||||
EMPATICA_INTER_BEAT_INTERVAL:
|
||||
|
@ -613,16 +524,6 @@ EMPATICA_INTER_BEAT_INTERVAL:
|
|||
COMPUTE: False
|
||||
FEATURES: ["maxibi", "minibi", "avgibi", "medianibi", "modeibi", "stdibi", "diffmaxmodeibi", "diffminmodeibi", "entropyibi"]
|
||||
SRC_SCRIPT: src/features/empatica_inter_beat_interval/dbdp/main.py
|
||||
CR:
|
||||
COMPUTE: True
|
||||
FEATURES: ['meanHr', 'ibi', 'sdnn', 'sdsd', 'rmssd', 'pnn20', 'pnn50', 'sd', 'sd2', 'sd1/sd2', 'numRR', # Time features
|
||||
'VLF', 'LF', 'LFnorm', 'HF', 'HFnorm', 'LF/HF', 'fullIntegral'] # Freq features
|
||||
PATCH_WITH_BVP: True
|
||||
WINDOWS:
|
||||
COMPUTE: True
|
||||
WINDOW_LENGTH: 300 # specify window length in seconds
|
||||
SECOND_ORDER_FEATURES: ['mean', 'median', 'sd', 'nlargest', 'nsmallest', 'count_windows', 'hrv_num_windows_non_nan']
|
||||
SRC_SCRIPT: src/features/empatica_inter_beat_interval/cr/main.py
|
||||
|
||||
# See https://www.rapids.science/latest/features/empatica-tags/
|
||||
EMPATICA_TAGS:
|
||||
|
@ -673,86 +574,33 @@ ALL_CLEANING_INDIVIDUAL:
|
|||
RAPIDS:
|
||||
COMPUTE: False
|
||||
IMPUTE_SELECTED_EVENT_FEATURES:
|
||||
COMPUTE: False
|
||||
COMPUTE: True
|
||||
MIN_DATA_YIELDED_MINUTES_TO_IMPUTE: 0.33
|
||||
COLS_NAN_THRESHOLD: 1 # set to 1 to disable
|
||||
COLS_NAN_THRESHOLD: 0.3 # set to 1 to disable
|
||||
COLS_VAR_THRESHOLD: True
|
||||
ROWS_NAN_THRESHOLD: 1 # set to 1 to disable
|
||||
DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_HOURS # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
|
||||
DATA_YIELD_RATIO_THRESHOLD: 0 # set to 0 to disable
|
||||
ROWS_NAN_THRESHOLD: 0.3 # set to 1 to disable
|
||||
DATA_YIELD_UNIT: HOURS # HOURS or MINUTES
|
||||
DATA_YIELD_RATIO_THRESHOLD: 0.5 # set to 0 to disable
|
||||
DROP_HIGHLY_CORRELATED_FEATURES:
|
||||
COMPUTE: True
|
||||
MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
|
||||
CORR_THRESHOLD: 0.95
|
||||
SRC_SCRIPT: src/features/all_cleaning_individual/rapids/main.R
|
||||
STRAW:
|
||||
COMPUTE: True
|
||||
PHONE_DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_MINUTES # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
|
||||
PHONE_DATA_YIELD_RATIO_THRESHOLD: 0.5 # set to 0 to disable
|
||||
EMPATICA_DATA_YIELD_RATIO_THRESHOLD: 0.5 # set to 0 to disable
|
||||
ROWS_NAN_THRESHOLD: 0.33 # set to 1 to disable
|
||||
COLS_NAN_THRESHOLD: 0.9 # set to 1 to remove only columns that contains all (100% of) NaN
|
||||
COLS_VAR_THRESHOLD: True
|
||||
DROP_HIGHLY_CORRELATED_FEATURES:
|
||||
COMPUTE: True
|
||||
MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
|
||||
CORR_THRESHOLD: 0.95
|
||||
STANDARDIZATION: True
|
||||
SRC_SCRIPT: src/features/all_cleaning_individual/straw/main.py
|
||||
|
||||
ALL_CLEANING_OVERALL:
|
||||
PROVIDERS:
|
||||
RAPIDS:
|
||||
COMPUTE: False
|
||||
IMPUTE_SELECTED_EVENT_FEATURES:
|
||||
COMPUTE: False
|
||||
COMPUTE: True
|
||||
MIN_DATA_YIELDED_MINUTES_TO_IMPUTE: 0.33
|
||||
COLS_NAN_THRESHOLD: 1 # set to 1 to disable
|
||||
COLS_NAN_THRESHOLD: 0.3 # set to 1 to disable
|
||||
COLS_VAR_THRESHOLD: True
|
||||
ROWS_NAN_THRESHOLD: 1 # set to 1 to disable
|
||||
DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_HOURS # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
|
||||
DATA_YIELD_RATIO_THRESHOLD: 0 # set to 0 to disable
|
||||
ROWS_NAN_THRESHOLD: 0.3 # set to 1 to disable
|
||||
DATA_YIELD_UNIT: HOURS # HOURS or MINUTES
|
||||
DATA_YIELD_RATIO_THRESHOLD: 0.5 # set to 0 to disable
|
||||
DROP_HIGHLY_CORRELATED_FEATURES:
|
||||
COMPUTE: True
|
||||
MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
|
||||
CORR_THRESHOLD: 0.95
|
||||
SRC_SCRIPT: src/features/all_cleaning_overall/rapids/main.R
|
||||
STRAW:
|
||||
COMPUTE: True
|
||||
PHONE_DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_MINUTES # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
|
||||
PHONE_DATA_YIELD_RATIO_THRESHOLD: 0.5 # set to 0 to disable
|
||||
EMPATICA_DATA_YIELD_RATIO_THRESHOLD: 0.5 # set to 0 to disable
|
||||
ROWS_NAN_THRESHOLD: 0.33 # set to 1 to disable
|
||||
COLS_NAN_THRESHOLD: 0.8 # set to 1 to remove only columns that contains all (100% of) NaN
|
||||
COLS_VAR_THRESHOLD: True
|
||||
DROP_HIGHLY_CORRELATED_FEATURES:
|
||||
COMPUTE: True
|
||||
MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
|
||||
CORR_THRESHOLD: 0.95
|
||||
STANDARDIZATION: True
|
||||
TARGET_STANDARDIZATION: False
|
||||
SRC_SCRIPT: src/features/all_cleaning_overall/straw/main.py
|
||||
|
||||
|
||||
########################################################################################################################
|
||||
# Baseline #
|
||||
########################################################################################################################
|
||||
|
||||
PARAMS_FOR_ANALYSIS:
|
||||
BASELINE:
|
||||
COMPUTE: True
|
||||
FOLDER: data/external/baseline
|
||||
CONTAINER: [results-survey637813_final.csv, # Slovenia
|
||||
results-survey358134_final.csv, # Belgium 1
|
||||
results-survey413767_final.csv # Belgium 2
|
||||
]
|
||||
QUESTION_LIST: survey637813+question_text.csv
|
||||
FEATURES: [age, gender, startlanguage, limesurvey_demand, limesurvey_control, limesurvey_demand_control_ratio, limesurvey_demand_control_ratio_quartile]
|
||||
CATEGORICAL_FEATURES: [gender]
|
||||
|
||||
TARGET:
|
||||
COMPUTE: True
|
||||
LABEL: appraisal_stressfulness_event_mean
|
||||
ALL_LABELS: [PANAS_positive_affect_mean, PANAS_negative_affect_mean, JCQ_job_demand_mean, JCQ_job_control_mean, JCQ_supervisor_support_mean, JCQ_coworker_support_mean, appraisal_stressfulness_period_mean]
|
||||
# PANAS_positive_affect_mean, PANAS_negative_affect_mean, JCQ_job_demand_mean, JCQ_job_control_mean, JCQ_supervisor_support_mean,
|
||||
# JCQ_coworker_support_mean, appraisal_stressfulness_period_mean, appraisal_stressfulness_event_mean, appraisal_threat_mean, appraisal_challenge_mean
|
||||
|
|
|
@ -1,9 +0,0 @@
|
|||
"_id","timestamp","device_id","call_type","call_duration","trace"
|
||||
1,1587663260695,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",2,14,"d5e84f8af01b2728021d4f43f53a163c0c90000c"
|
||||
2,1587739118007,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",3,0,"47c125dc7bd163b8612cdea13724a814917b6e93"
|
||||
5,1587746544891,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",2,95,"9cc793ffd6e88b1d850ce540b5d7e000ef5650d4"
|
||||
6,1587911379859,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",2,63,"51fb9344e988049a3fec774c7ca622358bf80264"
|
||||
7,1587992647361,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",3,0,"2a862a7730cfdfaf103a9487afe3e02935fd6e02"
|
||||
8,1588020039448,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",1,11,"a2c53f6a086d98622c06107780980cf1bb4e37bd"
|
||||
11,1588176189024,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",2,65,"56589df8c830c70e330b644921ed38e08d8fd1f3"
|
||||
12,1588197745079,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",3,0,"cab458018a8ed3b626515e794c70b6f415318adc"
|
|
Binary file not shown.
|
@ -1,57 +0,0 @@
|
|||
label,empatica_id
|
||||
uploader_79170,A0245B
|
||||
uploader_89788,A02731
|
||||
uploader_68294,A02705
|
||||
uploader_92856,A024AF
|
||||
uploader_23726,A0231C
|
||||
uploader_66620,A02305
|
||||
uploader_58435,A026B5
|
||||
uploader_87801,A022A8
|
||||
uploader_96055,A027BA
|
||||
uploader_69549,A0226C
|
||||
uploader_26363,A0263D
|
||||
uploader_72010,A023FA
|
||||
uploader_13997,A024AF
|
||||
uploader_31156,A02305
|
||||
uploader_63187,A027BA
|
||||
uploader_94821,A022A8
|
||||
uploader_65413,A023F1;A023FA
|
||||
uploader_36488,A02713
|
||||
uploader_91087,A0231C
|
||||
uploader_35174,A025D1
|
||||
uploader_73880,A02705
|
||||
uploader_78650,A02731
|
||||
uploader_70578,A0245B
|
||||
uploader_88313,A02736
|
||||
uploader_58482,A0261A
|
||||
uploader_80601,A027BA
|
||||
uploader_93729,A0226C
|
||||
uploader_61663,A0245B
|
||||
uploader_80848,A025D1
|
||||
uploader_57312,A023F9;A02361;A027A0
|
||||
uploader_52087,A02666
|
||||
uploader_98770,A02953
|
||||
uploader_51327,A0245F
|
||||
uploader_11737,A02732
|
||||
uploader_77440,A0264E
|
||||
uploader_57277,A02422
|
||||
uploader_13098,A026E5
|
||||
uploader_80719,A023C8
|
||||
uploader_54698,A02953
|
||||
uploader_95571,A02853
|
||||
uploader_21880,A024DC
|
||||
uploader_92905,A02920
|
||||
uploader_12108,A023F4
|
||||
uploader_17436,A026E5
|
||||
uploader_58440,A0273F
|
||||
uploader_22172,A0245F
|
||||
uploader_39250,A02422
|
||||
uploader_15311,A023F9
|
||||
uploader_45766,A02920
|
||||
uploader_23096,A02361
|
||||
uploader_78243,A02422
|
||||
uploader_58777,A0245F
|
||||
uploader_82941,A02666
|
||||
uploader_89606,A023F4
|
||||
uploader_82969,A023C8
|
||||
uploader_53573,A024DC;A02361
|
|
|
@ -1,11 +0,0 @@
|
|||
PHONE:
|
||||
DEVICE_IDS: [4b62a655-cbf0-4ac0-a448-06726f45b56a]
|
||||
PLATFORMS: [android]
|
||||
LABEL: uploader_53573
|
||||
START_DATE: 2021-05-21 09:21:24
|
||||
END_DATE: 2021-07-12 17:32:07
|
||||
EMPATICA:
|
||||
DEVICE_IDS: [uploader_53573]
|
||||
LABEL: uploader_53573
|
||||
START_DATE: 2021-05-21 09:21:24
|
||||
END_DATE: 2021-07-12 17:32:07
|
File diff suppressed because it is too large
Load Diff
|
@ -1,45 +0,0 @@
|
|||
genre,n
|
||||
System,261
|
||||
Tools,96
|
||||
Productivity,71
|
||||
Health & Fitness,60
|
||||
Finance,54
|
||||
Communication,39
|
||||
Music & Audio,39
|
||||
Shopping,38
|
||||
Lifestyle,33
|
||||
Education,28
|
||||
News & Magazines,24
|
||||
Maps & Navigation,23
|
||||
Entertainment,21
|
||||
Business,18
|
||||
Travel & Local,18
|
||||
Books & Reference,16
|
||||
Social,16
|
||||
Weather,16
|
||||
Food & Drink,14
|
||||
Sports,14
|
||||
Other,13
|
||||
Photography,13
|
||||
Puzzle,13
|
||||
Video Players & Editors,12
|
||||
Card,9
|
||||
Casual,9
|
||||
Personalization,8
|
||||
Medical,7
|
||||
Board,5
|
||||
Strategy,4
|
||||
House & Home,3
|
||||
Trivia,3
|
||||
Word,3
|
||||
Adventure,2
|
||||
Art & Design,2
|
||||
Auto & Vehicles,2
|
||||
Dating,2
|
||||
Role Playing,2
|
||||
STRAW,2
|
||||
Simulation,2
|
||||
"Board,Brain Games",1
|
||||
"Entertainment,Music & Video",1
|
||||
Parenting,1
|
||||
Racing,1
|
|
|
@ -1,3 +0,0 @@
|
|||
label,start_time,length,repeats_on,repeats_value
|
||||
daily,04:00:00,23H 59M 59S,every_day,0
|
||||
working_day,04:00:00,18H 00M 00S,every_day,0
|
|
|
@ -1,2 +1,2 @@
|
|||
label,length
|
||||
fiveminutes,5
|
||||
thirtyminutes,30
|
|
|
@ -1,2 +1,9 @@
|
|||
label,start_time,length,repeats_on,repeats_value
|
||||
daily,00:00:00,23H 59M 59S,every_day,0
|
||||
threeday,00:00:00,2D 23H 59M 59S,every_day,0
|
||||
daily, 00:00:00,23H 59M 59S, every_day, 0
|
||||
morning,06:00:00,5H 59M 59S,every_day,0
|
||||
afternoon,12:00:00,5H 59M 59S,every_day,0
|
||||
evening,18:00:00,5H 59M 59S,every_day,0
|
||||
night,00:00:00,5H 59M 59S,every_day,0
|
||||
two_weeks_overlapping,00:00:00,13D 23H 59M 59S,every_day,0
|
||||
weekends,00:00:00,2D 23H 59M 59S,wday,5
|
||||
|
|
|
File diff suppressed because it is too large
Load Diff
|
@ -1,7 +1,7 @@
|
|||
Data Cleaning
|
||||
=============
|
||||
|
||||
The goal of this module is to perform basic clean tasks on the behavioral features that RAPIDS computes. You might need to do further processing depending on your analysis objectives. This module can clean features at the individual level and at the study level. If you are interested in creating individual models (using each participant's features independently of the others) use [`ALL_CLEANING_INDIVIDUAL`]. If you are interested in creating population models (using everyone's data in the same model) use [`ALL_CLEANING_OVERALL`]
|
||||
The goal of this module is to perform basic clean tasks on the behavioral features that RAPIDS computes. You might need to do further processing depending on your analysis objectives. This module can clean features at the individual level and at the study level. For example, if we have enough data for a participant, we can train a model for that participant. By setting parameters in `[ALL_CLEANING_INDIVIDUAL]` section, we can get cleaned data for that participant. Similarly, if we want to train a model for all participants, we can get the cleaned data for all participants by setting parameters in `[ALL_CLEANING_OVERALL]` section.
|
||||
|
||||
## Clean sensor features for individual participants
|
||||
|
||||
|
@ -22,8 +22,8 @@ Parameters description for `[ALL_CLEANING_INDIVIDUAL][PROVIDERS][RAPIDS]`:
|
|||
|`[COLS_NAN_THRESHOLD]` | Discard columns with missing value ratios higher than `[COLS_NAN_THRESHOLD]`. Set to 1 to disable
|
||||
|`[COLS_VAR_THRESHOLD]` | Set to `True` to discard columns with zero variance
|
||||
|`[ROWS_NAN_THRESHOLD]` | Discard rows with missing value ratios higher than `[ROWS_NAN_THRESHOLD]`. Set to 1 to disable
|
||||
|`[DATA_YIELD_FEATURE]` | `RATIO_VALID_YIELDED_HOURS` or `RATIO_VALID_YIELDED_MINUTES`
|
||||
|`[DATA_YIELD_RATIO_THRESHOLD]` | Discard rows with `ratiovalidyieldedhours` or `ratiovalidyieldedminutes` feature less than `[DATA_YIELD_RATIO_THRESHOLD]`. The feature name is determined by `[DATA_YIELD_FEATURE]` parameter. Set to 0 to disable
|
||||
|`[DATA_YIELD_UNIT]` | `HOURS` or `MINUTES`. Set to `HOURS` to denote `ratiovalidyieldedhours` feature; set to `MINUTES` to denote `ratiovalidyieldedminutes` feature.
|
||||
|`[DATA_YIELD_RATIO_THRESHOLD]` | Discard rows with `ratiovalidyieldedhours` or `ratiovalidyieldedminutes` feature less than `[DATA_YIELD_RATIO_THRESHOLD]`. The feature name is determined by `[DATA_YIELD_UNIT]` parameter. Set to 0 to disable
|
||||
|`DROP_HIGHLY_CORRELATED_FEATURES` | Discard highly correlated features, see table below
|
||||
|
||||
Parameters description for `[ALL_CLEANING_INDIVIDUAL][PROVIDERS][RAPIDS][IMPUTE_SELECTED_EVENT_FEATURES]`:
|
||||
|
@ -59,7 +59,7 @@ Steps to clean sensor features for individual participants. It only considers th
|
|||
- WiFi: all connected and visible features.
|
||||
|
||||
??? info "2. Discard unreliable rows."
|
||||
Extracted features might be not reliable if the sensor only works for a short period during a time segment. In this step, we discard rows when the `phone_data_yield_rapids_ratiovalidyieldedminutes` column or the `phone_data_yield_rapids_ratiovalidyieldedhours` column is less than the `[DATA_YIELD_RATIO_THRESHOLD]` parameter. We recommend using `phone_data_yield_rapids_ratiovalidyieldedminutes` column (set `[DATA_YIELD_FEATURE]` to `RATIO_VALID_YIELDED_MINUTES`) on time segments that are shorter than two or three hours and `phone_data_yield_rapids_ratiovalidyieldedhours` (set `[DATA_YIELD_FEATURE]` to `RATIO_VALID_YIELDED_HOURS`) for longer segments. We do not recommend you to skip this step, but you can do it by setting `[DATA_YIELD_RATIO_THRESHOLD]` to 0.
|
||||
Extracted features might be not reliable if the sensor only works for a short period during a time segment. In this step, we discard rows when the `phone_data_yield_rapids_ratiovalidyieldedminutes` column or the `phone_data_yield_rapids_ratiovalidyieldedhours` column is less than the `[DATA_YIELD_RATIO_THRESHOLD]` parameter. We recommend using `phone_data_yield_rapids_ratiovalidyieldedminutes` column (set `[DATA_YIELD_UNIT]` to `MINUTES`) on time segments that are shorter than two or three hours and `phone_data_yield_rapids_ratiovalidyieldedhours` (set `[DATA_YIELD_UNIT]` to `HOURS`) for longer segments. We do not recommend you to skip this step, but you can do it by setting `[DATA_YIELD_RATIO_THRESHOLD]` to 0.
|
||||
|
||||
??? info "3. Discard columns (features) with too many missing values."
|
||||
In this step, we discard columns with missing value ratios higher than `[COLS_NAN_THRESHOLD]`. We do not recommend you to skip this step, but you can do it by setting `[COLS_NAN_THRESHOLD]` to 1.
|
||||
|
|
|
@ -1,17 +1,8 @@
|
|||
# Change Log
|
||||
## v1.8.0
|
||||
- Add data stream for AWARE Micro server
|
||||
- Fix the NA bug in PHONE_LOCATIONS BARNETT provider
|
||||
- Fix the bug of data type for call_duration field
|
||||
- Fix the index bug of heatmap_sensors_per_minute_per_time_segment
|
||||
## v1.7.1
|
||||
- Update docs for Git Flow section
|
||||
- Update RAPIDS paper information
|
||||
## v1.7.0
|
||||
- Add firststeptime and laststeptime features to FITBIT_STEPS_INTRADAY RAPIDS provider
|
||||
- Update tests for Fitbit steps intraday features
|
||||
## v.1.7.0
|
||||
- Add tests for phone battery features
|
||||
- Add a data cleaning module to replace NAs with 0 in selected event-based features, discard unreliable rows and columns, discard columns with zero variance, and discard highly correlated columns
|
||||
- Replace NA with 0 for selected event-based features. Done in each feature extraction script and the data cleaning module
|
||||
- Refactor data cleaning module: update the structure and add dropping highly correlated features section
|
||||
## v1.6.0
|
||||
- Refactor PHONE_CALLS RAPIDS provider to compute features based on call episodes or events
|
||||
- Refactor PHONE_LOCATIONS DORYAB provider to compute features based on location episodes
|
||||
|
|
|
@ -5,10 +5,14 @@
|
|||
|
||||
## RAPIDS
|
||||
|
||||
If you used RAPIDS, please cite [this paper](https://www.frontiersin.org/article/10.3389/fdgth.2021.769823).
|
||||
If you used RAPIDS, please cite [this paper](https://preprints.jmir.org/preprint/23246).
|
||||
|
||||
!!! cite "RAPIDS et al. citation"
|
||||
Vega, J., Li, M., Aguillera, K., Goel, N., Joshi, E., Khandekar, K., ... & Low, C. A. (2021). Reproducible Analysis Pipeline for Data Streams (RAPIDS): Open-Source Software to Process Data Collected with Mobile Devices. Frontiers in Digital Health, 168.
|
||||
Vega J, Li M, Aguillera K, Goel N, Joshi E, Durica KC, Kunta AR, Low CA
|
||||
RAPIDS: Reproducible Analysis Pipeline for Data Streams Collected with Mobile Devices
|
||||
JMIR Preprints. 18/08/2020:23246
|
||||
DOI: 10.2196/preprints.23246
|
||||
URL: https://preprints.jmir.org/preprint/23246
|
||||
|
||||
## DBDP (all Empatica sensors)
|
||||
|
||||
|
|
|
@ -1,15 +0,0 @@
|
|||
# `aware_micro_mysql`
|
||||
|
||||
This [data stream](../../datastreams/data-streams-introduction) handles iOS and Android sensor data collected with the [AWARE Framework's](https://awareframework.com/) [AWARE Micro](https://github.com/denzilferreira/aware-micro) server and stored in a MySQL database.
|
||||
|
||||
## Container
|
||||
A MySQL database with a table per sensor, each containing the data for all participants. Sensor data is stored in a JSON field within each table called `data`
|
||||
|
||||
The script to connect and download data from this container is at:
|
||||
```bash
|
||||
src/data/streams/aware_micro_mysql/container.R
|
||||
```
|
||||
|
||||
## Format
|
||||
|
||||
--8<---- "docs/snippets/aware_format.md"
|
|
@ -16,7 +16,6 @@ For reference, these are the data streams we currently support:
|
|||
| Data Stream | Device | Format | Container | Docs
|
||||
|--|--|--|--|--|
|
||||
| `aware_mysql`| Phone | AWARE app | MySQL | [link](../aware-mysql)
|
||||
| `aware_micro_mysql`| Phone | AWARE Micro server | MySQL | [link](../aware-micro-mysql)
|
||||
| `aware_csv`| Phone | AWARE app | CSV files | [link](../aware-csv)
|
||||
| `aware_influxdb` (beta)| Phone | AWARE app | InfluxDB | [link](../aware-influxdb)
|
||||
| `fitbitjson_mysql`| Fitbit | JSON (per [Fitbit's API](https://dev.fitbit.com/build/reference/web-api/)) | MySQL | [link](../fitbitjson-mysql)
|
||||
|
|
|
@ -127,9 +127,9 @@ git branch -d release/v[NEW_RELEASE]
|
|||
```
|
||||
git checkout master
|
||||
git merge --ff-only develop
|
||||
git push # Unlock the master branch before merging
|
||||
git push
|
||||
```
|
||||
1. Release happens automatically after passing the tests
|
||||
1. Go to [GitHub](https://github.com/carissalow/rapids/tags) and create a new release based on the newest tag `v[NEW_RELEASE]` (remember to add the change log)
|
||||
|
||||
## Release a Hotfix
|
||||
1. Pull the latest master
|
||||
|
@ -156,6 +156,6 @@ git branch -d hotfix/v[NEW_HOTFIX]
|
|||
```
|
||||
git checkout master
|
||||
git merge --ff-only v[NEW_HOTFIX]
|
||||
git push # Unlock the master branch before merging
|
||||
git push
|
||||
```
|
||||
1. Release happens automatically after passing the tests
|
||||
1. Go to [GitHub](https://github.com/carissalow/rapids/tags) and create a new release based on the newest tag `v[NEW_HOTFIX]` (remember to add the change log)
|
||||
|
|
|
@ -29,7 +29,6 @@ Parameters description for `[FITBIT_STEPS_INTRADAY][PROVIDERS][RAPIDS]`:
|
|||
|----------------|-----------------------------------------------------------------------------------------------------------------------------------
|
||||
|`[COMPUTE]` | Set to `True` to extract `FITBIT_STEPS_INTRADAY` features from the `RAPIDS` provider|
|
||||
|`[FEATURES]` | Features to be computed from steps intraday data, see table below |
|
||||
|`[REFERENCE_HOUR]` | The reference point from which `firststeptime` or `laststeptime` is to be computed, default is midnight |
|
||||
|`[THRESHOLD_ACTIVE_BOUT]` | Every minute with Fitbit steps data wil be labelled as `sedentary` if its step count is below this threshold, otherwise, `active`. |
|
||||
|`[INCLUDE_ZERO_STEP_ROWS]` | Whether or not to include time segments with a 0 step count during the whole day. |
|
||||
|
||||
|
@ -43,8 +42,6 @@ Features description for `[FITBIT_STEPS_INTRADAY][PROVIDERS][RAPIDS]`:
|
|||
|minsteps |steps |The minimum step count during a time segment.
|
||||
|avgsteps |steps |The average step count during a time segment.
|
||||
|stdsteps |steps |The standard deviation of step count during a time segment.
|
||||
|firststeptime |minutes |Minutes until the first non-zero step count.
|
||||
|laststeptime |minutes |Minutes until the last non-zero step count.
|
||||
|countepisodesedentarybout |bouts |Number of sedentary bouts during a time segment.
|
||||
|sumdurationsedentarybout |minutes |Total duration of all sedentary bouts during a time segment.
|
||||
|maxdurationsedentarybout |minutes |The maximum duration of any sedentary bout during a time segment.
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
# Welcome to RAPIDS documentation
|
||||
|
||||
Reproducible Analysis Pipeline for Data Streams (RAPIDS) allows you to process smartphone and wearable data to [extract](features/feature-introduction.md) and [create](features/add-new-features.md) **behavioral features** (a.k.a. digital biomarkers), [visualize](visualizations/data-quality-visualizations.md) mobile sensor data, and [structure](analysis/complete-workflow-example.md) your analysis into reproducible workflows. Check out our [paper](https://www.frontiersin.org/article/10.3389/fdgth.2021.769823)!
|
||||
Reproducible Analysis Pipeline for Data Streams (RAPIDS) allows you to process smartphone and wearable data to [extract](features/feature-introduction.md) and [create](features/add-new-features.md) **behavioral features** (a.k.a. digital biomarkers), [visualize](visualizations/data-quality-visualizations.md) mobile sensor data, and [structure](analysis/complete-workflow-example.md) your analysis into reproducible workflows.
|
||||
|
||||
RAPIDS is open source, documented, multi-platform, modular, tested, and reproducible. At the moment, we support [data streams](datastreams/data-streams-introduction) logged by smartphones, Fitbit wearables, and Empatica wearables (the latter in collaboration with the [DBDP](https://dbdp.org/)).
|
||||
|
||||
|
|
|
@ -9,12 +9,16 @@ If you are interested in contributing feel free to submit a pull request or cont
|
|||
??? abstract "About"
|
||||
Julio Vega is a postdoctoral associate at the Mobile Sensing + Health Institute. He is interested in personalized methodologies to monitor chronic conditions that affect daily human behavior using mobile and wearable data.
|
||||
|
||||
- *vegaju* at *upmc* . *edu*
|
||||
- [Personal Website](https://juliovega.info/)
|
||||
|
||||
### Meng Li
|
||||
|
||||
??? abstract "About"
|
||||
Meng Li received her Master of Science degree in Information Science from the University of Pittsburgh. She is interested in applying machine learning algorithms to the medical field.
|
||||
|
||||
- *lim11* at *upmc* . *edu*
|
||||
- [Linkedin Profile](https://www.linkedin.com/in/meng-li-57238414a)
|
||||
- [Github Profile](https://github.com/Meng6)
|
||||
|
||||
### Abhineeth Reddy Kunta
|
||||
|
|
|
@ -1,39 +0,0 @@
|
|||
"""
|
||||
Please do not make any changes, as RAPIDS is running on tmux server ...
|
||||
"""
|
||||
# !
|
||||
# !
|
||||
"""
|
||||
Please do not make any changes, as RAPIDS is running on tmux server ...
|
||||
"""
|
||||
# !
|
||||
# !
|
||||
"""
|
||||
Please do not make any changes, as RAPIDS is running on tmux server ...
|
||||
"""
|
||||
# !
|
||||
# !
|
||||
"""
|
||||
Please do not make any changes, as RAPIDS is running on tmux server ...
|
||||
"""
|
||||
# !
|
||||
# !
|
||||
"""
|
||||
Please do not make any changes, as RAPIDS is running on tmux server ...
|
||||
"""
|
||||
# !
|
||||
# !
|
||||
"""
|
||||
Please do not make any changes, as RAPIDS is running on tmux server ...
|
||||
"""
|
||||
# !
|
||||
# !
|
||||
"""
|
||||
Please do not make any changes, as RAPIDS is running on tmux server ...
|
||||
"""
|
||||
# !
|
||||
# !
|
||||
"""
|
||||
Please do not make any changes, as RAPIDS is running on tmux server ...
|
||||
"""
|
||||
# !
|
134
environment.yml
134
environment.yml
|
@ -1,30 +1,112 @@
|
|||
name: rapids
|
||||
name: rapids202108
|
||||
channels:
|
||||
- conda-forge
|
||||
- defaults
|
||||
dependencies:
|
||||
- auto-sklearn
|
||||
- hmmlearn
|
||||
- imbalanced-learn
|
||||
- jsonschema
|
||||
- lightgbm
|
||||
- matplotlib
|
||||
- numpy
|
||||
- pandas
|
||||
- peakutils
|
||||
- pip
|
||||
- plotly
|
||||
- python-dateutil
|
||||
- pytz
|
||||
- pywavelets
|
||||
- pyyaml
|
||||
- scikit-learn
|
||||
- scipy
|
||||
- seaborn
|
||||
- setuptools
|
||||
- bioconda::snakemake
|
||||
- bioconda::snakemake-minimal
|
||||
- tqdm
|
||||
- xgboost
|
||||
- _py-xgboost-mutex=2.0
|
||||
- appdirs=1.4.4
|
||||
- arrow=0.16.0
|
||||
- asn1crypto=1.4.0
|
||||
- astropy=4.2.1
|
||||
- attrs=20.3.0
|
||||
- binaryornot=0.4.4
|
||||
- blas=1.0
|
||||
- brotlipy=0.7.0
|
||||
- bzip2=1.0.8
|
||||
- ca-certificates=2021.7.5
|
||||
- certifi=2021.5.30
|
||||
- cffi=1.14.4
|
||||
- chardet=3.0.4
|
||||
- click=7.1.2
|
||||
- cookiecutter=1.6.0
|
||||
- cryptography=3.3.1
|
||||
- datrie=0.8.2
|
||||
- docutils=0.16
|
||||
- future=0.18.2
|
||||
- gitdb=4.0.5
|
||||
- gitdb2=4.0.2
|
||||
- gitpython=3.1.11
|
||||
- idna=2.10
|
||||
- imbalanced-learn=0.6.2
|
||||
- importlib-metadata=2.0.0
|
||||
- importlib_metadata=2.0.0
|
||||
- intel-openmp=2019.4
|
||||
- jinja2=2.11.2
|
||||
- jinja2-time=0.2.0
|
||||
- joblib=1.0.0
|
||||
- jsonschema=3.2.0
|
||||
- libblas=3.8.0
|
||||
- libcblas=3.8.0
|
||||
- libcxx=10.0.0
|
||||
- libedit=3.1.20191231
|
||||
- libffi=3.3
|
||||
- libgfortran
|
||||
- liblapack=3.8.0
|
||||
- libopenblas=0.3.10
|
||||
- libxgboost=0.90
|
||||
- lightgbm=3.1.1
|
||||
- llvm-openmp=10.0.0
|
||||
- markupsafe=1.1.1
|
||||
- mkl
|
||||
- mkl-service=2.3.0
|
||||
- mkl_fft=1.2.0
|
||||
- mkl_random=1.1.1
|
||||
- more-itertools=8.6.0
|
||||
- ncurses=6.2
|
||||
- numpy=1.19.2
|
||||
- numpy-base=1.19.2
|
||||
- openblas=0.3.4
|
||||
- openssl=1.1.1k
|
||||
- pandas=1.1.5
|
||||
- pbr=5.5.1
|
||||
- pip=20.3.3
|
||||
- plotly=4.14.1
|
||||
- poyo=0.5.0
|
||||
- psutil=5.7.2
|
||||
- py-xgboost=0.90
|
||||
- pycparser=2.20
|
||||
- pyerfa=1.7.1.1
|
||||
- pyopenssl=20.0.1
|
||||
- pysocks=1.7.1
|
||||
- python=3.7.9
|
||||
- python-dateutil=2.8.1
|
||||
- python_abi=3.7
|
||||
- pytz=2020.4
|
||||
- pyyaml=5.3.1
|
||||
- readline=8.0
|
||||
- requests=2.25.0
|
||||
- retrying=1.3.3
|
||||
- scikit-learn=0.23.2
|
||||
- scipy=1.5.2
|
||||
- setuptools=51.0.0
|
||||
- six=1.15.0
|
||||
- smmap=3.0.4
|
||||
- smmap2=3.0.1
|
||||
- sqlite=3.33.0
|
||||
- threadpoolctl=2.1.0
|
||||
- tk=8.6.10
|
||||
- tqdm=4.62.0
|
||||
- urllib3=1.25.11
|
||||
- wheel=0.36.2
|
||||
- whichcraft=0.6.1
|
||||
- wrapt=1.12.1
|
||||
- xgboost=0.90
|
||||
- xz=5.2.5
|
||||
- yaml=0.2.5
|
||||
- zipp=3.4.0
|
||||
- zlib=1.2.11
|
||||
- pip:
|
||||
- biosppy
|
||||
- cr_features>=0.2
|
||||
- amply==0.1.4
|
||||
- configargparse==0.15.1
|
||||
- decorator==4.4.2
|
||||
- ipython-genutils==0.2.0
|
||||
- jupyter-core==4.6.3
|
||||
- nbformat==5.0.7
|
||||
- pulp==2.4
|
||||
- pyparsing==2.4.7
|
||||
- pyrsistent==0.15.5
|
||||
- ratelimiter==1.2.0.post0
|
||||
- snakemake==5.30.2
|
||||
- toposort==1.5
|
||||
- traitlets==4.3.3
|
||||
prefix: /usr/local/Caskroom/miniconda/base/envs/rapids202108
|
||||
|
|
|
@ -3,7 +3,7 @@ include: "../rules/common.smk"
|
|||
include: "../rules/renv.smk"
|
||||
include: "../rules/preprocessing.smk"
|
||||
include: "../rules/features.smk"
|
||||
include: "../rules/models_example.smk"
|
||||
include: "../rules/models.smk"
|
||||
include: "../rules/reports.smk"
|
||||
|
||||
import itertools
|
||||
|
|
|
@ -84,7 +84,6 @@ PHONE_APPLICATIONS_CRASHES:
|
|||
CATALOGUE_SOURCE: FILE # FILE (genres are read from CATALOGUE_FILE) or GOOGLE (genres are scrapped from the Play Store)
|
||||
CATALOGUE_FILE: "data/external/stachl_application_genre_catalogue.csv"
|
||||
UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether or not to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
|
||||
PACKAGE_NAMES_HASHED: False
|
||||
SCRAPE_MISSING_CATEGORIES: False # whether or not to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
|
||||
PROVIDERS: # None implemented yet but this sensor can be used in PHONE_DATA_YIELD
|
||||
|
||||
|
@ -94,7 +93,6 @@ PHONE_APPLICATIONS_FOREGROUND:
|
|||
APPLICATION_CATEGORIES:
|
||||
CATALOGUE_SOURCE: FILE # FILE (genres are read from CATALOGUE_FILE) or GOOGLE (genres are scrapped from the Play Store)
|
||||
CATALOGUE_FILE: "data/external/stachl_application_genre_catalogue.csv"
|
||||
PACKAGE_NAMES_HASHED: False
|
||||
UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether or not to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
|
||||
SCRAPE_MISSING_CATEGORIES: False # whether or not to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
|
||||
PROVIDERS:
|
||||
|
@ -122,7 +120,6 @@ PHONE_APPLICATIONS_NOTIFICATIONS:
|
|||
CATALOGUE_SOURCE: FILE # FILE (genres are read from CATALOGUE_FILE) or GOOGLE (genres are scrapped from the Play Store)
|
||||
CATALOGUE_FILE: "data/external/stachl_application_genre_catalogue.csv"
|
||||
UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether or not to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
|
||||
PACKAGE_NAMES_HASHED: False
|
||||
SCRAPE_MISSING_CATEGORIES: False # whether or not to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
|
||||
PROVIDERS: # None implemented yet but this sensor can be used in PHONE_DATA_YIELD
|
||||
|
||||
|
@ -524,7 +521,7 @@ HEATMAP_SENSORS_PER_MINUTE_PER_TIME_SEGMENT:
|
|||
|
||||
# See https://www.rapids.science/latest/visualizations/data-quality-visualizations/#4-heatmap-of-sensor-row-count
|
||||
HEATMAP_SENSOR_ROW_COUNT_PER_TIME_SEGMENT:
|
||||
PLOT: False
|
||||
PLOT: True
|
||||
SENSORS: [PHONE_ACTIVITY_RECOGNITION, PHONE_APPLICATIONS_FOREGROUND, PHONE_BATTERY, PHONE_BLUETOOTH, PHONE_CALLS, PHONE_CONVERSATION, PHONE_LIGHT, PHONE_LOCATIONS, PHONE_MESSAGES, PHONE_SCREEN, PHONE_WIFI_CONNECTED, PHONE_WIFI_VISIBLE]
|
||||
|
||||
# Features ------
|
||||
|
@ -551,7 +548,7 @@ ALL_CLEANING_INDIVIDUAL:
|
|||
COLS_NAN_THRESHOLD: 0.3 # set to 1 to disable
|
||||
COLS_VAR_THRESHOLD: True
|
||||
ROWS_NAN_THRESHOLD: 0.3 # set to 1 to disable
|
||||
DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_HOURS # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
|
||||
DATA_YIELD_UNIT: HOURS # HOURS or MINUTES
|
||||
DATA_YIELD_RATIO_THRESHOLD: 0.75 # set to 0 to disable
|
||||
DROP_HIGHLY_CORRELATED_FEATURES:
|
||||
COMPUTE: False
|
||||
|
@ -569,7 +566,7 @@ ALL_CLEANING_OVERALL:
|
|||
COLS_NAN_THRESHOLD: 0.3 # set to 1 to disable
|
||||
COLS_VAR_THRESHOLD: True
|
||||
ROWS_NAN_THRESHOLD: 0.3 # set to 1 to disable
|
||||
DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_HOURS # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
|
||||
DATA_YIELD_UNIT: HOURS # HOURS or MINUTES
|
||||
DATA_YIELD_RATIO_THRESHOLD: 0.75 # set to 0 to disable
|
||||
DROP_HIGHLY_CORRELATED_FEATURES:
|
||||
COMPUTE: False
|
||||
|
|
|
@ -85,7 +85,6 @@ nav:
|
|||
- Introduction: datastreams/data-streams-introduction.md
|
||||
- Phone:
|
||||
- aware_mysql: datastreams/aware-mysql.md
|
||||
- aware_micro_mysql: datastreams/aware-micro-mysql.md
|
||||
- aware_csv: datastreams/aware-csv.md
|
||||
- aware_influxdb (beta): datastreams/aware-influxdb.md
|
||||
- Mandatory Phone Format: datastreams/mandatory-phone-format.md
|
||||
|
|
|
@ -1,33 +0,0 @@
|
|||
Warning: 1241 parsing failures.
|
||||
row col expected actual file
|
||||
1 is_system_app an integer TRUE 'data/raw/p011/phone_applications_foreground_with_datetime_with_categories.csv'
|
||||
2 is_system_app an integer FALSE 'data/raw/p011/phone_applications_foreground_with_datetime_with_categories.csv'
|
||||
3 is_system_app an integer TRUE 'data/raw/p011/phone_applications_foreground_with_datetime_with_categories.csv'
|
||||
4 is_system_app an integer TRUE 'data/raw/p011/phone_applications_foreground_with_datetime_with_categories.csv'
|
||||
5 is_system_app an integer TRUE 'data/raw/p011/phone_applications_foreground_with_datetime_with_categories.csv'
|
||||
... ............. .......... ...... ...............................................................................
|
||||
See problems(...) for more details.
|
||||
|
||||
Warning message:
|
||||
The following named parsers don't match the column names: application_name
|
||||
Error: Problem with `filter()` input `..1`.
|
||||
✖ object 'application_name' not found
|
||||
ℹ Input `..1` is `!is.na(application_name)`.
|
||||
Backtrace:
|
||||
█
|
||||
1. ├─`%>%`(...)
|
||||
2. ├─dplyr::mutate(...)
|
||||
3. ├─utils::head(., -1)
|
||||
4. ├─dplyr::select(., -c("timestamp"))
|
||||
5. ├─dplyr::filter(., !is.na(application_name))
|
||||
6. ├─dplyr:::filter.data.frame(., !is.na(application_name))
|
||||
7. │ └─dplyr:::filter_rows(.data, ...)
|
||||
8. │ ├─base::withCallingHandlers(...)
|
||||
9. │ └─mask$eval_all_filter(dots, env_filter)
|
||||
10. └─base::.handleSimpleError(...)
|
||||
11. └─dplyr:::h(simpleError(msg, call))
|
||||
Execution halted
|
||||
[Mon Dec 13 17:19:06 2021]
|
||||
Error in rule app_episodes:
|
||||
jobid: 54
|
||||
output: data/interim/p011/phone_app_episodes.csv
|
|
@ -1,5 +0,0 @@
|
|||
Warning message:
|
||||
In barnett_daily_features(snakemake) :
|
||||
Barnett's location features cannot be computed for data or time segments that do not span one or more entire days (00:00:00 to 23:59:59). Values below point to the problem:
|
||||
Location data rows within a daily time segment: 0
|
||||
Location data time span in days: 398.6
|
|
@ -15,6 +15,9 @@ local({
|
|||
Sys.setenv("RENV_R_INITIALIZING" = "true")
|
||||
on.exit(Sys.unsetenv("RENV_R_INITIALIZING"), add = TRUE)
|
||||
|
||||
if(grepl("Darwin", Sys.info()["sysname"], fixed = TRUE) & grepl("ARM64", Sys.info()["version"], fixed = TRUE)) # M1 Macs
|
||||
Sys.setenv("TZDIR" = file.path(R.home(), "share", "zoneinfo"))
|
||||
|
||||
# signal that we've consented to use renv
|
||||
options(renv.consent = TRUE)
|
||||
|
||||
|
|
|
@ -40,17 +40,6 @@ def find_features_files(wildcards):
|
|||
feature_files.extend(expand("data/interim/{{pid}}/{sensor_key}_features/{sensor_key}_{language}_{provider_key}.csv", sensor_key=wildcards.sensor_key.lower(), language=get_script_language(provider["SRC_SCRIPT"]), provider_key=provider_key.lower()))
|
||||
return(feature_files)
|
||||
|
||||
def find_joint_non_empatica_sensor_files(wildcards):
|
||||
joined_files = []
|
||||
for config_key in config.keys():
|
||||
if config_key.startswith(("PHONE", "FITBIT")) and "PROVIDERS" in config[config_key] and isinstance(config[config_key]["PROVIDERS"], dict):
|
||||
for provider_key, provider in config[config_key]["PROVIDERS"].items():
|
||||
if "COMPUTE" in provider.keys() and provider["COMPUTE"]:
|
||||
joined_files.append("data/processed/features/{pid}/" + config_key.lower() + ".csv")
|
||||
break
|
||||
return joined_files
|
||||
|
||||
|
||||
def optional_steps_sleep_input(wildcards):
|
||||
if config["FITBIT_STEPS_INTRADAY"]["EXCLUDE_SLEEP"]["FITBIT_BASED"]["EXCLUDE"]:
|
||||
return "data/raw/{pid}/fitbit_sleep_summary_raw.csv"
|
||||
|
@ -125,16 +114,7 @@ def input_tzcodes_file(wilcards):
|
|||
if not config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"].lower().endswith(".csv"):
|
||||
raise ValueError("[TIMEZONE][MULTIPLE][TZCODES_FILE] should point to a CSV file, instead you typed: " + config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"])
|
||||
if not Path(config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"]).exists():
|
||||
try:
|
||||
config["TIMEZONE"]["MULTIPLE"]["TZ_FILE"]
|
||||
except KeyError:
|
||||
raise ValueError("To create TZCODES_FILE, a list of timezones should be created " +
|
||||
"with the rule preprocessing.smk/prepare_tzcodes_file " +
|
||||
"which will create a file specified as config['TIMEZONE']['MULTIPLE']['TZ_FILE']." +
|
||||
"\n An alternative is to provide the file manually:" +
|
||||
"[TIMEZONE][MULTIPLE][TZCODES_FILE] should point to a CSV file," +
|
||||
"but the file in the path you typed does not exist: " +
|
||||
config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"])
|
||||
raise ValueError("[TIMEZONE][MULTIPLE][TZCODES_FILE] should point to a CSV file, the file in the path you typed does not exist: " + config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"])
|
||||
return [config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"]]
|
||||
return []
|
||||
|
||||
|
|
|
@ -324,40 +324,6 @@ rule conversation_r_features:
|
|||
script:
|
||||
"../src/features/entry.R"
|
||||
|
||||
rule preprocess_esm:
|
||||
input: "data/raw/{pid}/phone_esm_with_datetime.csv"
|
||||
params:
|
||||
scales=lambda wildcards: config["PHONE_ESM"]["PROVIDERS"]["STRAW"]["SCALES"]
|
||||
output: "data/interim/{pid}/phone_esm_clean.csv"
|
||||
script:
|
||||
"../src/features/phone_esm/straw/preprocess.py"
|
||||
|
||||
rule esm_features:
|
||||
input:
|
||||
sensor_data = "data/interim/{pid}/phone_esm_clean.csv",
|
||||
time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
|
||||
params:
|
||||
provider = lambda wildcards: config["PHONE_ESM"]["PROVIDERS"][wildcards.provider_key.upper()],
|
||||
provider_key = "{provider_key}",
|
||||
sensor_key = "phone_esm",
|
||||
scales=lambda wildcards: config["PHONE_ESM"]["PROVIDERS"][wildcards.provider_key.upper()]["SCALES"]
|
||||
output: "data/interim/{pid}/phone_esm_features/phone_esm_python_{provider_key}.csv"
|
||||
script:
|
||||
"../src/features/entry.py"
|
||||
|
||||
rule phone_speech_python_features:
|
||||
input:
|
||||
sensor_data = "data/raw/{pid}/phone_speech_with_datetime.csv",
|
||||
time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
|
||||
params:
|
||||
provider = lambda wildcards: config["PHONE_SPEECH"]["PROVIDERS"][wildcards.provider_key.upper()],
|
||||
provider_key = "{provider_key}",
|
||||
sensor_key = "phone_speech"
|
||||
output:
|
||||
"data/interim/{pid}/phone_speech_features/phone_speech_python_{provider_key}.csv"
|
||||
script:
|
||||
"../src/features/entry.py"
|
||||
|
||||
rule phone_keyboard_python_features:
|
||||
input:
|
||||
sensor_data = "data/raw/{pid}/phone_keyboard_with_datetime.csv",
|
||||
|
@ -804,8 +770,7 @@ rule empatica_accelerometer_python_features:
|
|||
provider_key = "{provider_key}",
|
||||
sensor_key = "empatica_accelerometer"
|
||||
output:
|
||||
"data/interim/{pid}/empatica_accelerometer_features/empatica_accelerometer_python_{provider_key}.csv",
|
||||
"data/interim/{pid}/empatica_accelerometer_features/empatica_accelerometer_python_{provider_key}_windows.csv"
|
||||
"data/interim/{pid}/empatica_accelerometer_features/empatica_accelerometer_python_{provider_key}.csv"
|
||||
script:
|
||||
"../src/features/entry.py"
|
||||
|
||||
|
@ -831,8 +796,7 @@ rule empatica_heartrate_python_features:
|
|||
provider_key = "{provider_key}",
|
||||
sensor_key = "empatica_heartrate"
|
||||
output:
|
||||
"data/interim/{pid}/empatica_heartrate_features/empatica_heartrate_python_{provider_key}.csv",
|
||||
"data/interim/{pid}/empatica_heartrate_features/empatica_heartrate_python_{provider_key}_windows.csv"
|
||||
"data/interim/{pid}/empatica_heartrate_features/empatica_heartrate_python_{provider_key}.csv"
|
||||
script:
|
||||
"../src/features/entry.py"
|
||||
|
||||
|
@ -858,8 +822,7 @@ rule empatica_temperature_python_features:
|
|||
provider_key = "{provider_key}",
|
||||
sensor_key = "empatica_temperature"
|
||||
output:
|
||||
"data/interim/{pid}/empatica_temperature_features/empatica_temperature_python_{provider_key}.csv",
|
||||
"data/interim/{pid}/empatica_temperature_features/empatica_temperature_python_{provider_key}_windows.csv"
|
||||
"data/interim/{pid}/empatica_temperature_features/empatica_temperature_python_{provider_key}.csv"
|
||||
script:
|
||||
"../src/features/entry.py"
|
||||
|
||||
|
@ -885,8 +848,7 @@ rule empatica_electrodermal_activity_python_features:
|
|||
provider_key = "{provider_key}",
|
||||
sensor_key = "empatica_electrodermal_activity"
|
||||
output:
|
||||
"data/interim/{pid}/empatica_electrodermal_activity_features/empatica_electrodermal_activity_python_{provider_key}.csv",
|
||||
"data/interim/{pid}/empatica_electrodermal_activity_features/empatica_electrodermal_activity_python_{provider_key}_windows.csv"
|
||||
"data/interim/{pid}/empatica_electrodermal_activity_features/empatica_electrodermal_activity_python_{provider_key}.csv"
|
||||
script:
|
||||
"../src/features/entry.py"
|
||||
|
||||
|
@ -912,8 +874,7 @@ rule empatica_blood_volume_pulse_python_features:
|
|||
provider_key = "{provider_key}",
|
||||
sensor_key = "empatica_blood_volume_pulse"
|
||||
output:
|
||||
"data/interim/{pid}/empatica_blood_volume_pulse_features/empatica_blood_volume_pulse_python_{provider_key}.csv",
|
||||
"data/interim/{pid}/empatica_blood_volume_pulse_features/empatica_blood_volume_pulse_python_{provider_key}_windows.csv"
|
||||
"data/interim/{pid}/empatica_blood_volume_pulse_features/empatica_blood_volume_pulse_python_{provider_key}.csv"
|
||||
script:
|
||||
"../src/features/entry.py"
|
||||
|
||||
|
@ -939,8 +900,7 @@ rule empatica_inter_beat_interval_python_features:
|
|||
provider_key = "{provider_key}",
|
||||
sensor_key = "empatica_inter_beat_interval"
|
||||
output:
|
||||
"data/interim/{pid}/empatica_inter_beat_interval_features/empatica_inter_beat_interval_python_{provider_key}.csv",
|
||||
"data/interim/{pid}/empatica_inter_beat_interval_features/empatica_inter_beat_interval_python_{provider_key}_windows.csv"
|
||||
"data/interim/{pid}/empatica_inter_beat_interval_features/empatica_inter_beat_interval_python_{provider_key}.csv"
|
||||
script:
|
||||
"../src/features/entry.py"
|
||||
|
||||
|
@ -1007,12 +967,11 @@ rule clean_sensor_features_for_individual_participants:
|
|||
params:
|
||||
provider = lambda wildcards: config["ALL_CLEANING_INDIVIDUAL"]["PROVIDERS"][wildcards.provider_key.upper()],
|
||||
provider_key = "{provider_key}",
|
||||
script_extension = "{script_extension}",
|
||||
sensor_key = "all_cleaning_individual"
|
||||
output:
|
||||
"data/processed/features/{pid}/all_sensor_features_cleaned_{provider_key}_{script_extension}.csv"
|
||||
"data/processed/features/{pid}/all_sensor_features_cleaned_{provider_key}.csv"
|
||||
script:
|
||||
"../src/features/entry.{params.script_extension}"
|
||||
"../src/features/entry.R"
|
||||
|
||||
rule clean_sensor_features_for_all_participants:
|
||||
input:
|
||||
|
@ -1020,10 +979,9 @@ rule clean_sensor_features_for_all_participants:
|
|||
params:
|
||||
provider = lambda wildcards: config["ALL_CLEANING_OVERALL"]["PROVIDERS"][wildcards.provider_key.upper()],
|
||||
provider_key = "{provider_key}",
|
||||
script_extension = "{script_extension}",
|
||||
sensor_key = "all_cleaning_overall",
|
||||
target = "{target}"
|
||||
sensor_key = "all_cleaning_overall"
|
||||
output:
|
||||
"data/processed/features/all_participants/all_sensor_features_cleaned_{provider_key}_{script_extension}_({target}).csv"
|
||||
"data/processed/features/all_participants/all_sensor_features_cleaned_{provider_key}.csv"
|
||||
script:
|
||||
"../src/features/entry.{params.script_extension}"
|
||||
"../src/features/entry.R"
|
||||
|
||||
|
|
147
rules/models.smk
147
rules/models.smk
|
@ -1,52 +1,139 @@
|
|||
rule merge_baseline_data:
|
||||
input:
|
||||
data = expand(config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["FOLDER"] + "/{container}", container=config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["CONTAINER"])
|
||||
output:
|
||||
"data/raw/baseline_merged.csv"
|
||||
script:
|
||||
"../src/data/merge_baseline_data.py"
|
||||
|
||||
rule download_baseline_data:
|
||||
rule download_demographic_data:
|
||||
input:
|
||||
participant_file = "data/external/participant_files/{pid}.yaml",
|
||||
data = "data/raw/baseline_merged.csv"
|
||||
data = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FOLDER"] + "/" + config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["CONTAINER"]
|
||||
output:
|
||||
"data/raw/{pid}/participant_baseline_raw.csv"
|
||||
"data/raw/{pid}/participant_info_raw.csv"
|
||||
script:
|
||||
"../src/data/download_baseline_data.py"
|
||||
"../src/data/workflow_example/download_demographic_data.R"
|
||||
|
||||
rule baseline_features:
|
||||
rule demographic_features:
|
||||
input:
|
||||
"data/raw/{pid}/participant_baseline_raw.csv"
|
||||
participant_info = "data/raw/{pid}/participant_info_raw.csv"
|
||||
params:
|
||||
pid="{pid}",
|
||||
features=config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["FEATURES"],
|
||||
question_filename=config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["FOLDER"] + "/" + config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["QUESTION_LIST"]
|
||||
pid = "{pid}",
|
||||
features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FEATURES"]
|
||||
output:
|
||||
interim="data/interim/{pid}/baseline_questionnaires.csv",
|
||||
features="data/processed/features/{pid}/baseline_features.csv"
|
||||
"data/processed/features/{pid}/demographic_features.csv"
|
||||
script:
|
||||
"../src/data/baseline_features.py"
|
||||
"../src/features/workflow_example/demographic_features.py"
|
||||
|
||||
rule select_target:
|
||||
rule download_target_data:
|
||||
input:
|
||||
cleaned_sensor_features = "data/processed/features/{pid}/all_sensor_features_cleaned_straw_py.csv"
|
||||
participant_file = "data/external/participant_files/{pid}.yaml",
|
||||
data = config["PARAMS_FOR_ANALYSIS"]["TARGET"]["FOLDER"] + "/" + config["PARAMS_FOR_ANALYSIS"]["TARGET"]["CONTAINER"]
|
||||
output:
|
||||
"data/raw/{pid}/participant_target_raw.csv"
|
||||
script:
|
||||
"../src/data/workflow_example/download_target_data.R"
|
||||
|
||||
rule target_readable_datetime:
|
||||
input:
|
||||
sensor_input = "data/raw/{pid}/participant_target_raw.csv",
|
||||
time_segments = "data/interim/time_segments/{pid}_time_segments.csv",
|
||||
pid_file = "data/external/participant_files/{pid}.yaml",
|
||||
tzcodes_file = input_tzcodes_file,
|
||||
params:
|
||||
target_variable = config["PARAMS_FOR_ANALYSIS"]["TARGET"]["LABEL"]
|
||||
device_type = "fitbit",
|
||||
timezone_parameters = config["TIMEZONE"],
|
||||
pid = "{pid}",
|
||||
time_segments_type = config["TIME_SEGMENTS"]["TYPE"],
|
||||
include_past_periodic_segments = config["TIME_SEGMENTS"]["INCLUDE_PAST_PERIODIC_SEGMENTS"]
|
||||
output:
|
||||
"data/raw/{pid}/participant_target_with_datetime.csv"
|
||||
script:
|
||||
"../src/data/datetime/readable_datetime.R"
|
||||
|
||||
rule parse_targets:
|
||||
input:
|
||||
targets = "data/raw/{pid}/participant_target_with_datetime.csv",
|
||||
time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
|
||||
output:
|
||||
"data/processed/targets/{pid}/parsed_targets.csv"
|
||||
script:
|
||||
"../src/models/workflow_example/parse_targets.py"
|
||||
|
||||
rule merge_features_and_targets_for_individual_model:
|
||||
input:
|
||||
cleaned_sensor_features = "data/processed/features/{pid}/all_sensor_features_cleaned_rapids.csv",
|
||||
targets = "data/processed/targets/{pid}/parsed_targets.csv",
|
||||
output:
|
||||
"data/processed/models/individual_model/{pid}/input.csv"
|
||||
script:
|
||||
"../src/models/select_targets.py"
|
||||
"../src/models/workflow_example/merge_features_and_targets_for_individual_model.py"
|
||||
|
||||
rule merge_features_and_targets_for_population_model:
|
||||
input:
|
||||
cleaned_sensor_features = "data/processed/features/all_participants/all_sensor_features_cleaned_straw_py_({target}).csv",
|
||||
demographic_features = expand("data/processed/features/{pid}/baseline_features.csv", pid=config["PIDS"]),
|
||||
params:
|
||||
target_variable="{target}"
|
||||
cleaned_sensor_features = "data/processed/features/all_participants/all_sensor_features_cleaned_rapids.csv",
|
||||
demographic_features = expand("data/processed/features/{pid}/demographic_features.csv", pid=config["PIDS"]),
|
||||
targets = expand("data/processed/targets/{pid}/parsed_targets.csv", pid=config["PIDS"]),
|
||||
output:
|
||||
"data/processed/models/population_model/input_{target}.csv"
|
||||
"data/processed/models/population_model/input.csv"
|
||||
script:
|
||||
"../src/models/merge_features_and_targets_for_population_model.py"
|
||||
"../src/models/workflow_example/merge_features_and_targets_for_population_model.py"
|
||||
|
||||
rule baselines_for_individual_model:
|
||||
input:
|
||||
"data/processed/models/individual_model/{pid}/input.csv"
|
||||
params:
|
||||
cv_method = "{cv_method}",
|
||||
colnames_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FEATURES"],
|
||||
output:
|
||||
"data/processed/models/individual_model/{pid}/output_{cv_method}/baselines.csv"
|
||||
log:
|
||||
"data/processed/models/individual_model/{pid}/output_{cv_method}/baselines_notes.log"
|
||||
script:
|
||||
"../src/models/workflow_example/baselines.py"
|
||||
|
||||
rule baselines_for_population_model:
|
||||
input:
|
||||
"data/processed/models/population_model/input.csv"
|
||||
params:
|
||||
cv_method = "{cv_method}",
|
||||
colnames_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FEATURES"],
|
||||
output:
|
||||
"data/processed/models/population_model/output_{cv_method}/baselines.csv"
|
||||
log:
|
||||
"data/processed/models/population_model/output_{cv_method}/baselines_notes.log"
|
||||
script:
|
||||
"../src/models/workflow_example/baselines.py"
|
||||
|
||||
rule modelling_for_individual_participants:
|
||||
input:
|
||||
data = "data/processed/models/individual_model/{pid}/input.csv"
|
||||
params:
|
||||
model = "{model}",
|
||||
cv_method = "{cv_method}",
|
||||
scaler = "{scaler}",
|
||||
categorical_operators = config["PARAMS_FOR_ANALYSIS"]["CATEGORICAL_OPERATORS"],
|
||||
categorical_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["CATEGORICAL_FEATURES"],
|
||||
model_hyperparams = config["PARAMS_FOR_ANALYSIS"]["MODEL_HYPERPARAMS"],
|
||||
output:
|
||||
fold_predictions = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/fold_predictions.csv",
|
||||
fold_metrics = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/fold_metrics.csv",
|
||||
overall_results = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/overall_results.csv",
|
||||
fold_feature_importances = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/fold_feature_importances.csv"
|
||||
log:
|
||||
"data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/notes.log"
|
||||
script:
|
||||
"../src/models/workflow_example/modelling.py"
|
||||
|
||||
rule modelling_for_all_participants:
|
||||
input:
|
||||
data = "data/processed/models/population_model/input.csv"
|
||||
params:
|
||||
model = "{model}",
|
||||
cv_method = "{cv_method}",
|
||||
scaler = "{scaler}",
|
||||
categorical_operators = config["PARAMS_FOR_ANALYSIS"]["CATEGORICAL_OPERATORS"],
|
||||
categorical_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["CATEGORICAL_FEATURES"],
|
||||
model_hyperparams = config["PARAMS_FOR_ANALYSIS"]["MODEL_HYPERPARAMS"],
|
||||
output:
|
||||
fold_predictions = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/fold_predictions.csv",
|
||||
fold_metrics = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/fold_metrics.csv",
|
||||
overall_results = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/overall_results.csv",
|
||||
fold_feature_importances = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/fold_feature_importances.csv"
|
||||
log:
|
||||
"data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/notes.log"
|
||||
script:
|
||||
"../src/models/workflow_example/modelling.py"
|
||||
|
|
|
@ -1,139 +0,0 @@
|
|||
rule download_demographic_data:
|
||||
input:
|
||||
participant_file = "data/external/participant_files/{pid}.yaml",
|
||||
data = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FOLDER"] + "/" + config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["CONTAINER"]
|
||||
output:
|
||||
"data/raw/{pid}/participant_info_raw.csv"
|
||||
script:
|
||||
"../src/data/workflow_example/download_demographic_data.R"
|
||||
|
||||
rule demographic_features:
|
||||
input:
|
||||
participant_info = "data/raw/{pid}/participant_info_raw.csv"
|
||||
params:
|
||||
pid = "{pid}",
|
||||
features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FEATURES"]
|
||||
output:
|
||||
"data/processed/features/{pid}/demographic_features.csv"
|
||||
script:
|
||||
"../src/features/workflow_example/demographic_features.py"
|
||||
|
||||
rule download_target_data:
|
||||
input:
|
||||
participant_file = "data/external/participant_files/{pid}.yaml",
|
||||
data = config["PARAMS_FOR_ANALYSIS"]["TARGET"]["FOLDER"] + "/" + config["PARAMS_FOR_ANALYSIS"]["TARGET"]["CONTAINER"]
|
||||
output:
|
||||
"data/raw/{pid}/participant_target_raw.csv"
|
||||
script:
|
||||
"../src/data/workflow_example/download_target_data.R"
|
||||
|
||||
rule target_readable_datetime:
|
||||
input:
|
||||
sensor_input = "data/raw/{pid}/participant_target_raw.csv",
|
||||
time_segments = "data/interim/time_segments/{pid}_time_segments.csv",
|
||||
pid_file = "data/external/participant_files/{pid}.yaml",
|
||||
tzcodes_file = input_tzcodes_file,
|
||||
params:
|
||||
device_type = "fitbit",
|
||||
timezone_parameters = config["TIMEZONE"],
|
||||
pid = "{pid}",
|
||||
time_segments_type = config["TIME_SEGMENTS"]["TYPE"],
|
||||
include_past_periodic_segments = config["TIME_SEGMENTS"]["INCLUDE_PAST_PERIODIC_SEGMENTS"]
|
||||
output:
|
||||
"data/raw/{pid}/participant_target_with_datetime.csv"
|
||||
script:
|
||||
"../src/data/datetime/readable_datetime.R"
|
||||
|
||||
rule parse_targets:
|
||||
input:
|
||||
targets = "data/raw/{pid}/participant_target_with_datetime.csv",
|
||||
time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
|
||||
output:
|
||||
"data/processed/targets/{pid}/parsed_targets.csv"
|
||||
script:
|
||||
"../src/models/workflow_example/parse_targets.py"
|
||||
|
||||
rule merge_features_and_targets_for_individual_model:
|
||||
input:
|
||||
cleaned_sensor_features = "data/processed/features/{pid}/all_sensor_features_cleaned_rapids.csv",
|
||||
targets = "data/processed/targets/{pid}/parsed_targets.csv",
|
||||
output:
|
||||
"data/processed/models/individual_model/{pid}/input.csv"
|
||||
script:
|
||||
"../src/models/workflow_example/merge_features_and_targets_for_individual_model.py"
|
||||
|
||||
rule merge_features_and_targets_for_population_model:
|
||||
input:
|
||||
cleaned_sensor_features = "data/processed/features/all_participants/all_sensor_features_cleaned_rapids.csv",
|
||||
demographic_features = expand("data/processed/features/{pid}/demographic_features.csv", pid=config["PIDS"]),
|
||||
targets = expand("data/processed/targets/{pid}/parsed_targets.csv", pid=config["PIDS"]),
|
||||
output:
|
||||
"data/processed/models/population_model/input.csv"
|
||||
script:
|
||||
"../src/models/workflow_example/merge_features_and_targets_for_population_model.py"
|
||||
|
||||
rule baselines_for_individual_model:
|
||||
input:
|
||||
"data/processed/models/individual_model/{pid}/input.csv"
|
||||
params:
|
||||
cv_method = "{cv_method}",
|
||||
colnames_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FEATURES"],
|
||||
output:
|
||||
"data/processed/models/individual_model/{pid}/output_{cv_method}/baselines.csv"
|
||||
log:
|
||||
"data/processed/models/individual_model/{pid}/output_{cv_method}/baselines_notes.log"
|
||||
script:
|
||||
"../src/models/workflow_example/baselines.py"
|
||||
|
||||
rule baselines_for_population_model:
|
||||
input:
|
||||
"data/processed/models/population_model/input.csv"
|
||||
params:
|
||||
cv_method = "{cv_method}",
|
||||
colnames_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FEATURES"],
|
||||
output:
|
||||
"data/processed/models/population_model/output_{cv_method}/baselines.csv"
|
||||
log:
|
||||
"data/processed/models/population_model/output_{cv_method}/baselines_notes.log"
|
||||
script:
|
||||
"../src/models/workflow_example/baselines.py"
|
||||
|
||||
rule modelling_for_individual_participants:
|
||||
input:
|
||||
data = "data/processed/models/individual_model/{pid}/input.csv"
|
||||
params:
|
||||
model = "{model}",
|
||||
cv_method = "{cv_method}",
|
||||
scaler = "{scaler}",
|
||||
categorical_operators = config["PARAMS_FOR_ANALYSIS"]["CATEGORICAL_OPERATORS"],
|
||||
categorical_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["CATEGORICAL_FEATURES"],
|
||||
model_hyperparams = config["PARAMS_FOR_ANALYSIS"]["MODEL_HYPERPARAMS"],
|
||||
output:
|
||||
fold_predictions = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/fold_predictions.csv",
|
||||
fold_metrics = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/fold_metrics.csv",
|
||||
overall_results = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/overall_results.csv",
|
||||
fold_feature_importances = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/fold_feature_importances.csv"
|
||||
log:
|
||||
"data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/notes.log"
|
||||
script:
|
||||
"../src/models/workflow_example/modelling.py"
|
||||
|
||||
rule modelling_for_all_participants:
|
||||
input:
|
||||
data = "data/processed/models/population_model/input.csv"
|
||||
params:
|
||||
model = "{model}",
|
||||
cv_method = "{cv_method}",
|
||||
scaler = "{scaler}",
|
||||
categorical_operators = config["PARAMS_FOR_ANALYSIS"]["CATEGORICAL_OPERATORS"],
|
||||
categorical_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["CATEGORICAL_FEATURES"],
|
||||
model_hyperparams = config["PARAMS_FOR_ANALYSIS"]["MODEL_HYPERPARAMS"],
|
||||
output:
|
||||
fold_predictions = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/fold_predictions.csv",
|
||||
fold_metrics = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/fold_metrics.csv",
|
||||
overall_results = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/overall_results.csv",
|
||||
fold_feature_importances = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/fold_feature_importances.csv"
|
||||
log:
|
||||
"data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/notes.log"
|
||||
script:
|
||||
"../src/models/workflow_example/modelling.py"
|
|
@ -4,36 +4,6 @@ rule create_example_participant_files:
|
|||
shell:
|
||||
"echo 'PHONE:\n DEVICE_IDS: [a748ee1a-1d0b-4ae9-9074-279a2b6ba524]\n PLATFORMS: [android]\n LABEL: test-01\n START_DATE: 2020-04-23 00:00:00\n END_DATE: 2020-05-04 23:59:59\nFITBIT:\n DEVICE_IDS: [a748ee1a-1d0b-4ae9-9074-279a2b6ba524]\n LABEL: test-01\n START_DATE: 2020-04-23 00:00:00\n END_DATE: 2020-05-04 23:59:59\n' >> ./data/external/participant_files/example01.yaml && echo 'PHONE:\n DEVICE_IDS: [13dbc8a3-dae3-4834-823a-4bc96a7d459d]\n PLATFORMS: [ios]\n LABEL: test-02\n START_DATE: 2020-04-23 00:00:00\n END_DATE: 2020-05-04 23:59:59\nFITBIT:\n DEVICE_IDS: [13dbc8a3-dae3-4834-823a-4bc96a7d459d]\n LABEL: test-02\n START_DATE: 2020-04-23 00:00:00\n END_DATE: 2020-05-04 23:59:59\n' >> ./data/external/participant_files/example02.yaml"
|
||||
|
||||
# rule query_usernames_device_empatica_ids:
|
||||
# params:
|
||||
# baseline_folder = "/mnt/e/STRAWbaseline/"
|
||||
# output:
|
||||
# usernames_file = config["CREATE_PARTICIPANT_FILES"]["USERNAMES_CSV"],
|
||||
# timezone_file = config["TIMEZONE"]["MULTIPLE"]["TZ_FILE"]
|
||||
# script:
|
||||
# "../../participants/prepare_usernames_file.py"
|
||||
|
||||
rule prepare_tzcodes_file:
|
||||
input:
|
||||
timezone_file = config["TIMEZONE"]["MULTIPLE"]["TZ_FILE"]
|
||||
output:
|
||||
tzcodes_file = config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"]
|
||||
script:
|
||||
"../tools/create_multi_timezones_file.py"
|
||||
|
||||
rule prepare_participants_csv:
|
||||
input:
|
||||
username_list = config["CREATE_PARTICIPANT_FILES"]["USERNAMES_CSV"]
|
||||
params:
|
||||
data_configuration = config["PHONE_DATA_STREAMS"][config["PHONE_DATA_STREAMS"]["USE"]],
|
||||
participants_table = "participants",
|
||||
device_id_table = "esm",
|
||||
start_end_date_table = "esm"
|
||||
output:
|
||||
participants_file = config["CREATE_PARTICIPANT_FILES"]["CSV_FILE_PATH"]
|
||||
script:
|
||||
"../src/data/translate_usernames_into_participants_data.R"
|
||||
|
||||
rule create_participants_files:
|
||||
input:
|
||||
participants_file = config["CREATE_PARTICIPANT_FILES"]["CSV_FILE_PATH"]
|
||||
|
@ -177,6 +147,7 @@ rule resample_episodes_with_datetime:
|
|||
script:
|
||||
"../src/data/datetime/readable_datetime.R"
|
||||
|
||||
|
||||
rule phone_application_categories:
|
||||
input:
|
||||
"data/raw/{pid}/phone_applications_{type}_with_datetime.csv"
|
||||
|
@ -247,33 +218,5 @@ rule empatica_readable_datetime:
|
|||
include_past_periodic_segments = config["TIME_SEGMENTS"]["INCLUDE_PAST_PERIODIC_SEGMENTS"]
|
||||
output:
|
||||
"data/raw/{pid}/empatica_{sensor}_with_datetime.csv"
|
||||
resources:
|
||||
mem_mb=50000
|
||||
script:
|
||||
"../src/data/datetime/readable_datetime.R"
|
||||
|
||||
|
||||
rule extract_event_information_from_esm:
|
||||
input:
|
||||
esm_raw_input = "data/raw/{pid}/phone_esm_raw.csv",
|
||||
pid_file = "data/external/participant_files/{pid}.yaml"
|
||||
params:
|
||||
stage = "extract",
|
||||
pid = "{pid}"
|
||||
output:
|
||||
"data/raw/ers/{pid}_ers.csv",
|
||||
"data/raw/ers/{pid}_stress_event_targets.csv"
|
||||
script:
|
||||
"../src/features/phone_esm/straw/process_user_event_related_segments.py"
|
||||
|
||||
rule merge_event_related_segments_files:
|
||||
input:
|
||||
ers_files = expand("data/raw/ers/{pid}_ers.csv", pid=config["PIDS"]),
|
||||
se_files = expand("data/raw/ers/{pid}_stress_event_targets.csv", pid=config["PIDS"])
|
||||
params:
|
||||
stage = "merge"
|
||||
output:
|
||||
"data/external/straw_events.csv",
|
||||
"data/external/stress_event_targets.csv"
|
||||
script:
|
||||
"../src/features/phone_esm/straw/process_user_event_related_segments.py"
|
|
@ -1,182 +0,0 @@
|
|||
import numpy as np
|
||||
import pandas as pd
|
||||
|
||||
pid = snakemake.params["pid"]
|
||||
requested_features = snakemake.params["features"]
|
||||
baseline_interim = pd.DataFrame(columns=["qid", "question", "score_original", "score"])
|
||||
baseline_features = pd.DataFrame(columns=requested_features)
|
||||
question_filename = snakemake.params["question_filename"]
|
||||
|
||||
JCQ_DEMAND = "JobEisen"
|
||||
JCQ_CONTROL = "JobControle"
|
||||
|
||||
dict_JCQ_demand_control_reverse = {
|
||||
JCQ_DEMAND: {
|
||||
3: " [Od mene se ne zahteva,",
|
||||
4: " [Imam dovolj časa, da končam",
|
||||
5: " [Pri svojem delu se ne srečujem s konfliktnimi",
|
||||
},
|
||||
JCQ_CONTROL: {
|
||||
2: " |Moje delo vključuje veliko ponavljajočega",
|
||||
6: " [Pri svojem delu imam zelo malo svobode",
|
||||
},
|
||||
}
|
||||
|
||||
LIMESURVEY_JCQ_MIN = 1
|
||||
LIMESURVEY_JCQ_MAX = 4
|
||||
|
||||
DEMAND_CONTROL_RATIO_MIN = 5 / (9 * 4)
|
||||
DEMAND_CONTROL_RATIO_MAX = (4 * 5) / 9
|
||||
|
||||
JCQ_NORMS = {
|
||||
"F": {
|
||||
0: DEMAND_CONTROL_RATIO_MIN,
|
||||
1: 0.45,
|
||||
2: 0.52,
|
||||
3: 0.62,
|
||||
4: DEMAND_CONTROL_RATIO_MAX,
|
||||
},
|
||||
"M": {
|
||||
0: DEMAND_CONTROL_RATIO_MIN,
|
||||
1: 0.41,
|
||||
2: 0.48,
|
||||
3: 0.56,
|
||||
4: DEMAND_CONTROL_RATIO_MAX,
|
||||
},
|
||||
}
|
||||
|
||||
participant_info = pd.read_csv(snakemake.input[0], parse_dates=["date_of_birth"])
|
||||
|
||||
if not participant_info.empty:
|
||||
if "age" in requested_features:
|
||||
now = pd.Timestamp("now")
|
||||
baseline_features.loc[0, "age"] = (
|
||||
now - participant_info.loc[0, "date_of_birth"]
|
||||
).days / 365.25245
|
||||
if "gender" in requested_features:
|
||||
baseline_features.loc[0, "gender"] = participant_info.loc[0, "gender"]
|
||||
if "startlanguage" in requested_features:
|
||||
baseline_features.loc[0, "startlanguage"] = participant_info.loc[
|
||||
0, "startlanguage"
|
||||
]
|
||||
if (
|
||||
("limesurvey_demand" in requested_features)
|
||||
or ("limesurvey_control" in requested_features)
|
||||
or ("limesurvey_demand_control_ratio" in requested_features)
|
||||
):
|
||||
participant_info_t = participant_info.T
|
||||
rows_baseline = participant_info_t.index
|
||||
|
||||
if ("limesurvey_demand" in requested_features) or (
|
||||
"limesurvey_demand_control_ratio" in requested_features
|
||||
):
|
||||
# Find questions about demand, but disregard time (duration of filling in questionnaire)
|
||||
rows_demand = rows_baseline.str.startswith(
|
||||
JCQ_DEMAND
|
||||
) & ~rows_baseline.str.endswith("Time")
|
||||
limesurvey_demand = (
|
||||
participant_info_t[rows_demand]
|
||||
.reset_index()
|
||||
.rename(columns={"index": "question", 0: "score_original"})
|
||||
)
|
||||
# Extract question IDs from names such as JobEisen[3]
|
||||
limesurvey_demand["qid"] = (
|
||||
limesurvey_demand["question"].str.extract(r"\[(\d+)\]").astype(int)
|
||||
)
|
||||
limesurvey_demand["score"] = limesurvey_demand["score_original"]
|
||||
# Identify rows that include questions to be reversed.
|
||||
rows_demand_reverse = limesurvey_demand["qid"].isin(
|
||||
dict_JCQ_demand_control_reverse[JCQ_DEMAND].keys()
|
||||
)
|
||||
# Reverse the score, so that the maximum value becomes the minimum etc.
|
||||
limesurvey_demand.loc[rows_demand_reverse, "score"] = (
|
||||
LIMESURVEY_JCQ_MAX
|
||||
+ LIMESURVEY_JCQ_MIN
|
||||
- limesurvey_demand.loc[rows_demand_reverse, "score_original"]
|
||||
)
|
||||
baseline_interim = pd.concat([baseline_interim, limesurvey_demand], axis=0, ignore_index=True)
|
||||
if "limesurvey_demand" in requested_features:
|
||||
baseline_features.loc[0, "limesurvey_demand"] = limesurvey_demand[
|
||||
"score"
|
||||
].sum()
|
||||
|
||||
if ("limesurvey_control" in requested_features) or (
|
||||
"limesurvey_demand_control_ratio" in requested_features
|
||||
):
|
||||
# Find questions about control, but disregard time (duration of filling in questionnaire)
|
||||
rows_control = rows_baseline.str.startswith(
|
||||
JCQ_CONTROL
|
||||
) & ~rows_baseline.str.endswith("Time")
|
||||
limesurvey_control = (
|
||||
participant_info_t[rows_control]
|
||||
.reset_index()
|
||||
.rename(columns={"index": "question", 0: "score_original"})
|
||||
)
|
||||
# Extract question IDs from names such as JobControle[3]
|
||||
limesurvey_control["qid"] = (
|
||||
limesurvey_control["question"].str.extract(r"\[(\d+)\]").astype(int)
|
||||
)
|
||||
limesurvey_control["score"] = limesurvey_control["score_original"]
|
||||
# Identify rows that include questions to be reversed.
|
||||
rows_control_reverse = limesurvey_control["qid"].isin(
|
||||
dict_JCQ_demand_control_reverse[JCQ_CONTROL].keys()
|
||||
)
|
||||
# Reverse the score, so that the maximum value becomes the minimum etc.
|
||||
limesurvey_control.loc[rows_control_reverse, "score"] = (
|
||||
LIMESURVEY_JCQ_MAX
|
||||
+ LIMESURVEY_JCQ_MIN
|
||||
- limesurvey_control.loc[rows_control_reverse, "score_original"]
|
||||
)
|
||||
|
||||
baseline_interim = pd.concat([baseline_interim, limesurvey_control], axis=0, ignore_index=True)
|
||||
|
||||
if "limesurvey_control" in requested_features:
|
||||
baseline_features.loc[0, "limesurvey_control"] = limesurvey_control[
|
||||
"score"
|
||||
].sum()
|
||||
|
||||
if "limesurvey_demand_control_ratio" in requested_features:
|
||||
if limesurvey_control["score"].sum():
|
||||
limesurvey_demand_control_ratio = (
|
||||
limesurvey_demand["score"].sum() / limesurvey_control["score"].sum()
|
||||
)
|
||||
else:
|
||||
limesurvey_demand_control_ratio = 0
|
||||
if (
|
||||
JCQ_NORMS[participant_info.loc[0, "gender"]][0]
|
||||
<= limesurvey_demand_control_ratio
|
||||
< JCQ_NORMS[participant_info.loc[0, "gender"]][1]
|
||||
):
|
||||
limesurvey_quartile = 1
|
||||
elif (
|
||||
JCQ_NORMS[participant_info.loc[0, "gender"]][1]
|
||||
<= limesurvey_demand_control_ratio
|
||||
< JCQ_NORMS[participant_info.loc[0, "gender"]][2]
|
||||
):
|
||||
limesurvey_quartile = 2
|
||||
elif (
|
||||
JCQ_NORMS[participant_info.loc[0, "gender"]][2]
|
||||
<= limesurvey_demand_control_ratio
|
||||
< JCQ_NORMS[participant_info.loc[0, "gender"]][3]
|
||||
):
|
||||
limesurvey_quartile = 3
|
||||
elif (
|
||||
JCQ_NORMS[participant_info.loc[0, "gender"]][3]
|
||||
<= limesurvey_demand_control_ratio
|
||||
< JCQ_NORMS[participant_info.loc[0, "gender"]][4]
|
||||
):
|
||||
limesurvey_quartile = 4
|
||||
else:
|
||||
limesurvey_quartile = np.nan
|
||||
|
||||
baseline_features.loc[
|
||||
0, "limesurvey_demand_control_ratio"
|
||||
] = limesurvey_demand_control_ratio
|
||||
baseline_features.loc[
|
||||
0, "limesurvey_demand_control_ratio_quartile"
|
||||
] = limesurvey_quartile
|
||||
|
||||
if not baseline_interim.empty:
|
||||
baseline_interim.to_csv(snakemake.output["interim"], index=False, encoding="utf-8")
|
||||
|
||||
baseline_features.to_csv(snakemake.output["features"], index=False, encoding="utf-8")
|
|
@ -1,6 +1,6 @@
|
|||
source("renv/activate.R")
|
||||
|
||||
#library(RMariaDB)
|
||||
library(RMariaDB)
|
||||
library(stringr)
|
||||
library(purrr)
|
||||
library(readr)
|
||||
|
@ -58,7 +58,7 @@ participants %>%
|
|||
lines <- append(lines, empty_fitbit)
|
||||
|
||||
if(add_empatica_section == TRUE && !is.na(row[empatica_device_id_column])){
|
||||
lines <- append(lines, c("EMPATICA:", paste0(" DEVICE_IDS: [",row$label,"]"),
|
||||
lines <- append(lines, c("EMPATICA:", paste0(" DEVICE_IDS: [",row[empatica_device_id_column],"]"),
|
||||
paste(" LABEL:",row$label), paste(" START_DATE:", start_date), paste(" END_DATE:", end_date)))
|
||||
} else
|
||||
lines <- append(lines, empty_empatica)
|
||||
|
|
|
@ -5,16 +5,13 @@ options(scipen=999)
|
|||
|
||||
assign_rows_to_segments <- function(data, segments){
|
||||
# This function is used by all segment types, we use data.tables because they are fast
|
||||
|
||||
data <- data.table::as.data.table(data)
|
||||
data[, assigned_segments := ""]
|
||||
for(i in seq_len(nrow(segments))) {
|
||||
segment <- segments[i,]
|
||||
|
||||
data[segment$segment_start_ts<= timestamp & segment$segment_end_ts >= timestamp,
|
||||
assigned_segments := stringi::stri_c(assigned_segments, segment$segment_id, sep = "|")]
|
||||
}
|
||||
|
||||
data[,assigned_segments:=substring(assigned_segments, 2)]
|
||||
data
|
||||
}
|
||||
|
|
|
@ -1,14 +0,0 @@
|
|||
import pandas as pd
|
||||
import yaml
|
||||
|
||||
filename = snakemake.input["data"]
|
||||
baseline = pd.read_csv(filename)
|
||||
|
||||
with open(snakemake.input["participant_file"], "r") as file:
|
||||
participant = yaml.safe_load(file)
|
||||
|
||||
username = participant["PHONE"]["LABEL"]
|
||||
|
||||
baseline[baseline["username"] == username].to_csv(snakemake.output[0],
|
||||
index=False,
|
||||
encoding="utf-8",)
|
|
@ -1,30 +0,0 @@
|
|||
import pandas as pd
|
||||
|
||||
VARIABLES_TO_TRANSLATE = {
|
||||
"Gebruikersnaam": "username",
|
||||
"Geslacht": "gender",
|
||||
"Geboortedatum": "date_of_birth",
|
||||
}
|
||||
|
||||
filenames = snakemake.input["data"]
|
||||
|
||||
baseline_dfs = []
|
||||
|
||||
for fn in filenames:
|
||||
baseline_dfs.append(pd.read_csv(fn,
|
||||
parse_dates=["Geboortedatum"],
|
||||
infer_datetime_format=True,
|
||||
cache_dates=True,
|
||||
))
|
||||
|
||||
baseline = (
|
||||
pd.concat(baseline_dfs, join="inner")
|
||||
.reset_index()
|
||||
.drop(columns="index")
|
||||
)
|
||||
|
||||
baseline.rename(columns=VARIABLES_TO_TRANSLATE, copy=False, inplace=True)
|
||||
|
||||
baseline.to_csv(snakemake.output[0],
|
||||
index=False,
|
||||
encoding="utf-8",)
|
|
@ -1,85 +0,0 @@
|
|||
# if you need a new package, you should add it with renv::install(package) so your renv venv is updated
|
||||
library(RMariaDB)
|
||||
library(yaml)
|
||||
|
||||
#' @description
|
||||
#' Auxiliary function to parse the connection credentials from a specifc group in ./credentials.yaml
|
||||
#' You can reause most of this function if you are connection to a DB or Web API.
|
||||
#' It's OK to delete this function if you don't need credentials, e.g., you are pulling data from a CSV for example.
|
||||
#' @param group the yaml key containing the credentials to connect to a database
|
||||
#' @preturn dbEngine a database engine (connection) ready to perform queries
|
||||
get_db_engine <- function(group){
|
||||
# The working dir is aways RAPIDS root folder, so your credentials file is always /credentials.yaml
|
||||
credentials <- read_yaml("./credentials.yaml")
|
||||
if(!group %in% names(credentials))
|
||||
stop(paste("The credentials group",group, "does not exist in ./credentials.yaml. The only groups that exist in that file are:", paste(names(credentials), collapse = ","), ". Did you forget to set the group in [PHONE_DATA_STREAMS][aware_mysql][DATABASE_GROUP] in config.yaml?"))
|
||||
dbEngine <- dbConnect(MariaDB(), db = credentials[[group]][["database"]],
|
||||
username = credentials[[group]][["user"]],
|
||||
password = credentials[[group]][["password"]],
|
||||
host = credentials[[group]][["host"]],
|
||||
port = credentials[[group]][["port"]])
|
||||
return(dbEngine)
|
||||
}
|
||||
|
||||
# This file gets executed for each PHONE_SENSOR of each participant
|
||||
# If you are connecting to a database the env file containing its credentials is available at "./.env"
|
||||
# If you are reading a CSV file instead of a DB table, the @param sensor_container wil contain the file path as set in config.yaml
|
||||
# You are not bound to databases or files, you can query a web API or whatever data source you need.
|
||||
|
||||
#' @description
|
||||
#' RAPIDS allows users to use the keyword "infer" (previously "multiple") to automatically infer the mobile Operative System a device was running.
|
||||
#' If you have a way to infer the OS of a device ID, implement this function. For example, for AWARE data we use the "aware_device" table.
|
||||
#'
|
||||
#' If you don't have a way to infer the OS, call stop("Error Message") so other users know they can't use "infer" or the inference failed,
|
||||
#' and they have to assign the OS manually in the participant file
|
||||
#'
|
||||
#' @param stream_parameters The PHONE_STREAM_PARAMETERS key in config.yaml. If you need specific parameters add them there.
|
||||
#' @param device A device ID string
|
||||
#' @return The OS the device ran, "android" or "ios"
|
||||
|
||||
infer_device_os <- function(stream_parameters, device){
|
||||
dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
|
||||
query <- paste0("SELECT device_id,brand FROM aware_device WHERE device_id = '", device, "'")
|
||||
message(paste0("Executing the following query to infer phone OS: ", query))
|
||||
os <- dbGetQuery(dbEngine, query)
|
||||
dbDisconnect(dbEngine)
|
||||
|
||||
if(nrow(os) > 0)
|
||||
return(os %>% mutate(os = ifelse(brand == "iPhone", "ios", "android")) %>% pull(os))
|
||||
else
|
||||
stop(paste("We cannot infer the OS of the following device id because it does not exist in the aware_device table:", device))
|
||||
|
||||
return(os)
|
||||
}
|
||||
|
||||
#' @description
|
||||
#' Gets the sensor data for a specific device id from a database table, file or whatever source you want to query
|
||||
#'
|
||||
#' @param stream_parameters The PHONE_STREAM_PARAMETERS key in config.yaml. If you need specific parameters add them there.
|
||||
#' @param device A device ID string
|
||||
#' @param sensor_container database table or file containing the sensor data for all participants. This is the PHONE_SENSOR[CONTAINER] key in config.yaml
|
||||
#' @param columns the columns needed from this sensor (we recommend to only return these columns instead of every column in sensor_container)
|
||||
#' @return A dataframe with the sensor data for device
|
||||
|
||||
pull_data <- function(stream_parameters, device, sensor, sensor_container, columns){
|
||||
dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
|
||||
|
||||
select_items <- c()
|
||||
for (column in columns) {
|
||||
select_items <- append(select_items, paste0("data->>'$.", column, "' ", column))
|
||||
}
|
||||
|
||||
query <- paste0("SELECT ", paste(select_items, collapse = ",")," FROM ", sensor_container, " WHERE ", columns$DEVICE_ID ," = '", device,"'")
|
||||
|
||||
# Letting the user know what we are doing
|
||||
message(paste0("Executing the following query to download data: ", query))
|
||||
sensor_data <- dbGetQuery(dbEngine, query)
|
||||
|
||||
dbDisconnect(dbEngine)
|
||||
|
||||
if(nrow(sensor_data) == 0)
|
||||
warning(paste("The device '", device,"' did not have data in ", sensor_container))
|
||||
|
||||
return(sensor_data)
|
||||
}
|
||||
|
|
@ -1,337 +0,0 @@
|
|||
PHONE_ACCELEROMETER:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
DOUBLE_VALUES_0: double_values_0
|
||||
DOUBLE_VALUES_1: double_values_1
|
||||
DOUBLE_VALUES_2: double_values_2
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
IOS:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
DOUBLE_VALUES_0: double_values_0
|
||||
DOUBLE_VALUES_1: double_values_1
|
||||
DOUBLE_VALUES_2: double_values_2
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
|
||||
PHONE_ACTIVITY_RECOGNITION:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
ACTIVITY_NAME: activity_name
|
||||
ACTIVITY_TYPE: activity_type
|
||||
CONFIDENCE: confidence
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
IOS:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
ACTIVITY_NAME: FLAG_TO_MUTATE
|
||||
ACTIVITY_TYPE: FLAG_TO_MUTATE
|
||||
CONFIDENCE: FLAG_TO_MUTATE
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
ACTIVITIES: activities
|
||||
CONFIDENCE: confidence
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
- "src/data/streams/mutations/phone/aware/activity_recogniton_ios_unification.R"
|
||||
|
||||
PHONE_APPLICATIONS_CRASHES:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
PACKAGE_NAME: package_name
|
||||
APPLICATION_NAME: application_name
|
||||
APPLICATION_VERSION: application_version
|
||||
ERROR_SHORT: error_short
|
||||
ERROR_LONG: error_long
|
||||
ERROR_CONDITION: error_condition
|
||||
IS_SYSTEM_APP: is_system_app
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
|
||||
PHONE_APPLICATIONS_FOREGROUND:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
PACKAGE_NAME: package_name
|
||||
APPLICATION_NAME: application_name
|
||||
IS_SYSTEM_APP: is_system_app
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
|
||||
PHONE_APPLICATIONS_NOTIFICATIONS:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
PACKAGE_NAME: package_name
|
||||
APPLICATION_NAME: application_name
|
||||
TEXT: text
|
||||
SOUND: sound
|
||||
VIBRATE: vibrate
|
||||
DEFAULTS: defaults
|
||||
FLAGS: flags
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
|
||||
PHONE_BATTERY:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
BATTERY_STATUS: battery_status
|
||||
BATTERY_LEVEL: battery_level
|
||||
BATTERY_SCALE: battery_scale
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
IOS:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
BATTERY_STATUS: FLAG_TO_MUTATE
|
||||
BATTERY_LEVEL: battery_level
|
||||
BATTERY_SCALE: battery_scale
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
BATTERY_STATUS: battery_status
|
||||
SCRIPTS:
|
||||
- "src/data/streams/mutations/phone/aware/battery_ios_unification.R"
|
||||
|
||||
PHONE_BLUETOOTH:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
BT_ADDRESS: bt_address
|
||||
BT_NAME: bt_name
|
||||
BT_RSSI: bt_rssi
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
IOS:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
BT_ADDRESS: bt_address
|
||||
BT_NAME: bt_name
|
||||
BT_RSSI: bt_rssi
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
|
||||
PHONE_CALLS:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
CALL_TYPE: call_type
|
||||
CALL_DURATION: call_duration
|
||||
TRACE: trace
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
IOS:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
CALL_TYPE: FLAG_TO_MUTATE
|
||||
CALL_DURATION: call_duration
|
||||
TRACE: trace
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
CALL_TYPE: call_type
|
||||
SCRIPTS:
|
||||
- "src/data/streams/mutations/phone/aware/calls_ios_unification.R"
|
||||
|
||||
PHONE_CONVERSATION:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
DOUBLE_ENERGY: double_energy
|
||||
INFERENCE: inference
|
||||
DOUBLE_CONVO_START: double_convo_start
|
||||
DOUBLE_CONVO_END: double_convo_end
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
IOS:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
DOUBLE_ENERGY: double_energy
|
||||
INFERENCE: inference
|
||||
DOUBLE_CONVO_START: double_convo_start
|
||||
DOUBLE_CONVO_END: double_convo_end
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
- "src/data/streams/mutations/phone/aware/conversation_ios_timestamp.R"
|
||||
|
||||
PHONE_KEYBOARD:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
PACKAGE_NAME: package_name
|
||||
BEFORE_TEXT: before_text
|
||||
CURRENT_TEXT: current_text
|
||||
IS_PASSWORD: is_password
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
|
||||
PHONE_LIGHT:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
DOUBLE_LIGHT_LUX: double_light_lux
|
||||
ACCURACY: accuracy
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
|
||||
PHONE_LOCATIONS:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
DOUBLE_LATITUDE: double_latitude
|
||||
DOUBLE_LONGITUDE: double_longitude
|
||||
DOUBLE_BEARING: double_bearing
|
||||
DOUBLE_SPEED: double_speed
|
||||
DOUBLE_ALTITUDE: double_altitude
|
||||
PROVIDER: provider
|
||||
ACCURACY: accuracy
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
IOS:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
DOUBLE_LATITUDE: double_latitude
|
||||
DOUBLE_LONGITUDE: double_longitude
|
||||
DOUBLE_BEARING: double_bearing
|
||||
DOUBLE_SPEED: double_speed
|
||||
DOUBLE_ALTITUDE: double_altitude
|
||||
PROVIDER: provider
|
||||
ACCURACY: accuracy
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
|
||||
PHONE_LOG:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
LOG_MESSAGE: log_message
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
IOS:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
LOG_MESSAGE: log_message
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
|
||||
PHONE_MESSAGES:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
MESSAGE_TYPE: message_type
|
||||
TRACE: trace
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
|
||||
PHONE_SCREEN:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
SCREEN_STATUS: screen_status
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
IOS:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
SCREEN_STATUS: FLAG_TO_MUTATE
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCREEN_STATUS: screen_status
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
- "src/data/streams/mutations/phone/aware/screen_ios_unification.R"
|
||||
|
||||
PHONE_WIFI_CONNECTED:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
MAC_ADDRESS: mac_address
|
||||
SSID: ssid
|
||||
BSSID: bssid
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
IOS:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
MAC_ADDRESS: mac_address
|
||||
SSID: ssid
|
||||
BSSID: bssid
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
|
||||
PHONE_WIFI_VISIBLE:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
SSID: ssid
|
||||
BSSID: bssid
|
||||
SECURITY: security
|
||||
FREQUENCY: frequency
|
||||
RSSI: rssi
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
IOS:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
SSID: ssid
|
||||
BSSID: bssid
|
||||
SECURITY: security
|
||||
FREQUENCY: frequency
|
||||
RSSI: rssi
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
|
|
@ -1,212 +0,0 @@
|
|||
# if you need a new package, you should add it with renv::install(package) so your renv venv is updated
|
||||
library(RPostgres)
|
||||
# Needs libpq-dev for compiling from source.
|
||||
|
||||
# Error installing package 'RPostgres':
|
||||
# =====================================
|
||||
#
|
||||
# * installing *source* package 'RPostgres' ...
|
||||
# ** package 'RPostgres' successfully unpacked and MD5 sums checked
|
||||
# ** using staged installation
|
||||
# Using PKG_CFLAGS=
|
||||
# Using PKG_LIBS=-lpq
|
||||
# Using PKG_PLOGR=
|
||||
# ------------------------- ANTICONF ERROR ---------------------------
|
||||
# Configuration failed because libpq was not found. Try installing:
|
||||
# * deb: libpq-dev (Debian, Ubuntu, etc)
|
||||
# * rpm: postgresql-devel (Fedora, EPEL)
|
||||
# * rpm: postgreql8-devel, psstgresql92-devel, postgresql93-devel, or postgresql94-devel (Amazon Linux)
|
||||
# * csw: postgresql_dev (Solaris)
|
||||
# * brew: libpq (OSX)
|
||||
# If libpq is already installed, check that either:
|
||||
# (i) 'pkg-config' is in your PATH AND PKG_CONFIG_PATH contains
|
||||
# a libpq.pc file; or
|
||||
# (ii) 'pg_config' is in your PATH.
|
||||
# If neither can detect , you can set INCLUDE_DIR
|
||||
# and LIB_DIR manually via:
|
||||
# R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
|
||||
# --------------------------[ ERROR MESSAGE ]----------------------------
|
||||
# <stdin>:1:10: fatal error: libpq-fe.h: No such file or directory
|
||||
# compilation terminated.
|
||||
|
||||
library(dbplyr)
|
||||
library(yaml)
|
||||
|
||||
#' @description
|
||||
#' Auxiliary function to parse the connection credentials from a specifc group in ./credentials.yaml
|
||||
#' You can reause most of this function if you are connection to a DB or Web API.
|
||||
#' It's OK to delete this function if you don't need credentials, e.g., you are pulling data from a CSV for example.
|
||||
#' @param group the yaml key containing the credentials to connect to a database
|
||||
#' @preturn dbEngine a database engine (connection) ready to perform queries
|
||||
get_db_engine <- function(group){
|
||||
# The working dir is aways RAPIDS root folder, so your credentials file is always /credentials.yaml
|
||||
credentials <- read_yaml("./credentials.yaml")
|
||||
if(!group %in% names(credentials))
|
||||
stop(paste("The credentials group",group, "does not exist in ./credentials.yaml. The only groups that exist in that file are:", paste(names(credentials), collapse = ","), ". Did you forget to set the group in [PHONE_DATA_STREAMS][aware_mysql][DATABASE_GROUP] in config.yaml?"))
|
||||
dbEngine <- dbConnect(Postgres(), db = credentials[[group]][["database"]],
|
||||
user = credentials[[group]][["user"]],
|
||||
password = credentials[[group]][["password"]],
|
||||
host = credentials[[group]][["host"]],
|
||||
port = credentials[[group]][["port"]])
|
||||
return(dbEngine)
|
||||
}
|
||||
|
||||
# This file gets executed for each PHONE_SENSOR of each participant
|
||||
# If you are connecting to a database the env file containing its credentials is available at "./.env"
|
||||
# If you are reading a CSV file instead of a DB table, the @param sensor_container wil contain the file path as set in config.yaml
|
||||
# You are not bound to databases or files, you can query a web API or whatever data source you need.
|
||||
|
||||
#' @description
|
||||
#' RAPIDS allows users to use the keyword "infer" (previously "multiple") to automatically infer the mobile Operative System a device was running.
|
||||
#' If you have a way to infer the OS of a device ID, implement this function. For example, for AWARE data we use the "aware_device" table.
|
||||
#'
|
||||
#' If you don't have a way to infer the OS, call stop("Error Message") so other users know they can't use "infer" or the inference failed,
|
||||
#' and they have to assign the OS manually in the participant file
|
||||
#'
|
||||
#' @param stream_parameters The PHONE_STREAM_PARAMETERS key in config.yaml. If you need specific parameters add them there.
|
||||
#' @param device A device ID string
|
||||
#' @return The OS the device ran, "android" or "ios"
|
||||
|
||||
infer_device_os <- function(stream_parameters, device){
|
||||
#dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
|
||||
#query <- paste0("SELECT device_id,brand FROM aware_device WHERE device_id = '", device, "'")
|
||||
#message(paste0("Executing the following query to infer phone OS: ", query))
|
||||
#os <- dbGetQuery(dbEngine, query)
|
||||
#dbDisconnect(dbEngine)
|
||||
|
||||
#if(nrow(os) > 0)
|
||||
# return(os %>% mutate(os = ifelse(brand == "iPhone", "ios", "android")) %>% pull(os))
|
||||
#else
|
||||
stop(paste("We cannot infer the OS of the following device id because the aware_device table does not exist."))
|
||||
|
||||
#return(os)
|
||||
}
|
||||
|
||||
#' @description
|
||||
#' Gets the sensor data for a specific device id from a database table, file or whatever source you want to query
|
||||
#'
|
||||
#' @param stream_parameters The PHONE_STREAM_PARAMETERS key in config.yaml. If you need specific parameters add them there.
|
||||
#' @param device A device ID string
|
||||
#' @param sensor_container database table or file containing the sensor data for all participants. This is the PHONE_SENSOR[CONTAINER] key in config.yaml
|
||||
#' @param columns the columns needed from this sensor (we recommend to only return these columns instead of every column in sensor_container)
|
||||
#' @return A dataframe with the sensor data for device
|
||||
|
||||
pull_data <- function(stream_parameters, device, sensor, sensor_container, columns){
|
||||
dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
|
||||
query <- paste0("SELECT ", paste(columns, collapse = ",")," FROM ", sensor_container, " WHERE ", columns$DEVICE_ID ," = '", device,"'")
|
||||
# Letting the user know what we are doing
|
||||
message(paste0("Executing the following query to download data: ", query))
|
||||
sensor_data <- dbGetQuery(dbEngine, query)
|
||||
|
||||
dbDisconnect(dbEngine)
|
||||
|
||||
if(nrow(sensor_data) == 0)
|
||||
warning(paste("The device '", device,"' did not have data in ", sensor_container))
|
||||
|
||||
return(sensor_data)
|
||||
}
|
||||
|
||||
#' @description
|
||||
#' Gets participants' IDs for specified usernames.
|
||||
#'
|
||||
#' @param stream_parameters The PHONE_DATA_STREAMS key in config.yaml. If you need specific parameters add them there.
|
||||
#' @param usernames A vector of usernames
|
||||
#' @param participants_container The name of the database table containing participants data, such as their username.
|
||||
#' @return A dataframe with participant IDs matching usernames
|
||||
|
||||
pull_participants_ids <- function(stream_parameters, usernames, participants_container) {
|
||||
dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
|
||||
|
||||
query_participant_id <- tbl(dbEngine, participants_container) %>%
|
||||
filter(username %in% usernames) %>%
|
||||
select(username, id)
|
||||
|
||||
message(paste0("Executing the following query to get participants' IDs: \n", sql_render(query_participant_id)))
|
||||
|
||||
participant_data <- query_participant_id %>% collect()
|
||||
|
||||
dbDisconnect(dbEngine)
|
||||
|
||||
if(nrow(participant_data) == 0)
|
||||
warning(paste("We could not find requested usernames (", usernames, ") in ", participants_container))
|
||||
|
||||
return(participant_data)
|
||||
}
|
||||
|
||||
#' @description
|
||||
#' Gets participants' IDs for specified participant IDs
|
||||
#'
|
||||
#' @param stream_parameters The PHONE_DATA_STREAMS key in config.yaml. If you need specific parameters add them there.
|
||||
#' @param participants_ids A vector of numeric participant IDs
|
||||
#' @param device_id_container The name of the database table which will be used to determine distinct device ID. Ideally, a table that reliably contains data, but not too much.
|
||||
#' @return A dataframe with a row matching each distinct device ID with a participant ID
|
||||
|
||||
pull_participants_device_ids <- function(stream_parameters, participants_ids, device_id_container) {
|
||||
dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
|
||||
|
||||
query_device_id <- tbl(dbEngine, device_id_container) %>%
|
||||
filter(participant_id %in% !!participants_ids) %>%
|
||||
group_by(participant_id) %>%
|
||||
distinct(device_id, .keep_all = FALSE)
|
||||
|
||||
message(
|
||||
paste0(
|
||||
"Executing the following query to get the distinct device IDs: \n",
|
||||
sql_render(query_device_id),
|
||||
"\n NOTE: This might take a long time."
|
||||
)
|
||||
)
|
||||
|
||||
device_ids <- query_device_id %>% collect()
|
||||
|
||||
dbDisconnect(dbEngine)
|
||||
|
||||
if(nrow(device_ids) == 0)
|
||||
warning(paste("We could not find device IDs for requested participant IDs (", participants_ids, ") in ", device_id_container))
|
||||
|
||||
return(device_ids)
|
||||
}
|
||||
|
||||
#' @description
|
||||
#' Gets start and end datetimes for specified participant IDs.
|
||||
#'
|
||||
#' @param stream_parameters The PHONE_DATA_STREAMS key in config.yaml. If you need specific parameters add them there.
|
||||
#' @param participants_ids A vector of numeric participant IDs
|
||||
#' @param start_end_date_container The name of the database table which will be used to determine when a participant started and ended their participation. Briefing and debriefing EMAs can be meaningfully used here.
|
||||
#' @return A dataframe relating participant IDs with their start and end datetimes.
|
||||
|
||||
pull_participants_start_end_dates <- function(stream_parameters, participants_ids, start_end_date_container) {
|
||||
dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
|
||||
|
||||
query_timestamps <- tbl(dbEngine, start_end_date_container) %>%
|
||||
filter(
|
||||
participant_id %in% !!participants_ids,
|
||||
double_esm_user_answer_timestamp > 0
|
||||
) %>%
|
||||
group_by(participant_id) %>%
|
||||
summarise(
|
||||
timestamp_min = min(double_esm_user_answer_timestamp, na.rm = TRUE),
|
||||
timestamp_max = max(double_esm_user_answer_timestamp, na.rm = TRUE)
|
||||
) %>%
|
||||
select(participant_id, timestamp_min, timestamp_max)
|
||||
|
||||
message(paste0("Executing the following query to get the starting and ending datetimes: \n", sql_render(query_timestamps)))
|
||||
|
||||
start_end_timestamps <- query_timestamps %>% collect()
|
||||
|
||||
if(nrow(start_end_timestamps) == 0)
|
||||
warning(paste("We could not find datetimes for requested participant IDs (", participants_ids, ") in ", start_end_date_container))
|
||||
|
||||
start_end_times <- start_end_timestamps %>%
|
||||
mutate(
|
||||
datetime_start = as_datetime(timestamp_min/1000, tz = "UTC"),
|
||||
datetime_end = as_datetime(timestamp_max/1000, tz = "UTC")
|
||||
) %>%
|
||||
select(-c(timestamp_min, timestamp_max))
|
||||
|
||||
dbDisconnect(dbEngine)
|
||||
|
||||
return(start_end_times)
|
||||
}
|
||||
|
||||
|
|
@ -1,372 +0,0 @@
|
|||
PHONE_ACCELEROMETER:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
DOUBLE_VALUES_0: double_values_0
|
||||
DOUBLE_VALUES_1: double_values_1
|
||||
DOUBLE_VALUES_2: double_values_2
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
IOS:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
DOUBLE_VALUES_0: double_values_0
|
||||
DOUBLE_VALUES_1: double_values_1
|
||||
DOUBLE_VALUES_2: double_values_2
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
|
||||
PHONE_ACTIVITY_RECOGNITION:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
ACTIVITY_NAME: activity_name
|
||||
ACTIVITY_TYPE: activity_type
|
||||
CONFIDENCE: confidence
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
IOS:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
ACTIVITY_NAME: FLAG_TO_MUTATE
|
||||
ACTIVITY_TYPE: FLAG_TO_MUTATE
|
||||
CONFIDENCE: FLAG_TO_MUTATE
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
ACTIVITIES: activities
|
||||
CONFIDENCE: confidence
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
- "src/data/streams/mutations/phone/aware/activity_recogniton_ios_unification.R"
|
||||
|
||||
PHONE_APPLICATIONS_CRASHES:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
PACKAGE_NAME: package_name
|
||||
APPLICATION_NAME: application_name
|
||||
APPLICATION_VERSION: application_version
|
||||
ERROR_SHORT: error_short
|
||||
ERROR_LONG: error_long
|
||||
ERROR_CONDITION: error_condition
|
||||
IS_SYSTEM_APP: is_system_app
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
|
||||
PHONE_APPLICATIONS_FOREGROUND:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
PACKAGE_NAME: package_hash
|
||||
APPLICATION_NAME: FLAG_TO_MUTATE
|
||||
IS_SYSTEM_APP: is_system_app
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS:
|
||||
- src/data/streams/mutations/phone/straw/app_add_name.R
|
||||
|
||||
PHONE_APPLICATIONS_NOTIFICATIONS:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
PACKAGE_NAME: package_hash
|
||||
APPLICATION_NAME: FLAG_TO_MUTATE
|
||||
SOUND: sound
|
||||
VIBRATE: vibrate
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS:
|
||||
- src/data/streams/mutations/phone/straw/app_add_name.R
|
||||
|
||||
PHONE_BATTERY:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
BATTERY_STATUS: battery_status
|
||||
BATTERY_LEVEL: battery_level
|
||||
BATTERY_SCALE: battery_scale
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
IOS:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
BATTERY_STATUS: FLAG_TO_MUTATE
|
||||
BATTERY_LEVEL: battery_level
|
||||
BATTERY_SCALE: battery_scale
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
BATTERY_STATUS: battery_status
|
||||
SCRIPTS:
|
||||
- "src/data/streams/mutations/phone/aware/battery_ios_unification.R"
|
||||
|
||||
PHONE_BLUETOOTH:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
BT_ADDRESS: bt_address
|
||||
BT_NAME: bt_name
|
||||
BT_RSSI: bt_rssi
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
IOS:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
BT_ADDRESS: bt_address
|
||||
BT_NAME: bt_name
|
||||
BT_RSSI: bt_rssi
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
|
||||
PHONE_CALLS:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
CALL_TYPE: call_type
|
||||
CALL_DURATION: call_duration
|
||||
TRACE: trace
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
IOS:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
CALL_TYPE: FLAG_TO_MUTATE
|
||||
CALL_DURATION: call_duration
|
||||
TRACE: trace
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
CALL_TYPE: call_type
|
||||
SCRIPTS:
|
||||
- "src/data/streams/mutations/phone/aware/calls_ios_unification.R"
|
||||
|
||||
PHONE_CONVERSATION:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
DOUBLE_ENERGY: double_energy
|
||||
INFERENCE: inference
|
||||
DOUBLE_CONVO_START: double_convo_start
|
||||
DOUBLE_CONVO_END: double_convo_end
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
IOS:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
DOUBLE_ENERGY: double_energy
|
||||
INFERENCE: inference
|
||||
DOUBLE_CONVO_START: double_convo_start
|
||||
DOUBLE_CONVO_END: double_convo_end
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
- "src/data/streams/mutations/phone/aware/conversation_ios_timestamp.R"
|
||||
|
||||
PHONE_ESM:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: double_esm_user_answer_timestamp
|
||||
DEVICE_ID: device_id
|
||||
ESM_STATUS: esm_status
|
||||
ESM_USER_ANSWER: esm_user_answer
|
||||
ESM_JSON: esm_json
|
||||
ESM_TRIGGER: esm_trigger
|
||||
ESM_SESSION: esm_session
|
||||
ESM_NOTIFICATION_ID: esm_notification_id
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS:
|
||||
|
||||
PHONE_KEYBOARD:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
PACKAGE_NAME: package_name
|
||||
BEFORE_TEXT: before_text
|
||||
CURRENT_TEXT: current_text
|
||||
IS_PASSWORD: is_password
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
|
||||
PHONE_LIGHT:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
DOUBLE_LIGHT_LUX: double_light_lux
|
||||
ACCURACY: accuracy
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
|
||||
PHONE_LOCATIONS:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
DOUBLE_LATITUDE: double_latitude
|
||||
DOUBLE_LONGITUDE: double_longitude
|
||||
DOUBLE_BEARING: double_bearing
|
||||
DOUBLE_SPEED: double_speed
|
||||
DOUBLE_ALTITUDE: double_altitude
|
||||
PROVIDER: provider
|
||||
ACCURACY: accuracy
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
IOS:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
DOUBLE_LATITUDE: double_latitude
|
||||
DOUBLE_LONGITUDE: double_longitude
|
||||
DOUBLE_BEARING: double_bearing
|
||||
DOUBLE_SPEED: double_speed
|
||||
DOUBLE_ALTITUDE: double_altitude
|
||||
PROVIDER: provider
|
||||
ACCURACY: accuracy
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
|
||||
PHONE_LOG:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
LOG_MESSAGE: log_message
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
IOS:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
LOG_MESSAGE: log_message
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
|
||||
PHONE_MESSAGES:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
MESSAGE_TYPE: message_type
|
||||
TRACE: trace
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
|
||||
PHONE_SCREEN:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
SCREEN_STATUS: screen_status
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
IOS:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
SCREEN_STATUS: FLAG_TO_MUTATE
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCREEN_STATUS: screen_status
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
- "src/data/streams/mutations/phone/aware/screen_ios_unification.R"
|
||||
|
||||
PHONE_WIFI_CONNECTED:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
MAC_ADDRESS: mac_address
|
||||
SSID: ssid
|
||||
BSSID: bssid
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
IOS:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
MAC_ADDRESS: mac_address
|
||||
SSID: ssid
|
||||
BSSID: bssid
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
|
||||
PHONE_WIFI_VISIBLE:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
SSID: ssid
|
||||
BSSID: bssid
|
||||
SECURITY: security
|
||||
FREQUENCY: frequency
|
||||
RSSI: rssi
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
IOS:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
SSID: ssid
|
||||
BSSID: bssid
|
||||
SECURITY: security
|
||||
FREQUENCY: frequency
|
||||
RSSI: rssi
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
|
||||
PHONE_SPEECH:
|
||||
ANDROID:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
SPEECH_PROPORTION: speech_proportion
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
IOS:
|
||||
RAPIDS_COLUMN_MAPPINGS:
|
||||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
SPEECH_PROPORTION: speech_proportion
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
|
||||
|
||||
|
||||
|
|
@ -2,16 +2,11 @@ from zipfile import ZipFile
|
|||
import warnings
|
||||
from pathlib import Path
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
from pandas.core import indexing
|
||||
import yaml
|
||||
import csv
|
||||
from collections import OrderedDict
|
||||
from io import BytesIO, StringIO
|
||||
import sys, os
|
||||
|
||||
from cr_features.hrv import get_HRV_features, get_patched_ibi_with_bvp
|
||||
from cr_features.helper_functions import empatica1d_to_array, empatica2d_to_array
|
||||
|
||||
def processAcceleration(x, y, z):
|
||||
x = float(x)
|
||||
|
@ -57,8 +52,6 @@ def extract_empatica_data(data, sensor):
|
|||
df = pd.DataFrame.from_dict(ddict, orient='index', columns=[column])
|
||||
df[column] = df[column].astype(float)
|
||||
df.index.name = 'timestamp'
|
||||
if df.empty:
|
||||
return df
|
||||
|
||||
elif sensor == 'EMPATICA_ACCELEROMETER':
|
||||
ddict = readFile(sensor_data_file, sensor)
|
||||
|
@ -67,16 +60,9 @@ def extract_empatica_data(data, sensor):
|
|||
df['y'] = df['y'].astype(float)
|
||||
df['z'] = df['z'].astype(float)
|
||||
df.index.name = 'timestamp'
|
||||
if df.empty:
|
||||
return df
|
||||
|
||||
elif sensor == 'EMPATICA_INTER_BEAT_INTERVAL':
|
||||
|
||||
df = pd.read_csv(sensor_data_file, names=['timings', column], header=None)
|
||||
df['timestamp'] = df['timings']
|
||||
if df.empty:
|
||||
df = df.set_index('timestamp')
|
||||
return df
|
||||
df = pd.read_csv(sensor_data_file, names=['timestamp', column], header=None)
|
||||
timestampstart = float(df['timestamp'][0])
|
||||
df['timestamp'] = (df['timestamp'][1:len(df)]).astype(float) + timestampstart
|
||||
df = df.drop([0])
|
||||
|
@ -98,10 +84,6 @@ def pull_data(data_configuration, device, sensor, container, columns_to_download
|
|||
participant_data = pd.DataFrame(columns=columns_to_download.values())
|
||||
participant_data.set_index('timestamp', inplace=True)
|
||||
|
||||
with open('config.yaml', 'r') as stream:
|
||||
config = yaml.load(stream, Loader=yaml.FullLoader)
|
||||
cr_ibi_provider = config['EMPATICA_INTER_BEAT_INTERVAL']['PROVIDERS']['CR']
|
||||
|
||||
available_zipfiles = list((Path(data_configuration["FOLDER"]) / Path(device)).rglob("*.zip"))
|
||||
if len(available_zipfiles) == 0:
|
||||
warnings.warn("There were no zip files in: {}. If you were expecting data for this participant the [EMPATICA][DEVICE_IDS] key in their participant file is missing the pid".format((Path(data_configuration["FOLDER"]) / Path(device))))
|
||||
|
@ -112,13 +94,7 @@ def pull_data(data_configuration, device, sensor, container, columns_to_download
|
|||
listOfFileNames = zipFile.namelist()
|
||||
for fileName in listOfFileNames:
|
||||
if fileName == sensor_csv:
|
||||
if sensor == "EMPATICA_INTER_BEAT_INTERVAL" and cr_ibi_provider.get('PATCH_WITH_BVP', False):
|
||||
participant_data = \
|
||||
pd.concat([participant_data, patch_ibi_with_bvp(zipFile.read('IBI.csv'), zipFile.read('BVP.csv'))], axis=0)
|
||||
#print("patch with ibi")
|
||||
else:
|
||||
participant_data = pd.concat([participant_data, extract_empatica_data(zipFile.read(fileName), sensor)], axis=0)
|
||||
#print("no patching")
|
||||
warning = False
|
||||
if warning:
|
||||
warnings.warn("We could not find a zipped file for {} in {} (we tried to find {})".format(sensor, zipFile, sensor_csv))
|
||||
|
@ -129,54 +105,4 @@ def pull_data(data_configuration, device, sensor, container, columns_to_download
|
|||
participant_data["device_id"] = device
|
||||
return(participant_data)
|
||||
|
||||
def patch_ibi_with_bvp(ibi_data, bvp_data):
|
||||
ibi_data_file = BytesIO(ibi_data).getvalue().decode('utf-8')
|
||||
ibi_data_file = StringIO(ibi_data_file)
|
||||
|
||||
# Begin with the cr-features part
|
||||
try:
|
||||
ibi_data, ibi_start_timestamp = empatica2d_to_array(ibi_data_file)
|
||||
except (IndexError, KeyError) as e:
|
||||
# Checks whether IBI.csv is empty
|
||||
# It may raise a KeyError if df is empty here: startTimeStamp = df.time[0]
|
||||
df_test = pd.read_csv(ibi_data_file, names=['timings', 'inter_beat_interval'], header=None)
|
||||
if df_test.empty:
|
||||
df_test['timestamp'] = df_test['timings']
|
||||
df_test = df_test.set_index('timestamp')
|
||||
return df_test
|
||||
else:
|
||||
raise IndexError("Something went wrong with indices. Error that was previously caught:\n", repr(e))
|
||||
|
||||
bvp_data_file = BytesIO(bvp_data).getvalue().decode('utf-8')
|
||||
bvp_data_file = StringIO(bvp_data_file)
|
||||
|
||||
bvp_data, bvp_start_timestamp, sample_rate = empatica1d_to_array(bvp_data_file)
|
||||
|
||||
hrv_time_and_freq_features, sample, bvp_rr, bvp_timings, peak_indx = \
|
||||
get_HRV_features(bvp_data, ma=False,
|
||||
detrend=False, m_deternd=False, low_pass=False, winsorize=True,
|
||||
winsorize_value=25, hampel_fiter=False, median_filter=False,
|
||||
mod_z_score_filter=True, sampling=64, feature_names=['meanHr'])
|
||||
|
||||
ibi_timings, ibi_rr = get_patched_ibi_with_bvp(ibi_data[0], ibi_data[1], bvp_timings, bvp_rr)
|
||||
|
||||
df = \
|
||||
pd.DataFrame(np.array([ibi_timings, ibi_rr]).transpose(), columns=['timestamp', 'inter_beat_interval'])
|
||||
df.loc[-1] = [ibi_start_timestamp, 'IBI'] # adding a row
|
||||
df.index = df.index + 1 # shifting index
|
||||
df = df.sort_index() # sorting by index
|
||||
|
||||
# Repeated as in extract_empatica_data for IBI
|
||||
df['timings'] = df['timestamp']
|
||||
timestampstart = float(df['timestamp'][0])
|
||||
df['timestamp'] = (df['timestamp'][1:len(df)]).astype(float) + timestampstart
|
||||
df = df.drop([0])
|
||||
df['inter_beat_interval'] = df['inter_beat_interval'].astype(float)
|
||||
df = df.set_index('timestamp')
|
||||
|
||||
# format timestamps
|
||||
df.index *= 1000
|
||||
df.index = df.index.astype(int)
|
||||
return(df)
|
||||
|
||||
# print(pull_data({'FOLDER': 'data/external/empatica'}, "e01", "EMPATICA_accelerometer", {'TIMESTAMP': 'timestamp', 'DEVICE_ID': 'device_id', 'DOUBLE_VALUES_0': 'x', 'DOUBLE_VALUES_1': 'y', 'DOUBLE_VALUES_2': 'z'}))
|
|
@ -50,7 +50,6 @@ EMPATICA_INTER_BEAT_INTERVAL:
|
|||
TIMESTAMP: timestamp
|
||||
DEVICE_ID: device_id
|
||||
INTER_BEAT_INTERVAL: inter_beat_interval
|
||||
TIMINGS: timings
|
||||
MUTATION:
|
||||
COLUMN_MAPPINGS:
|
||||
SCRIPTS: # List any python or r scripts that mutate your raw data
|
||||
|
|
|
@ -39,7 +39,7 @@ unify_ios_calls <- function(ios_calls){
|
|||
assigned_segments = first(assigned_segments))
|
||||
}
|
||||
else {
|
||||
ios_calls <- ios_calls %>% summarise(call_type_sequence = paste(call_type, collapse = ","), call_duration = sum(as.numeric(call_duration)), timestamp = first(timestamp), device_id = first(device_id))
|
||||
ios_calls <- ios_calls %>% summarise(call_type_sequence = paste(call_type, collapse = ","), call_duration = sum(call_duration), timestamp = first(timestamp), device_id = first(device_id))
|
||||
}
|
||||
ios_calls <- ios_calls %>% mutate(call_type = case_when(
|
||||
call_type_sequence == "1,2,4" | call_type_sequence == "2,1,4" ~ 1, # incoming
|
||||
|
|
|
@ -1,8 +0,0 @@
|
|||
source("renv/activate.R") # needed to use RAPIDS renv environment
|
||||
library(dplyr)
|
||||
|
||||
main <- function(data, stream_parameters){
|
||||
data <- data %>%
|
||||
mutate(application_name = "hashed")
|
||||
return(data)
|
||||
}
|
|
@ -1,5 +0,0 @@
|
|||
import pandas as pd
|
||||
|
||||
def main(data, stream_parameters):
|
||||
data["application_name"] = "hashed"
|
||||
return(data)
|
|
@ -35,8 +35,11 @@ PHONE_APPLICATIONS_NOTIFICATIONS:
|
|||
- DEVICE_ID
|
||||
- PACKAGE_NAME
|
||||
- APPLICATION_NAME
|
||||
- TEXT
|
||||
- SOUND
|
||||
- VIBRATE
|
||||
- DEFAULTS
|
||||
- FLAGS
|
||||
|
||||
PHONE_BATTERY:
|
||||
- TIMESTAMP
|
||||
|
@ -67,16 +70,6 @@ PHONE_CONVERSATION:
|
|||
- DOUBLE_CONVO_START
|
||||
- DOUBLE_CONVO_END
|
||||
|
||||
PHONE_ESM:
|
||||
- TIMESTAMP
|
||||
- DEVICE_ID
|
||||
- ESM_STATUS
|
||||
- ESM_USER_ANSWER
|
||||
- ESM_JSON
|
||||
- ESM_TRIGGER
|
||||
- ESM_SESSION
|
||||
- ESM_NOTIFICATION_ID
|
||||
|
||||
PHONE_KEYBOARD:
|
||||
- TIMESTAMP
|
||||
- DEVICE_ID
|
||||
|
@ -118,11 +111,6 @@ PHONE_SCREEN:
|
|||
- DEVICE_ID
|
||||
- SCREEN_STATUS
|
||||
|
||||
PHONE_SPEECH:
|
||||
- TIMESTAMP
|
||||
- DEVICE_ID
|
||||
- SPEECH_PROPORTION
|
||||
|
||||
PHONE_WIFI_CONNECTED:
|
||||
- TIMESTAMP
|
||||
- DEVICE_ID
|
||||
|
@ -232,7 +220,6 @@ EMPATICA_INTER_BEAT_INTERVAL:
|
|||
- TIMESTAMP
|
||||
- DEVICE_ID
|
||||
- INTER_BEAT_INTERVAL
|
||||
- TIMINGS
|
||||
|
||||
EMPATICA_TAGS:
|
||||
- TIMESTAMP
|
||||
|
|
|
@ -1,62 +0,0 @@
|
|||
source("renv/activate.R")
|
||||
source("src/data/streams/aware_postgresql/container.R")
|
||||
|
||||
library(RPostgres)
|
||||
library(magrittr)
|
||||
library(tidyverse)
|
||||
library(lubridate)
|
||||
|
||||
prepare_participants_file <- function() {
|
||||
|
||||
username_list_csv_location <- snakemake@input[["username_list"]]
|
||||
|
||||
data_configuration <- snakemake@params[["data_configuration"]]
|
||||
participants_container <- snakemake@params[["participants_table"]]
|
||||
device_id_container <- snakemake@params[["device_id_table"]]
|
||||
start_end_date_container <- snakemake@params[["start_end_date_table"]]
|
||||
|
||||
output_data_file <- snakemake@output[["participants_file"]]
|
||||
|
||||
platform <- "android"
|
||||
pid_format <- "p%03d"
|
||||
datetime_format <- "%Y-%m-%d %H:%M:%S"
|
||||
|
||||
participant_data <- read_csv(username_list_csv_location, col_types = "cc", progress = FALSE)
|
||||
usernames <- participant_data$label
|
||||
|
||||
participant_ids <- pull_participants_ids(data_configuration, usernames, participants_container)
|
||||
participant_data %<>%
|
||||
left_join(participant_ids, by = c("label" = "username")) %>%
|
||||
rename(participant_id = id)
|
||||
|
||||
device_ids <- pull_participants_device_ids(data_configuration, participant_data$participant_id, device_id_container)
|
||||
device_ids %<>%
|
||||
filter(device_id != "") %>%
|
||||
group_by(participant_id) %>%
|
||||
summarise(device_ids = list(unique(device_id)))
|
||||
participant_data %<>%
|
||||
left_join(device_ids, by = "participant_id")
|
||||
|
||||
start_end_datetimes <- pull_participants_start_end_dates(data_configuration, participant_data$participant_id, start_end_date_container)
|
||||
participant_data %<>%
|
||||
left_join(start_end_datetimes, by = "participant_id")
|
||||
|
||||
participant_data %<>%
|
||||
mutate(
|
||||
pid = sprintf(pid_format, participant_id),
|
||||
start_date = strftime(datetime_start, format=datetime_format, tz = "UTC", usetz = FALSE), #TODO Check what timezone is expected
|
||||
end_date = strftime(datetime_end, format=datetime_format, tz = "UTC", usetz = FALSE),
|
||||
device_id = map_chr(device_ids, str_c, collapse = ";"),
|
||||
number_of_devices = map_int(device_ids, length),
|
||||
fitbit_id = ""
|
||||
) %>%
|
||||
rowwise() %>%
|
||||
mutate(platform = str_c(replicate(number_of_devices, platform), collapse = ";")) %>%
|
||||
ungroup() %>%
|
||||
arrange(pid) %>%
|
||||
select(pid, label, start_date, end_date, empatica_id, device_id, platform, fitbit_id)
|
||||
|
||||
write_csv(participant_data, output_data_file)
|
||||
}
|
||||
|
||||
prepare_participants_file()
|
|
@ -12,7 +12,7 @@ rapids_cleaning <- function(sensor_data_files, provider){
|
|||
cols_nan_threshold <- as.numeric(provider[["COLS_NAN_THRESHOLD"]])
|
||||
drop_zero_variance_columns <- as.logical(provider[["COLS_VAR_THRESHOLD"]])
|
||||
rows_nan_threshold <- as.numeric(provider[["ROWS_NAN_THRESHOLD"]])
|
||||
data_yield_unit <- tolower(str_split_fixed(provider[["DATA_YIELD_FEATURE"]], "_", 4)[[4]])
|
||||
data_yield_unit <- tolower(provider[["DATA_YIELD_UNIT"]])
|
||||
data_yield_column <- paste0("phone_data_yield_rapids_ratiovalidyielded", data_yield_unit)
|
||||
data_yield_ratio_threshold <- as.numeric(provider[["DATA_YIELD_RATIO_THRESHOLD"]])
|
||||
drop_highly_correlated_features <- provider[["DROP_HIGHLY_CORRELATED_FEATURES"]]
|
||||
|
@ -39,18 +39,16 @@ rapids_cleaning <- function(sensor_data_files, provider){
|
|||
if(!data_yield_column %in% colnames(clean_features)){
|
||||
stop(paste0("Error: RAPIDS provider needs to clean data based on ", data_yield_column, " column, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyielded", data_yield_unit, "' in [FEATURES]."))
|
||||
}
|
||||
if (data_yield_ratio_threshold > 0) {
|
||||
clean_features <- clean_features %>%
|
||||
filter(.[[data_yield_column]] >= data_yield_ratio_threshold)
|
||||
}
|
||||
|
||||
# Drop columns with a percentage of NA values above cols_nan_threshold
|
||||
if(nrow(clean_features))
|
||||
clean_features <- clean_features %>% select(where(~ sum(is.na(.)) / length(.) <= cols_nan_threshold ), starts_with("phone_esm"))
|
||||
clean_features <- clean_features %>% select_if(~ sum(is.na(.)) / length(.) <= cols_nan_threshold )
|
||||
|
||||
# Drop columns with zero variance
|
||||
if(drop_zero_variance_columns)
|
||||
clean_features <- clean_features %>% select_if(grepl("pid|local_segment|local_segment_label|local_segment_start_datetime|local_segment_end_datetime|phone_esm",names(.)) | sapply(., n_distinct, na.rm = T) > 1)
|
||||
clean_features <- clean_features %>% select_if(grepl("pid|local_segment|local_segment_label|local_segment_start_datetime|local_segment_end_datetime",names(.)) | sapply(., n_distinct, na.rm = T) > 1)
|
||||
|
||||
# Drop highly correlated features
|
||||
if(as.logical(drop_highly_correlated_features$COMPUTE)){
|
||||
|
|
|
@ -1,180 +0,0 @@
|
|||
import pandas as pd
|
||||
import numpy as np
|
||||
import math, sys, random
|
||||
import yaml
|
||||
|
||||
from sklearn.impute import KNNImputer
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
import matplotlib.pyplot as plt
|
||||
import seaborn as sns
|
||||
|
||||
sys.path.append('/rapids/')
|
||||
from src.features import empatica_data_yield as edy
|
||||
|
||||
pd.set_option('display.max_columns', 20)
|
||||
|
||||
def straw_cleaning(sensor_data_files, provider):
|
||||
|
||||
features = pd.read_csv(sensor_data_files["sensor_data"][0])
|
||||
|
||||
esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] # Get target (esm) columns
|
||||
|
||||
with open('config.yaml', 'r') as stream:
|
||||
config = yaml.load(stream, Loader=yaml.FullLoader)
|
||||
|
||||
excluded_columns = ['local_segment', 'local_segment_label', 'local_segment_start_datetime', 'local_segment_end_datetime']
|
||||
|
||||
# (1) FILTER_OUT THE ROWS THAT DO NOT HAVE THE TARGET COLUMN AVAILABLE
|
||||
if config['PARAMS_FOR_ANALYSIS']['TARGET']['COMPUTE']:
|
||||
target = config['PARAMS_FOR_ANALYSIS']['TARGET']['LABEL'] # get target label from config
|
||||
if 'phone_esm_straw_' + target in features:
|
||||
features = features[features['phone_esm_straw_' + target].notna()].reset_index(drop=True)
|
||||
else:
|
||||
return features
|
||||
|
||||
# (2.1) QUALITY CHECK (DATA YIELD COLUMN) deletes the rows where E4 or phone data is low quality
|
||||
phone_data_yield_unit = provider["PHONE_DATA_YIELD_FEATURE"].split("_")[3].lower()
|
||||
phone_data_yield_column = "phone_data_yield_rapids_ratiovalidyielded" + phone_data_yield_unit
|
||||
|
||||
if features.empty:
|
||||
return features
|
||||
|
||||
features = edy.calculate_empatica_data_yield(features)
|
||||
|
||||
if not phone_data_yield_column in features.columns and not "empatica_data_yield" in features.columns:
|
||||
raise KeyError(f"RAPIDS provider needs to clean the selected event features based on {phone_data_yield_column} and empatica_data_yield columns. For phone data yield, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyielded{data_yield_unit}' in [FEATURES].")
|
||||
|
||||
# Drop rows where phone data yield is less then given threshold
|
||||
if provider["PHONE_DATA_YIELD_RATIO_THRESHOLD"]:
|
||||
features = features[features[phone_data_yield_column] >= provider["PHONE_DATA_YIELD_RATIO_THRESHOLD"]].reset_index(drop=True)
|
||||
|
||||
# Drop rows where empatica data yield is less then given threshold
|
||||
if provider["EMPATICA_DATA_YIELD_RATIO_THRESHOLD"]:
|
||||
features = features[features["empatica_data_yield"] >= provider["EMPATICA_DATA_YIELD_RATIO_THRESHOLD"]].reset_index(drop=True)
|
||||
|
||||
if features.empty:
|
||||
return features
|
||||
|
||||
# (2.2) DO THE ROWS CONSIST OF ENOUGH NON-NAN VALUES?
|
||||
min_count = math.ceil((1 - provider["ROWS_NAN_THRESHOLD"]) * features.shape[1]) # minimal not nan values in row
|
||||
features.dropna(axis=0, thresh=min_count, inplace=True) # Thresh => at least this many not-nans
|
||||
|
||||
# (3) REMOVE COLS IF THEIR NAN THRESHOLD IS PASSED (should be <= if even all NaN columns must be preserved - this solution now drops columns with all NaN rows)
|
||||
esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] # Get target (esm) columns
|
||||
|
||||
features = features.loc[:, features.isna().sum() < provider["COLS_NAN_THRESHOLD"] * features.shape[0]]
|
||||
|
||||
# Preserve esm cols if deleted (has to come after drop cols operations)
|
||||
for esm in esm_cols:
|
||||
if esm not in features:
|
||||
features[esm] = esm_cols[esm]
|
||||
|
||||
# (4) CONTEXTUAL IMPUTATION
|
||||
|
||||
# Impute selected phone features with a high number
|
||||
impute_w_hn = [col for col in features.columns if \
|
||||
"timeoffirstuse" in col or
|
||||
"timeoflastuse" in col or
|
||||
"timefirstcall" in col or
|
||||
"timelastcall" in col or
|
||||
"firstuseafter" in col or
|
||||
"timefirstmessages" in col or
|
||||
"timelastmessages" in col]
|
||||
features[impute_w_hn] = features[impute_w_hn].fillna(1500)
|
||||
|
||||
|
||||
# Impute special case (mostcommonactivity) and (homelabel)
|
||||
impute_w_sn = [col for col in features.columns if "mostcommonactivity" in col]
|
||||
features[impute_w_sn] = features[impute_w_sn].fillna(4) # Special case of imputation - nominal/ordinal value
|
||||
|
||||
impute_w_sn2 = [col for col in features.columns if "homelabel" in col]
|
||||
features[impute_w_sn2] = features[impute_w_sn2].fillna(1) # Special case of imputation - nominal/ordinal value
|
||||
|
||||
impute_w_sn3 = [col for col in features.columns if "loglocationvariance" in col]
|
||||
features[impute_w_sn2] = features[impute_w_sn2].fillna(-1000000) # Special case of imputation - nominal/ordinal value
|
||||
|
||||
|
||||
# Impute selected phone features with 0
|
||||
impute_zero = [col for col in features if \
|
||||
col.startswith('phone_applications_foreground_rapids_') or
|
||||
col.startswith('phone_battery_rapids_') or
|
||||
col.startswith('phone_bluetooth_rapids_') or
|
||||
col.startswith('phone_light_rapids_') or
|
||||
col.startswith('phone_calls_rapids_') or
|
||||
col.startswith('phone_messages_rapids_') or
|
||||
col.startswith('phone_screen_rapids_') or
|
||||
col.startswith('phone_wifi_visible')]
|
||||
|
||||
features[impute_zero+list(esm_cols.columns)] = features[impute_zero+list(esm_cols.columns)].fillna(0)
|
||||
|
||||
## (5) STANDARDIZATION
|
||||
if provider["STANDARDIZATION"]:
|
||||
features.loc[:, ~features.columns.isin(excluded_columns)] = StandardScaler().fit_transform(features.loc[:, ~features.columns.isin(excluded_columns)])
|
||||
|
||||
# (6) IMPUTATION: IMPUTE DATA WITH KNN METHOD
|
||||
impute_cols = [col for col in features.columns if col not in excluded_columns]
|
||||
features.reset_index(drop=True, inplace=True)
|
||||
features[impute_cols] = impute(features[impute_cols], method="knn")
|
||||
|
||||
# (7) REMOVE COLS WHERE VARIANCE IS 0
|
||||
esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')]
|
||||
|
||||
if provider["COLS_VAR_THRESHOLD"]:
|
||||
features.drop(features.std(numeric_only=True)[features.std(numeric_only=True) == 0].index.values, axis=1, inplace=True)
|
||||
|
||||
fe5 = features.copy()
|
||||
|
||||
# (8) DROP HIGHLY CORRELATED FEATURES
|
||||
drop_corr_features = provider["DROP_HIGHLY_CORRELATED_FEATURES"]
|
||||
if drop_corr_features["COMPUTE"] and features.shape[0]: # If small amount of segments (rows) is present, do not execute correlation check
|
||||
|
||||
numerical_cols = features.select_dtypes(include=np.number).columns.tolist()
|
||||
|
||||
# Remove columns where NaN count threshold is passed
|
||||
valid_features = features[numerical_cols].loc[:, features[numerical_cols].isna().sum() < drop_corr_features['MIN_OVERLAP_FOR_CORR_THRESHOLD'] * features[numerical_cols].shape[0]]
|
||||
|
||||
corr_matrix = valid_features.corr().abs()
|
||||
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
|
||||
to_drop = [column for column in upper.columns if any(upper[column] > drop_corr_features["CORR_THRESHOLD"])]
|
||||
|
||||
features.drop(to_drop, axis=1, inplace=True)
|
||||
|
||||
# Preserve esm cols if deleted (has to come after drop cols operations)
|
||||
for esm in esm_cols:
|
||||
if esm not in features:
|
||||
features[esm] = esm_cols[esm]
|
||||
|
||||
# (9) VERIFY IF THERE ARE ANY NANS LEFT IN THE DATAFRAME
|
||||
if features.isna().any().any():
|
||||
raise ValueError("There are still some NaNs present in the dataframe. Please check for implementation errors.")
|
||||
|
||||
return features
|
||||
|
||||
|
||||
def k_nearest(df):
|
||||
pd.set_option('display.max_columns', None)
|
||||
imputer = KNNImputer(n_neighbors=3)
|
||||
return pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
|
||||
|
||||
|
||||
def impute(df, method='zero'):
|
||||
|
||||
return {
|
||||
'zero': df.fillna(0),
|
||||
'high_number': df.fillna(1500),
|
||||
'mean': df.fillna(df.mean()),
|
||||
'median': df.fillna(df.median()),
|
||||
'knn': k_nearest(df)
|
||||
}[method]
|
||||
|
||||
|
||||
def graph_bf_af(features, phase_name, plt_flag=False):
|
||||
if plt_flag:
|
||||
sns.set(rc={"figure.figsize":(16, 8)})
|
||||
sns.heatmap(features.isna(), cbar=False) #features.select_dtypes(include=np.number)
|
||||
plt.savefig(f'features_overall_nans_{phase_name}.png', bbox_inches='tight')
|
||||
|
||||
print(f"\n-------------{phase_name}-------------")
|
||||
print("Rows number:", features.shape[0])
|
||||
print("Columns number:", len(features.columns))
|
||||
print("---------------------------------------------\n")
|
|
@ -12,7 +12,7 @@ rapids_cleaning <- function(sensor_data_files, provider){
|
|||
cols_nan_threshold <- as.numeric(provider[["COLS_NAN_THRESHOLD"]])
|
||||
drop_zero_variance_columns <- as.logical(provider[["COLS_VAR_THRESHOLD"]])
|
||||
rows_nan_threshold <- as.numeric(provider[["ROWS_NAN_THRESHOLD"]])
|
||||
data_yield_unit <- tolower(str_split_fixed(provider[["DATA_YIELD_FEATURE"]], "_", 4)[[4]])
|
||||
data_yield_unit <- tolower(provider[["DATA_YIELD_UNIT"]])
|
||||
data_yield_column <- paste0("phone_data_yield_rapids_ratiovalidyielded", data_yield_unit)
|
||||
data_yield_ratio_threshold <- as.numeric(provider[["DATA_YIELD_RATIO_THRESHOLD"]])
|
||||
drop_highly_correlated_features <- provider[["DROP_HIGHLY_CORRELATED_FEATURES"]]
|
||||
|
@ -39,18 +39,16 @@ rapids_cleaning <- function(sensor_data_files, provider){
|
|||
if(!data_yield_column %in% colnames(clean_features)){
|
||||
stop(paste0("Error: RAPIDS provider needs to clean data based on ", data_yield_column, " column, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyielded", data_yield_unit, "' in [FEATURES]."))
|
||||
}
|
||||
if (data_yield_ratio_threshold > 0) {
|
||||
clean_features <- clean_features %>%
|
||||
filter(.[[data_yield_column]] >= data_yield_ratio_threshold)
|
||||
}
|
||||
|
||||
# Drop columns with a percentage of NA values above cols_nan_threshold
|
||||
if(nrow(clean_features))
|
||||
clean_features <- clean_features %>% select(where(~ sum(is.na(.)) / length(.) <= cols_nan_threshold ), starts_with("phone_esm"))
|
||||
clean_features <- clean_features %>% select_if(~ sum(is.na(.)) / length(.) <= cols_nan_threshold )
|
||||
|
||||
# Drop columns with zero variance
|
||||
if(drop_zero_variance_columns)
|
||||
clean_features <- clean_features %>% select_if(grepl("pid|local_segment|local_segment_label|local_segment_start_datetime|local_segment_end_datetime|phone_esm",names(.)) | sapply(., n_distinct, na.rm = T) > 1)
|
||||
clean_features <- clean_features %>% select_if(grepl("pid|local_segment|local_segment_label|local_segment_start_datetime|local_segment_end_datetime",names(.)) | sapply(., n_distinct, na.rm = T) > 1)
|
||||
|
||||
# Drop highly correlated features
|
||||
if(as.logical(drop_highly_correlated_features$COMPUTE)){
|
||||
|
|
|
@ -1,275 +0,0 @@
|
|||
import pandas as pd
|
||||
import numpy as np
|
||||
import math, sys, random, warnings, yaml
|
||||
|
||||
from sklearn.impute import KNNImputer
|
||||
from sklearn.preprocessing import StandardScaler, minmax_scale
|
||||
import matplotlib.pyplot as plt
|
||||
import seaborn as sns
|
||||
|
||||
sys.path.append('/rapids/')
|
||||
from src.features import empatica_data_yield as edy
|
||||
|
||||
def straw_cleaning(sensor_data_files, provider, target):
|
||||
|
||||
features = pd.read_csv(sensor_data_files["sensor_data"][0])
|
||||
|
||||
with open('config.yaml', 'r') as stream:
|
||||
config = yaml.load(stream, Loader=yaml.FullLoader)
|
||||
|
||||
esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] # Get target (esm) columns
|
||||
|
||||
excluded_columns = ['local_segment', 'local_segment_label', 'local_segment_start_datetime', 'local_segment_end_datetime']
|
||||
|
||||
graph_bf_af(features, "1target_rows_before")
|
||||
|
||||
# (1.0) OVERRIDE STRESSFULNESS EVENT TARGETS IF ERS SEGMENTING_METHOD IS "STRESS_EVENT"
|
||||
if config["TIME_SEGMENTS"]["TAILORED_EVENTS"]["SEGMENTING_METHOD"] == "stress_event":
|
||||
|
||||
stress_events_targets = pd.read_csv("data/external/stress_event_targets.csv")
|
||||
|
||||
if "appraisal_stressfulness_event_mean" in config['PARAMS_FOR_ANALYSIS']['TARGET']['ALL_LABELS']:
|
||||
features.drop(columns=['phone_esm_straw_appraisal_stressfulness_event_mean'], inplace=True)
|
||||
features = features.merge(stress_events_targets[["label", "appraisal_stressfulness_event"]] \
|
||||
.rename(columns={'label': 'local_segment_label'}), on=['local_segment_label'], how='inner') \
|
||||
.rename(columns={'appraisal_stressfulness_event': 'phone_esm_straw_appraisal_stressfulness_event_mean'})
|
||||
|
||||
if "appraisal_threat_mean" in config['PARAMS_FOR_ANALYSIS']['TARGET']['ALL_LABELS']:
|
||||
features.drop(columns=['phone_esm_straw_appraisal_threat_mean'], inplace=True)
|
||||
features = features.merge(stress_events_targets[["label", "appraisal_threat"]] \
|
||||
.rename(columns={'label': 'local_segment_label'}), on=['local_segment_label'], how='inner') \
|
||||
.rename(columns={'appraisal_threat': 'phone_esm_straw_appraisal_threat_mean'})
|
||||
|
||||
if "appraisal_challenge_mean" in config['PARAMS_FOR_ANALYSIS']['TARGET']['ALL_LABELS']:
|
||||
features.drop(columns=['phone_esm_straw_appraisal_challenge_mean'], inplace=True)
|
||||
features = features.merge(stress_events_targets[["label", "appraisal_challenge"]] \
|
||||
.rename(columns={'label': 'local_segment_label'}), on=['local_segment_label'], how='inner') \
|
||||
.rename(columns={'appraisal_challenge': 'phone_esm_straw_appraisal_challenge_mean'})
|
||||
|
||||
esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] # Get target (esm) columns
|
||||
|
||||
# (1.1) FILTER_OUT THE ROWS THAT DO NOT HAVE THE TARGET COLUMN AVAILABLE
|
||||
if config['PARAMS_FOR_ANALYSIS']['TARGET']['COMPUTE']:
|
||||
features = features[features['phone_esm_straw_' + target].notna()].reset_index(drop=True)
|
||||
|
||||
if features.empty:
|
||||
return pd.DataFrame(columns=excluded_columns)
|
||||
|
||||
graph_bf_af(features, "2target_rows_after")
|
||||
|
||||
# (2) QUALITY CHECK (DATA YIELD COLUMN) drops the rows where E4 or phone data is low quality
|
||||
phone_data_yield_unit = provider["PHONE_DATA_YIELD_FEATURE"].split("_")[3].lower()
|
||||
phone_data_yield_column = "phone_data_yield_rapids_ratiovalidyielded" + phone_data_yield_unit
|
||||
|
||||
features = edy.calculate_empatica_data_yield(features)
|
||||
|
||||
if not phone_data_yield_column in features.columns and not "empatica_data_yield" in features.columns:
|
||||
raise KeyError(f"RAPIDS provider needs to clean the selected event features based on {phone_data_yield_column} and empatica_data_yield columns. For phone data yield, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyielded{data_yield_unit}' in [FEATURES].")
|
||||
|
||||
hist = features[["empatica_data_yield", phone_data_yield_column]].hist()
|
||||
plt.savefig(f'phone_E4_histogram.png', bbox_inches='tight')
|
||||
|
||||
# Drop rows where phone data yield is less then given threshold
|
||||
if provider["PHONE_DATA_YIELD_RATIO_THRESHOLD"]:
|
||||
hist = features[phone_data_yield_column].hist(bins=5)
|
||||
plt.close()
|
||||
features = features[features[phone_data_yield_column] >= provider["PHONE_DATA_YIELD_RATIO_THRESHOLD"]].reset_index(drop=True)
|
||||
|
||||
# Drop rows where empatica data yield is less then given threshold
|
||||
if provider["EMPATICA_DATA_YIELD_RATIO_THRESHOLD"]:
|
||||
features = features[features["empatica_data_yield"] >= provider["EMPATICA_DATA_YIELD_RATIO_THRESHOLD"]].reset_index(drop=True)
|
||||
|
||||
if features.empty:
|
||||
return pd.DataFrame(columns=excluded_columns)
|
||||
|
||||
graph_bf_af(features, "3data_yield_drop_rows")
|
||||
|
||||
if features.empty:
|
||||
return pd.DataFrame(columns=excluded_columns)
|
||||
|
||||
|
||||
# (3) CONTEXTUAL IMPUTATION
|
||||
|
||||
# Impute selected phone features with a high number
|
||||
impute_w_hn = [col for col in features.columns if \
|
||||
"timeoffirstuse" in col or
|
||||
"timeoflastuse" in col or
|
||||
"timefirstcall" in col or
|
||||
"timelastcall" in col or
|
||||
"firstuseafter" in col or
|
||||
"timefirstmessages" in col or
|
||||
"timelastmessages" in col]
|
||||
features[impute_w_hn] = features[impute_w_hn].fillna(1500)
|
||||
|
||||
# Impute special case (mostcommonactivity) and (homelabel)
|
||||
impute_w_sn = [col for col in features.columns if "mostcommonactivity" in col]
|
||||
features[impute_w_sn] = features[impute_w_sn].fillna(4) # Special case of imputation - nominal/ordinal value
|
||||
|
||||
impute_w_sn2 = [col for col in features.columns if "homelabel" in col]
|
||||
features[impute_w_sn2] = features[impute_w_sn2].fillna(1) # Special case of imputation - nominal/ordinal value
|
||||
|
||||
impute_w_sn3 = [col for col in features.columns if "loglocationvariance" in col]
|
||||
features[impute_w_sn3] = features[impute_w_sn3].fillna(-1000000) # Special case of imputation - loglocation
|
||||
|
||||
# Impute location features
|
||||
impute_locations = [col for col in features \
|
||||
if col.startswith('phone_locations_doryab_') and
|
||||
'radiusgyration' not in col
|
||||
]
|
||||
|
||||
# Impute selected phone, location, and esm features with 0
|
||||
impute_zero = [col for col in features if \
|
||||
col.startswith('phone_applications_foreground_rapids_') or
|
||||
col.startswith('phone_activity_recognition_') or
|
||||
col.startswith('phone_battery_rapids_') or
|
||||
col.startswith('phone_bluetooth_rapids_') or
|
||||
col.startswith('phone_light_rapids_') or
|
||||
col.startswith('phone_calls_rapids_') or
|
||||
col.startswith('phone_messages_rapids_') or
|
||||
col.startswith('phone_screen_rapids_') or
|
||||
col.startswith('phone_bluetooth_doryab_') or
|
||||
col.startswith('phone_wifi_visible')
|
||||
]
|
||||
|
||||
features[impute_zero+impute_locations+list(esm_cols.columns)] = features[impute_zero+impute_locations+list(esm_cols.columns)].fillna(0)
|
||||
|
||||
pd.set_option('display.max_rows', None)
|
||||
|
||||
graph_bf_af(features, "4context_imp")
|
||||
|
||||
# (4) REMOVE COLS IF THEIR NAN THRESHOLD IS PASSED (should be <= if even all NaN columns must be preserved - this solution now drops columns with all NaN rows)
|
||||
esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] # Get target (esm) columns
|
||||
|
||||
features = features.loc[:, features.isna().sum() < provider["COLS_NAN_THRESHOLD"] * features.shape[0]]
|
||||
|
||||
graph_bf_af(features, "5too_much_nans_cols")
|
||||
# (5) REMOVE COLS WHERE VARIANCE IS 0
|
||||
|
||||
if provider["COLS_VAR_THRESHOLD"]:
|
||||
features.drop(features.std(numeric_only=True)[features.std(numeric_only=True) == 0].index.values, axis=1, inplace=True)
|
||||
|
||||
graph_bf_af(features, "6variance_drop")
|
||||
|
||||
# Preserve esm cols if deleted (has to come after drop cols operations)
|
||||
for esm in esm_cols:
|
||||
if esm not in features:
|
||||
features[esm] = esm_cols[esm]
|
||||
|
||||
# (6) DO THE ROWS CONSIST OF ENOUGH NON-NAN VALUES?
|
||||
min_count = math.ceil((1 - provider["ROWS_NAN_THRESHOLD"]) * features.shape[1]) # minimal not nan values in row
|
||||
features.dropna(axis=0, thresh=min_count, inplace=True) # Thresh => at least this many not-nans
|
||||
|
||||
graph_bf_af(features, "7too_much_nans_rows")
|
||||
|
||||
if features.empty:
|
||||
return pd.DataFrame(columns=excluded_columns)
|
||||
|
||||
# (7) STANDARDIZATION
|
||||
if provider["STANDARDIZATION"]:
|
||||
nominal_cols = [col for col in features.columns if "mostcommonactivity" in col or "homelabel" in col] # Excluded nominal features
|
||||
# Expected warning within this code block
|
||||
with warnings.catch_warnings():
|
||||
warnings.simplefilter("ignore", category=RuntimeWarning)
|
||||
if provider["TARGET_STANDARDIZATION"]:
|
||||
features.loc[:, ~features.columns.isin(excluded_columns + ["pid"] + nominal_cols)] = \
|
||||
features.loc[:, ~features.columns.isin(excluded_columns + nominal_cols)].groupby('pid').transform(lambda x: StandardScaler().fit_transform(x.values[:,np.newaxis]).ravel())
|
||||
else:
|
||||
features.loc[:, ~features.columns.isin(excluded_columns + ["pid"] + nominal_cols + ['phone_esm_straw_' + target])] = \
|
||||
features.loc[:, ~features.columns.isin(excluded_columns + nominal_cols + ['phone_esm_straw_' + target])].groupby('pid').transform(lambda x: StandardScaler().fit_transform(x.values[:,np.newaxis]).ravel())
|
||||
|
||||
graph_bf_af(features, "8standardization")
|
||||
|
||||
# (8) IMPUTATION: IMPUTE DATA WITH KNN METHOD
|
||||
features.reset_index(drop=True, inplace=True)
|
||||
impute_cols = [col for col in features.columns if col not in excluded_columns and col != "pid"]
|
||||
|
||||
features[impute_cols] = impute(features[impute_cols], method="knn")
|
||||
|
||||
graph_bf_af(features, "9knn_after")
|
||||
|
||||
|
||||
# (9) DROP HIGHLY CORRELATED FEATURES
|
||||
esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')]
|
||||
|
||||
drop_corr_features = provider["DROP_HIGHLY_CORRELATED_FEATURES"]
|
||||
if drop_corr_features["COMPUTE"] and features.shape[0] > 5: # If small amount of segments (rows) is present, do not execute correlation check
|
||||
|
||||
numerical_cols = features.select_dtypes(include=np.number).columns.tolist()
|
||||
|
||||
# Remove columns where NaN count threshold is passed
|
||||
valid_features = features[numerical_cols].loc[:, features[numerical_cols].isna().sum() < drop_corr_features['MIN_OVERLAP_FOR_CORR_THRESHOLD'] * features[numerical_cols].shape[0]]
|
||||
|
||||
corr_matrix = valid_features.corr().abs()
|
||||
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
|
||||
to_drop = [column for column in upper.columns if any(upper[column] > drop_corr_features["CORR_THRESHOLD"])]
|
||||
|
||||
# sns.heatmap(corr_matrix, cmap="YlGnBu")
|
||||
# plt.savefig(f'correlation_matrix.png', bbox_inches='tight')
|
||||
# plt.close()
|
||||
|
||||
# s = corr_matrix.unstack()
|
||||
# so = s.sort_values(ascending=False)
|
||||
|
||||
# pd.set_option('display.max_rows', None)
|
||||
# sorted_upper = upper.unstack().sort_values(ascending=False)
|
||||
# print(sorted_upper[sorted_upper > drop_corr_features["CORR_THRESHOLD"]])
|
||||
|
||||
features.drop(to_drop, axis=1, inplace=True)
|
||||
|
||||
# Preserve esm cols if deleted (has to come after drop cols operations)
|
||||
for esm in esm_cols:
|
||||
if esm not in features:
|
||||
features[esm] = esm_cols[esm]
|
||||
|
||||
graph_bf_af(features, "10correlation_drop")
|
||||
|
||||
# Transform categorical columns to category dtype
|
||||
|
||||
cat1 = [col for col in features.columns if "mostcommonactivity" in col]
|
||||
if cat1: # Transform columns to category dtype (mostcommonactivity)
|
||||
features[cat1] = features[cat1].astype(int).astype('category')
|
||||
|
||||
cat2 = [col for col in features.columns if "homelabel" in col]
|
||||
if cat2: # Transform columns to category dtype (homelabel)
|
||||
features[cat2] = features[cat2].astype(int).astype('category')
|
||||
|
||||
# (10) DROP ALL WINDOW RELATED COLUMNS
|
||||
win_count_cols = [col for col in features if "SO_windowsCount" in col]
|
||||
if win_count_cols:
|
||||
features.drop(columns=win_count_cols, inplace=True)
|
||||
|
||||
# (11) VERIFY IF THERE ARE ANY NANS LEFT IN THE DATAFRAME
|
||||
if features.isna().any().any():
|
||||
raise ValueError("There are still some NaNs present in the dataframe. Please check for implementation errors.")
|
||||
|
||||
|
||||
return features
|
||||
|
||||
|
||||
def k_nearest(df):
|
||||
imputer = KNNImputer(n_neighbors=3)
|
||||
return pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
|
||||
|
||||
|
||||
def impute(df, method='zero'):
|
||||
|
||||
return {
|
||||
'zero': df.fillna(0),
|
||||
'high_number': df.fillna(1500),
|
||||
'mean': df.fillna(df.mean()),
|
||||
'median': df.fillna(df.median()),
|
||||
'knn': k_nearest(df)
|
||||
}[method]
|
||||
|
||||
|
||||
def graph_bf_af(features, phase_name, plt_flag=False):
|
||||
if plt_flag:
|
||||
sns.set(rc={"figure.figsize":(16, 8)})
|
||||
sns.heatmap(features.isna(), cbar=False) #features.select_dtypes(include=np.number)
|
||||
plt.savefig(f'features_overall_nans_{phase_name}.png', bbox_inches='tight')
|
||||
|
||||
print(f"\n-------------{phase_name}-------------")
|
||||
print("Rows number:", features.shape[0])
|
||||
print("Columns number:", len(features.columns))
|
||||
print("NaN values:", features.isna().sum().sum())
|
||||
print("---------------------------------------------\n")
|
|
@ -1,59 +0,0 @@
|
|||
import pandas as pd
|
||||
import numpy as np
|
||||
import math as m
|
||||
|
||||
import sys
|
||||
|
||||
def extract_second_order_features(intraday_features, so_features_names, prefix=""):
|
||||
|
||||
if prefix:
|
||||
groupby_cols = ['local_segment', 'local_segment_label', 'local_segment_start_datetime', 'local_segment_end_datetime']
|
||||
else:
|
||||
groupby_cols = ['local_segment']
|
||||
|
||||
if not intraday_features.empty:
|
||||
so_features = pd.DataFrame()
|
||||
#print(intraday_features.drop("level_1", axis=1).groupby(["local_segment"]).nsmallest())
|
||||
if "mean" in so_features_names:
|
||||
so_features = pd.concat([so_features, intraday_features.drop(prefix+"level_1", axis=1).groupby(groupby_cols).mean(numeric_only=True).add_suffix("_SO_mean")], axis=1)
|
||||
|
||||
if "median" in so_features_names:
|
||||
so_features = pd.concat([so_features, intraday_features.drop(prefix+"level_1", axis=1).groupby(groupby_cols).median(numeric_only=True).add_suffix("_SO_median")], axis=1)
|
||||
|
||||
if "sd" in so_features_names:
|
||||
so_features = pd.concat([so_features, intraday_features.drop(prefix+"level_1", axis=1).groupby(groupby_cols).std(numeric_only=True).fillna(0).add_suffix("_SO_sd")], axis=1)
|
||||
|
||||
if "nlargest" in so_features_names: # largest 5 -- maybe there is a faster groupby solution?
|
||||
for column in intraday_features.loc[:, ~intraday_features.columns.isin(groupby_cols+[prefix+"level_1"])]:
|
||||
so_features[column+"_SO_nlargest"] = intraday_features.drop(prefix+"level_1", axis=1).groupby(groupby_cols)[column].apply(lambda x: x.nlargest(5).mean())
|
||||
|
||||
if "nsmallest" in so_features_names: # smallest 5 -- maybe there is a faster groupby solution?
|
||||
for column in intraday_features.loc[:, ~intraday_features.columns.isin(groupby_cols+[prefix+"level_1"])]:
|
||||
so_features[column+"_SO_nsmallest"] = intraday_features.drop(prefix+"level_1", axis=1).groupby(groupby_cols)[column].apply(lambda x: x.nsmallest(5).mean())
|
||||
|
||||
if "count_windows" in so_features_names:
|
||||
so_features["SO_windowsCount"] = intraday_features.groupby(groupby_cols).count()[prefix+"level_1"]
|
||||
|
||||
# numPeaksNonZero specialized for EDA sensor
|
||||
if "eda_num_peaks_non_zero" in so_features_names and prefix+"numPeaks" in intraday_features.columns:
|
||||
so_features[prefix+"SO_numPeaksNonZero"] = intraday_features.groupby(groupby_cols)[prefix+"numPeaks"].apply(lambda x: (x!=0).sum())
|
||||
|
||||
# numWindowsNonZero specialized for BVP and IBI sensors
|
||||
if "hrv_num_windows_non_nan" in so_features_names and prefix+"meanHr" in intraday_features.columns:
|
||||
so_features[prefix+"SO_numWindowsNonNaN"] = intraday_features.groupby(groupby_cols)[prefix+"meanHr"].apply(lambda x: (~np.isnan(x)).sum())
|
||||
|
||||
so_features.reset_index(inplace=True)
|
||||
|
||||
else:
|
||||
so_features = pd.DataFrame(columns=groupby_cols)
|
||||
|
||||
return so_features
|
||||
|
||||
def get_sample_rate(data): # To-Do get the sample rate information from the file's metadata
|
||||
try:
|
||||
timestamps_diff = data['timestamp'].diff().dropna().mean()
|
||||
print("Timestamp diff:", timestamps_diff)
|
||||
except:
|
||||
raise Exception("Error occured while trying to get the mean sample rate from the data.")
|
||||
|
||||
return m.ceil(1000/timestamps_diff)
|
|
@ -1,75 +0,0 @@
|
|||
import pandas as pd
|
||||
from scipy.stats import entropy
|
||||
|
||||
from cr_features.helper_functions import convert_to2d, accelerometer_features, frequency_features
|
||||
from cr_features.calculate_features_old import calculateFeatures
|
||||
from cr_features.calculate_features import calculate_features
|
||||
from cr_features_helper_methods import extract_second_order_features
|
||||
|
||||
import sys
|
||||
|
||||
def extract_acc_features_from_intraday_data(acc_intraday_data, features, window_length, time_segment, filter_data_by_segment):
|
||||
acc_intraday_features = pd.DataFrame(columns=["local_segment"] + features)
|
||||
|
||||
if not acc_intraday_data.empty:
|
||||
sample_rate = 32
|
||||
|
||||
acc_intraday_data = filter_data_by_segment(acc_intraday_data, time_segment)
|
||||
|
||||
if not acc_intraday_data.empty:
|
||||
|
||||
acc_intraday_features = pd.DataFrame()
|
||||
|
||||
# apply methods from calculate features module
|
||||
if window_length is None:
|
||||
acc_intraday_features = \
|
||||
acc_intraday_data.groupby('local_segment').apply(lambda x: calculate_features( \
|
||||
convert_to2d(x['double_values_0'], x.shape[0]), \
|
||||
convert_to2d(x['double_values_1'], x.shape[0]), \
|
||||
convert_to2d(x['double_values_2'], x.shape[0]), \
|
||||
fs=sample_rate, feature_names=features, show_progress=False))
|
||||
else:
|
||||
acc_intraday_features = \
|
||||
acc_intraday_data.groupby('local_segment').apply(lambda x: calculate_features( \
|
||||
convert_to2d(x['double_values_0'], window_length*sample_rate), \
|
||||
convert_to2d(x['double_values_1'], window_length*sample_rate), \
|
||||
convert_to2d(x['double_values_2'], window_length*sample_rate), \
|
||||
fs=sample_rate, feature_names=features, show_progress=False))
|
||||
|
||||
acc_intraday_features.reset_index(inplace=True)
|
||||
|
||||
return acc_intraday_features
|
||||
|
||||
|
||||
|
||||
def cr_features(sensor_data_files, time_segment, provider, filter_data_by_segment, *args, **kwargs):
|
||||
|
||||
data_types = {'local_timezone': 'str', 'device_id': 'str', 'timestamp': 'int64', 'double_values_0': 'float64',
|
||||
'double_values_1': 'float64', 'double_values_2': 'float64', 'local_date_time': 'str', 'local_date': "str",
|
||||
'local_time': "str", 'local_hour': "str", 'local_minute': "str", 'assigned_segments': "str"}
|
||||
acc_intraday_data = pd.read_csv(sensor_data_files["sensor_data"], dtype=data_types)
|
||||
|
||||
requested_intraday_features = provider["FEATURES"]
|
||||
|
||||
calc_windows = kwargs.get('calc_windows', False)
|
||||
|
||||
if provider["WINDOWS"]["COMPUTE"] and calc_windows:
|
||||
requested_window_length = provider["WINDOWS"]["WINDOW_LENGTH"]
|
||||
else:
|
||||
requested_window_length = None
|
||||
|
||||
# name of the features this function can compute
|
||||
base_intraday_features_names = accelerometer_features + frequency_features
|
||||
# the subset of requested features this function can compute
|
||||
intraday_features_to_compute = list(set(requested_intraday_features) & set(base_intraday_features_names))
|
||||
|
||||
# extract features from intraday data
|
||||
acc_intraday_features = extract_acc_features_from_intraday_data(acc_intraday_data, intraday_features_to_compute,
|
||||
requested_window_length, time_segment, filter_data_by_segment)
|
||||
|
||||
if calc_windows:
|
||||
so_features_names = provider["WINDOWS"]["SECOND_ORDER_FEATURES"]
|
||||
acc_second_order_features = extract_second_order_features(acc_intraday_features, so_features_names)
|
||||
return acc_intraday_features, acc_second_order_features
|
||||
|
||||
return acc_intraday_features
|
|
@ -1,73 +0,0 @@
|
|||
import pandas as pd
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
|
||||
from cr_features.helper_functions import convert_to2d, hrv_features
|
||||
from cr_features.hrv import extract_hrv_features_2d_wrapper
|
||||
from cr_features_helper_methods import extract_second_order_features
|
||||
|
||||
import sys
|
||||
|
||||
# pd.set_option('display.max_rows', 1000)
|
||||
pd.set_option('display.max_columns', None)
|
||||
|
||||
def extract_bvp_features_from_intraday_data(bvp_intraday_data, features, window_length, time_segment, filter_data_by_segment):
|
||||
bvp_intraday_features = pd.DataFrame(columns=["local_segment"] + features)
|
||||
|
||||
if not bvp_intraday_data.empty:
|
||||
sample_rate = 64
|
||||
|
||||
bvp_intraday_data = filter_data_by_segment(bvp_intraday_data, time_segment)
|
||||
|
||||
if not bvp_intraday_data.empty:
|
||||
|
||||
bvp_intraday_features = pd.DataFrame()
|
||||
|
||||
# apply methods from calculate features module
|
||||
if window_length is None:
|
||||
bvp_intraday_features = \
|
||||
bvp_intraday_data.groupby('local_segment').apply(\
|
||||
lambda x:
|
||||
extract_hrv_features_2d_wrapper(
|
||||
convert_to2d(x['blood_volume_pulse'], x.shape[0]),
|
||||
sampling=sample_rate, hampel_fiter=False, median_filter=False, mod_z_score_filter=True, feature_names=features))
|
||||
|
||||
else:
|
||||
bvp_intraday_features = \
|
||||
bvp_intraday_data.groupby('local_segment').apply(\
|
||||
lambda x:
|
||||
extract_hrv_features_2d_wrapper(
|
||||
convert_to2d(x['blood_volume_pulse'], window_length*sample_rate),
|
||||
sampling=sample_rate, hampel_fiter=False, median_filter=False, mod_z_score_filter=True, feature_names=features))
|
||||
|
||||
bvp_intraday_features.reset_index(inplace=True)
|
||||
|
||||
return bvp_intraday_features
|
||||
|
||||
|
||||
def cr_features(sensor_data_files, time_segment, provider, filter_data_by_segment, *args, **kwargs):
|
||||
bvp_intraday_data = pd.read_csv(sensor_data_files["sensor_data"])
|
||||
|
||||
requested_intraday_features = provider["FEATURES"]
|
||||
|
||||
calc_windows = kwargs.get('calc_windows', False)
|
||||
|
||||
if provider["WINDOWS"]["COMPUTE"] and calc_windows:
|
||||
requested_window_length = provider["WINDOWS"]["WINDOW_LENGTH"]
|
||||
else:
|
||||
requested_window_length = None
|
||||
|
||||
# name of the features this function can compute
|
||||
base_intraday_features_names = hrv_features
|
||||
# the subset of requested features this function can compute
|
||||
intraday_features_to_compute = list(set(requested_intraday_features) & set(base_intraday_features_names))
|
||||
|
||||
# extract features from intraday data
|
||||
bvp_intraday_features = extract_bvp_features_from_intraday_data(bvp_intraday_data, intraday_features_to_compute,
|
||||
requested_window_length, time_segment, filter_data_by_segment)
|
||||
|
||||
if calc_windows:
|
||||
so_features_names = provider["WINDOWS"]["SECOND_ORDER_FEATURES"]
|
||||
bvp_second_order_features = extract_second_order_features(bvp_intraday_features, so_features_names)
|
||||
return bvp_intraday_features, bvp_second_order_features
|
||||
|
||||
return bvp_intraday_features
|
|
@ -1,32 +0,0 @@
|
|||
import pandas as pd
|
||||
import numpy as np
|
||||
from datetime import datetime
|
||||
|
||||
import sys, yaml
|
||||
|
||||
def calculate_empatica_data_yield(features): # TODO
|
||||
|
||||
# Get time segment duration in seconds from all segments in features dataframe
|
||||
datetime_start = pd.to_datetime(features['local_segment_start_datetime'], format='%Y-%m-%d %H:%M:%S')
|
||||
datetime_end = pd.to_datetime(features['local_segment_end_datetime'], format='%Y-%m-%d %H:%M:%S')
|
||||
tseg_duration = (datetime_end - datetime_start).dt.total_seconds()
|
||||
|
||||
with open('config.yaml', 'r') as stream:
|
||||
config = yaml.load(stream, Loader=yaml.FullLoader)
|
||||
|
||||
sensors = ["EMPATICA_ACCELEROMETER", "EMPATICA_TEMPERATURE", "EMPATICA_ELECTRODERMAL_ACTIVITY", "EMPATICA_INTER_BEAT_INTERVAL"]
|
||||
for sensor in sensors:
|
||||
features[f"{sensor.lower()}_data_yield"] = \
|
||||
(features[f"{sensor.lower()}_cr_SO_windowsCount"] * config[sensor]["PROVIDERS"]["CR"]["WINDOWS"]["WINDOW_LENGTH"]) / tseg_duration \
|
||||
if f'{sensor.lower()}_cr_SO_windowsCount' in features else 0
|
||||
|
||||
empatica_data_yield_cols = [sensor.lower() + "_data_yield" for sensor in sensors]
|
||||
pd.set_option('display.max_rows', None)
|
||||
|
||||
# Assigns 1 to values that are over 1 (in case of windows not being filled fully)
|
||||
features[empatica_data_yield_cols] = features[empatica_data_yield_cols].apply(lambda x: [y if y <= 1 or np.isnan(y) else 1 for y in x])
|
||||
|
||||
features["empatica_data_yield"] = features[empatica_data_yield_cols].mean(axis=1, numeric_only=True).fillna(0)
|
||||
features.drop(empatica_data_yield_cols, axis=1, inplace=True) # In case of if the advanced operations will later not be needed (e.g., weighted average)
|
||||
|
||||
return features
|
|
@ -1,82 +0,0 @@
|
|||
import pandas as pd
|
||||
import numpy as np
|
||||
from scipy.stats import entropy
|
||||
|
||||
from cr_features.helper_functions import convert_to2d, gsr_features
|
||||
from cr_features.calculate_features import calculate_features
|
||||
from cr_features.gsr import extractGsrFeatures2D
|
||||
from cr_features_helper_methods import extract_second_order_features
|
||||
|
||||
import sys
|
||||
|
||||
#pd.set_option('display.max_columns', None)
|
||||
#pd.set_option('display.max_rows', None)
|
||||
#np.seterr(invalid='ignore')
|
||||
|
||||
|
||||
def extract_eda_features_from_intraday_data(eda_intraday_data, features, window_length, time_segment, filter_data_by_segment):
|
||||
eda_intraday_features = pd.DataFrame(columns=["local_segment"] + features)
|
||||
|
||||
if not eda_intraday_data.empty:
|
||||
sample_rate = 4
|
||||
|
||||
eda_intraday_data = filter_data_by_segment(eda_intraday_data, time_segment)
|
||||
|
||||
if not eda_intraday_data.empty:
|
||||
|
||||
eda_intraday_features = pd.DataFrame()
|
||||
|
||||
# apply methods from calculate features module
|
||||
if window_length is None:
|
||||
eda_intraday_features = \
|
||||
eda_intraday_data.groupby('local_segment').apply(\
|
||||
lambda x: extractGsrFeatures2D(convert_to2d(x['electrodermal_activity'], x.shape[0]), sampleRate=sample_rate, featureNames=features,
|
||||
threshold=.01, offset=1, riseTime=5, decayTime=15))
|
||||
else:
|
||||
eda_intraday_features = \
|
||||
eda_intraday_data.groupby('local_segment').apply(\
|
||||
lambda x: extractGsrFeatures2D(convert_to2d(x['electrodermal_activity'], window_length*sample_rate), sampleRate=sample_rate, featureNames=features,
|
||||
threshold=.01, offset=1, riseTime=5, decayTime=15))
|
||||
|
||||
eda_intraday_features.reset_index(inplace=True)
|
||||
|
||||
return eda_intraday_features
|
||||
|
||||
|
||||
def cr_features(sensor_data_files, time_segment, provider, filter_data_by_segment, *args, **kwargs):
|
||||
|
||||
data_types = {'local_timezone': 'str', 'device_id': 'str', 'timestamp': 'int64', 'electrodermal_activity': 'float64', 'local_date_time': 'str',
|
||||
'local_date': "str", 'local_time': "str", 'local_hour': "str", 'local_minute': "str", 'assigned_segments': "str"}
|
||||
|
||||
eda_intraday_data = pd.read_csv(sensor_data_files["sensor_data"], dtype=data_types)
|
||||
|
||||
requested_intraday_features = provider["FEATURES"]
|
||||
|
||||
calc_windows = kwargs.get('calc_windows', False)
|
||||
|
||||
if provider["WINDOWS"]["COMPUTE"] and calc_windows:
|
||||
requested_window_length = provider["WINDOWS"]["WINDOW_LENGTH"]
|
||||
else:
|
||||
requested_window_length = None
|
||||
|
||||
# name of the features this function can compute
|
||||
base_intraday_features_names = gsr_features
|
||||
# the subset of requested features this function can compute
|
||||
intraday_features_to_compute = list(set(requested_intraday_features) & set(base_intraday_features_names))
|
||||
|
||||
# extract features from intraday data
|
||||
eda_intraday_features = extract_eda_features_from_intraday_data(eda_intraday_data, intraday_features_to_compute,
|
||||
requested_window_length, time_segment, filter_data_by_segment)
|
||||
|
||||
if calc_windows:
|
||||
if provider["WINDOWS"]["IMPUTE_NANS"]:
|
||||
eda_intraday_features[eda_intraday_features["numPeaks"] == 0] = \
|
||||
eda_intraday_features[eda_intraday_features["numPeaks"] == 0].fillna(0)
|
||||
pd.set_option('display.max_columns', None)
|
||||
|
||||
so_features_names = provider["WINDOWS"]["SECOND_ORDER_FEATURES"]
|
||||
eda_second_order_features = extract_second_order_features(eda_intraday_features, so_features_names)
|
||||
|
||||
return eda_intraday_features, eda_second_order_features
|
||||
|
||||
return eda_intraday_features
|
|
@ -1,83 +0,0 @@
|
|||
import pandas as pd
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
import numpy as np
|
||||
|
||||
from cr_features.helper_functions import convert_ibi_to2d_time, hrv_features
|
||||
from cr_features.hrv import extract_hrv_features_2d_wrapper, get_HRV_features
|
||||
from cr_features_helper_methods import extract_second_order_features
|
||||
|
||||
import math
|
||||
import sys
|
||||
|
||||
# pd.set_option('display.max_rows', 1000)
|
||||
pd.set_option('display.max_columns', None)
|
||||
|
||||
|
||||
def extract_ibi_features_from_intraday_data(ibi_intraday_data, features, window_length, time_segment, filter_data_by_segment):
|
||||
ibi_intraday_features = pd.DataFrame(columns=["local_segment"] + features)
|
||||
|
||||
if not ibi_intraday_data.empty:
|
||||
|
||||
ibi_intraday_data = filter_data_by_segment(ibi_intraday_data, time_segment)
|
||||
|
||||
if not ibi_intraday_data.empty:
|
||||
|
||||
ibi_intraday_features = pd.DataFrame()
|
||||
|
||||
# apply methods from calculate features module
|
||||
if window_length is None:
|
||||
ibi_intraday_features = \
|
||||
ibi_intraday_data.groupby('local_segment').apply(\
|
||||
lambda x:
|
||||
extract_hrv_features_2d_wrapper(
|
||||
signal_2D = \
|
||||
convert_ibi_to2d_time(x[['timings', 'inter_beat_interval']], math.ceil(x['timings'].iloc[-1]))[0],
|
||||
ibi_timings = \
|
||||
convert_ibi_to2d_time(x[['timings', 'inter_beat_interval']], math.ceil(x['timings'].iloc[-1]))[1],
|
||||
sampling=None, hampel_fiter=False, median_filter=False, mod_z_score_filter=True, feature_names=features))
|
||||
else:
|
||||
ibi_intraday_features = \
|
||||
ibi_intraday_data.groupby('local_segment').apply(\
|
||||
lambda x:
|
||||
extract_hrv_features_2d_wrapper(
|
||||
signal_2D = convert_ibi_to2d_time(x[['timings', 'inter_beat_interval']], window_length)[0],
|
||||
ibi_timings = convert_ibi_to2d_time(x[['timings', 'inter_beat_interval']], window_length)[1],
|
||||
sampling=None, hampel_fiter=False, median_filter=False, mod_z_score_filter=True, feature_names=features))
|
||||
|
||||
ibi_intraday_features.reset_index(inplace=True)
|
||||
|
||||
return ibi_intraday_features
|
||||
|
||||
|
||||
def cr_features(sensor_data_files, time_segment, provider, filter_data_by_segment, *args, **kwargs):
|
||||
|
||||
data_types = {'local_timezone': 'str', 'device_id': 'str', 'timestamp': 'int64', 'inter_beat_interval': 'float64', 'timings': 'float64', 'local_date_time': 'str',
|
||||
'local_date': "str", 'local_time': "str", 'local_hour': "str", 'local_minute': "str", 'assigned_segments': "str"}
|
||||
|
||||
ibi_intraday_data = pd.read_csv(sensor_data_files["sensor_data"], dtype=data_types)
|
||||
|
||||
requested_intraday_features = provider["FEATURES"]
|
||||
|
||||
calc_windows = kwargs.get('calc_windows', False)
|
||||
|
||||
if provider["WINDOWS"]["COMPUTE"] and calc_windows:
|
||||
requested_window_length = provider["WINDOWS"]["WINDOW_LENGTH"]
|
||||
else:
|
||||
requested_window_length = None
|
||||
|
||||
# name of the features this function can compute
|
||||
base_intraday_features_names = hrv_features
|
||||
# the subset of requested features this function can compute
|
||||
intraday_features_to_compute = list(set(requested_intraday_features) & set(base_intraday_features_names))
|
||||
|
||||
# extract features from intraday data
|
||||
ibi_intraday_features = extract_ibi_features_from_intraday_data(ibi_intraday_data, intraday_features_to_compute,
|
||||
requested_window_length, time_segment, filter_data_by_segment)
|
||||
|
||||
if calc_windows:
|
||||
so_features_names = provider["WINDOWS"]["SECOND_ORDER_FEATURES"]
|
||||
ibi_second_order_features = extract_second_order_features(ibi_intraday_features, so_features_names)
|
||||
|
||||
return ibi_intraday_features, ibi_second_order_features
|
||||
|
||||
return ibi_intraday_features
|
|
@ -1,68 +0,0 @@
|
|||
import pandas as pd
|
||||
from scipy.stats import entropy
|
||||
|
||||
from cr_features.helper_functions import convert_to2d, generic_features
|
||||
from cr_features.calculate_features_old import calculateFeatures
|
||||
from cr_features.calculate_features import calculate_features
|
||||
from cr_features_helper_methods import extract_second_order_features
|
||||
|
||||
import sys
|
||||
|
||||
def extract_temp_features_from_intraday_data(temperature_intraday_data, features, window_length, time_segment, filter_data_by_segment):
|
||||
temperature_intraday_features = pd.DataFrame(columns=["local_segment"] + features)
|
||||
|
||||
if not temperature_intraday_data.empty:
|
||||
sample_rate = 4
|
||||
|
||||
temperature_intraday_data = filter_data_by_segment(temperature_intraday_data, time_segment)
|
||||
|
||||
if not temperature_intraday_data.empty:
|
||||
|
||||
temperature_intraday_features = pd.DataFrame()
|
||||
|
||||
# apply methods from calculate features module
|
||||
if window_length is None:
|
||||
temperature_intraday_features = \
|
||||
temperature_intraday_data.groupby('local_segment').apply(\
|
||||
lambda x: calculate_features(convert_to2d(x['temperature'], x.shape[0]), fs=sample_rate, feature_names=features, show_progress=False))
|
||||
else:
|
||||
temperature_intraday_features = \
|
||||
temperature_intraday_data.groupby('local_segment').apply(\
|
||||
lambda x: calculate_features(convert_to2d(x['temperature'], window_length*sample_rate), fs=sample_rate, feature_names=features, show_progress=False))
|
||||
|
||||
|
||||
temperature_intraday_features.reset_index(inplace=True)
|
||||
|
||||
return temperature_intraday_features
|
||||
|
||||
|
||||
def cr_features(sensor_data_files, time_segment, provider, filter_data_by_segment, *args, **kwargs):
|
||||
data_types = {'local_timezone': 'str', 'device_id': 'str', 'timestamp': 'int64', 'temperature': 'float64', 'local_date_time': 'str',
|
||||
'local_date': "str", 'local_time': "str", 'local_hour': "str", 'local_minute': "str", 'assigned_segments': "str"}
|
||||
|
||||
temperature_intraday_data = pd.read_csv(sensor_data_files["sensor_data"], dtype=data_types)
|
||||
|
||||
requested_intraday_features = provider["FEATURES"]
|
||||
|
||||
calc_windows = kwargs.get('calc_windows', False)
|
||||
|
||||
if provider["WINDOWS"]["COMPUTE"] and calc_windows:
|
||||
requested_window_length = provider["WINDOWS"]["WINDOW_LENGTH"]
|
||||
else:
|
||||
requested_window_length = None
|
||||
|
||||
# name of the features this function can compute
|
||||
base_intraday_features_names = generic_features
|
||||
# the subset of requested features this function can compute
|
||||
intraday_features_to_compute = list(set(requested_intraday_features) & set(base_intraday_features_names))
|
||||
|
||||
# extract features from intraday data
|
||||
temperature_intraday_features = extract_temp_features_from_intraday_data(temperature_intraday_data, intraday_features_to_compute,
|
||||
requested_window_length, time_segment, filter_data_by_segment)
|
||||
|
||||
if calc_windows:
|
||||
so_features_names = provider["WINDOWS"]["SECOND_ORDER_FEATURES"]
|
||||
temperature_second_order_features = extract_second_order_features(temperature_intraday_features, so_features_names)
|
||||
return temperature_intraday_features, temperature_second_order_features
|
||||
|
||||
return temperature_intraday_features
|
|
@ -1,38 +1,19 @@
|
|||
import pandas as pd
|
||||
from utils.utils import fetch_provider_features, run_provider_cleaning_script
|
||||
|
||||
import sys
|
||||
|
||||
sensor_data_files = dict(snakemake.input)
|
||||
|
||||
provider = snakemake.params["provider"]
|
||||
provider_key = snakemake.params["provider_key"]
|
||||
sensor_key = snakemake.params["sensor_key"]
|
||||
|
||||
calc_windows = True if (provider.get("WINDOWS", False) and provider["WINDOWS"].get("COMPUTE", False)) else False
|
||||
|
||||
if sensor_key == "all_cleaning_individual" or sensor_key == "all_cleaning_overall":
|
||||
# Data cleaning
|
||||
if "overall" in sensor_key:
|
||||
sensor_features = run_provider_cleaning_script(provider, provider_key, sensor_key, sensor_data_files, snakemake.params["target"])
|
||||
else:
|
||||
sensor_features = run_provider_cleaning_script(provider, provider_key, sensor_key, sensor_data_files)
|
||||
else:
|
||||
# Extract sensor features
|
||||
del sensor_data_files["time_segments_labels"]
|
||||
time_segments_file = snakemake.input["time_segments_labels"]
|
||||
sensor_features = fetch_provider_features(provider, provider_key, sensor_key, sensor_data_files, time_segments_file)
|
||||
|
||||
if calc_windows:
|
||||
window_features, second_order_features = fetch_provider_features(provider, provider_key, sensor_key, sensor_data_files, time_segments_file, calc_windows=True)
|
||||
|
||||
window_features.to_csv(snakemake.output[1], index=False)
|
||||
second_order_features.to_csv(snakemake.output[0], index=False)
|
||||
|
||||
elif "empatica" in sensor_key:
|
||||
pd.DataFrame().to_csv(snakemake.output[1], index=False)
|
||||
|
||||
if not calc_windows:
|
||||
sensor_features = fetch_provider_features(provider, provider_key, sensor_key, sensor_data_files, time_segments_file, calc_windows=False)
|
||||
|
||||
if not calc_windows:
|
||||
sensor_features.to_csv(snakemake.output[0], index=False)
|
||||
sensor_features.to_csv(snakemake.output[0], index=False)
|
|
@ -1,10 +1,9 @@
|
|||
import pandas as pd
|
||||
import numpy as np
|
||||
|
||||
def statsFeatures(steps_data, features_to_compute, features_type, steps_features, *args, **kwargs):
|
||||
def statsFeatures(steps_data, features_to_compute, features_type, steps_features):
|
||||
if features_type == "steps" or features_type == "sumsteps":
|
||||
col_name = "steps"
|
||||
reference_hour = kwargs["reference_hour"]
|
||||
elif features_type == "durationsedentarybout" or features_type == "durationactivebout":
|
||||
col_name = "duration"
|
||||
else:
|
||||
|
@ -24,10 +23,6 @@ def statsFeatures(steps_data, features_to_compute, features_type, steps_features
|
|||
steps_features["median" + features_type] = steps_data.groupby(["local_segment"])[col_name].median()
|
||||
if "std" + features_type in features_to_compute:
|
||||
steps_features["std" + features_type] = steps_data.groupby(["local_segment"])[col_name].std()
|
||||
if (col_name == "steps") and ("firststeptime" in features_to_compute):
|
||||
steps_features["firststeptime"] = steps_data[steps_data["steps"].ne(0)].groupby(["local_segment"])["local_time"].first().apply(lambda x: (int(x.split(":")[0]) - reference_hour) * 60 + int(x.split(":")[1]) + (int(x.split(":")[2]) / 60))
|
||||
if (col_name == "steps") and ("laststeptime" in features_to_compute):
|
||||
steps_features["laststeptime"] = steps_data[steps_data["steps"].ne(0)].groupby(["local_segment"])["local_time"].last().apply(lambda x: (int(x.split(":")[0]) - reference_hour) * 60 + int(x.split(":")[1]) + (int(x.split(":")[2]) / 60))
|
||||
|
||||
return steps_features
|
||||
|
||||
|
@ -43,11 +38,11 @@ def getBouts(steps_data):
|
|||
|
||||
return bouts
|
||||
|
||||
def extractStepsFeaturesFromIntradayData(steps_intraday_data, reference_hour, threshold_active_bout, intraday_features_to_compute_steps, intraday_features_to_compute_sedentarybout, intraday_features_to_compute_activebout, steps_intraday_features):
|
||||
def extractStepsFeaturesFromIntradayData(steps_intraday_data, threshold_active_bout, intraday_features_to_compute_steps, intraday_features_to_compute_sedentarybout, intraday_features_to_compute_activebout, steps_intraday_features):
|
||||
steps_intraday_features = pd.DataFrame()
|
||||
|
||||
# statistics features of steps count
|
||||
steps_intraday_features = statsFeatures(steps_intraday_data, intraday_features_to_compute_steps, "steps", steps_intraday_features, reference_hour=reference_hour)
|
||||
steps_intraday_features = statsFeatures(steps_intraday_data, intraday_features_to_compute_steps, "steps", steps_intraday_features)
|
||||
|
||||
# sedentary bout: less than THRESHOLD_ACTIVE_BOUT (default: 10) steps in a minute
|
||||
# active bout: greater or equal to THRESHOLD_ACTIVE_BOUT (default: 10) steps in a minute
|
||||
|
@ -71,7 +66,6 @@ def extractStepsFeaturesFromIntradayData(steps_intraday_data, reference_hour, th
|
|||
|
||||
def rapids_features(sensor_data_files, time_segment, provider, filter_data_by_segment, *args, **kwargs):
|
||||
|
||||
reference_hour = provider["REFERENCE_HOUR"]
|
||||
threshold_active_bout = provider["THRESHOLD_ACTIVE_BOUT"]
|
||||
include_zero_step_rows = provider["INCLUDE_ZERO_STEP_ROWS"]
|
||||
|
||||
|
@ -79,11 +73,11 @@ def rapids_features(sensor_data_files, time_segment, provider, filter_data_by_se
|
|||
|
||||
requested_intraday_features = provider["FEATURES"]
|
||||
|
||||
requested_intraday_features_steps = [x + "steps" if x not in ["firststeptime", "laststeptime"] else x for x in requested_intraday_features["STEPS"]]
|
||||
requested_intraday_features_steps = [x + "steps" for x in requested_intraday_features["STEPS"]]
|
||||
requested_intraday_features_sedentarybout = [x + "sedentarybout" for x in requested_intraday_features["SEDENTARY_BOUT"]]
|
||||
requested_intraday_features_activebout = [x + "activebout" for x in requested_intraday_features["ACTIVE_BOUT"]]
|
||||
# name of the features this function can compute
|
||||
base_intraday_features_steps = ["sumsteps", "maxsteps", "minsteps", "avgsteps", "stdsteps", "firststeptime", "laststeptime"]
|
||||
base_intraday_features_steps = ["sumsteps", "maxsteps", "minsteps", "avgsteps", "stdsteps"]
|
||||
base_intraday_features_sedentarybout = ["countepisodesedentarybout", "sumdurationsedentarybout", "maxdurationsedentarybout", "mindurationsedentarybout", "avgdurationsedentarybout", "stddurationsedentarybout"]
|
||||
base_intraday_features_activebout = ["countepisodeactivebout", "sumdurationactivebout", "maxdurationactivebout", "mindurationactivebout", "avgdurationactivebout", "stddurationactivebout"]
|
||||
# the subset of requested features this function can compute
|
||||
|
@ -105,6 +99,6 @@ def rapids_features(sensor_data_files, time_segment, provider, filter_data_by_se
|
|||
steps_intraday_data = filter_data_by_segment(steps_intraday_data, time_segment)
|
||||
|
||||
if not steps_intraday_data.empty:
|
||||
steps_intraday_features = extractStepsFeaturesFromIntradayData(steps_intraday_data, reference_hour, threshold_active_bout, intraday_features_to_compute_steps, intraday_features_to_compute_sedentarybout, intraday_features_to_compute_activebout, steps_intraday_features)
|
||||
steps_intraday_features = extractStepsFeaturesFromIntradayData(steps_intraday_data, threshold_active_bout, intraday_features_to_compute_steps, intraday_features_to_compute_sedentarybout, intraday_features_to_compute_activebout, steps_intraday_features)
|
||||
|
||||
return steps_intraday_features
|
||||
|
|
|
@ -37,6 +37,6 @@ def rapids_features(sensor_data_files, time_segment, provider, filter_data_by_se
|
|||
ar_features.index.names = ["local_segment"]
|
||||
ar_features = ar_features.reset_index()
|
||||
|
||||
ar_features.fillna(value={"count": 0, "countuniqueactivities": 0, "durationstationary": 0, "durationmobile": 0, "durationvehicle": 0, "mostcommonactivity": 4}, inplace=True)
|
||||
ar_features.fillna(value={"count": 0, "countuniqueactivities": 0, "durationstationary": 0, "durationmobile": 0, "durationvehicle": 0}, inplace=True)
|
||||
|
||||
return ar_features
|
||||
|
|
|
@ -9,19 +9,19 @@ def compute_features(filtered_data, apps_type, requested_features, apps_features
|
|||
if "timeoffirstuse" in requested_features:
|
||||
time_first_event = filtered_data.sort_values(by="timestamp", ascending=True).drop_duplicates(subset="local_segment", keep="first").set_index("local_segment")
|
||||
if time_first_event.empty:
|
||||
apps_features["timeoffirstuse" + apps_type] = 1500 # np.nan
|
||||
apps_features["timeoffirstuse" + apps_type] = np.nan
|
||||
else:
|
||||
apps_features["timeoffirstuse" + apps_type] = time_first_event["local_hour"] * 60 + time_first_event["local_minute"]
|
||||
if "timeoflastuse" in requested_features:
|
||||
time_last_event = filtered_data.sort_values(by="timestamp", ascending=False).drop_duplicates(subset="local_segment", keep="first").set_index("local_segment")
|
||||
if time_last_event.empty:
|
||||
apps_features["timeoflastuse" + apps_type] = 1500 # np.nan
|
||||
apps_features["timeoflastuse" + apps_type] = np.nan
|
||||
else:
|
||||
apps_features["timeoflastuse" + apps_type] = time_last_event["local_hour"] * 60 + time_last_event["local_minute"]
|
||||
if "frequencyentropy" in requested_features:
|
||||
apps_with_count = filtered_data.groupby(["local_segment","application_name"]).count().sort_values(by="timestamp", ascending=False).reset_index()
|
||||
if (len(apps_with_count.index) < 2 ):
|
||||
apps_features["frequencyentropy" + apps_type] = 0 # np.nan
|
||||
apps_features["frequencyentropy" + apps_type] = np.nan
|
||||
else:
|
||||
apps_features["frequencyentropy" + apps_type] = apps_with_count.groupby("local_segment")["timestamp"].agg(entropy)
|
||||
if "countevent" in requested_features:
|
||||
|
@ -43,7 +43,6 @@ def compute_features(filtered_data, apps_type, requested_features, apps_features
|
|||
apps_features["sumduration" + apps_type] = filtered_data.groupby(by = ["local_segment"])["duration"].sum()
|
||||
|
||||
apps_features.index.names = ["local_segment"]
|
||||
|
||||
return apps_features
|
||||
|
||||
def process_app_features(data, requested_features, time_segment, provider, filter_data_by_segment):
|
||||
|
|
|
@ -15,7 +15,7 @@ def deviceFeatures(devices, ownership, common_devices, features_to_compute, feat
|
|||
if "meanscans" in features_to_compute:
|
||||
features = features.join(device_value_counts.groupby("local_segment")["scans"].mean().to_frame("meanscans" + ownership), how="outer")
|
||||
if "stdscans" in features_to_compute:
|
||||
features = features.join(device_value_counts.groupby("local_segment")["scans"].std().to_frame("stdscans" + ownership).fillna(0), how="outer")
|
||||
features = features.join(device_value_counts.groupby("local_segment")["scans"].std().to_frame("stdscans" + ownership), how="outer")
|
||||
# Most frequent device within segments, across segments, and across dataset
|
||||
if "countscansmostfrequentdevicewithinsegments" in features_to_compute:
|
||||
features = features.join(device_value_counts.groupby("local_segment")["scans"].max().to_frame("countscansmostfrequentdevicewithinsegments" + ownership), how="outer")
|
||||
|
|
|
@ -88,16 +88,6 @@ rapids_features <- function(sensor_data_files, time_segment, provider){
|
|||
features <- call_features_of_type(calls_of_type, features_type, call_type, time_segment, requested_features)
|
||||
call_features <- merge(call_features, features, all=TRUE)
|
||||
}
|
||||
|
||||
# Fill seleted columns with a high number
|
||||
time_cols <- select(call_features, contains("timefirstcall") | contains("timelastcall")) %>%
|
||||
colnames(.)
|
||||
|
||||
call_features <- call_features %>%
|
||||
mutate_at(., time_cols, ~replace(., is.na(.), 1500))
|
||||
|
||||
# Fill NA values with 0
|
||||
call_features <- call_features %>% mutate_all(~replace(., is.na(.), 0))
|
||||
|
||||
call_features <- call_features %>% mutate_at(vars(contains("countmostfrequentcontact") | contains("distinctcontacts") | contains("count") | contains("sumduration") | contains("minduration") | contains("maxduration") | contains("meanduration") | contains("modeduration")), list( ~ replace_na(., 0)))
|
||||
return(call_features)
|
||||
}
|
|
@ -3,11 +3,9 @@ library(tidyr)
|
|||
library(readr)
|
||||
|
||||
compute_data_yield_features <- function(data, feature_name, time_segment, provider){
|
||||
|
||||
data <- data %>% filter_data_by_segment(time_segment)
|
||||
if(nrow(data) == 0){
|
||||
if(nrow(data) == 0)
|
||||
return(tibble(local_segment = character(), ratiovalidyieldedminutes = numeric(), ratiovalidyieldedhours = numeric()))
|
||||
}
|
||||
features <- data %>%
|
||||
separate(timestamps_segment, into = c("start_timestamp", "end_timestamp"), convert = T, sep = ",") %>%
|
||||
mutate(duration_minutes = (end_timestamp - start_timestamp) / 60000,
|
||||
|
|
|
@ -1,274 +0,0 @@
|
|||
from collections.abc import Collection
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from pytz import timezone
|
||||
import datetime, json
|
||||
|
||||
# from config.models import ESM, Participant
|
||||
# from features import helper
|
||||
|
||||
ESM_STATUS_ANSWERED = 2
|
||||
|
||||
GROUP_SESSIONS_BY = ["device_id", "esm_session"] # 'participant_id
|
||||
|
||||
SESSION_STATUS_UNANSWERED = "ema_unanswered"
|
||||
SESSION_STATUS_DAY_FINISHED = "day_finished"
|
||||
SESSION_STATUS_COMPLETE = "ema_completed"
|
||||
|
||||
ANSWER_DAY_FINISHED = "DayFinished3421"
|
||||
ANSWER_DAY_OFF = "DayOff3421"
|
||||
ANSWER_SET_EVENING = "DayFinishedSetEvening"
|
||||
|
||||
MAX_MORNING_LENGTH = 3
|
||||
# When the participants was not yet at work at the time of the first (morning) EMA,
|
||||
# only three items were answered.
|
||||
# Two sleep related items and one indicating NOT starting work yet.
|
||||
# Daytime EMAs are all longer, in fact they always consist of at least 6 items.
|
||||
|
||||
|
||||
TZ_LJ = timezone("Europe/Ljubljana")
|
||||
COLUMN_TIMESTAMP = "timestamp"
|
||||
COLUMN_TIMESTAMP_ESM = "double_esm_user_answer_timestamp"
|
||||
|
||||
|
||||
def get_date_from_timestamp(df_aware) -> pd.DataFrame:
|
||||
"""
|
||||
Transform a UNIX timestamp into a datetime (with Ljubljana timezone).
|
||||
Additionally, extract only the date part, where anything until 4 AM is considered the same day.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df_aware: pd.DataFrame
|
||||
Any AWARE-type data as defined in models.py.
|
||||
|
||||
Returns
|
||||
-------
|
||||
df_aware: pd.DataFrame
|
||||
The same dataframe with datetime_lj and date_lj columns added.
|
||||
|
||||
"""
|
||||
if COLUMN_TIMESTAMP_ESM in df_aware:
|
||||
column_timestamp = COLUMN_TIMESTAMP_ESM
|
||||
else:
|
||||
column_timestamp = COLUMN_TIMESTAMP
|
||||
|
||||
df_aware["datetime_lj"] = df_aware[column_timestamp].apply(
|
||||
lambda x: datetime.datetime.fromtimestamp(x / 1000.0, tz=TZ_LJ)
|
||||
)
|
||||
df_aware = df_aware.assign(
|
||||
date_lj=lambda x: (x.datetime_lj - datetime.timedelta(hours=4)).dt.date
|
||||
)
|
||||
# Since daytime EMAs could *theoretically* last beyond midnight, but never after 4 AM,
|
||||
# the datetime is first translated to 4 h earlier.
|
||||
|
||||
return df_aware
|
||||
|
||||
|
||||
def preprocess_esm(df_esm: pd.DataFrame) -> pd.DataFrame:
|
||||
"""
|
||||
Convert timestamps into human-readable datetimes and dates
|
||||
and expand the JSON column into several Pandas DF columns.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df_esm: pd.DataFrame
|
||||
A dataframe of esm data.
|
||||
|
||||
Returns
|
||||
-------
|
||||
df_esm_preprocessed: pd.DataFrame
|
||||
A dataframe with added columns: datetime in Ljubljana timezone and all fields from ESM_JSON column.
|
||||
"""
|
||||
df_esm = get_date_from_timestamp(df_esm)
|
||||
|
||||
df_esm_json = df_esm["esm_json"].apply(json.loads)
|
||||
df_esm_json = pd.json_normalize(df_esm_json).drop(
|
||||
columns=["esm_trigger"]
|
||||
) # The esm_trigger column is already present in the main df.
|
||||
return df_esm.join(df_esm_json)
|
||||
|
||||
|
||||
def classify_sessions_by_completion(df_esm_preprocessed: pd.DataFrame) -> pd.DataFrame:
|
||||
"""
|
||||
For each distinct EMA session, determine how the participant responded to it.
|
||||
Possible outcomes are: SESSION_STATUS_UNANSWERED, SESSION_STATUS_DAY_FINISHED, and SESSION_STATUS_COMPLETE
|
||||
|
||||
This is done in three steps.
|
||||
|
||||
First, the esm_status is considered.
|
||||
If any of the ESMs in a session has a status *other than* "answered", then this session is taken as unfinished.
|
||||
|
||||
Second, the sessions which do not represent full questionnaires are identified.
|
||||
These are sessions where participants only marked they are finished with the day or have not yet started working.
|
||||
|
||||
Third, the sessions with only one item are marked with their trigger.
|
||||
We never offered questionnaires with single items, so we can be sure these are unfinished.
|
||||
|
||||
Finally, all sessions that remain are marked as completed.
|
||||
By going through different possibilities in expl_esm_adherence.ipynb, this turned out to be a reasonable option.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df_esm_preprocessed: pd.DataFrame
|
||||
A preprocessed dataframe of esm data, which must include the session ID (esm_session).
|
||||
|
||||
Returns
|
||||
-------
|
||||
df_session_counts: pd.Dataframe
|
||||
A dataframe of all sessions (grouped by GROUP_SESSIONS_BY) with their statuses and the number of items.
|
||||
"""
|
||||
sessions_grouped = df_esm_preprocessed.groupby(GROUP_SESSIONS_BY)
|
||||
|
||||
# 0. First, assign all session statuses as NaN.
|
||||
df_session_counts = pd.DataFrame(sessions_grouped.count()["timestamp"]).rename(
|
||||
columns={"timestamp": "esm_session_count"}
|
||||
)
|
||||
df_session_counts["session_response"] = np.nan
|
||||
|
||||
# 1. Identify all ESMs with status other than answered.
|
||||
esm_not_answered = sessions_grouped.apply(
|
||||
lambda x: (x.esm_status != ESM_STATUS_ANSWERED).any()
|
||||
)
|
||||
df_session_counts.loc[
|
||||
esm_not_answered, "session_response"
|
||||
] = SESSION_STATUS_UNANSWERED
|
||||
|
||||
# 2. Identify non-sessions, i.e. answers about the end of the day.
|
||||
non_session = sessions_grouped.apply(
|
||||
lambda x: (
|
||||
(x.esm_user_answer == ANSWER_DAY_FINISHED) # I finished working for today.
|
||||
| (x.esm_user_answer == ANSWER_DAY_OFF) # I am not going to work today.
|
||||
| (
|
||||
x.esm_user_answer == ANSWER_SET_EVENING
|
||||
) # When would you like to answer the evening EMA?
|
||||
).any()
|
||||
)
|
||||
df_session_counts.loc[non_session, "session_response"] = SESSION_STATUS_DAY_FINISHED
|
||||
|
||||
# 3. Identify sessions appearing only once, as those were not true EMAs for sure.
|
||||
singleton_sessions = (df_session_counts.esm_session_count == 1) & (
|
||||
df_session_counts.session_response.isna()
|
||||
)
|
||||
df_session_1 = df_session_counts[singleton_sessions]
|
||||
df_esm_unique_session = df_session_1.join(
|
||||
df_esm_preprocessed.set_index(GROUP_SESSIONS_BY), how="left"
|
||||
)
|
||||
df_esm_unique_session = df_esm_unique_session.assign(
|
||||
session_response=lambda x: x.esm_trigger
|
||||
)["session_response"]
|
||||
df_session_counts.loc[
|
||||
df_esm_unique_session.index, "session_response"
|
||||
] = df_esm_unique_session
|
||||
|
||||
# 4. Mark the remaining sessions as completed.
|
||||
df_session_counts.loc[
|
||||
df_session_counts.session_response.isna(), "session_response"
|
||||
] = SESSION_STATUS_COMPLETE
|
||||
|
||||
return df_session_counts
|
||||
|
||||
|
||||
def classify_sessions_by_time(df_esm_preprocessed: pd.DataFrame) -> pd.DataFrame:
|
||||
"""
|
||||
For each EMA session, determine the time of the first user answer and its time type (morning, workday, or evening.)
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df_esm_preprocessed: pd.DataFrame
|
||||
A preprocessed dataframe of esm data, which must include the session ID (esm_session).
|
||||
|
||||
Returns
|
||||
-------
|
||||
df_session_time: pd.DataFrame
|
||||
A dataframe of all sessions (grouped by GROUP_SESSIONS_BY) with their time type and timestamp of first answer.
|
||||
"""
|
||||
df_session_time = (
|
||||
df_esm_preprocessed.sort_values(["datetime_lj"]) # "participant_id"
|
||||
.groupby(GROUP_SESSIONS_BY)
|
||||
.first()[["time", "datetime_lj"]]
|
||||
)
|
||||
return df_session_time
|
||||
|
||||
|
||||
def classify_sessions_by_completion_time(
|
||||
df_esm_preprocessed: pd.DataFrame,
|
||||
) -> pd.DataFrame:
|
||||
"""
|
||||
The point of this function is to not only classify sessions by using the previously defined functions.
|
||||
It also serves to "correct" the time type of some EMA sessions.
|
||||
|
||||
A morning questionnaire could seamlessly transition into a daytime questionnaire,
|
||||
if the participant was already at work.
|
||||
In this case, the "time" label changed mid-session.
|
||||
Because of the way classify_sessions_by_time works, this questionnaire was classified as "morning".
|
||||
But for all intents and purposes, it can be treated as a "daytime" EMA.
|
||||
|
||||
The way this scenario is differentiated from a true "morning" questionnaire,
|
||||
where the participants NOT yet at work, is by considering their length.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df_esm_preprocessed: pd.DataFrame
|
||||
A preprocessed dataframe of esm data, which must include the session ID (esm_session).
|
||||
|
||||
Returns
|
||||
-------
|
||||
df_session_counts_time: pd.DataFrame
|
||||
A dataframe of all sessions (grouped by GROUP_SESSIONS_BY) with statuses, the number of items,
|
||||
their time type (with some morning EMAs reclassified) and timestamp of first answer.
|
||||
|
||||
"""
|
||||
df_session_counts = classify_sessions_by_completion(df_esm_preprocessed)
|
||||
df_session_time = classify_sessions_by_time(df_esm_preprocessed)
|
||||
|
||||
df_session_counts_time = df_session_time.join(df_session_counts)
|
||||
|
||||
morning_transition_to_daytime = (df_session_counts_time.time == "morning") & (
|
||||
df_session_counts_time.esm_session_count > MAX_MORNING_LENGTH
|
||||
)
|
||||
|
||||
df_session_counts_time.loc[morning_transition_to_daytime, "time"] = "daytime"
|
||||
|
||||
return df_session_counts_time
|
||||
|
||||
|
||||
# def clean_up_esm(df_esm_preprocessed: pd.DataFrame) -> pd.DataFrame:
|
||||
# """
|
||||
# This function eliminates invalid ESM responses.
|
||||
# It removes unanswered ESMs and those that indicate end of work and similar.
|
||||
# It also extracts a numeric answer from strings such as "4 - I strongly agree".
|
||||
|
||||
# Parameters
|
||||
# ----------
|
||||
# df_esm_preprocessed: pd.DataFrame
|
||||
# A preprocessed dataframe of esm data.
|
||||
|
||||
# Returns
|
||||
# -------
|
||||
# df_esm_clean: pd.DataFrame
|
||||
# A subset of the original dataframe.
|
||||
|
||||
# """
|
||||
# df_esm_clean = df_esm_preprocessed[
|
||||
# df_esm_preprocessed["esm_status"] == ESM_STATUS_ANSWERED
|
||||
# ]
|
||||
# df_esm_clean = df_esm_clean[
|
||||
# ~df_esm_clean["esm_user_answer"].isin(
|
||||
# [ANSWER_DAY_FINISHED, ANSWER_DAY_OFF, ANSWER_SET_EVENING]
|
||||
# )
|
||||
# ]
|
||||
# df_esm_clean["esm_user_answer_numeric"] = np.nan
|
||||
# esm_type_numeric = [
|
||||
# ESM.ESM_TYPE.get("radio"),
|
||||
# ESM.ESM_TYPE.get("scale"),
|
||||
# ESM.ESM_TYPE.get("number"),
|
||||
# ]
|
||||
# df_esm_clean.loc[
|
||||
# df_esm_clean["esm_type"].isin(esm_type_numeric)
|
||||
# ] = df_esm_clean.loc[df_esm_clean["esm_type"].isin(esm_type_numeric)].assign(
|
||||
# esm_user_answer_numeric=lambda x: x.esm_user_answer.str.slice(stop=1).astype(
|
||||
# int
|
||||
# )
|
||||
# )
|
||||
# return df_esm_clean
|
|
@ -1,108 +0,0 @@
|
|||
import pandas as pd
|
||||
|
||||
JCQ_ORIGINAL_MAX = 4
|
||||
JCQ_ORIGINAL_MIN = 1
|
||||
|
||||
dict_JCQ_demand_control_reverse = {
|
||||
75: (
|
||||
"I was NOT asked",
|
||||
"Men legde mij geen overdreven",
|
||||
"Men legde mij GEEN overdreven", # Capitalized in some versions
|
||||
"Od mene se NI zahtevalo",
|
||||
),
|
||||
76: (
|
||||
"I had enough time to do my work",
|
||||
"Ik had voldoende tijd om mijn werk",
|
||||
"Imela sem dovolj časa, da končam",
|
||||
"Imel sem dovolj časa, da končam",
|
||||
),
|
||||
77: (
|
||||
"I was free of conflicting demands",
|
||||
"Er werden mij op het werk geen tegenstrijdige",
|
||||
"Er werden mij op het werk GEEN tegenstrijdige", # Capitalized in some versions
|
||||
"Pri svojem delu se NISEM srečeval",
|
||||
),
|
||||
79: (
|
||||
"My job involved a lot of repetitive work",
|
||||
"Mijn taak omvatte veel repetitief werk",
|
||||
"Moje delo je vključevalo veliko ponavljajočega",
|
||||
),
|
||||
85: (
|
||||
"On my job, I had very little freedom",
|
||||
"In mijn taak had ik zeer weinig vrijheid",
|
||||
"Pri svojem delu sem imel zelo malo svobode",
|
||||
"Pri svojem delu sem imela zelo malo svobode",
|
||||
),
|
||||
}
|
||||
|
||||
|
||||
def reverse_jcq_demand_control_scoring(
|
||||
df_esm_jcq_demand_control: pd.DataFrame,
|
||||
) -> pd.DataFrame:
|
||||
"""
|
||||
This function recodes answers in Job content questionnaire by first incrementing them by 1,
|
||||
to be in line with original (1-4) scoring.
|
||||
Then, some answers are reversed (i.e. 1 becomes 4 etc.), because the questions are negatively phrased.
|
||||
These answers are listed in dict_JCQ_demand_control_reverse and identified by their question ID.
|
||||
However, the existing data is checked against literal phrasing of these questions
|
||||
to protect against wrong numbering of questions (differing question IDs).
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df_esm_jcq_demand_control: pd.DataFrame
|
||||
A cleaned up dataframe, which must also include esm_user_answer_numeric.
|
||||
|
||||
Returns
|
||||
-------
|
||||
df_esm_jcq_demand_control: pd.DataFrame
|
||||
The same dataframe with a column esm_user_score containing answers recoded and reversed.
|
||||
"""
|
||||
df_esm_jcq_demand_control_unique_answers = (
|
||||
df_esm_jcq_demand_control.groupby("question_id")
|
||||
.esm_instructions.value_counts()
|
||||
.rename()
|
||||
.reset_index()
|
||||
)
|
||||
# Tabulate all possible answers to each question (group by question ID).
|
||||
for q_id in dict_JCQ_demand_control_reverse.keys():
|
||||
# Look through all answers that need to be reversed.
|
||||
possible_answers = df_esm_jcq_demand_control_unique_answers.loc[
|
||||
df_esm_jcq_demand_control_unique_answers["question_id"] == q_id,
|
||||
"esm_instructions",
|
||||
]
|
||||
# These are all answers to a given question (by q_id).
|
||||
answers_matches = possible_answers.str.startswith(
|
||||
dict_JCQ_demand_control_reverse.get(q_id)
|
||||
)
|
||||
# See if they are expected, i.e. included in the dictionary.
|
||||
if ~answers_matches.all():
|
||||
print("One of the answers that occur in the data should not be reversed.")
|
||||
print("This was the answer found in the data: ")
|
||||
raise KeyError(possible_answers[~answers_matches])
|
||||
# In case there is an unexpected answer, raise an exception.
|
||||
|
||||
try:
|
||||
df_esm_jcq_demand_control = df_esm_jcq_demand_control.assign(
|
||||
esm_user_score=lambda x: x.esm_user_answer_numeric + 1
|
||||
)
|
||||
# Increment the original answer by 1
|
||||
# to keep in line with traditional scoring (JCQ_ORIGINAL_MIN - JCQ_ORIGINAL_MAX).
|
||||
df_esm_jcq_demand_control[
|
||||
df_esm_jcq_demand_control["question_id"].isin(
|
||||
dict_JCQ_demand_control_reverse.keys()
|
||||
)
|
||||
] = df_esm_jcq_demand_control[
|
||||
df_esm_jcq_demand_control["question_id"].isin(
|
||||
dict_JCQ_demand_control_reverse.keys()
|
||||
)
|
||||
].assign(
|
||||
esm_user_score=lambda x: JCQ_ORIGINAL_MAX
|
||||
+ JCQ_ORIGINAL_MIN
|
||||
- x.esm_user_score
|
||||
)
|
||||
# Reverse the items that require it.
|
||||
except AttributeError as e:
|
||||
print("Please, clean the dataframe first using features.esm.clean_up_esm.")
|
||||
print(e)
|
||||
|
||||
return df_esm_jcq_demand_control
|
|
@ -1,135 +0,0 @@
|
|||
import json
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
|
||||
ESM_TYPE = {
|
||||
"text": 1,
|
||||
"radio": 2,
|
||||
"checkbox": 3,
|
||||
"likert": 4,
|
||||
"quick_answers": 5,
|
||||
"scale": 6,
|
||||
"datetime": 7,
|
||||
"pam": 8,
|
||||
"number": 9,
|
||||
"web": 10,
|
||||
"date": 11,
|
||||
}
|
||||
|
||||
QUESTIONNAIRE_IDS = {
|
||||
"sleep_quality": 1,
|
||||
"PANAS_positive_affect": 8,
|
||||
"PANAS_negative_affect": 9,
|
||||
"JCQ_job_demand": 10,
|
||||
"JCQ_job_control": 11,
|
||||
"JCQ_supervisor_support": 12,
|
||||
"JCQ_coworker_support": 13,
|
||||
"PFITS_supervisor": 14,
|
||||
"PFITS_coworkers": 15,
|
||||
"UWES_vigor": 16,
|
||||
"UWES_dedication": 17,
|
||||
"UWES_absorption": 18,
|
||||
"COPE_active": 19,
|
||||
"COPE_support": 20,
|
||||
"COPE_emotions": 21,
|
||||
"balance_life_work": 22,
|
||||
"balance_work_life": 23,
|
||||
"recovery_experience_detachment": 24,
|
||||
"recovery_experience_relaxation": 25,
|
||||
"symptoms": 26,
|
||||
"appraisal_stressfulness_event": 87,
|
||||
"appraisal_threat": 88,
|
||||
"appraisal_challenge": 89,
|
||||
"appraisal_event_time": 90,
|
||||
"appraisal_event_duration": 91,
|
||||
"appraisal_event_work_related": 92,
|
||||
"appraisal_stressfulness_period": 93,
|
||||
"late_work": 94,
|
||||
"work_hours": 95,
|
||||
"left_work": 96,
|
||||
"activities": 97,
|
||||
"coffee_breaks": 98,
|
||||
"at_work_yet": 99,
|
||||
}
|
||||
|
||||
ESM_STATUS_ANSWERED = 2
|
||||
|
||||
GROUP_SESSIONS_BY = ["participant_id", "device_id", "esm_session"]
|
||||
|
||||
SESSION_STATUS_UNANSWERED = "ema_unanswered"
|
||||
SESSION_STATUS_DAY_FINISHED = "day_finished"
|
||||
SESSION_STATUS_COMPLETE = "ema_completed"
|
||||
|
||||
ANSWER_DAY_FINISHED = "DayFinished3421"
|
||||
ANSWER_DAY_OFF = "DayOff3421"
|
||||
ANSWER_SET_EVENING = "DayFinishedSetEvening"
|
||||
|
||||
MAX_MORNING_LENGTH = 3
|
||||
# When the participants was not yet at work at the time of the first (morning) EMA,
|
||||
# only three items were answered.
|
||||
# Two sleep related items and one indicating NOT starting work yet.
|
||||
# Daytime EMAs are all longer, in fact they always consist of at least 6 items.
|
||||
|
||||
|
||||
def preprocess_esm(df_esm: pd.DataFrame) -> pd.DataFrame:
|
||||
"""
|
||||
Convert timestamps into human-readable datetimes and dates
|
||||
and expand the JSON column into several Pandas DF columns.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df_esm: pd.DataFrame
|
||||
A dataframe of esm data.
|
||||
|
||||
Returns
|
||||
-------
|
||||
df_esm_preprocessed: pd.DataFrame
|
||||
A dataframe with added columns: datetime in Ljubljana timezone and all fields from ESM_JSON column.
|
||||
"""
|
||||
df_esm_json = df_esm["esm_json"].apply(json.loads)
|
||||
df_esm_json = pd.json_normalize(df_esm_json).drop(
|
||||
columns=["esm_trigger"]
|
||||
) # The esm_trigger column is already present in the main df.
|
||||
return df_esm.join(df_esm_json)
|
||||
|
||||
|
||||
def clean_up_esm(df_esm_preprocessed: pd.DataFrame) -> pd.DataFrame:
|
||||
"""
|
||||
This function eliminates invalid ESM responses.
|
||||
It removes unanswered ESMs and those that indicate end of work and similar.
|
||||
It also extracts a numeric answer from strings such as "4 - I strongly agree".
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df_esm_preprocessed: pd.DataFrame
|
||||
A preprocessed dataframe of esm data.
|
||||
|
||||
Returns
|
||||
-------
|
||||
df_esm_clean: pd.DataFrame
|
||||
A subset of the original dataframe.
|
||||
|
||||
"""
|
||||
df_esm_clean = df_esm_preprocessed[
|
||||
df_esm_preprocessed["esm_status"] == ESM_STATUS_ANSWERED
|
||||
]
|
||||
df_esm_clean = df_esm_clean[
|
||||
~df_esm_clean["esm_user_answer"].isin(
|
||||
[ANSWER_DAY_FINISHED, ANSWER_DAY_OFF, ANSWER_SET_EVENING]
|
||||
)
|
||||
]
|
||||
df_esm_clean["esm_user_answer_numeric"] = np.nan
|
||||
esm_type_numeric = [
|
||||
ESM_TYPE.get("radio"),
|
||||
ESM_TYPE.get("scale"),
|
||||
ESM_TYPE.get("number"),
|
||||
]
|
||||
df_esm_clean.loc[
|
||||
df_esm_clean["esm_type"].isin(esm_type_numeric)
|
||||
] = df_esm_clean.loc[df_esm_clean["esm_type"].isin(esm_type_numeric)].assign(
|
||||
esm_user_answer_numeric=lambda x: x.esm_user_answer.str.slice(stop=1).astype(
|
||||
int
|
||||
)
|
||||
)
|
||||
return df_esm_clean
|
|
@ -1,66 +0,0 @@
|
|||
import pandas as pd
|
||||
|
||||
QUESTIONNAIRE_IDS = {
|
||||
"sleep_quality": 1,
|
||||
"PANAS_positive_affect": 8,
|
||||
"PANAS_negative_affect": 9,
|
||||
"JCQ_job_demand": 10,
|
||||
"JCQ_job_control": 11,
|
||||
"JCQ_supervisor_support": 12,
|
||||
"JCQ_coworker_support": 13,
|
||||
"PFITS_supervisor": 14,
|
||||
"PFITS_coworkers": 15,
|
||||
"UWES_vigor": 16,
|
||||
"UWES_dedication": 17,
|
||||
"UWES_absorption": 18,
|
||||
"COPE_active": 19,
|
||||
"COPE_support": 20,
|
||||
"COPE_emotions": 21,
|
||||
"balance_life_work": 22,
|
||||
"balance_work_life": 23,
|
||||
"recovery_experience_detachment": 24,
|
||||
"recovery_experience_relaxation": 25,
|
||||
"symptoms": 26,
|
||||
"appraisal_stressfulness_event": 87,
|
||||
"appraisal_threat": 88,
|
||||
"appraisal_challenge": 89,
|
||||
"appraisal_event_time": 90,
|
||||
"appraisal_event_duration": 91,
|
||||
"appraisal_event_work_related": 92,
|
||||
"appraisal_stressfulness_period": 93,
|
||||
"late_work": 94,
|
||||
"work_hours": 95,
|
||||
"left_work": 96,
|
||||
"activities": 97,
|
||||
"coffee_breaks": 98,
|
||||
"at_work_yet": 99,
|
||||
}
|
||||
|
||||
|
||||
def straw_features(sensor_data_files, time_segment, provider, filter_data_by_segment, *args, **kwargs):
|
||||
esm_data = pd.read_csv(sensor_data_files["sensor_data"])
|
||||
requested_features = provider["FEATURES"]
|
||||
# name of the features this function can compute
|
||||
requested_scales = provider["SCALES"]
|
||||
base_features_names = ["PANAS_positive_affect", "PANAS_negative_affect", "JCQ_job_demand", "JCQ_job_control", "JCQ_supervisor_support", "JCQ_coworker_support",
|
||||
"appraisal_stressfulness_period", "appraisal_stressfulness_event", "appraisal_threat", "appraisal_challenge"]
|
||||
#TODO Check valid questionnaire and feature names.
|
||||
# the subset of requested features this function can compute
|
||||
features_to_compute = list(set(requested_features) & set(base_features_names))
|
||||
esm_features = pd.DataFrame(columns=["local_segment"] + features_to_compute)
|
||||
if not esm_data.empty:
|
||||
esm_data = filter_data_by_segment(esm_data, time_segment)
|
||||
|
||||
if not esm_data.empty:
|
||||
esm_features = pd.DataFrame()
|
||||
for scale in requested_scales:
|
||||
questionnaire_id = QUESTIONNAIRE_IDS[scale]
|
||||
mask = esm_data["questionnaire_id"] == questionnaire_id
|
||||
esm_features[scale + "_mean"] = esm_data.loc[mask].groupby(["local_segment"])["esm_user_score"].mean()
|
||||
#TODO Create the column esm_user_score in esm_clean. Currently, this is only done when reversing.
|
||||
|
||||
esm_features = esm_features.reset_index()
|
||||
if 'index' in esm_features: # In calse of empty esm_features df
|
||||
esm_features.rename(columns={'index': 'local_segment'}, inplace=True)
|
||||
|
||||
return esm_features
|
|
@ -1,25 +0,0 @@
|
|||
from esm_preprocess import *
|
||||
from esm_JCQ import reverse_jcq_demand_control_scoring
|
||||
|
||||
requested_scales = snakemake.params["scales"]
|
||||
|
||||
df_esm = pd.read_csv(snakemake.input[0])
|
||||
df_esm_preprocessed = preprocess_esm(df_esm)
|
||||
|
||||
if not all([scale in QUESTIONNAIRE_IDS for scale in requested_scales]):
|
||||
unknown_scales = set(requested_scales) - set(QUESTIONNAIRE_IDS.keys())
|
||||
print("The requested questionnaire name should be one of the following:")
|
||||
print(QUESTIONNAIRE_IDS.keys())
|
||||
raise ValueError("You requested scales not collected: ", unknown_scales)
|
||||
|
||||
df_esm_clean = clean_up_esm(df_esm_preprocessed)
|
||||
df_esm_clean["esm_user_score"] = df_esm_clean["esm_user_answer_numeric"]
|
||||
|
||||
for scale in requested_scales:
|
||||
questionnaire_id = QUESTIONNAIRE_IDS[scale]
|
||||
mask = df_esm_clean["questionnaire_id"] == questionnaire_id
|
||||
if scale.startswith("JCQ"):
|
||||
df_esm_clean.loc[mask] = reverse_jcq_demand_control_scoring(df_esm_clean.loc[mask])
|
||||
#TODO Reverse other questionnaires if needed and/or adapt esm_user_score to original scoring.
|
||||
|
||||
df_esm_clean.to_csv(snakemake.output[0], index=False)
|
|
@ -1,260 +0,0 @@
|
|||
import pandas as pd
|
||||
import numpy as np
|
||||
import datetime
|
||||
|
||||
import math, sys, yaml
|
||||
|
||||
from esm_preprocess import clean_up_esm
|
||||
from esm import classify_sessions_by_completion_time, preprocess_esm
|
||||
|
||||
input_data_files = dict(snakemake.input)
|
||||
|
||||
def format_timestamp(x):
|
||||
"""This method formates inputed timestamp into format "HH MM SS". Including spaces. If there is no hours or minutes present
|
||||
that part is ignored, e.g., "MM SS" or just "SS".
|
||||
|
||||
Args:
|
||||
x (int): unix timestamp in seconds
|
||||
|
||||
Returns:
|
||||
str: formatted timestamp using "HH MM SS" sintax
|
||||
"""
|
||||
tstring=""
|
||||
space = False
|
||||
if x//3600 > 0:
|
||||
tstring += f"{x//3600}H"
|
||||
space = True
|
||||
if x % 3600 // 60 > 0:
|
||||
tstring += f" {x % 3600 // 60}M" if "H" in tstring else f"{x % 3600 // 60}M"
|
||||
if x % 60 > 0:
|
||||
tstring += f" {x % 60}S" if "M" in tstring or "H" in tstring else f"{x % 60}S"
|
||||
|
||||
return tstring
|
||||
|
||||
|
||||
def extract_ers(esm_df):
|
||||
"""This method has two major functionalities:
|
||||
(1) It prepares STRAW event-related segments file with the use of esm file. The execution protocol is depended on
|
||||
the segmenting method specified in the config.yaml file.
|
||||
(2) It prepares and writes csv with targets and corresponding time segments labels. This is later used
|
||||
in the overall cleaning script (straw).
|
||||
|
||||
Details about each segmenting method are listed below by each corresponding condition. Refer to the RAPIDS documentation for the
|
||||
ERS file format: https://www.rapids.science/1.9/setup/configuration/#time-segments -> event segments
|
||||
|
||||
Args:
|
||||
esm_df (DataFrame): read esm file that is dependend on the current participant.
|
||||
|
||||
Returns:
|
||||
extracted_ers (DataFrame): dataframe with all necessary information to write event-related segments file
|
||||
in the correct format.
|
||||
"""
|
||||
|
||||
pd.set_option("display.max_rows", 100)
|
||||
pd.set_option("display.max_columns", None)
|
||||
|
||||
with open('config.yaml', 'r') as stream:
|
||||
config = yaml.load(stream, Loader=yaml.FullLoader)
|
||||
|
||||
pd.DataFrame(columns=["label"]).to_csv(snakemake.output[1]) # Create an empty stress_events_targets file
|
||||
|
||||
esm_preprocessed = clean_up_esm(preprocess_esm(esm_df))
|
||||
|
||||
# Take only ema_completed sessions responses
|
||||
classified = classify_sessions_by_completion_time(esm_preprocessed)
|
||||
esm_filtered_sessions = classified[classified["session_response"] == 'ema_completed'].reset_index()[['device_id', 'esm_session']]
|
||||
esm_df = esm_preprocessed.loc[(esm_preprocessed['device_id'].isin(esm_filtered_sessions['device_id'])) & (esm_preprocessed['esm_session'].isin(esm_filtered_sessions['esm_session']))]
|
||||
|
||||
segmenting_method = config["TIME_SEGMENTS"]["TAILORED_EVENTS"]["SEGMENTING_METHOD"]
|
||||
|
||||
if segmenting_method in ["30_before", "90_before"]: # takes 30-minute peroid before the questionnaire + the duration of the questionnaire
|
||||
""" '30-minutes and 90-minutes before' have the same fundamental logic with couple of deviations that will be explained below.
|
||||
Both take x-minute period before the questionnaire that is summed with the questionnaire duration.
|
||||
All questionnaire durations over 15 minutes are excluded from the querying.
|
||||
"""
|
||||
# Extract time-relevant information
|
||||
extracted_ers = esm_df.groupby(["device_id", "esm_session"])['timestamp'].apply(lambda x: math.ceil((x.max() - x.min()) / 1000)).reset_index() # questionnaire length
|
||||
extracted_ers["label"] = f"straw_event_{segmenting_method}_" + snakemake.params["pid"] + "_" + extracted_ers.index.astype(str).str.zfill(3)
|
||||
extracted_ers[['event_timestamp', 'device_id']] = esm_df.groupby(["device_id", "esm_session"])['timestamp'].min().reset_index()[['timestamp', 'device_id']]
|
||||
extracted_ers = extracted_ers[extracted_ers["timestamp"] <= 15 * 60].reset_index(drop=True) # ensure that the longest duration of the questionnaire anwsering is 15 min
|
||||
extracted_ers["shift_direction"] = -1
|
||||
|
||||
if segmenting_method == "30_before":
|
||||
"""The method 30-minutes before simply takes 30 minutes before the questionnaire and sums it with the questionnaire duration.
|
||||
The timestamps are formatted with the help of format_timestamp() method.
|
||||
"""
|
||||
time_before_questionnaire = 30 * 60 # in seconds (30 minutes)
|
||||
|
||||
extracted_ers["length"] = (extracted_ers["timestamp"] + time_before_questionnaire).apply(lambda x: format_timestamp(x))
|
||||
extracted_ers["shift"] = time_before_questionnaire
|
||||
extracted_ers["shift"] = extracted_ers["shift"].apply(lambda x: format_timestamp(x))
|
||||
|
||||
elif segmenting_method == "90_before":
|
||||
"""The method 90-minutes before has an important condition. If the time between the current and the previous questionnaire is
|
||||
longer then 90 minutes it takes 90 minutes, otherwise it takes the original time difference between the questionnaires.
|
||||
"""
|
||||
time_before_questionnaire = 90 * 60 # in seconds (90 minutes)
|
||||
|
||||
extracted_ers[['end_event_timestamp', 'device_id']] = esm_df.groupby(["device_id", "esm_session"])['timestamp'].max().reset_index()[['timestamp', 'device_id']]
|
||||
|
||||
extracted_ers['diffs'] = extracted_ers['event_timestamp'].astype('int64') - extracted_ers['end_event_timestamp'].shift(1, fill_value=0).astype('int64')
|
||||
extracted_ers.loc[extracted_ers['diffs'] > time_before_questionnaire * 1000, 'diffs'] = time_before_questionnaire * 1000
|
||||
|
||||
extracted_ers["diffs"] = (extracted_ers["diffs"] / 1000).apply(lambda x: math.ceil(x))
|
||||
|
||||
extracted_ers["length"] = (extracted_ers["timestamp"] + extracted_ers["diffs"]).apply(lambda x: format_timestamp(x))
|
||||
extracted_ers["shift"] = extracted_ers["diffs"].apply(lambda x: format_timestamp(x))
|
||||
|
||||
elif segmenting_method == "stress_event":
|
||||
"""
|
||||
TODO: update documentation for this condition
|
||||
This is a special case of the method as it consists of two important parts:
|
||||
(1) Generating of the ERS file (same as the methods above) and
|
||||
(2) Generating targets file alongside with the correct time segment labels.
|
||||
|
||||
This extracts event-related segments, depended on the event time and duration specified by the participant in the next
|
||||
questionnaire. Additionally, 5 minutes before the specified start time of this event is taken to take into a account the
|
||||
possiblity of the participant not remembering the start time percisely => this parameter can be manipulated with the variable
|
||||
"time_before_event" which is defined below.
|
||||
|
||||
In case if the participant marked that no stressful event happened, the default of 30 minutes before the event is choosen.
|
||||
In this case, se_threat and se_challenge are NaN.
|
||||
|
||||
By default, this method also excludes all events that are longer then 2.5 hours so that the segments are easily comparable.
|
||||
"""
|
||||
|
||||
ioi = config["TIME_SEGMENTS"]["TAILORED_EVENTS"]["INTERVAL_OF_INTEREST"] * 60 # interval of interest in seconds
|
||||
ioi_error_tolerance = config["TIME_SEGMENTS"]["TAILORED_EVENTS"]["IOI_ERROR_TOLERANCE"] * 60 # interval of interest error tolerance in seconds
|
||||
|
||||
# Get and join required data
|
||||
extracted_ers = esm_df.groupby(["device_id", "esm_session"])['timestamp'].apply(lambda x: math.ceil((x.max() - x.min()) / 1000)).reset_index().rename(columns={'timestamp': 'session_length'}) # questionnaire length
|
||||
extracted_ers = extracted_ers[extracted_ers["session_length"] <= 15 * 60].reset_index(drop=True) # ensure that the longest duration of the questionnaire answering is 15 min
|
||||
session_start_timestamp = esm_df.groupby(['device_id', 'esm_session'])['timestamp'].min().to_frame().rename(columns={'timestamp': 'session_start_timestamp'}) # questionnaire start timestamp
|
||||
session_end_timestamp = esm_df.groupby(['device_id', 'esm_session'])['timestamp'].max().to_frame().rename(columns={'timestamp': 'session_end_timestamp'}) # questionnaire end timestamp
|
||||
|
||||
# Users' answers for the stressfulness event (se) start times and durations
|
||||
se_time = esm_df[esm_df.questionnaire_id == 90.].set_index(['device_id', 'esm_session'])['esm_user_answer'].to_frame().rename(columns={'esm_user_answer': 'se_time'})
|
||||
se_duration = esm_df[esm_df.questionnaire_id == 91.].set_index(['device_id', 'esm_session'])['esm_user_answer'].to_frame().rename(columns={'esm_user_answer': 'se_duration'})
|
||||
|
||||
# Make se_durations to the appropriate lengths
|
||||
|
||||
# Extracted 3 targets that will be transfered in the csv file to the cleaning script.
|
||||
se_stressfulness_event_tg = esm_df[esm_df.questionnaire_id == 87.].set_index(['device_id', 'esm_session'])['esm_user_answer_numeric'].to_frame().rename(columns={'esm_user_answer_numeric': 'appraisal_stressfulness_event'})
|
||||
se_threat_tg = esm_df[esm_df.questionnaire_id == 88.].groupby(["device_id", "esm_session"]).mean(numeric_only=True)['esm_user_answer_numeric'].to_frame().rename(columns={'esm_user_answer_numeric': 'appraisal_threat'})
|
||||
se_challenge_tg = esm_df[esm_df.questionnaire_id == 89.].groupby(["device_id", "esm_session"]).mean(numeric_only=True)['esm_user_answer_numeric'].to_frame().rename(columns={'esm_user_answer_numeric': 'appraisal_challenge'})
|
||||
|
||||
# All relevant features are joined by inner join to remove standalone columns (e.g., stressfulness event target has larger count)
|
||||
extracted_ers = extracted_ers.join(session_start_timestamp, on=['device_id', 'esm_session'], how='inner') \
|
||||
.join(session_end_timestamp, on=['device_id', 'esm_session'], how='inner') \
|
||||
.join(se_stressfulness_event_tg, on=['device_id', 'esm_session'], how='inner') \
|
||||
.join(se_time, on=['device_id', 'esm_session'], how='left') \
|
||||
.join(se_duration, on=['device_id', 'esm_session'], how='left') \
|
||||
.join(se_threat_tg, on=['device_id', 'esm_session'], how='left') \
|
||||
.join(se_challenge_tg, on=['device_id', 'esm_session'], how='left')
|
||||
|
||||
# Filter-out the sessions that are not useful. Because of the ambiguity this excludes:
|
||||
# (1) straw event times that are marked as "0 - I don't remember"
|
||||
extracted_ers = extracted_ers[~extracted_ers.se_time.astype(str).str.startswith("0 - ")]
|
||||
extracted_ers.reset_index(drop=True, inplace=True)
|
||||
|
||||
extracted_ers.loc[extracted_ers.se_duration.astype(str).str.startswith("0 - "), 'se_duration'] = 0
|
||||
|
||||
# Add default duration in case if participant answered that no stressful event occured
|
||||
extracted_ers["se_duration"] = extracted_ers["se_duration"].fillna(int((ioi + 2*ioi_error_tolerance) * 1000))
|
||||
|
||||
# Prepare data to fit the data structure in the CSV file ...
|
||||
# Add the event time as the end of the questionnaire if no stress event occured
|
||||
extracted_ers['se_time'] = extracted_ers['se_time'].fillna(extracted_ers['session_start_timestamp'])
|
||||
# Type could be an int (timestamp [ms]) which stays the same, and datetime str which is converted to timestamp in miliseconds
|
||||
extracted_ers['event_timestamp'] = extracted_ers['se_time'].apply(lambda x: x if isinstance(x, int) else pd.to_datetime(x).timestamp() * 1000).astype('int64')
|
||||
extracted_ers['shift_direction'] = -1
|
||||
|
||||
""">>>>> begin section (could be optimized) <<<<<"""
|
||||
|
||||
# Checks whether the duration is marked with "1 - It's still ongoing" which means that the end of the current questionnaire
|
||||
# is taken as end time of the segment. Else the user input duration is taken.
|
||||
extracted_ers['se_duration'] = \
|
||||
np.where(
|
||||
extracted_ers['se_duration'].astype(str).str.startswith("1 - "),
|
||||
extracted_ers['session_end_timestamp'] - extracted_ers['event_timestamp'],
|
||||
extracted_ers['se_duration']
|
||||
)
|
||||
|
||||
# This converts the rows of timestamps in miliseconds and the rows with datetime... to timestamp in seconds.
|
||||
extracted_ers['se_duration'] = \
|
||||
extracted_ers['se_duration'].apply(lambda x: math.ceil(x / 1000) if isinstance(x, int) else (pd.to_datetime(x).hour * 60 + pd.to_datetime(x).minute) * 60)
|
||||
|
||||
# Check explicitley whether min duration is at least 0. This will eliminate rows that would be investigated after the end of the questionnaire.
|
||||
extracted_ers = extracted_ers[extracted_ers['session_end_timestamp'] - extracted_ers['event_timestamp'] >= 0]
|
||||
# Double check whether min se_duration is at least 0. Filter-out the rest. Negative values are considered invalid.
|
||||
extracted_ers = extracted_ers[extracted_ers["se_duration"] >= 0].reset_index(drop=True)
|
||||
|
||||
""">>>>> end section <<<<<"""
|
||||
|
||||
# Simply override all durations to be of an equal amount
|
||||
extracted_ers['se_duration'] = ioi + 2*ioi_error_tolerance
|
||||
|
||||
# If target is 0 then shift by the total stress event duration, otherwise shift it by ioi_tolerance
|
||||
extracted_ers['shift'] = \
|
||||
np.where(
|
||||
extracted_ers['appraisal_stressfulness_event'] == 0,
|
||||
extracted_ers['se_duration'],
|
||||
ioi_error_tolerance
|
||||
)
|
||||
|
||||
extracted_ers['shift'] = extracted_ers['shift'].apply(lambda x: format_timestamp(int(x)))
|
||||
extracted_ers['length'] = extracted_ers['se_duration'].apply(lambda x: format_timestamp(int(x)))
|
||||
|
||||
# Drop event_timestamp duplicates in case in the user is referencing the same event over multiple questionnaires
|
||||
extracted_ers.drop_duplicates(subset=["event_timestamp"], keep='first', inplace=True)
|
||||
extracted_ers.reset_index(drop=True, inplace=True)
|
||||
|
||||
extracted_ers["label"] = f"straw_event_{segmenting_method}_" + snakemake.params["pid"] + "_" + extracted_ers.index.astype(str).str.zfill(3)
|
||||
|
||||
# Write the csv of extracted ERS labels with targets related to stressfulness event
|
||||
extracted_ers[["label", "appraisal_stressfulness_event", "appraisal_threat", "appraisal_challenge"]].to_csv(snakemake.output[1], index=False)
|
||||
|
||||
else:
|
||||
raise Exception("Please select correct target method for the event-related segments.")
|
||||
extracted_ers = pd.DataFrame(columns=["label", "event_timestamp", "length", "shift", "shift_direction", "device_id"])
|
||||
|
||||
return extracted_ers[["label", "event_timestamp", "length", "shift", "shift_direction", "device_id"]]
|
||||
|
||||
|
||||
"""
|
||||
Here the code is executed - this .py file is used both for extraction of the STRAW time_segments file for the individual
|
||||
participant, and also for merging all participant's files into one combined file which is later used for the time segments
|
||||
to all sensors assignment.
|
||||
|
||||
There are two files involved (see rules extract_event_information_from_esm and merge_event_related_segments_files in preprocessing.smk)
|
||||
(1) ERS file which contains all the information about the time segment timings and
|
||||
(2) targets file which has corresponding target value for the segment label which is later used to merge with other features in the cleaning script.
|
||||
For more information, see the comment in the method above.
|
||||
"""
|
||||
if snakemake.params["stage"] == "extract":
|
||||
esm_df = pd.read_csv(input_data_files['esm_raw_input'])
|
||||
|
||||
extracted_ers = extract_ers(esm_df)
|
||||
|
||||
extracted_ers.to_csv(snakemake.output[0], index=False)
|
||||
|
||||
elif snakemake.params["stage"] == "merge":
|
||||
|
||||
input_data_files = dict(snakemake.input)
|
||||
straw_events = pd.DataFrame(columns=["label", "event_timestamp", "length", "shift", "shift_direction", "device_id"])
|
||||
stress_events_targets = pd.DataFrame(columns=["label", "appraisal_stressfulness_event", "appraisal_threat", "appraisal_challenge"])
|
||||
|
||||
for input_file in input_data_files["ers_files"]:
|
||||
ers_df = pd.read_csv(input_file)
|
||||
straw_events = pd.concat([straw_events, ers_df], axis=0, ignore_index=True)
|
||||
|
||||
straw_events.to_csv(snakemake.output[0], index=False)
|
||||
|
||||
for input_file in input_data_files["se_files"]:
|
||||
se_df = pd.read_csv(input_file)
|
||||
stress_events_targets = pd.concat([stress_events_targets, se_df], axis=0, ignore_index=True)
|
||||
|
||||
stress_events_targets.to_csv(snakemake.output[1], index=False)
|
||||
|
||||
|
||||
|
|
@ -29,7 +29,7 @@ def rapids_features(sensor_data_files, time_segment, provider, filter_data_by_se
|
|||
if "medianlux" in features_to_compute:
|
||||
light_features["medianlux"] = light_data.groupby(["local_segment"])["double_light_lux"].median()
|
||||
if "stdlux" in features_to_compute:
|
||||
light_features["stdlux"] = light_data.groupby(["local_segment"])["double_light_lux"].std().fillna(0)
|
||||
light_features["stdlux"] = light_data.groupby(["local_segment"])["double_light_lux"].std()
|
||||
|
||||
light_features = light_features.reset_index()
|
||||
|
||||
|
|
|
@ -26,9 +26,7 @@ barnett_daily_features <- function(snakemake){
|
|||
location <- location %>%
|
||||
mutate(is_daily = str_detect(assigned_segments, paste0(".*#", datetime_start_regex, ",", datetime_end_regex, ".*")))
|
||||
|
||||
does_not_span = nrow(segment_labels) == 0 || nrow(location) == 0 || all(location$is_daily == FALSE) || (max(location$timestamp) - min(location$timestamp) < 86400000)
|
||||
|
||||
if(is.na(does_not_span) || does_not_span){
|
||||
if(nrow(segment_labels) == 0 || nrow(location) == 0 || all(location$is_daily == FALSE) || (max(location$timestamp) - min(location$timestamp) < 86400000)){
|
||||
warning("Barnett's location features cannot be computed for data or time segments that do not span one or more entire days (00:00:00 to 23:59:59). Values below point to the problem:",
|
||||
"\nLocation data rows within a daily time segment: ", nrow(filter(location, is_daily)),
|
||||
"\nLocation data time span in days: ", round((max(location$timestamp) - min(location$timestamp)) / 86400000, 2)
|
||||
|
|
|
@ -115,7 +115,7 @@ cluster_on = provider["CLUSTER_ON"]
|
|||
strategy = provider["INFER_HOME_LOCATION_STRATEGY"]
|
||||
days_threshold = provider["MINIMUM_DAYS_TO_DETECT_HOME_CHANGES"]
|
||||
|
||||
if not location_data.timestamp.is_monotonic_increasing:
|
||||
if not location_data.timestamp.is_monotonic:
|
||||
location_data.sort_values(by=["timestamp"], inplace=True)
|
||||
|
||||
location_data["duration_in_seconds"] = -1 * location_data.timestamp.diff(-1) / 1000
|
||||
|
|
|
@ -37,8 +37,7 @@ def variance_and_logvariance_features(location_data, location_features):
|
|||
location_data["longitude_for_wvar"] = (location_data["double_longitude"] - location_data["longitude_wavg"]) ** 2 * location_data["duration"] * 60
|
||||
|
||||
location_features["locationvariance"] = ((location_data_grouped["latitude_for_wvar"].sum() + location_data_grouped["longitude_for_wvar"].sum()) / (location_data_grouped["duration"].sum() * 60 - 1)).fillna(0)
|
||||
|
||||
location_features["loglocationvariance"] = np.log10(location_features["locationvariance"]).replace(-np.inf, -1000000)
|
||||
location_features["loglocationvariance"] = np.log10(location_features["locationvariance"]).replace(-np.inf, np.nan)
|
||||
|
||||
return location_features
|
||||
|
||||
|
@ -181,9 +180,6 @@ def doryab_features(sensor_data_files, time_segment, provider, filter_data_by_se
|
|||
location_features = location_features.merge(location_entropy(stationary_data_without_outliers), how="outer", left_index=True, right_index=True)
|
||||
|
||||
# time at home
|
||||
if stationary_data.empty:
|
||||
location_features["timeathome"] = 0
|
||||
else:
|
||||
stationary_data["time_at_home"] = stationary_data.apply(lambda row: row["duration"] if row["distance_from_home"] <= radius_from_home else 0, axis=1)
|
||||
location_features["timeathome"] = stationary_data[["local_segment", "time_at_home"]].groupby(["local_segment"])["time_at_home"].sum()
|
||||
|
||||
|
|
|
@ -65,15 +65,6 @@ rapids_features <- function(sensor_data_files, time_segment, provider){
|
|||
features <- message_features_of_type(messages_of_type, message_type, time_segment, requested_features)
|
||||
messages_features <- merge(messages_features, features, all=TRUE)
|
||||
}
|
||||
# Fill seleted columns with a high number
|
||||
time_cols <- select(messages_features, contains("timefirstmessages") | contains("timelastmessages")) %>%
|
||||
colnames(.)
|
||||
|
||||
messages_features <- messages_features %>%
|
||||
mutate_at(., time_cols, ~replace(., is.na(.), 1500))
|
||||
|
||||
# Fill NA values with 0
|
||||
messages_features <- messages_features %>% mutate_all(~replace(., is.na(.), 0))
|
||||
|
||||
messages_features <- messages_features %>% mutate_at(vars(contains("countmostfrequentcontact") | contains("distinctcontacts") | contains("count")), list( ~ replace_na(., 0)))
|
||||
return(messages_features)
|
||||
}
|
|
@ -15,7 +15,7 @@ def getEpisodeDurationFeatures(screen_data, time_segment, episode, features, ref
|
|||
if "avgduration" in features:
|
||||
duration_helper = pd.concat([duration_helper, screen_data_episode.groupby(["local_segment"])[["duration"]].mean().rename(columns = {"duration":"avgduration" + episode})], axis = 1)
|
||||
if "stdduration" in features:
|
||||
duration_helper = pd.concat([duration_helper, screen_data_episode.groupby(["local_segment"])[["duration"]].std().fillna(0).rename(columns = {"duration":"stdduration" + episode})], axis = 1)
|
||||
duration_helper = pd.concat([duration_helper, screen_data_episode.groupby(["local_segment"])[["duration"]].std().rename(columns = {"duration":"stdduration" + episode})], axis = 1)
|
||||
if "firstuseafter" + "{0:0=2d}".format(reference_hour_first_use) in features:
|
||||
screen_data_episode_after_hour = screen_data_episode.copy()
|
||||
screen_data_episode_after_hour["hour"] = pd.to_datetime(screen_data_episode["local_start_date_time"]).dt.hour
|
||||
|
|
|
@ -1,30 +0,0 @@
|
|||
import pandas as pd
|
||||
|
||||
|
||||
def straw_features(sensor_data_files, time_segment, provider, filter_data_by_segment, *args, **kwargs):
|
||||
speech_data = pd.read_csv(sensor_data_files["sensor_data"])
|
||||
requested_features = provider["FEATURES"]
|
||||
# name of the features this function can compute+
|
||||
base_features_names = ["meanspeech", "stdspeech", "nlargest", "nsmallest", "medianspeech"]
|
||||
features_to_compute = list(set(requested_features) & set(base_features_names))
|
||||
speech_features = pd.DataFrame(columns=["local_segment"] + features_to_compute)
|
||||
|
||||
if not speech_data.empty:
|
||||
speech_data = filter_data_by_segment(speech_data, time_segment)
|
||||
|
||||
if not speech_data.empty:
|
||||
speech_features = pd.DataFrame()
|
||||
if "meanspeech" in features_to_compute:
|
||||
speech_features["meanspeech"] = speech_data.groupby(["local_segment"])['speech_proportion'].mean()
|
||||
if "stdspeech" in features_to_compute:
|
||||
speech_features["stdspeech"] = speech_data.groupby(["local_segment"])['speech_proportion'].std()
|
||||
if "nlargest" in features_to_compute:
|
||||
speech_features["nlargest"] = speech_data.groupby(["local_segment"])['speech_proportion'].apply(lambda x: x.nlargest(5).mean())
|
||||
if "nsmallest" in features_to_compute:
|
||||
speech_features["nsmallest"] = speech_data.groupby(["local_segment"])['speech_proportion'].apply(lambda x: x.nsmallest(5).mean())
|
||||
if "medianspeech" in features_to_compute:
|
||||
speech_features["medianspeech"] = speech_data.groupby(["local_segment"])['speech_proportion'].median()
|
||||
|
||||
speech_features = speech_features.reset_index()
|
||||
|
||||
return speech_features
|
|
@ -9,26 +9,21 @@ compute_wifi_feature <- function(data, feature, time_segment){
|
|||
"countscans" = data %>% summarise(!!feature := n()),
|
||||
"uniquedevices" = data %>% summarise(!!feature := n_distinct(bssid)))
|
||||
return(data)
|
||||
|
||||
} else if(feature == "countscansmostuniquedevice"){
|
||||
# Get the most scanned device
|
||||
mostuniquedevice <- data %>%
|
||||
filter(bssid != "") %>%
|
||||
group_by(bssid) %>%
|
||||
mutate(N=n()) %>%
|
||||
ungroup() %>%
|
||||
filter(N == max(N)) %>%
|
||||
head(1) %>% # if there are multiple device with the same amount of scans pick the first one only
|
||||
pull(bssid)
|
||||
|
||||
data <- data %>% filter_data_by_segment(time_segment)
|
||||
|
||||
return(data %>%
|
||||
filter(bssid == mostuniquedevice) %>%
|
||||
group_by(local_segment) %>%
|
||||
summarise(!!feature := n())
|
||||
)
|
||||
|
||||
summarise(!!feature := n()) %>%
|
||||
replace(is.na(.), 0))
|
||||
}
|
||||
}
|
||||
|
||||
|
@ -48,6 +43,6 @@ rapids_features <- function(sensor_data_files, time_segment, provider){
|
|||
feature <- compute_wifi_feature(wifi_data, feature_name, time_segment)
|
||||
features <- merge(features, feature, by="local_segment", all = TRUE)
|
||||
}
|
||||
features <- features %>% mutate_all(~replace(., is.na(.), 0))
|
||||
|
||||
return(features)
|
||||
}
|
||||
|
|
|
@ -1,17 +0,0 @@
|
|||
source("renv/activate.R")
|
||||
|
||||
library(tidyr)
|
||||
library(purrr)
|
||||
library("dplyr", warn.conflicts = F)
|
||||
library(stringr)
|
||||
|
||||
feature_files <- snakemake@input[["feature_files"]]
|
||||
|
||||
|
||||
features_of_all_participants <- tibble(filename = feature_files) %>% # create a data frame
|
||||
mutate(file_contents = map(filename, ~ read.csv(., stringsAsFactors = F, colClasses = c(local_segment = "character", local_segment_label = "character", local_segment_start_datetime="character", local_segment_end_datetime="character"))),
|
||||
pid = str_match(filename, ".*/(.*)/z_all_sensor_features.csv")[,2]) %>%
|
||||
unnest(cols = c(file_contents)) %>%
|
||||
select(-filename)
|
||||
|
||||
write.csv(features_of_all_participants, snakemake@output[[1]], row.names = FALSE)
|
|
@ -88,13 +88,11 @@ def chunk_episodes(sensor_episodes):
|
|||
|
||||
return merged_sensor_episodes
|
||||
|
||||
def fetch_provider_features(provider, provider_key, sensor_key, sensor_data_files, time_segments_file, calc_windows=False):
|
||||
def fetch_provider_features(provider, provider_key, sensor_key, sensor_data_files, time_segments_file):
|
||||
import pandas as pd
|
||||
from importlib import import_module, util
|
||||
|
||||
sensor_features = pd.DataFrame(columns=["local_segment"])
|
||||
sensor_fo_features = pd.DataFrame(columns=["local_segment"])
|
||||
sensor_so_features = pd.DataFrame(columns=["local_segment"])
|
||||
time_segments_labels = pd.read_csv(time_segments_file, header=0)
|
||||
if "FEATURES" not in provider:
|
||||
raise ValueError("Provider config[{}][PROVIDERS][{}] is missing a FEATURES attribute in config.yaml".format(sensor_key.upper(), provider_key.upper()))
|
||||
|
@ -108,20 +106,7 @@ def fetch_provider_features(provider, provider_key, sensor_key, sensor_data_file
|
|||
time_segments_labels["label"] = [""]
|
||||
for time_segment in time_segments_labels["label"]:
|
||||
print("{} Processing {} {} {}".format(rapids_log_tag, sensor_key, provider_key, time_segment))
|
||||
|
||||
features = feature_function(sensor_data_files, time_segment, provider, filter_data_by_segment=filter_data_by_segment, chunk_episodes=chunk_episodes, calc_windows=calc_windows)
|
||||
|
||||
# In case of calc_window = True
|
||||
if isinstance(features, tuple):
|
||||
if not "local_segment" in features[0].columns or not "local_segment" in features[1].columns:
|
||||
raise ValueError("The dataframe returned by the " + sensor_key + " provider '" + provider_key + "' is missing the 'local_segment' column added by the 'filter_data_by_segment()' function. Check the provider script is using such function and is not removing 'local_segment' by accident (" + provider["SRC_SCRIPT"] + ")\n The 'local_segment' column is used to index a provider's features (each row corresponds to a different time segment instance (e.g. 2020-01-01, 2020-01-02, 2020-01-03, etc.)")
|
||||
features[0].columns = ["{}{}".format("" if col.startswith("local_segment") else (sensor_key + "_"+ provider_key + "_"), col) for col in features[0].columns]
|
||||
features[1].columns = ["{}{}".format("" if col.startswith("local_segment") else (sensor_key + "_"+ provider_key + "_"), col) for col in features[1].columns]
|
||||
if not features[0].empty:
|
||||
sensor_fo_features = pd.concat([sensor_fo_features, features[0]], axis=0, sort=False)
|
||||
if not features[1].empty:
|
||||
sensor_so_features = pd.concat([sensor_so_features, features[1]], axis=0, sort=False)
|
||||
else:
|
||||
features = feature_function(sensor_data_files, time_segment, provider, filter_data_by_segment=filter_data_by_segment, chunk_episodes=chunk_episodes)
|
||||
if not "local_segment" in features.columns:
|
||||
raise ValueError("The dataframe returned by the " + sensor_key + " provider '" + provider_key + "' is missing the 'local_segment' column added by the 'filter_data_by_segment()' function. Check the provider script is using such function and is not removing 'local_segment' by accident (" + provider["SRC_SCRIPT"] + ")\n The 'local_segment' column is used to index a provider's features (each row corresponds to a different time segment instance (e.g. 2020-01-01, 2020-01-02, 2020-01-03, etc.)")
|
||||
features.columns = ["{}{}".format("" if col.startswith("local_segment") else (sensor_key + "_"+ provider_key + "_"), col) for col in features.columns]
|
||||
|
@ -129,27 +114,6 @@ def fetch_provider_features(provider, provider_key, sensor_key, sensor_data_file
|
|||
else:
|
||||
for feature in provider["FEATURES"]:
|
||||
sensor_features[feature] = None
|
||||
|
||||
if calc_windows:
|
||||
segment_colums = pd.DataFrame()
|
||||
sensor_fo_features['local_segment'] = sensor_fo_features['local_segment'].str.replace(r'_RR\d+SS', '')
|
||||
split_segemnt_columns = sensor_fo_features["local_segment"].str.split(pat="(.*)#(.*),(.*)", expand=True)
|
||||
new_segment_columns = split_segemnt_columns.iloc[:,1:4] if split_segemnt_columns.shape[1] == 5 else pd.DataFrame(columns=["local_segment_label", "local_segment_start_datetime","local_segment_end_datetime"])
|
||||
segment_colums[["local_segment_label", "local_segment_start_datetime", "local_segment_end_datetime"]] = new_segment_columns
|
||||
for i in range(segment_colums.shape[1]):
|
||||
sensor_fo_features.insert(1 + i, segment_colums.columns[i], segment_colums[segment_colums.columns[i]])
|
||||
|
||||
segment_colums = pd.DataFrame()
|
||||
sensor_so_features['local_segment'] = sensor_so_features['local_segment'].str.replace(r'_RR\d+SS', '')
|
||||
split_segemnt_columns = sensor_so_features["local_segment"].str.split(pat="(.*)#(.*),(.*)", expand=True)
|
||||
new_segment_columns = split_segemnt_columns.iloc[:,1:4] if split_segemnt_columns.shape[1] == 5 else pd.DataFrame(columns=["local_segment_label", "local_segment_start_datetime","local_segment_end_datetime"])
|
||||
segment_colums[["local_segment_label", "local_segment_start_datetime", "local_segment_end_datetime"]] = new_segment_columns
|
||||
for i in range(segment_colums.shape[1]):
|
||||
sensor_so_features.insert(1 + i, segment_colums.columns[i], segment_colums[segment_colums.columns[i]])
|
||||
|
||||
return sensor_fo_features, sensor_so_features
|
||||
|
||||
else:
|
||||
segment_colums = pd.DataFrame()
|
||||
sensor_features['local_segment'] = sensor_features['local_segment'].str.replace(r'_RR\d+SS', '')
|
||||
split_segemnt_columns = sensor_features["local_segment"].str.split(pat="(.*)#(.*),(.*)", expand=True)
|
||||
|
@ -160,16 +124,12 @@ def fetch_provider_features(provider, provider_key, sensor_key, sensor_data_file
|
|||
|
||||
return sensor_features
|
||||
|
||||
def run_provider_cleaning_script(provider, provider_key, sensor_key, sensor_data_files, target=False):
|
||||
def run_provider_cleaning_script(provider, provider_key, sensor_key, sensor_data_files):
|
||||
from importlib import import_module, util
|
||||
print("{} Processing {} {}".format(rapids_log_tag, sensor_key, provider_key))
|
||||
|
||||
cleaning_module = import_path(provider["SRC_SCRIPT"])
|
||||
cleaning_function = getattr(cleaning_module, provider_key.lower() + "_cleaning")
|
||||
|
||||
if target:
|
||||
sensor_features = cleaning_function(sensor_data_files, provider, target)
|
||||
else:
|
||||
sensor_features = cleaning_function(sensor_data_files, provider)
|
||||
|
||||
return sensor_features
|
|
@ -1,19 +0,0 @@
|
|||
import pandas as pd
|
||||
import sys
|
||||
import warnings
|
||||
|
||||
def retain_target_column(df_input: pd.DataFrame, target_variable_name: str):
|
||||
column_names = df_input.columns
|
||||
esm_names_index = column_names.str.startswith("phone_esm_straw")
|
||||
# Find all columns coming from phone_esm, since these are not features for our purposes and we will drop them.
|
||||
esm_names = column_names[esm_names_index]
|
||||
target_variable_index = esm_names.str.contains(target_variable_name)
|
||||
if all(~target_variable_index):
|
||||
warnings.warn(f"The requested target (, {target_variable_name} ,)cannot be found in the dataset. Please check the names of phone_esm_ columns in cleaned python file")
|
||||
return None
|
||||
|
||||
sensor_features_plus_target = df_input.drop(esm_names, axis=1)
|
||||
sensor_features_plus_target["target"] = df_input[esm_names[target_variable_index]]
|
||||
# We will only keep one column related to phone_esm and that will be our target variable.
|
||||
# Add it back to the very and of the data frame and rename it to target.
|
||||
return sensor_features_plus_target
|
|
@ -1,24 +0,0 @@
|
|||
import pandas as pd
|
||||
|
||||
from helper import retain_target_column
|
||||
|
||||
sensor_features = pd.read_csv(snakemake.input["cleaned_sensor_features"])
|
||||
|
||||
all_baseline_features = pd.DataFrame()
|
||||
for baseline_features_path in snakemake.input["demographic_features"]:
|
||||
pid = baseline_features_path.split("/")[3]
|
||||
baseline_features = pd.read_csv(baseline_features_path)
|
||||
baseline_features = baseline_features.assign(pid=pid)
|
||||
all_baseline_features = pd.concat([all_baseline_features, baseline_features], axis=0)
|
||||
|
||||
# merge sensor features and baseline features
|
||||
if not sensor_features.empty:
|
||||
features = sensor_features.merge(all_baseline_features, on="pid", how="left")
|
||||
|
||||
target_variable_name = snakemake.params["target_variable"]
|
||||
model_input = retain_target_column(features, target_variable_name)
|
||||
|
||||
model_input.to_csv(snakemake.output[0], index=False)
|
||||
|
||||
else:
|
||||
sensor_features.to_csv(snakemake.output[0], index=False)
|
|
@ -1,13 +0,0 @@
|
|||
import pandas as pd
|
||||
|
||||
from helper import retain_target_column
|
||||
|
||||
cleaned_sensor_features = pd.read_csv(snakemake.input["cleaned_sensor_features"])
|
||||
target_variable_name = snakemake.params["target_variable"]
|
||||
|
||||
model_input = retain_target_column(cleaned_sensor_features, target_variable_name)
|
||||
|
||||
if model_input is None:
|
||||
pd.DataFrame().to_csv(snakemake.output[0])
|
||||
else:
|
||||
model_input.to_csv(snakemake.output[0], index=False)
|
|
@ -24,12 +24,12 @@ def colors2colorscale(colors):
|
|||
def getDataForPlot(phone_data_yield_per_segment):
|
||||
# calculate the length (in minute) of per segment instance
|
||||
phone_data_yield_per_segment["length"] = phone_data_yield_per_segment["timestamps_segment"].str.split(",").apply(lambda x: int((int(x[1])-int(x[0])) / (1000 * 60)))
|
||||
# calculate the number of sensors logged at least one row of data per minute.
|
||||
phone_data_yield_per_segment = phone_data_yield_per_segment.groupby(["local_segment", "length", "local_date", "local_hour", "local_minute"])[["sensor", "local_date_time"]].max().reset_index()
|
||||
# extract local start datetime of the segment from "local_segment" column
|
||||
phone_data_yield_per_segment["local_segment_start_datetimes"] = pd.to_datetime(phone_data_yield_per_segment["local_segment"].apply(lambda x: x.split("#")[1].split(",")[0]))
|
||||
# calculate the number of minutes after local start datetime of the segment
|
||||
phone_data_yield_per_segment["minutes_after_segment_start"] = ((phone_data_yield_per_segment["local_date_time"] - phone_data_yield_per_segment["local_segment_start_datetimes"]) / pd.Timedelta(minutes=1)).astype("int")
|
||||
# calculate the number of sensors logged at least one row of data per minute.
|
||||
phone_data_yield_per_segment = phone_data_yield_per_segment.groupby(["local_segment", "length", "local_segment_start_datetimes", "minutes_after_segment_start"])[["sensor"]].max().reset_index()
|
||||
|
||||
# impute missing rows with 0
|
||||
columns_for_full_index = phone_data_yield_per_segment[["local_segment_start_datetimes", "length"]].drop_duplicates(keep="first")
|
||||
|
@ -38,7 +38,6 @@ def getDataForPlot(phone_data_yield_per_segment):
|
|||
for columns in columns_for_full_index:
|
||||
full_index = full_index + columns
|
||||
full_index = pd.MultiIndex.from_tuples(full_index, names=("local_segment_start_datetimes", "minutes_after_segment_start"))
|
||||
phone_data_yield_per_segment = phone_data_yield_per_segment.drop_duplicates(subset=["local_segment_start_datetimes", "minutes_after_segment_start"], keep="first")
|
||||
phone_data_yield_per_segment = phone_data_yield_per_segment.set_index(["local_segment_start_datetimes", "minutes_after_segment_start"]).reindex(full_index).reset_index().fillna(0)
|
||||
|
||||
# transpose the dataframe per local start datetime of the segment and discard the useless index layer
|
||||
|
|
|
@ -22,7 +22,7 @@ output:
|
|||
</style>
|
||||
|
||||
```{r include=FALSE}
|
||||
source("/mnt/c/Users/junos/Documents/FWO-ARRS/Analysis/straw2analysis/rapids/renv/activate.R")
|
||||
source("renv/activate.R")
|
||||
```
|
||||
|
||||
|
||||
|
|
|
@ -1 +1 @@
|
|||
"local_segment","local_segment_label","local_segment_start_datetime","local_segment_end_datetime","fitbit_steps_intraday_rapids_sumsteps","fitbit_steps_intraday_rapids_stdsteps","fitbit_steps_intraday_rapids_minsteps","fitbit_steps_intraday_rapids_maxsteps","fitbit_steps_intraday_rapids_firststeptime","fitbit_steps_intraday_rapids_avgsteps","fitbit_steps_intraday_rapids_laststeptime","fitbit_steps_intraday_rapids_stddurationsedentarybout","fitbit_steps_intraday_rapids_sumdurationsedentarybout","fitbit_steps_intraday_rapids_countepisodesedentarybout","fitbit_steps_intraday_rapids_mindurationsedentarybout","fitbit_steps_intraday_rapids_maxdurationsedentarybout","fitbit_steps_intraday_rapids_avgdurationsedentarybout","fitbit_steps_intraday_rapids_countepisodeactivebout","fitbit_steps_intraday_rapids_maxdurationactivebout","fitbit_steps_intraday_rapids_avgdurationactivebout","fitbit_steps_intraday_rapids_mindurationactivebout","fitbit_steps_intraday_rapids_sumdurationactivebout","fitbit_steps_intraday_rapids_stddurationactivebout"
|
||||
"local_segment","local_segment_label","local_segment_start_datetime","local_segment_end_datetime","fitbit_steps_intraday_rapids_sumsteps","fitbit_steps_intraday_rapids_minsteps","fitbit_steps_intraday_rapids_maxsteps","fitbit_steps_intraday_rapids_stdsteps","fitbit_steps_intraday_rapids_avgsteps","fitbit_steps_intraday_rapids_stddurationsedentarybout","fitbit_steps_intraday_rapids_avgdurationsedentarybout","fitbit_steps_intraday_rapids_sumdurationsedentarybout","fitbit_steps_intraday_rapids_mindurationsedentarybout","fitbit_steps_intraday_rapids_countepisodesedentarybout","fitbit_steps_intraday_rapids_maxdurationsedentarybout","fitbit_steps_intraday_rapids_countepisodeactivebout","fitbit_steps_intraday_rapids_maxdurationactivebout","fitbit_steps_intraday_rapids_mindurationactivebout","fitbit_steps_intraday_rapids_stddurationactivebout","fitbit_steps_intraday_rapids_avgdurationactivebout","fitbit_steps_intraday_rapids_sumdurationactivebout"
|
||||
|
|
|
|
@ -1 +1 @@
|
|||
"local_segment","local_segment_label","local_segment_start_datetime","local_segment_end_datetime","fitbit_steps_intraday_rapids_sumsteps","fitbit_steps_intraday_rapids_stdsteps","fitbit_steps_intraday_rapids_minsteps","fitbit_steps_intraday_rapids_maxsteps","fitbit_steps_intraday_rapids_firststeptime","fitbit_steps_intraday_rapids_avgsteps","fitbit_steps_intraday_rapids_laststeptime","fitbit_steps_intraday_rapids_stddurationsedentarybout","fitbit_steps_intraday_rapids_sumdurationsedentarybout","fitbit_steps_intraday_rapids_countepisodesedentarybout","fitbit_steps_intraday_rapids_mindurationsedentarybout","fitbit_steps_intraday_rapids_maxdurationsedentarybout","fitbit_steps_intraday_rapids_avgdurationsedentarybout","fitbit_steps_intraday_rapids_countepisodeactivebout","fitbit_steps_intraday_rapids_maxdurationactivebout","fitbit_steps_intraday_rapids_avgdurationactivebout","fitbit_steps_intraday_rapids_mindurationactivebout","fitbit_steps_intraday_rapids_sumdurationactivebout","fitbit_steps_intraday_rapids_stddurationactivebout"
|
||||
"local_segment","local_segment_label","local_segment_start_datetime","local_segment_end_datetime","fitbit_steps_intraday_rapids_sumsteps","fitbit_steps_intraday_rapids_minsteps","fitbit_steps_intraday_rapids_maxsteps","fitbit_steps_intraday_rapids_stdsteps","fitbit_steps_intraday_rapids_avgsteps","fitbit_steps_intraday_rapids_stddurationsedentarybout","fitbit_steps_intraday_rapids_avgdurationsedentarybout","fitbit_steps_intraday_rapids_sumdurationsedentarybout","fitbit_steps_intraday_rapids_mindurationsedentarybout","fitbit_steps_intraday_rapids_countepisodesedentarybout","fitbit_steps_intraday_rapids_maxdurationsedentarybout","fitbit_steps_intraday_rapids_countepisodeactivebout","fitbit_steps_intraday_rapids_maxdurationactivebout","fitbit_steps_intraday_rapids_mindurationactivebout","fitbit_steps_intraday_rapids_stddurationactivebout","fitbit_steps_intraday_rapids_avgdurationactivebout","fitbit_steps_intraday_rapids_sumdurationactivebout"
|
||||
|
|
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue