Compare commits

...

325 Commits

Author SHA1 Message Date
junos 63f5a526fc Bring back requested fields in config.yaml.
Update coding files based on 7e565c34db98265afcda922a337493781fdd8ed5 in supermodule.
2023-04-19 11:07:58 +02:00
junos 1cc7339fc8 Completely remove PACKAGE_NAMES_HASHED and instead provide a differently structured file. 2023-04-18 22:58:42 +02:00
junos 5307c71df0 Missing comma. 2023-04-18 22:45:12 +02:00
junos f261286542 Add package_names_hashed param for rule phone_application_categories. 2023-04-18 22:40:11 +02:00
junos a6bc0a90d1 Do not ignore application categories. 2023-04-18 21:34:59 +02:00
junos f161da41f4 Merge branch 'master' into runner 2023-04-18 21:23:26 +02:00
junos 8ffd934fd3 Categorize applications in config.yaml. 2023-04-18 20:39:57 +02:00
junos cf6af7c9a4 Add a TODO. 2023-04-18 16:11:30 +02:00
junos 4dacb7129d Change targets for 30 before.
Further increase resources for acc.
2023-04-18 10:47:54 +02:00
junos f542a97eab Change targets for 90 before. 2023-04-15 16:29:06 +02:00
junos 5cb2dcfb00 Run 90 before event. 2023-04-15 16:18:55 +02:00
junos 8cef60ba87 Limit memory usage by readable_datetime.
Especially important for accelerometer data.
2023-04-14 16:01:44 +02:00
junos 0d634f3622 Remove deprecated numpy dtype. 2023-04-14 13:43:20 +02:00
junos 00e4f8deae More numeric_only arguments.
See 1d903f3629 for explanation.
2023-04-13 13:04:53 +02:00
junos 03687a1ac2 Fix deprecated attribute. 2023-04-12 18:21:43 +02:00
junos a36da99ccb Catch another possible exception. 2023-04-12 16:37:25 +02:00
junos 1d903f3629 Specify numeric_only for pandas.core.groupby.DataFrameGroupBy.mean.
This parameter used to be None by default, but this usage is deprecated since pandas 2.0.
See [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.mean.html):

> Changed in version 2.0.0: numeric_only no longer accepts None and defaults to False.
2023-04-12 16:05:58 +02:00
junos d678be0641 Extract definition of function. 2023-04-12 16:00:13 +02:00
junos 27b90421bf Add a missing pip dependency. 2023-04-11 19:06:29 +02:00
junos cb006ed0cf Completely overhaul environment.yml. 2023-04-11 19:04:28 +02:00
junos 9ca58ed204 Fix Python libraries. 2023-04-11 17:21:03 +02:00
junos 982fa982f7 Remove Python libraries versions. 2023-04-11 17:16:42 +02:00
junos f8088172e9 Update more R packages. 2023-04-11 15:31:44 +02:00
junos 801fbe1c10 Update R packages. 2023-04-11 15:26:21 +02:00
Primoz 8721b944ca Revert "Set config to testing mode."
This reverts commit e825aa7c89.
2023-02-16 09:56:23 +00:00
Primoz 36651a11c8 Drop all window count related features in cleaning script. 2023-02-15 14:15:56 +00:00
Primoz 8ae5ad0e88 Add missing rapids_columns for speech sensor. 2023-02-15 13:42:44 +00:00
Primoz e825aa7c89 Set config to testing mode. 2023-02-15 13:36:00 +00:00
Primoz 5958948af2 Merge branch 'speech_sensor' 2023-02-15 13:30:43 +00:00
Primoz 7e37eb9067 Change SPEECH sensor place in config 2023-01-23 15:32:52 +00:00
Primoz 4d0497a5e0 Set appropriate calculations for speech senzor. 2023-01-17 14:00:42 +00:00
Primoz 75b054d358 Integrate phone_speech into rapids pipeline. 2023-01-17 14:00:14 +00:00
Primoz e27ec0269f Revert "Add Speech sensor - preparation."
This reverts commit 74fd4dfbd7.
2023-01-17 12:06:53 +00:00
Primoz 9b45188a61 Revert "PHONE SPEECH - continuation ..."
This reverts commit 3e6b34babc.
2023-01-17 12:06:44 +00:00
Primoz 3e6b34babc PHONE SPEECH - continuation ... 2023-01-11 13:05:42 +00:00
Primoz 74fd4dfbd7 Add Speech sensor - preparation. 2023-01-11 12:48:38 +00:00
Primoz 7b8538ce51 Fix a bug and remove sys.exit line from cleaning script. 2022-12-21 10:40:07 +00:00
Primoz 41a17d35f1 Update ERS stress_event logic. 2022-12-19 15:40:40 +00:00
Primoz 7f5a4e6744 Make stress events to be equal in duration. 2022-12-14 14:52:20 +00:00
Primoz 3ce7f2c2a5 Seperate target standardization from rest of the features. 2022-12-13 15:31:39 +00:00
Primoz e40f0fd8dc Bug fixed (mixed dtype warning). 2022-12-12 11:29:59 +00:00
Primoz 8af3bdf768 Reset PIDS 2022-12-09 16:07:13 +00:00
Primoz 01931b8873 Update README 2022-12-09 16:04:11 +00:00
Primoz 569854ddf5 Merge branch 'master' of https://repo.ijs.si/junoslukan/rapids 2022-12-09 16:01:52 +00:00
Primoz 3b2001f570 Modify the stress_event logic so that it includes where stressfulness is 0. 2022-12-09 16:01:46 +00:00
junos 44a87c53eb Clarify runtime vs installation export of TZ. 2022-12-09 15:34:06 +01:00
junos 8da7bd71b2 Merge branch 'master' of https://repo.ijs.si/junoslukan/rapids 2022-12-09 15:26:23 +01:00
junos 788a81d96f Update readme with info from supermodule. 2022-12-09 15:26:12 +01:00
Primoz 87e5209a9f Squashed commit of the following:
commit 8a6b52a97c
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Nov 29 11:35:49 2022 +0000

    Switch to 30_before ERS with corresponding targets.

commit 244a053730
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Nov 29 11:19:43 2022 +0000

    Change output files settings to nonstandardized.

commit be0324fd01
Author: Primoz <sisko.primoz@gmail.com>
Date:   Mon Nov 28 12:44:25 2022 +0000

    Fix some bugs and set categorical columns as categories dtypes.

commit 99c2fab8f9
Author: Primoz <sisko.primoz@gmail.com>
Date:   Wed Nov 16 09:50:18 2022 +0000

    Fix a bug in the making of the individual model (when there is no target in the participants columns).

commit 286de93bfd
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Nov 15 11:21:51 2022 +0000

    Fix some bugs and extend ERS and cleaning scripts with multiple stress event targets logic.

commit ab803ee49c
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Nov 15 10:14:07 2022 +0000

    Add additional appraisal targets.

commit 621f11b2d9
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Nov 15 09:53:31 2022 +0000

    Fix a bug related to wrong user input (duplicated events).

commit bd41f42a5d
Author: Primoz <sisko.primoz@gmail.com>
Date:   Mon Nov 14 15:07:36 2022 +0000

    Rename target_ to segmenting_ method.

commit a543ce372f
Author: Primoz <sisko.primoz@gmail.com>
Date:   Mon Nov 14 15:04:16 2022 +0000

    Add comments for event_related_script understanding.

commit 74b454b07b
Author: Primoz <sisko.primoz@gmail.com>
Date:   Fri Nov 11 09:15:12 2022 +0000

    Apply changes to string answers to make them language-generic.

commit 6ebe83e47e
Author: Primoz <sisko.primoz@gmail.com>
Date:   Thu Nov 10 12:42:52 2022 +0000

    Improve the ERS extract method with a couple of validations.

commit 00350ef8ca
Author: Primoz <sisko.primoz@gmail.com>
Date:   Thu Nov 10 10:32:58 2022 +0000

    Change config for stressfulness event target method.

commit e4985c9121
Author: Primoz <sisko.primoz@gmail.com>
Date:   Thu Nov 10 10:29:11 2022 +0000

    Override stressfulness event target with extracted values from csv.

commit a668b6e8da
Author: Primoz <sisko.primoz@gmail.com>
Date:   Thu Nov 10 09:37:27 2022 +0000

    Extract ERS and stress event targets to csv files (completed).

commit 9199b53ded
Author: Primoz <sisko.primoz@gmail.com>
Date:   Wed Nov 9 15:11:51 2022 +0000

    Get, join and start processing required ERS stress event data.

commit f3c6a66da9
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Nov 8 15:53:43 2022 +0000

    Begin with stress events in the ERS script.

commit 0b3e9226b3
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Nov 8 14:44:24 2022 +0000

    Make small corrections in ERS file.

commit 2d83f7ddec
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Nov 8 11:32:05 2022 +0000

    Begin the ERS logic for 90-minutes events.

commit 1da72a7cbe
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Nov 8 09:45:37 2022 +0000

    Rename targets method in config.

commit 9f441afc16
Author: Primoz <sisko.primoz@gmail.com>
Date:   Fri Nov 4 15:09:04 2022 +0000

    Begin ERS logic for 90-minutes events.

commit c1c9f4d05a
Merge: 62f46ea3 7ab0280d
Author: Primoz <sisko.primoz@gmail.com>
Date:   Fri Nov 4 09:11:58 2022 +0000

    Merge branch 'imputation_and_cleaning' of https://repo.ijs.si/junoslukan/rapids into imputation_and_cleaning

commit 62f46ea376
Author: Primoz <sisko.primoz@gmail.com>
Date:   Fri Nov 4 09:11:53 2022 +0000

    Prepare method-based logic for ERS generating.

commit 7ab0280d7e
Author: Primoz <sisko.primoz@gmail.com>
Date:   Fri Nov 4 08:58:08 2022 +0000

    Correctly rename stressful event target variable.

commit eefa9f3f4d
Author: Primoz <sisko.primoz@gmail.com>
Date:   Thu Nov 3 14:49:54 2022 +0000

    Add new target: stressfulness_event.

commit 5e8174dd41
Author: Primoz <sisko.primoz@gmail.com>
Date:   Thu Nov 3 13:52:45 2022 +0000

    Add new target: stressfulness_period.

commit 35c1a762e7
Author: Primoz <sisko.primoz@gmail.com>
Date:   Thu Nov 3 13:51:18 2022 +0000

    Improve filtering by esm_session and device_id.

commit 02264b21fd
Author: Primoz <sisko.primoz@gmail.com>
Date:   Thu Nov 3 09:30:12 2022 +0000

    Add logic for target selection in ERS processing.

commit 0ce8723bdb
Author: Primoz <sisko.primoz@gmail.com>
Date:   Wed Nov 2 14:01:21 2022 +0000

    Extend imputation logic within the cleaning script.

commit 30b38bfc02
Author: Primoz <sisko.primoz@gmail.com>
Date:   Fri Oct 28 09:00:13 2022 +0000

    Fix the generating procedure of ERS file for participants with multiple devices.

commit cd137af15a
Author: Primoz <sisko.primoz@gmail.com>
Date:   Thu Oct 27 14:20:15 2022 +0000

    Config for 30 minute EMA segments.

commit 3c0585a566
Author: Primoz <sisko.primoz@gmail.com>
Date:   Thu Oct 27 14:12:56 2022 +0000

    Remove obsolete comments.

commit 6b487fcf7b
Author: Primoz <sisko.primoz@gmail.com>
Date:   Thu Oct 27 14:11:42 2022 +0000

    Set E4 data yield to 1 if it is over 1. Optimize E4 data_yield script.

commit 5d17c92e54
Merge: a31fdd14 0d143e6a
Author: Primoz <sisko.primoz@gmail.com>
Date:   Wed Oct 26 14:18:20 2022 +0000

    Merge branch 'imputation_and_cleaning' of https://repo.ijs.si/junoslukan/rapids into imputation_and_cleaning

commit a31fdd1479
Author: Primoz <sisko.primoz@gmail.com>
Date:   Wed Oct 26 14:18:08 2022 +0000

    Start to test empatica_data_yield precieved error.

commit 936324d234
Author: Primoz <sisko.primoz@gmail.com>
Date:   Wed Oct 26 14:17:27 2022 +0000

    Switch config for 30 minutes event related segments.

commit da0a4596f8
Author: Primoz <sisko.primoz@gmail.com>
Date:   Wed Oct 26 14:16:25 2022 +0000

    Add additional ESM processing logic for ERS csv extraction.

commit d4d74818e6
Author: Primoz <sisko.primoz@gmail.com>
Date:   Wed Oct 26 14:14:32 2022 +0000

    Fix a bug - missing time_segment column when df is empty

commit 14ff59914b
Author: Primoz <sisko.primoz@gmail.com>
Date:   Wed Oct 26 09:59:46 2022 +0000

    Fix to correct dtypes.

commit 6ab0ac5329
Author: Primoz <sisko.primoz@gmail.com>
Date:   Wed Oct 26 09:57:26 2022 +0000

    Optimize memory consumption with dtype definition while reading csv file.

commit 0d143e6aad
Merge: 8acac501 b92a3aa3
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Oct 25 15:28:27 2022 +0000

    Merge branch 'imputation_and_cleaning' of https://repo.ijs.si/junoslukan/rapids into imputation_and_cleaning

commit 8acac50125
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Oct 25 15:26:43 2022 +0000

    Add safenet when features dataframe is empty.

commit b92a3aa37a
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Oct 25 15:25:22 2022 +0000

    Remove unwanted output or other error producing code.

commit bfd637eb9c
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Oct 25 08:53:44 2022 +0000

    Improve strings formatting in straw_events file.

commit 0d81ad5756
Author: Primoz <sisko.primoz@gmail.com>
Date:   Wed Oct 19 13:35:04 2022 +0000

    Debug assignment of segments to rows

commit cea451d344
Merge: e88bbd54 cf38d9f1
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Oct 18 09:15:06 2022 +0000

    Merge branch 'imputation_and_cleaning' of https://repo.ijs.si/junoslukan/rapids into imputation_and_cleaning

commit e88bbd548f
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Oct 18 09:15:00 2022 +0000

    Add new daily segment and filter by segment in the cleaning script.

commit cf38d9f175
Author: Primoz <sisko.primoz@gmail.com>
Date:   Mon Oct 17 15:07:33 2022 +0000

    Implement ERS generating logic.

commit f3ca56cdbf
Author: Primoz <sisko.primoz@gmail.com>
Date:   Fri Oct 14 14:46:28 2022 +0000

    Start with ERS logic integration within Snakemake.

commit 797aa98f4f
Author: Primoz <sisko.primoz@gmail.com>
Date:   Wed Oct 12 15:51:50 2022 +0000

    Config for ERS testing.

commit 9baff159cd
Author: Primoz <sisko.primoz@gmail.com>
Date:   Wed Oct 12 15:51:23 2022 +0000

    Changes needed for testing and starting of the Event-Related Segments.

commit 0f21273508
Author: Primoz <sisko.primoz@gmail.com>
Date:   Wed Oct 12 12:32:51 2022 +0000

    Bugs fix

commit 55517eb737
Author: Primoz <sisko.primoz@gmail.com>
Date:   Wed Oct 12 12:23:11 2022 +0000

    Necessary commit before proceeding.

commit de15a52dba
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Oct 11 08:36:23 2022 +0000

    Bug fix

commit 1ad25bb572
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Oct 11 08:26:17 2022 +0000

    Few modifications of some imputation values in cleaning script and feature extraction.

commit 9884b383cf
Author: Primoz <sisko.primoz@gmail.com>
Date:   Mon Oct 10 16:45:38 2022 +0000

    Testing new data with AutoML.

commit 2dc89c083c
Author: Primoz <sisko.primoz@gmail.com>
Date:   Fri Oct 7 08:52:12 2022 +0000

    Small changes in cleaning overall

commit 001d400729
Author: Primoz <sisko.primoz@gmail.com>
Date:   Thu Oct 6 14:28:12 2022 +0000

    Clean features and create input files based on all possible targets.

commit 1e38d9bf1e
Author: Primoz <sisko.primoz@gmail.com>
Date:   Thu Oct 6 13:27:38 2022 +0000

    Standardization and correlation visualization in overall cleaning script.

commit a34412a18d
Author: Primoz <sisko.primoz@gmail.com>
Date:   Wed Oct 5 14:16:55 2022 +0000

    E4 data yield corrections. Changes in overal cs - standardization.

commit 437459648f
Author: Primoz <sisko.primoz@gmail.com>
Date:   Wed Oct 5 13:35:05 2022 +0000

    Errors fix: individual script - treat participants missing data.

commit 53f6cc60d5
Author: Primoz <sisko.primoz@gmail.com>
Date:   Mon Oct 3 13:06:39 2022 +0000

    Config and cleaning script necessary changes ...

commit bbeabeee6f
Author: Primoz <sisko.primoz@gmail.com>
Date:   Mon Oct 3 12:53:31 2022 +0000

    Last changes before processing on the server.

commit 44531c6d94
Author: Primoz <sisko.primoz@gmail.com>
Date:   Fri Sep 30 10:04:07 2022 +0000

    Code cleaning, reworking cleaning individual based on changes in overall script. Changes in thresholds.

commit 7ac7cd5a37
Author: Primoz <sisko.primoz@gmail.com>
Date:   Thu Sep 29 14:33:21 2022 +0000

    Preparation of the overall cleaning script.

commit 68fd69dada
Author: Primoz <sisko.primoz@gmail.com>
Date:   Thu Sep 29 11:55:25 2022 +0000

    Cleaning script for individuals: corrections and comments.

commit a4f0d056a0
Author: Primoz <sisko.primoz@gmail.com>
Date:   Thu Sep 29 11:44:27 2022 +0000

    Fillna for app foreground and activity recognition

commit 6286e7a44c
Author: Primoz <sisko.primoz@gmail.com>
Date:   Wed Sep 28 12:47:08 2022 +0000

    firstuseafter column removed from contextual imputation

commit 9b3447febd
Author: Primoz <sisko.primoz@gmail.com>
Date:   Wed Sep 28 12:40:05 2022 +0000

    Contextual imputation correction

commit d6adda30cf
Author: Primoz <sisko.primoz@gmail.com>
Date:   Wed Sep 28 12:37:51 2022 +0000

    Contextual imputation on time(first/last) features.

commit 8af4ef11dc
Author: Primoz <sisko.primoz@gmail.com>
Date:   Wed Sep 28 10:02:47 2022 +0000

    Contextual imputation by feature type.

commit 536b9494cd
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Sep 27 14:12:08 2022 +0000

    Cleaning script corrections

commit f0b87c9dd0
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Sep 27 09:54:15 2022 +0000

    Debugging of the empatica data yield integration.

commit 7fcdb873fe
Merge: 5c7bb0f4 bd53dc16
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Sep 27 07:50:29 2022 +0000

    Merge branch 'imputation_and_cleaning' of https://repo.ijs.si/junoslukan/rapids into imputation_and_cleaning

commit 5c7bb0f4c1
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Sep 27 07:48:32 2022 +0000

    Config changes

commit bd53dc1684
Author: Primoz <sisko.primoz@gmail.com>
Date:   Mon Sep 26 15:54:00 2022 +0000

    Empatica data yield usage in the cleaning script.

commit d9a574c550
Author: Primoz <sisko.primoz@gmail.com>
Date:   Fri Sep 23 13:24:50 2022 +0000

    Changes in the cleaning script and preparation of empatica data yield method.

commit 19aa8707c0
Author: Primoz <sisko.primoz@gmail.com>
Date:   Thu Sep 22 13:45:51 2022 +0000

    Redefined cleaning steps after revision

commit 247d758cb7
Merge: 90ee99e4 7493aaa6
Author: Primoz <sisko.primoz@gmail.com>
Date:   Wed Sep 21 07:18:01 2022 +0000

    Merge branch 'imputation_and_cleaning' of https://repo.ijs.si/junoslukan/rapids into imputation_and_cleaning

commit 90ee99e4b9
Author: Primoz <sisko.primoz@gmail.com>
Date:   Wed Sep 21 07:16:00 2022 +0000

    Remove TODO comments

commit 7493aaa643
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Sep 20 12:57:55 2022 +0000

    Small changes in cleaning scrtipt and missing vals testing.

commit eaf4340afd
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Sep 20 08:03:48 2022 +0000

    Small imputation and cleaning corrections.

commit a96ea508c6
Author: Primoz <sisko.primoz@gmail.com>
Date:   Mon Sep 19 07:34:02 2022 +0000

    Fill NaN of Empatica's SD second order feature (must be tested).

commit 52e11cdcab
Author: Primoz <sisko.primoz@gmail.com>
Date:   Mon Sep 19 07:25:54 2022 +0000

    Configurations for new standardization path.

commit 92aff93e65
Author: Primoz <sisko.primoz@gmail.com>
Date:   Mon Sep 19 07:25:16 2022 +0000

    Remove standardization script.

commit 18b63127de
Author: Primoz <sisko.primoz@gmail.com>
Date:   Mon Sep 19 06:16:26 2022 +0000

    Removed all standardizaton rules and configurations.

commit 62982866cd
Author: Primoz <sisko.primoz@gmail.com>
Date:   Fri Sep 16 13:24:21 2022 +0000

    Phone wifi visible inspection (WIP)

commit 0ce6da5444
Author: Primoz <sisko.primoz@gmail.com>
Date:   Fri Sep 16 11:30:08 2022 +0000

    kNN imputation relocation and execution only on specific columns.

commit e3b78c8a85
Author: Primoz <sisko.primoz@gmail.com>
Date:   Fri Sep 16 10:58:57 2022 +0000

    Impute selected phone features with 0.
    Wifi visible, screen, and light.

commit 7d85f75d21
Author: Primoz <sisko.primoz@gmail.com>
Date:   Fri Sep 16 09:03:30 2022 +0000

    Changes in phone features NaN values script.

commit 385e21409d
Author: Primoz <sisko.primoz@gmail.com>
Date:   Thu Sep 15 14:16:58 2022 +0000

    Changes in NaN values testing script.

commit 18002f59e1
Author: Primoz <sisko.primoz@gmail.com>
Date:   Thu Sep 15 10:48:59 2022 +0000

    Doryab bluetooth and locations features fill in NaN values.

commit 3cf7ca41aa
Merge: d27a4a71 d5ab5a03
Author: Primoz <sisko.primoz@gmail.com>
Date:   Wed Sep 14 15:38:32 2022 +0000

    Merge branch 'imputation_and_cleaning' of https://repo.ijs.si/junoslukan/rapids into imputation_and_cleaning

commit d5ab5a0394
Author: Primoz <sisko.primoz@gmail.com>
Date:   Wed Sep 14 14:13:03 2022 +0000

    Writing testing scripts to determine the point of manual imputation.

commit dfbb758902
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Sep 13 13:54:06 2022 +0000

    Changes in AutoML params and environment.yml

commit 4ec371ed96
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Sep 13 09:51:03 2022 +0000

    Testing auto-sklearn

commit d27a4a71c8
Author: Primoz <sisko.primoz@gmail.com>
Date:   Mon Sep 12 13:44:17 2022 +0000

    Reorganisation and reordering of the cleaning script.

commit 15d792089d
Author: Primoz <sisko.primoz@gmail.com>
Date:   Thu Sep 1 10:33:36 2022 +0000

    Changes in cleaning script:
    - target extracted from config to remove rows where target is nan
    - prepared sns.heatmap for further missing values analysis
    - necessary changes in config and participant p01
    - picture of heatmap which shows the values state after cleaning

commit cb351e0ff6
Author: Primoz <sisko.primoz@gmail.com>
Date:   Thu Sep 1 10:06:57 2022 +0000

    Unnecessary line (rows with no target value will be removed in cleaning script).

commit 86299d346b
Author: Primoz <sisko.primoz@gmail.com>
Date:   Thu Sep 1 09:57:21 2022 +0000

    Impute phone and sms NAs with 0

commit 3f7ec80c18
Author: Primoz <sisko.primoz@gmail.com>
Date:   Wed Aug 31 10:18:50 2022 +0000

    Preparation a) phone_calls 0 imputation b) remove rows with NaN target
2022-12-08 16:04:39 +00:00
Primoz f78aa3e7b3 Preparation for cleaning & imputation 2022-08-26 10:56:14 +00:00
Primoz a620def209 Generate standardized model input files (NOTE: commented unstandardized sections!) 2022-08-24 13:42:39 +00:00
Primoz c498ecb742 Include baseline models (+corrections), disable columns drop in cleaning function. 2022-08-23 14:12:14 +00:00
Primoz f088e9586f Handle empty ACC.csv 2022-08-22 14:20:47 +00:00
Primoz 0aa0e82673 Handle empty Empatica csv files. 2022-08-22 14:18:12 +00:00
Primoz 4cfe5a3a98 Disable discarding rows if DATA_YIELD_RATIO_THRESHOLD==0. 2022-08-19 13:10:56 +00:00
Primoz 607da820f2 Configuration and cleaning changes 2022-08-18 14:21:05 +00:00
Primoz fb577bc9ad Squashed commit of the following:
commit 43ecc243cb62bb31eed85cb477ca4131555c7fe7
Author: Primoz <sisko.primoz@gmail.com>
Date:   Fri Jul 22 15:26:09 2022 +0000

    Adding TODO comments

commit 2df1ebf90c3a93812b112b8ed0ee4e23cd74533f
Author: Primoz <sisko.primoz@gmail.com>
Date:   Thu Jul 21 13:59:23 2022 +0000

    README update

commit 5182c2b16dff3537aad42984b8ea5214743cdb32
Author: Primoz <sisko.primoz@gmail.com>
Date:   Thu Jul 21 11:03:01 2022 +0000

    Few corrections for all_cleaning

commit 3d9254c1b3bed6e95e631d4e0402548830a19534
Author: Primoz <sisko.primoz@gmail.com>
Date:   Thu Jul 21 10:28:05 2022 +0000

    Adding the min overlap for corr threshold and preservation of esm cols.

commit e27c49cc8fa4c51f9fe8e593a8d25e9a032ab393
Author: Primoz <sisko.primoz@gmail.com>
Date:   Thu Jul 21 09:02:00 2022 +0000

    Commenting and cleaning.

commit 31a47a5ee4569264e39d7c445525a6e64bb7700a
Author: Primoz <sisko.primoz@gmail.com>
Date:   Wed Jul 20 13:49:22 2022 +0000

    Environment version change.

commit 5b274ed8993f58e783bda6d82fce936764209c28
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Jul 19 16:10:07 2022 +0000

    Enabled cleaning for all participants + standardization files.

commit 203fdb31e0f3c647ef8c8a60cb9531831b7ab924
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Jul 19 14:14:51 2022 +0000

    Features cleaning fixes after testing. Visualization script for phone features values.

commit 176178d73b154c30b9eb9eb4a67514f00d6a924e
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Jul 19 09:05:14 2022 +0000

    Revert "Necessary config changes."

    This reverts commit 6ec1ef50430d2e1f5ce4670d505d5e84ac47f0a0.

commit 26ea6512c9d512f95837e7b047fe510c1d196403
Author: Primoz <sisko.primoz@gmail.com>
Date:   Mon Jul 18 13:19:47 2022 +0000

    Adding cleaning function condition and cleaning functionality.

commit 575c29eef9c21e6f2d7832871e73bc0941643734
Author: Primoz <sisko.primoz@gmail.com>
Date:   Mon Jul 18 12:51:56 2022 +0000

    Translation of the cleaning individual RAPIDS function from R to py.

commit 6ec1ef50430d2e1f5ce4670d505d5e84ac47f0a0
Author: Primoz <sisko.primoz@gmail.com>
Date:   Mon Jul 18 12:02:18 2022 +0000

    Necessary config changes.

commit b5669f51612fbd8378848615d639677851ab032f
Author: Primoz <sisko.primoz@gmail.com>
Date:   Fri Jul 15 15:26:00 2022 +0000

    Modified snakemake rule to dynamically choose script extention.

commit 66636be1e8ae4828228b37c59b9df1faf3fc3d3d
Author: Primoz <sisko.primoz@gmail.com>
Date:   Fri Jul 15 14:43:08 2022 +0000

    Trying to modify the snakefile rule to execute scripts in two languages depended on the provider.

commit 574778b00f3cbb368ef4bc74de15cf5070c65ea9
Author: Primoz <sisko.primoz@gmail.com>
Date:   Fri Jul 15 09:49:41 2022 +0000

    gitignore: adding required files so that RAPIDS can be run successfully.

commit 71018ab178256970535e78961602ab8c7f0ebb14
Author: Primoz <sisko.primoz@gmail.com>
Date:   Fri Jul 15 08:34:19 2022 +0000

    Standardization bug fixes

commit 6253c470a624e6bfbb02e0c453b652452eb2dbbc
Author: Primoz <sisko.primoz@gmail.com>
Date:   Thu Jul 14 15:28:02 2022 +0000

    Seperate rules for empatica vs. nonempatica standardization.
    Parameter in config that controls the creation of standardized merged files for individual and all participants..

commit 90f902778565e0896d3bae22ae8551be8b487e67
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Jul 12 14:23:03 2022 +0000

    Preparing for final csvs' standardization.

commit d25dde3998786a9a582f5cda544ee104386778f9
Author: Primoz <sisko.primoz@gmail.com>
Date:   Mon Jul 11 12:08:47 2022 +0000

    Revert "Changes in config to be reverted."

    This reverts commit bea7608e7095021fb7c53a9afa07074448fe4313.

commit 6b23e70857e63deda98eb98d190af9090626c84b
Author: Primoz <sisko.primoz@gmail.com>
Date:   Mon Jul 11 12:08:26 2022 +0000

    Enabled standardization for rest (previously active)  phone features.
    Testing still needed.

commit 8ec58a6f34ba3d42e5cc71d26e6d91837472ca5f
Author: Primoz <sisko.primoz@gmail.com>
Date:   Mon Jul 11 09:07:55 2022 +0000

    Enabled standardization for phone calls.
    All steps completed and tested.

commit bea7608e7095021fb7c53a9afa07074448fe4313
Author: Primoz <sisko.primoz@gmail.com>
Date:   Mon Jul 11 07:47:51 2022 +0000

    Changes in config to be reverted.

commit 4e84ca0e51bf709bff56fd09437b95310ec6bedd
Author: Primoz <sisko.primoz@gmail.com>
Date:   Fri Jul 8 14:11:24 2022 +0000

    Standardization for the rest of the features.

commit cc581aa788e3d5c17131af8f3d5dd6b0c3b5aff7
Author: Primoz <sisko.primoz@gmail.com>
Date:   Fri Jul 8 14:11:08 2022 +0000

    README update again
2022-07-22 15:31:30 +00:00
Primoz 6ba4a66deb Squashed commit of the following:
commit 31a47a5ee4569264e39d7c445525a6e64bb7700a
Author: Primoz <sisko.primoz@gmail.com>
Date:   Wed Jul 20 13:49:22 2022 +0000

    Environment version change.

commit 5b274ed8993f58e783bda6d82fce936764209c28
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Jul 19 16:10:07 2022 +0000

    Enabled cleaning for all participants + standardization files.

commit 203fdb31e0f3c647ef8c8a60cb9531831b7ab924
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Jul 19 14:14:51 2022 +0000

    Features cleaning fixes after testing. Visualization script for phone features values.

commit 176178d73b154c30b9eb9eb4a67514f00d6a924e
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Jul 19 09:05:14 2022 +0000

    Revert "Necessary config changes."

    This reverts commit 6ec1ef50430d2e1f5ce4670d505d5e84ac47f0a0.

commit 26ea6512c9d512f95837e7b047fe510c1d196403
Author: Primoz <sisko.primoz@gmail.com>
Date:   Mon Jul 18 13:19:47 2022 +0000

    Adding cleaning function condition and cleaning functionality.

commit 575c29eef9c21e6f2d7832871e73bc0941643734
Author: Primoz <sisko.primoz@gmail.com>
Date:   Mon Jul 18 12:51:56 2022 +0000

    Translation of the cleaning individual RAPIDS function from R to py.

commit 6ec1ef50430d2e1f5ce4670d505d5e84ac47f0a0
Author: Primoz <sisko.primoz@gmail.com>
Date:   Mon Jul 18 12:02:18 2022 +0000

    Necessary config changes.

commit b5669f51612fbd8378848615d639677851ab032f
Author: Primoz <sisko.primoz@gmail.com>
Date:   Fri Jul 15 15:26:00 2022 +0000

    Modified snakemake rule to dynamically choose script extention.

commit 66636be1e8ae4828228b37c59b9df1faf3fc3d3d
Author: Primoz <sisko.primoz@gmail.com>
Date:   Fri Jul 15 14:43:08 2022 +0000

    Trying to modify the snakefile rule to execute scripts in two languages depended on the provider.

commit 574778b00f3cbb368ef4bc74de15cf5070c65ea9
Author: Primoz <sisko.primoz@gmail.com>
Date:   Fri Jul 15 09:49:41 2022 +0000

    gitignore: adding required files so that RAPIDS can be run successfully.

commit 71018ab178256970535e78961602ab8c7f0ebb14
Author: Primoz <sisko.primoz@gmail.com>
Date:   Fri Jul 15 08:34:19 2022 +0000

    Standardization bug fixes

commit 6253c470a624e6bfbb02e0c453b652452eb2dbbc
Author: Primoz <sisko.primoz@gmail.com>
Date:   Thu Jul 14 15:28:02 2022 +0000

    Seperate rules for empatica vs. nonempatica standardization.
    Parameter in config that controls the creation of standardized merged files for individual and all participants..

commit 90f902778565e0896d3bae22ae8551be8b487e67
Author: Primoz <sisko.primoz@gmail.com>
Date:   Tue Jul 12 14:23:03 2022 +0000

    Preparing for final csvs' standardization.

commit d25dde3998786a9a582f5cda544ee104386778f9
Author: Primoz <sisko.primoz@gmail.com>
Date:   Mon Jul 11 12:08:47 2022 +0000

    Revert "Changes in config to be reverted."

    This reverts commit bea7608e7095021fb7c53a9afa07074448fe4313.

commit 6b23e70857e63deda98eb98d190af9090626c84b
Author: Primoz <sisko.primoz@gmail.com>
Date:   Mon Jul 11 12:08:26 2022 +0000

    Enabled standardization for rest (previously active)  phone features.
    Testing still needed.

commit 8ec58a6f34ba3d42e5cc71d26e6d91837472ca5f
Author: Primoz <sisko.primoz@gmail.com>
Date:   Mon Jul 11 09:07:55 2022 +0000

    Enabled standardization for phone calls.
    All steps completed and tested.

commit bea7608e7095021fb7c53a9afa07074448fe4313
Author: Primoz <sisko.primoz@gmail.com>
Date:   Mon Jul 11 07:47:51 2022 +0000

    Changes in config to be reverted.

commit 4e84ca0e51bf709bff56fd09437b95310ec6bedd
Author: Primoz <sisko.primoz@gmail.com>
Date:   Fri Jul 8 14:11:24 2022 +0000

    Standardization for the rest of the features.

commit cc581aa788e3d5c17131af8f3d5dd6b0c3b5aff7
Author: Primoz <sisko.primoz@gmail.com>
Date:   Fri Jul 8 14:11:08 2022 +0000

    README update again
2022-07-20 13:51:22 +00:00
Primoz 788ac31190 Bug fix: if df has no rows write an empty zscore file. 2022-07-08 10:40:45 +00:00
Primoz 21eb2665d7 README: few changes. 2022-07-08 10:40:08 +00:00
Primoz a65a85cce9 Merge branch 'empatica_calculating_features' 2022-07-07 15:35:47 +00:00
Primoz fa961fe2f5 gitignore 2022-07-07 15:34:31 +00:00
Primoz 6c8014ba8e Updated R and Python package files. Updated README. 2022-07-07 15:30:07 +00:00
Primoz 5a777ac79f Working version that integrates both phone and empatica feature calculations. 2022-07-07 15:00:47 +00:00
Primoz 0425403951 Merge branch 'master' of https://repo.ijs.si/junoslukan/rapids 2022-07-06 11:53:31 +00:00
Primoz 887fd7dc72 Merge branch 'empatica_calculating_features' 2022-07-06 11:53:21 +00:00
Primoz 5a4696c548 Misc. changes 2022-07-06 11:30:18 +00:00
Primoz d2758eef46 Set not NaN sum insted of 0 sum for HRV features windows. 2022-07-06 07:36:35 +00:00
Primoz 2d5d23b615 Testing files change and remove standardization from hrv sensors main files. 2022-07-06 07:35:39 +00:00
Primoz a5480f1369 Few changes during addition to file structure. 2022-07-04 13:00:47 +00:00
Primoz 505c3a86b9 Testing different EDA findPeaks parameters. 2022-06-30 15:15:37 +00:00
junos ce04394679 Merge commit 'c05b047c2d9452151553961928c846c01d7395bc' 2022-06-25 20:06:24 +02:00
Primoz c851ab0763 Fill EDA NaN values where numPeak is zero. Other small changes. 2022-06-21 14:09:49 +00:00
Primoz a8cd16f88c Debugging eda_explorer 2022-06-16 11:32:59 +00:00
Primoz dda4554d46 Various small changes. 2022-06-15 13:57:46 +00:00
Primoz 212cf300f8 Debugging EDA signal - preliminary step for imputation. 2022-06-14 15:09:14 +00:00
Primoz 9ea39dc557 Standardization as a Snakefile's rule enabled for all E4 sensors. 2022-06-13 18:17:30 +00:00
Primoz 402059871f Making standardization as a rule. WIP: done only for BVP. 2022-06-13 14:12:03 +00:00
Primoz 094743244d Added SO feature for sum all rows that are non zero for BVP and IBI sensors. 2022-06-13 10:51:22 +00:00
primoz e1d7607de4 Extraction of additional SO features. Min/max has been changed to nsmallest/nlargest means. 2022-06-10 12:34:48 +00:00
primoz f371249b99 First order features standardization WIP 2022-06-09 13:35:15 +00:00
primoz 64e41cfa35 Second order features standardization in config.yaml. 2022-06-07 10:39:48 +00:00
primoz 2c7ac21465 Added standardization on SO features. 2022-06-06 13:51:15 +00:00
primoz 2acf6ff9fb Exception handling in case of empty ibi. Changes of the method EDA uses in main.py. Other small corrections. 2022-06-03 12:34:36 +00:00
primoz d300f0f8f0 Fixed RAPIDS bug: error when IBI.csv is empty. 2022-06-02 11:43:49 +00:00
Primoz fbf6a77dfc Small misc changes 2022-06-02 06:41:53 +00:00
Primoz 5532043b1f Patching IBI with BVP - completed. 2022-05-25 19:39:47 +00:00
Primoz bb62497ba6 Patching IBI with BVP - selecting appropriate pipeline entry point. WIP 2022-05-24 11:07:18 +00:00
Primoz 2a8f58f5c8 Patching IBI with BVP. WIP 2022-05-20 13:18:45 +00:00
Primoz 1471c86c62 Cr-features version update in rapids venv. 2022-05-13 13:37:12 +00:00
Primoz 6864cfe775 Changes after thorough testing with available data. 2022-05-13 13:35:34 +00:00
Primoz c1564f0cae Changed wrapper method calculate_feature to its newest version (for TEMP and ACC). 2022-05-11 14:21:21 +00:00
Primoz 31e36e7400 Alternating Second order and full segment features corresponding to config settings. 2022-05-11 08:50:15 +00:00
Primoz 9cf9e1fe14 Testing and modifying the code with different E4 data. 2022-05-10 11:36:49 +00:00
Primoz f62a1302dd Cr-features corrections for ACC and TEMP sensors 2022-05-09 11:01:52 +00:00
Primoz 5638367999 Implementation of the second order features. 2022-04-25 13:07:03 +00:00
Primoz 66451160e9 Calculating HRV features with IBI.csv. 2022-04-20 10:44:51 +00:00
= 8c8fe1fec7 Modifications, mostly imports, after changes in cr-features module. 2022-04-19 13:24:46 +00:00
= 075c64d1e5 HRV: changed wrapper calcFeat method with specialized one. 2022-04-14 11:51:53 +00:00
junos c05b047c2d Correct outstanding baseline feature mistake. 2022-04-13 17:05:16 +02:00
junos 53ec52a954 Disable (SOME) feature cleaning for ESM data. 2022-04-13 16:01:31 +02:00
= 3c058e4463 Add option to calculate features within windows and store it in CSV (all sensors). 2022-04-13 13:18:23 +00:00
junos 144f0d0dcf Account for missing baseline data. 2022-04-13 14:56:28 +02:00
junos ed5314aa98 Merge remote-tracking branch 'origin/master' 2022-04-12 17:27:25 +02:00
junos 11c64cfc1a Include all participants again. 2022-04-12 17:20:19 +02:00
junos a6a37c7bd9 Drop NaN targets.
This mirrors INNER join in merge_features_and_targets_for_individual_model.py:

data = pd.concat([sensor_features, targets[["target"]]], axis=1, join="inner")
2022-04-12 17:01:49 +02:00
junos 9f5edf1c2b Revert "Add a rule for model baselines."
The example was for a classification rather than regression problem.

This reverts commit 9ab0c8f289.

# Conflicts:
#	rules/models.smk
2022-04-12 16:59:42 +02:00
junos 4ad261fae5 Rename baseline features AGAIN.
Correct other mistakes.
2022-04-12 16:55:01 +02:00
= 74cf4ada1c Cr-feat window length for all empaticas sensors. 2022-04-12 14:00:44 +00:00
junos 9ab0c8f289 Add a rule for model baselines.
Add baselines and helper functions to main models dir.
2022-04-12 14:23:58 +02:00
junos 570d2eb656 Add the file for population model to Snakefile. 2022-04-12 14:11:40 +02:00
junos f5688f6154 Add a rule to merge sensor and baseline features.
And select target as before.
2022-04-08 15:42:04 +02:00
junos b1f356c3f7 Extract a function to be used elsewhere. 2022-04-08 15:36:32 +02:00
junos 7ff3dcf5fc Move and rename target variable. 2022-04-06 18:21:09 +02:00
junos 50c0defca7 Select target columns (no parsing necessary). 2022-04-06 18:16:49 +02:00
junos ac86221662 [WIP] Add a rule to parse targets.
Does nothing for now.
2022-04-06 17:47:03 +02:00
junos baa94c4c4e Correct additional error in feature file naming.
Add the final feature file to the list in Snakefile.
2022-04-06 17:29:17 +02:00
junos d2fbef5234 Merge branch 'labels' of https://repo.ijs.si/junoslukan/rapids into labels
# Conflicts:
#	src/features/phone_esm/straw/preprocess.py
2022-04-05 19:28:37 +02:00
junos d326a1b09d Include the constant directly in main.py. 2022-04-05 19:08:43 +02:00
junos 2e545e81f0 Include feature calculations for different scales. 2022-04-05 19:05:34 +02:00
junos cbc8ae4e03 Add necessary checks for empty data frames. 2022-04-05 18:58:09 +02:00
junos f50a13167e Add feature files back to Snakefile. 2022-04-05 18:37:58 +02:00
junos e84c35a36a Remove unnecessary parameters from preprocess_esm.
And correct the newly named interim file.
2022-04-05 18:36:09 +02:00
junos e2ce68f591 Defer creation of feature files to esm_features rule. 2022-04-05 18:30:04 +02:00
junos 751b04f3f4 Pass scale names to Snakemake correctly. 2022-04-05 18:14:37 +02:00
junos 99245afca3 Try a different approach for preprocessing ESMs.
It is important that this follows generic RAPIDS pattern.
In the subsequent step of calculating features,
there is an expected file and folder structure of data/interim.
See rules/common.smk/find_features_files()
2022-04-05 18:02:31 +02:00
junos ed298a9479 Implement the basic feature extraction steps. 2022-04-05 15:46:02 +02:00
= 1c42347b9b Small changes. 2022-04-04 12:19:33 +00:00
Primoz c050174ca3 Various minimal changes. 2022-03-31 09:16:00 +00:00
Primoz f9e40711e7 Modified README for RAPIDS-CalculatingFeatures integration. 2022-03-30 16:17:07 +00:00
Primoz a357138f6e Added CF for HRV and shortened test data 2022-03-30 15:01:24 +00:00
Primoz 470993eeb0 Modification of getSampleRate method for all CF scripts. 2022-03-30 15:00:11 +00:00
junos 798ec973b4 [WIP] Add a rule for ESM features. 2022-03-30 10:43:30 +02:00
junos 3af8de6235 Create feature provider script. 2022-03-30 10:40:53 +02:00
junos 7173ca13e3 Rename a parameter. 2022-03-30 10:40:53 +02:00
junos 9478dc94f2 Add an else.
This is to make sure that in case the reversing fails, we do not get any output items.
Snakemake will inform us of an error in this event.
2022-03-30 10:40:53 +02:00
= ab0b9227d7 Added ACC calculated features and shorter version of ACC data. 2022-03-29 09:41:51 +00:00
= a9244a60fc Corrections for TEMP cf src script. 2022-03-28 14:26:37 +00:00
= 8b76c96e47 Cleaning existing CF mains' and preparing src script for ACC. 2022-03-28 14:18:29 +00:00
= ca59a54d8f Get a sample rate from two sequential timestamps. 2022-03-28 13:50:08 +00:00
= 393dab72f5 Added components for the temperature features extraction. 2022-03-28 12:37:02 +00:00
Primoz 1902d02a86 Updating conda env. 2022-03-25 16:27:28 +00:00
Primoz f389ac9d89 Delete CF features folder 2022-03-25 16:24:52 +00:00
Primoz 191e53e543 Added cf provider for EDA feature processing. 2022-03-23 15:13:53 +00:00
Primoz d3a3f01f29 Preparation for the EDA features integration from CF. 2022-03-22 15:36:52 +00:00
Primoz 2da0911d4c Skeleton file main.py for EDA CalcFt. integration. 2022-03-22 12:48:43 +00:00
Primoz bd5a811256 Shortening input CSVs' and added test for ACC. 2022-03-21 12:00:57 +00:00
Primoz d1c59de2e9 Add folder structure for CF testing and EDA test. 2022-03-21 10:40:18 +00:00
Primoz a80f7c0cc4 Change of the relative import statements. 2022-03-21 10:38:15 +00:00
Primoz d63158c199 Build calc features lib and related packages. 2022-03-21 08:28:28 +00:00
junos b18dba366e Add an else.
This is to make sure that in case the reversing fails, we do not get any output items.
Snakemake will inform us of an error in this event.
2022-03-16 18:59:29 +01:00
junos 916bb21a53 Merge branch 'labels' into run_test_participant 2022-03-16 18:56:00 +01:00
junos c6144f8403 Reverse JCQ items. 2022-03-16 18:55:46 +01:00
junos fec7cc9550 Merge branch 'labels' into run_test_participant 2022-03-16 18:30:03 +01:00
junos 23f0aaba3a Get the name of the questionnaire from Snakefile. 2022-03-16 18:28:57 +01:00
junos 8ed7d23348 Merge branch 'labels' into run_test_participant 2022-03-16 17:56:07 +01:00
junos 679f00dc19 Enable selecting any questionnaire as target. 2022-03-16 17:55:44 +01:00
junos 1374eda171 Flatten questionnaire ID dict. 2022-03-16 17:38:09 +01:00
junos 3e9cdde66e Merge branch 'master' into run_test_participant 2022-03-16 17:27:50 +01:00
junos 155395512c Merge branch 'labels' into run_test_participant 2022-03-16 17:09:53 +01:00
junos cb116100dd Move preprocessing to features. 2022-03-16 17:06:42 +01:00
junos 19b9da0ba3 Separate function definitions from main. 2022-03-16 16:49:28 +01:00
Primoz 3f8e1cc252 Empatica features calculations with an example ZIP 2022-03-16 15:03:32 +00:00
junos 83a8bb6689 Add an option to disable calculation of baseline features. 2022-03-16 15:51:12 +01:00
Primoz dc2b462145 Reseting files to defaults - for Minimal Working Example 2022-03-16 13:30:19 +00:00
Primoz 50358978cc Testing Git integration on PC 2022-03-15 12:49:51 +00:00
junos ef57103bac Add questionnaire ID key. 2022-03-15 13:41:33 +01:00
junos 5f293211a7 Reformat. 2022-03-15 13:28:51 +01:00
Primoz 86c6312574 Added .devcontainer to gitignore 2022-03-14 18:36:10 +00:00
junos d470eef27e Add a rule to preprocess and clean ESM. 2022-03-09 18:38:46 +01:00
junos b09522a8af Merge branch 'labels' into run_test_participant 2022-03-09 17:58:44 +01:00
junos d4a4bbbff0 Remove unused columns. 2022-03-09 17:58:36 +01:00
junos 085a6d144b Add files to compute and create an empty script. 2022-03-09 17:32:02 +01:00
junos 42d62f16d0 Add RAPIDS mandatory columns for ESM. 2022-03-09 17:31:37 +01:00
junos a159ca3d3a Merge branch 'labels' into run_test_participant 2022-03-08 15:43:42 +01:00
junos 2bef86b1da Add a format for ESM and add to config. 2022-03-08 15:43:25 +01:00
junos d8e9a309f7 Rename features and write baseline_interim. 2022-03-08 15:10:36 +01:00
junos ba7c3e620b Merge branch 'master' into run_test_participant 2022-03-01 12:03:14 +01:00
junos a3a4f04ffe Setting with : produces NaNs. 2022-03-01 12:02:57 +01:00
junos aedb8b6785 Write questionnaire data to data/interim. 2022-03-01 12:02:36 +01:00
junos 631581cc8a Merge branch 'master' into run_test_participant 2022-03-01 11:42:19 +01:00
junos d3ebfeeabd Write questionnaire data to data/interim. 2022-03-01 11:42:08 +01:00
junos 70e077f6ab Merge branch 'master' into run_test_participant 2022-03-01 11:40:17 +01:00
junos f13a91044d Write questionnaire data to data/interim. 2022-03-01 11:39:58 +01:00
junos b5a6317f4b Calculate JCQ control and demand control ratio.
Include norms and corresponding quartile.
2022-02-28 18:51:47 +01:00
junos 2fed962644 Calculate JCQ demand score.
Hardcode question IDs to be reversed.
2022-02-28 18:30:41 +01:00
junos 30ac8b1cd5 Start calculating demand control features. 2022-02-23 19:08:10 +01:00
junos 9a74e74d08 Add the baseline features rule to snakefile.
Correct age calculation for a single value instead of dataframe.
2022-02-23 18:15:26 +01:00
junos 43e5ac7918 Merge branch 'master' into run_test_participant 2022-02-23 18:06:07 +01:00
junos 07da6be398 Add age, gender, and language as features.
Move calculation of age from merge_baseline_data.py to baseline_features.py.
2022-02-23 18:05:23 +01:00
junos c801f66533 Retain a single participant ID.
Do not plot heatmaps as this is bugged.
2022-02-23 11:10:54 +01:00
junos 176367631b Prepare baseline feature rule. 2022-02-23 11:09:33 +01:00
Meng Li 28e580e597 Update change-log for v1.8.0 2022-02-10 15:05:55 -05:00
junos bf9c764c97 Split baseline data to participants.
And some csv I/O settings.
2022-02-04 18:37:57 +01:00
junos 16e608db74 First merge baseline datasets. 2022-02-04 18:21:42 +01:00
junos 204f6f50b0 Read the relevant files. 2022-02-04 18:06:02 +01:00
junos 685ed6a546 Set up demographic data download. 2022-02-04 17:37:00 +01:00
junos ffa7a30575 Make place for STRAW models. 2022-02-04 17:25:24 +01:00
Meng Li 463ac0a2aa
Fix bug#169 (#174) 2022-01-27 11:27:32 -05:00
Sam 10e896ca1d
Add data stream for AWARE Micro server (#173)
* Add data stream for AWARE Micro server

* Fix one documentation typo and one ommission
2022-01-27 10:47:50 -05:00
junos afa3b8546f Mutate data in an R script.
The Python script did not read the timestamp correctly for some reason. All timestamps were 0.
2022-01-26 16:34:19 +01:00
junos 1efb8e3112 Clean features across participants.
Explore the best linear regression feature.
2022-01-19 13:41:09 +01:00
Sam e5dbbfce44
Avoid NA problem in barnett location evaluation (#172)
* Avoid occasional issue where does_not_span evaluates to NA, which breaks the if()

* Restored original warning
2022-01-18 10:16:37 -05:00
Sam 8ae26fb845
Fixes issue where 'duration' in the 'ios_calls' dataframe is seen as a character type. (#171) 2022-01-18 10:15:53 -05:00
junos b17a7eff1a Deal with inexplicable snakemake failure. 2022-01-07 18:11:38 +01:00
junos 2fb068cb8b Do not calculate accelerometer features.
Add data cleaning.
2022-01-07 12:20:51 +01:00
junos e1499a5ae2 Account for missing device_ids. 2021-12-15 20:41:28 +01:00
junos b29f902915 Look into ESM table for device_id. 2021-12-15 20:18:12 +01:00
junos c03ee788f6 Add missing dependencies for caret and corrr. 2021-12-15 19:26:16 +01:00
junos 5a9252e46e Merge remote-tracking branch 'origin/master' 2021-12-15 18:32:36 +01:00
junos e5cc02501f Set the timezone.csv path in config.
Take into account that TZCODES_FILE can be created with a rule.
2021-12-15 18:09:30 +01:00
junos 352598f3da Use absolute path to avoid RuleException. 2021-12-15 17:27:13 +01:00
junos 15653b6e70 Add forgotten line for hashed app names in config. 2021-12-15 17:26:54 +01:00
junos a66a7d0cc3 Keep track of warning messages.
These are not runtime errors, but might still indicate a problem.
2021-12-15 16:19:29 +01:00
junos 70cada8bb8 Consider a subset of columns when dropping. 2021-12-15 16:14:33 +01:00
junos d2ed73dccf Debug ValueError for index.
See exploration/debug_heatmap.py for illustration.
2021-12-15 16:03:04 +01:00
junos 6f451e05ac Bring back application_name.
This column still needs to be in the data, so add it in app_add_name.py.
Later, join categories by package hash.
2021-12-15 12:58:27 +01:00
junos 4485c4c95e Delete columns we don't have.
Rename light table.
Correct timesegments.
2021-12-08 20:02:47 +01:00
junos 633384c6a9 Use all available sensors for PHONE_YIELD. 2021-12-08 19:04:19 +01:00
junos 8e2222f307 Bring back deleted lines which are required. 2021-12-08 18:59:10 +01:00
junos 712ff74898 Set table names and calculate all relevant features. 2021-12-08 18:37:34 +01:00
junos 1f54195437 Configure timezone file to be created automatically. 2021-12-08 18:21:29 +01:00
junos 2b52f686b3 Define daily segments. 2021-12-08 18:20:22 +01:00
junos 22513415e9 Do not ask for specific patch numbers of libraries! 2021-12-03 14:38:25 +01:00
junos 0b8a493ff2 Incorporate mulitple timezones into RAPIDS. 2021-12-01 18:20:27 +01:00
junos f0d29d0d1a Incorporate DB query for usernames into snakemake workflow. 2021-12-01 18:14:27 +01:00
junos 37b3460b76 Use Empatica wristband numbers as provided in CSV. 2021-12-01 17:20:57 +01:00
junos 22f9e0722d Start preparing the true usernames CSV file. 2021-12-01 11:29:22 +01:00
junos 0be4cd5a8f Remove unnecessary library. 2021-11-30 17:08:07 +01:00
junos b99a3c19ed Update dbplyr to the latest version.
distinct changed its behaviour from 2.0.0 to 2.1.0.
2021-11-29 18:34:26 +01:00
junos 04ad2d0b81 Source specific container script.
It is probably not worth the effort of making this general.
2021-11-29 18:19:47 +01:00
junos da5ff0f36e Correct small errors in settings. 2021-11-29 18:04:06 +01:00
junos 35d9779026 Prepare the tibble in requested format.
Write it to a CSV file.
2021-11-29 17:54:16 +01:00
junos 32025cbd8c Start with a tibble from CSV. 2021-11-29 17:51:07 +01:00
junos 181e4f0118 Add parameters to yaml file.
And use these in the prepare_participants_file function.
2021-11-29 16:57:50 +01:00
junos 39bd244511 [WIP] Prepare yaml files.
These will be used to create participants files.
2021-11-24 19:11:19 +01:00
junos ab84109d55 Prepare a function to compile participants data.
It combines functions from container.R
2021-11-24 19:07:56 +01:00
junos f9863ec622 Fix small mistakes. 2021-11-24 19:01:30 +01:00
junos c1f56c61e8 Add a function to pull start and end datetimes. 2021-11-24 18:33:06 +01:00
junos 3acf6ece14 Add a function to pull device IDs. 2021-11-24 18:23:53 +01:00
junos 8b2717122d Add a function to get participants' IDs. 2021-11-24 18:05:17 +01:00
Meng Li 9338f77ae6 Update docs for Git Flow section & RAPIDS paper info 2021-11-19 13:57:10 -05:00
Meng Li 5bad3eb8b5
Data cleaning (#166)
* Refactor data cleaning module: move it from example workflow to main directory

* Replace NAs with 0 in selected event-based features

* Add one step to drop highly correlated features

Co-authored-by: Weiyu <weiyuhuang7@gmail.com>
2021-11-19 10:34:36 -05:00
Meng Li 296960f425 Fix the bug of location doryab features when a participant is moving during the whole time segment 2021-11-18 18:42:19 -05:00
Meng Li 3d34036eae
Add firststeptime and laststeptime features to FITBIT_STEPS_INTRADAY RAPIDS provider (#168)
* Add firststeptime and laststeptime features to FITBIT_STEPS_INTRADAY RAPIDS provider

* Update test config files
2021-11-18 18:35:27 -05:00
junos ed193d2290 Revert "Correct the name of a field."
This reverts commit b335561a55.

It was actually correct.
2021-11-17 19:16:35 +01:00
junos 24b11ea101 Force Unix style end of line. 2021-11-17 19:12:40 +01:00
junos 4829b155d5 Make config changes for minimal workflow. 2021-11-17 18:53:44 +01:00
junos b335561a55 Correct the name of a field. 2021-11-17 18:50:06 +01:00
junos fcec3e2f93 Implement the necessary functions for PSQL. 2021-11-17 18:49:25 +01:00
junos 7a1e4f7139 Add the format file copied from MySQL. 2021-11-17 18:46:16 +01:00
junos ae8ed3999f Add RPostgres to renv.
And update its dependencies.
2021-11-17 12:56:14 +01:00
JulioV 399dbc7d75
Update team.md (#167) 2021-11-13 16:47:12 -05:00
Meng Li dfa11acf87
Updated phone battery test data and results (#165)
Co-authored-by: Weiyu <weiyuhuang7@gmail.com>
2021-10-21 22:54:19 -04:00
Meng Li da633c5d08 Update change-log for v1.6.0 2021-10-18 17:03:21 -04:00
Meng Li 7f4683e0fe
Feature/screen default maxlength (#164)
Updated the default IGNORE_EPISODES_LONGER_THAN to be 6 hours for screen RAPIDS provider
2021-10-14 09:26:06 -04:00
Meng Li 3744367aa9 Updated docs and workflow example for location features with DORYAB provider 2021-10-13 17:06:53 -04:00
Meng Li c7e8777a6e Merge branch 'feature/phone_locations_refactor' into develop 2021-09-23 18:22:11 -04:00
Meng Li f340b89c58 Temporary revert PHONE_LOCATIONS BARNETT provider to use R script 2021-09-23 18:16:13 -04:00
Meng Li a3fb718aea Refactor PHONE_LOCATIONS DORYAB provider to compute features based on location episodes 2021-09-23 17:40:06 -04:00
Meng Li 80522e6b7f Merge branch 'feature/phone_calls_refactor' into develop 2021-09-15 11:47:41 -04:00
Weiyu ff38e36809 Tested phone call episodes 2021-09-15 10:29:56 -04:00
Meng Li a8a178486b Refactor PHONE_CALLS RAPIDS provider to compute features based on call episodes or events 2021-09-15 10:28:37 -04:00
JulioV 2e553dc9e7 Add tqdm package to environment.yaml 2021-08-16 11:04:03 -04:00
Meng Li 3ac12e7dad Fix the bug of step intraday features when INCLUDE_ZERO_STEP_ROWS is False 2021-08-11 12:40:40 -04:00
JulioV 1520a1e755 v1.5.0 Update changelog and team pages 2021-08-09 18:21:58 -04:00
Weiyu 46e4425323 Updated test data for data yield feature 2021-08-09 17:56:29 -04:00
Weiyu 35eebe8a51 Bug fixed: set ratiovalidyielded mins/hours value to the range 0 to 1 2021-08-09 17:56:29 -04:00
Shirley 3c46a1c878 Adding Ian and Shirley as community contributors 2021-08-05 18:25:37 -04:00
JulioV 3e69966c91 Update error message 2021-08-04 15:33:02 -04:00
Shirley 4ddb2845a6 Update initialize_params 2021-08-04 15:33:02 -04:00
JulioV 834bd3b93d Refactor in Python of Barnett provider
Co-authored-by: Shirley Hayati <sahayati@ucdavis.edu>
Co-authored-by: JulioV <JulioV@users.noreply.github.com>
2021-08-04 15:33:02 -04:00
Weiyu c41e1f08cc Tested data yield feature 2021-08-04 11:06:31 -04:00
Weiyu 64976919e4 Tested phone accelerometer feature 2021-08-04 11:06:31 -04:00
Weiyu 7f1c502ea0 Fixed bug: Added local_segment column if no data left after filtered 2021-08-04 11:06:31 -04:00
Weiyu 2e3e433d2b Tested fitbit step summary feature 2021-07-28 10:32:16 -04:00
Weiyu 6f5a143191 Tested fitbit steps intraday feature 2021-07-28 10:32:16 -04:00
JulioV 872125fbb2
Update link to paper that used RAPIDS 2021-07-23 11:57:04 -04:00
Hannah Roberts b52059b027 Ensure date/time format is maintained
Within the 'determine which is home' for loop, 'xx' is the midpoint of two datetime objects. When the midpoint is calculated to be midnight, only the date is returned. This can be replicated with:

mydates <- as.POSIXct("2018-01-01 00:00:00", tz = "UTC")
mydates
[1] "2018-01-01 UTC"

This results in 'hourofday' being NA as an hour cannot be found. By adding the suggested format wrapper, the time is maintained and 'hourofday' can be determined. It can then successfully be applied to the embedded if-statement within the loop.

mydates <- format(as.POSIXct("2018-01-01 00:00:00", tz = "UTC"), "%Y-%m-%d %H:%M:%S")
mydates
[1] "01-01-2018 00:00:00"
2021-07-23 10:12:11 -04:00
Weiyu 5a465873c4 Tested fitbit heartrate intraday feature 2021-07-21 10:24:02 -04:00
Weiyu e9c07924fd Tested fitbit heartrate summary feature 2021-07-20 16:54:08 -04:00
Kennedy Opoku Asare b1e3360e2b Update index.md 2021-07-19 09:54:57 -04:00
JulioV 18463f5e8e Fix bug with internal test script
Date times with a 00:00:00 would not be saved correctly for Fitbits
2021-07-16 16:56:02 -04:00
JulioV ad5796ed5e Update changelog for v1.4.1 2021-07-12 18:22:52 -04:00
JulioV 96d7b6e170 Fix links in home page 2021-07-12 18:15:27 -04:00
JulioV a323f6c390
Merge pull request #150 from carissalow/feature/phone_messages_test
Phone message tests
2021-07-07 10:51:42 -04:00
Weiyu 2e147fb89c Finished phone message test 2021-07-07 01:22:49 -04:00
JulioV 07eb2e7917 Update v1.4.0 changelog 2021-07-01 18:32:14 -04:00
JulioV c8dbd5c5ac Update app foreground docs 2021-07-01 18:27:46 -04:00
Meng Li e1cfcd46e4 Update example workflow for app episode features 2021-07-01 18:08:33 -04:00
JulioV d7fbee5914
Merge pull request #149 from carissalow/feature/own_categories
Feature/own categories
2021-07-01 17:02:28 -04:00
Weiyu 013425e36c Finished phone applications foreground feature test 2021-07-01 16:25:38 -04:00
JulioV 6fa1875bf3 Add app foreground episode count 2021-07-01 16:20:16 -04:00
JulioV bc5c0c9a4f Fix app episode length bug 2021-07-01 16:20:16 -04:00
JulioV 065a926a87 Change own to custom categories name 2021-07-01 16:20:16 -04:00
JulioV e74c745f86 Add own categories to app foreground features 2021-07-01 16:20:16 -04:00
JulioV 5892b6d838 Fix create_participants_files.R to handle numeric PIDs 2021-07-01 16:20:16 -04:00
Weiyu 9593667f38 Finished phone light test 2021-07-01 16:17:08 -04:00
Meng Li b7eafc8d5c Update changelog for v1.4.0 2021-06-29 11:54:42 -04:00
JulioV 0d7b7f3dad
Merge pull request #147 from carissalow/visualization_fix
Fix bugs of visualization module and analysis workflow example
2021-06-29 11:50:55 -04:00
Meng Li 2e12a061c7 Update docs of visualization module 2021-06-29 10:51:22 -04:00
Meng Li 97ef8a8368 Set color range and avoid SettingWithCopyWarning 2021-06-29 09:50:19 -04:00
Meng Li bb3c614135 Update analysis workflow example 2021-06-29 09:50:19 -04:00
Meng Li 1c57320ab3 Update segment labels and fix the bug when we do not have any labels for event segments 2021-06-29 09:49:24 -04:00
Meng Li cefcb0635b Update heatmap of recorded phone sensors 2021-06-29 09:49:24 -04:00
Meng Li bc06477d89 Update heatmap of sensor row count 2021-06-29 09:49:24 -04:00
Meng Li e98a8ff7ca Update histogram of phone data yield 2021-06-29 09:49:24 -04:00
Meng Li f436f1f530 Update heatmap of correlation matrix 2021-06-29 09:49:23 -04:00
Meng Li 4d37696158 Update heatmaps of overall data yield 2021-06-29 09:48:30 -04:00
JulioV 654f6f3c3d
Merge pull request #145 from carissalow/feature/phone_activity_recognition_test
Feature/phone activity recognition test
2021-06-23 19:11:26 -04:00
Weiyu 4efc247575 Tested phone activity recognition feature 2021-06-23 19:08:14 -04:00
Weiyu f374c67bd5 Bug fixed: Added unknown activity case 2021-06-23 19:04:55 -04:00
JulioV e4af893d25
Merge pull request #144 from carissalow/feature/phone_bluetooth_test
Feature/phone bluetooth test
2021-06-23 19:00:56 -04:00
Weiyu c07358df70 Tested phone bluetooth feature 2021-06-23 18:56:24 -04:00
Weiyu 3e4d167adc Bug fixed: sort bt_address alphabetically before picking the most frequent bt_address 2021-06-22 17:40:00 -04:00
Kirtiraj Khandekar 5924f251d9 Update phone-applications-foreground.md 2021-06-21 14:59:06 -04:00
Meng Li 339781252a Merge branch 'data_stream_fix' into develop 2021-06-11 18:37:30 -04:00
Meng Li f248b6c97d Fix bugs of Fitbit mutation scripts 2021-06-11 18:18:33 -04:00
kirtirajk 4b8698a4c6 adding app_episode with the changes as mentioned in the comments 2021-06-10 14:17:56 -04:00
Weiyu a7e720e1a8 Validated phone conversation feature results 2021-06-10 10:49:22 -04:00
Weiyu ae20d22d1e validated phone wifi feature results 2021-06-10 10:49:22 -04:00
Weiyu 65d5cb7bd4 Bug fixed: countscansmostuniquedevice stays the same for all time segments 2021-06-10 10:49:22 -04:00
Weiyu 56b344f9ce Updated phone call and screen test description
Updated phone screen description
2021-06-10 10:49:22 -04:00
Weiyu 93622b6781 Completed phone calls test 2021-06-10 10:49:22 -04:00
JulioV f1960f00b1 Update Docker Vs Code setup 2021-06-08 17:50:11 -04:00
728 changed files with 33066 additions and 1967 deletions

7
.gitattributes vendored 100644
View File

@ -0,0 +1,7 @@
# We'll let Git's auto-detection algorithm infer if a file is text. If it is,
# enforce LF line endings regardless of OS or git configurations.
* text=auto eol=lf
# Isolate binary files in case the auto-detection algorithm fails and
# marks them as text files (which could brick them).
*.{png,jpg,jpeg,gif,webp,woff,woff2} binary

16
.gitignore vendored
View File

@ -93,10 +93,17 @@ packrat/*
# exclude data from source control by default
data/external/*
!/data/external/empatica/empatica1/E4 Data.zip
!/data/external/.gitkeep
!/data/external/stachl_application_genre_catalogue.csv
!/data/external/timesegments*.csv
!/data/external/wiki_tz.csv
!/data/external/main_study_usernames.csv
!/data/external/timezone.csv
!/data/external/play_store_application_genre_catalogue.csv
!/data/external/play_store_categories_count.csv
data/raw/*
!/data/raw/.gitkeep
data/interim/*
@ -114,3 +121,12 @@ settings.dcf
tests/fakedata_generation/
site/
credentials.yaml
# Docker container and other files
.devcontainer
# Calculating features module
calculatingfeatures/
# Temp folder for rapids data/external
rapids_temp_data/

188
README.md
View File

@ -11,3 +11,191 @@
For more information refer to our [documentation](http://www.rapids.science)
By [MoSHI](https://www.moshi.pitt.edu/), [University of Pittsburgh](https://www.pitt.edu/)
## Installation
For RAPIDS installation refer to to the [documentation](https://www.rapids.science/1.8/setup/installation/)
### For the installation of the Docker version
1. Follow the [instructions](https://www.rapids.science/1.8/setup/installation/) to setup RAPIDS via Docker (from scratch).
2. Delete current contents in /rapids/ folder when in a container session.
```
cd ..
rm -rf rapids/{*,.*}
cd rapids
```
3. Clone RAPIDS workspace from Git and checkout a specific branch.
```
git clone "https://repo.ijs.si/junoslukan/rapids.git" .
git checkout <branch_name>
```
4. Install missing “libpq-dev” dependency with bash.
```
apt-get update -y
apt-get install -y libpq-dev
```
5. Restore R venv.
Type R to go to the interactive R session and then:
```
renv::restore()
```
6. Install cr-features module
From: https://repo.ijs.si/matjazbostic/calculatingfeatures.git -> branch master.
Then follow the "cr-features module" section below.
7. Install all required packages from environment.yml, prune also deletes conda packages not present in environment file.
```
conda env update --file environment.yml prune
```
8. If you wish to update your R or Python venvs.
```
R in interactive session:
renv::snapshot()
Python:
conda env export --no-builds | sed 's/^.*libgfortran.*$/ - libgfortran/' | sed 's/^.*mkl=.*$/ - mkl/' > environment.yml
```
### cr-features module
This RAPIDS extension uses cr-features library accessible [here](https://repo.ijs.si/matjazbostic/calculatingfeatures).
To use cr-features library:
- Follow the installation instructions in the [README.md](https://repo.ijs.si/matjazbostic/calculatingfeatures/-/blob/master/README.md).
- Copy built calculatingfeatures folder into the RAPIDS workspace.
- Install the cr-features package by:
```
pip install path/to/the/calculatingfeatures/folder
e.g. pip install ./calculatingfeatures if the folder is copied to main parent directory
cr-features package has to be built and installed everytime to get the newest version.
Or an the newest version of the docker image must be used.
```
## Updating RAPIDS
To update RAPIDS, first pull and merge [origin]( https://github.com/carissalow/rapids), such as with:
```commandline
git fetch --progress "origin" refs/heads/master
git merge --no-ff origin/master
```
Next, update the conda and R virtual environment.
```bash
R -e 'renv::restore(repos = c(CRAN = "https://packagemanager.rstudio.com/all/__linux__/focal/latest"))'
```
## Custom configuration
### Credentials
As mentioned under [Database in RAPIDS documentation](https://www.rapids.science/1.6/snippets/database/), a `credentials.yaml` file is needed to connect to a database.
It should contain:
```yaml
PSQL_STRAW:
database: staw
host: 212.235.208.113
password: password
port: 5432
user: staw_db
```
where`password` needs to be specified as well.
## Possible installation issues
### Missing dependencies for RPostgres
To install `RPostgres` R package (used to connect to the PostgreSQL database), an error might occur:
```text
------------------------- ANTICONF ERROR ---------------------------
Configuration failed because libpq was not found. Try installing:
* deb: libpq-dev (Debian, Ubuntu, etc)
* rpm: postgresql-devel (Fedora, EPEL)
* rpm: postgreql8-devel, psstgresql92-devel, postgresql93-devel, or postgresql94-devel (Amazon Linux)
* csw: postgresql_dev (Solaris)
* brew: libpq (OSX)
If libpq is already installed, check that either:
(i) 'pkg-config' is in your PATH AND PKG_CONFIG_PATH contains a libpq.pc file; or
(ii) 'pg_config' is in your PATH.
If neither can detect , you can set INCLUDE_DIR
and LIB_DIR manually via:
R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
--------------------------[ ERROR MESSAGE ]----------------------------
<stdin>:1:10: fatal error: libpq-fe.h: No such file or directory
compilation terminated.
```
The library requires `libpq` for compiling from source, so install accordingly.
### Timezone environment variable for tidyverse (relevant for WSL2)
One of the R packages, `tidyverse` might need access to the `TZ` environment variable during the installation.
On Ubuntu 20.04 on WSL2 this triggers the following error:
```text
> install.packages('tidyverse')
ERROR: configuration failed for package xml2
System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to create bus connection: Host is down
Warning in system("timedatectl", intern = TRUE) :
running command 'timedatectl' had status 1
Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) :
namespace xml2 1.3.1 is already loaded, but >= 1.3.2 is required
Calls: <Anonymous> ... namespaceImportFrom -> asNamespace -> loadNamespace
Execution halted
ERROR: lazy loading failed for package tidyverse
```
This happens because WSL2 does not use the `timedatectl` service, which provides this variable.
```bash
~$ timedatectl
System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to create bus connection: Host is down
```
and later
```bash
Warning message:
In system("timedatectl", intern = TRUE) :
running command 'timedatectl' had status 1
Execution halted
```
This can be amended by setting the environment variable manually before attempting to install `tidyverse`:
```bash
export TZ='Europe/Ljubljana'
```
Note: if this is needed to avoid runtime issues, you need to either define this environment variable in each new terminal window or (better) define it in your `~/.bashrc` or `~/.bash_profile`.
## Possible runtime issues
### Unix end of line characters
Upon running rapids, an error might occur:
```bash
/usr/bin/env: python3\r: No such file or directory
```
This is due to Windows style end of line characters.
To amend this, I added a `.gitattributes` files to force `git` to checkout `rapids` using Unix EOL characters.
If this still fails, `dos2unix` can be used to change them.
### System has not been booted with systemd as init system (PID 1)
See [the installation issue above](#Timezone-environment-variable-for-tidyverse-(relevant-for-WSL2)).

View File

@ -5,6 +5,7 @@ include: "rules/common.smk"
include: "rules/renv.smk"
include: "rules/preprocessing.smk"
include: "rules/features.smk"
include: "rules/models.smk"
include: "rules/reports.smk"
import itertools
@ -45,7 +46,12 @@ for provider in config["PHONE_MESSAGES"]["PROVIDERS"].keys():
for provider in config["PHONE_CALLS"]["PROVIDERS"].keys():
if config["PHONE_CALLS"]["PROVIDERS"][provider]["COMPUTE"]:
files_to_compute.extend(expand("data/raw/{pid}/phone_calls_raw.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/raw/{pid}/phone_calls_with_datetime.csv", pid=config["PIDS"]))
if (provider == "RAPIDS") and (config["PHONE_CALLS"]["PROVIDERS"][provider]["FEATURES_TYPE"] == "EPISODES"):
files_to_compute.extend(expand("data/interim/{pid}/phone_calls_episodes.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/interim/{pid}/phone_calls_episodes_resampled.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/interim/{pid}/phone_calls_episodes_resampled_with_datetime.csv", pid=config["PIDS"]))
else:
files_to_compute.extend(expand("data/raw/{pid}/phone_calls_with_datetime.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/interim/{pid}/phone_calls_features/phone_calls_{language}_{provider_key}.csv", pid=config["PIDS"], language=get_script_language(config["PHONE_CALLS"]["PROVIDERS"][provider]["SRC_SCRIPT"]), provider_key=provider.lower()))
files_to_compute.extend(expand("data/processed/features/{pid}/phone_calls.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"]))
@ -122,6 +128,10 @@ for provider in config["PHONE_APPLICATIONS_FOREGROUND"]["PROVIDERS"].keys():
files_to_compute.extend(expand("data/raw/{pid}/phone_applications_foreground_raw.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/raw/{pid}/phone_applications_foreground_with_datetime.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/raw/{pid}/phone_applications_foreground_with_datetime_with_categories.csv", pid=config["PIDS"]))
if config["PHONE_APPLICATIONS_FOREGROUND"]["PROVIDERS"][provider]["INCLUDE_EPISODE_FEATURES"]:
files_to_compute.extend(expand("data/interim/{pid}/phone_app_episodes.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/interim/{pid}/phone_app_episodes_resampled.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/interim/{pid}/phone_app_episodes_resampled_with_datetime.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/interim/{pid}/phone_applications_foreground_features/phone_applications_foreground_{language}_{provider_key}.csv", pid=config["PIDS"], language=get_script_language(config["PHONE_APPLICATIONS_FOREGROUND"]["PROVIDERS"][provider]["SRC_SCRIPT"]), provider_key=provider.lower()))
files_to_compute.extend(expand("data/processed/features/{pid}/phone_applications_foreground.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"]))
@ -154,6 +164,25 @@ for provider in config["PHONE_CONVERSATION"]["PROVIDERS"].keys():
files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"]))
files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv")
for provider in config["PHONE_ESM"]["PROVIDERS"].keys():
if config["PHONE_ESM"]["PROVIDERS"][provider]["COMPUTE"]:
files_to_compute.extend(expand("data/raw/{pid}/phone_esm_raw.csv",pid=config["PIDS"]))
files_to_compute.extend(expand("data/raw/{pid}/phone_esm_with_datetime.csv",pid=config["PIDS"]))
files_to_compute.extend(expand("data/interim/{pid}/phone_esm_clean.csv",pid=config["PIDS"]))
files_to_compute.extend(expand("data/interim/{pid}/phone_esm_features/phone_esm_{language}_{provider_key}.csv",pid=config["PIDS"],language=get_script_language(config["PHONE_ESM"]["PROVIDERS"][provider]["SRC_SCRIPT"]),provider_key=provider.lower()))
files_to_compute.extend(expand("data/processed/features/{pid}/phone_esm.csv", pid=config["PIDS"]))
# files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv",pid=config["PIDS"]))
# files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv")
for provider in config["PHONE_SPEECH"]["PROVIDERS"].keys():
if config["PHONE_SPEECH"]["PROVIDERS"][provider]["COMPUTE"]:
files_to_compute.extend(expand("data/raw/{pid}/phone_speech_raw.csv",pid=config["PIDS"]))
files_to_compute.extend(expand("data/raw/{pid}/phone_speech_with_datetime.csv",pid=config["PIDS"]))
files_to_compute.extend(expand("data/interim/{pid}/phone_speech_features/phone_speech_{language}_{provider_key}.csv",pid=config["PIDS"],language=get_script_language(config["PHONE_SPEECH"]["PROVIDERS"][provider]["SRC_SCRIPT"]),provider_key=provider.lower()))
files_to_compute.extend(expand("data/processed/features/{pid}/phone_speech.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"]))
files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv")
# We can delete these if's as soon as we add feature PROVIDERS to any of these sensors
if isinstance(config["PHONE_APPLICATIONS_CRASHES"]["PROVIDERS"], dict):
for provider in config["PHONE_APPLICATIONS_CRASHES"]["PROVIDERS"].keys():
@ -208,7 +237,8 @@ for provider in config["PHONE_LOCATIONS"]["PROVIDERS"].keys():
if provider == "BARNETT":
files_to_compute.extend(expand("data/interim/{pid}/phone_locations_barnett_daily.csv", pid=config["PIDS"]))
if provider == "DORYAB":
files_to_compute.extend(expand("data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns_episodes.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns_episodes_resampled_with_datetime.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/raw/{pid}/phone_locations_raw.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/interim/{pid}/phone_locations_processed.csv", pid=config["PIDS"]))
@ -307,7 +337,7 @@ for provider in config["EMPATICA_ACCELEROMETER"]["PROVIDERS"].keys():
files_to_compute.extend(expand("data/processed/features/{pid}/empatica_accelerometer.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"]))
files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv")
for provider in config["EMPATICA_HEARTRATE"]["PROVIDERS"].keys():
if config["EMPATICA_HEARTRATE"]["PROVIDERS"][provider]["COMPUTE"]:
files_to_compute.extend(expand("data/raw/{pid}/empatica_heartrate_raw.csv", pid=config["PIDS"]))
@ -353,7 +383,7 @@ for provider in config["EMPATICA_INTER_BEAT_INTERVAL"]["PROVIDERS"].keys():
files_to_compute.extend(expand("data/processed/features/{pid}/empatica_inter_beat_interval.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"]))
files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv")
if isinstance(config["EMPATICA_TAGS"]["PROVIDERS"], dict):
for provider in config["EMPATICA_TAGS"]["PROVIDERS"].keys():
if config["EMPATICA_TAGS"]["PROVIDERS"][provider]["COMPUTE"]:
@ -377,11 +407,41 @@ if config["HEATMAP_SENSOR_ROW_COUNT_PER_TIME_SEGMENT"]["PLOT"]:
files_to_compute.append("reports/data_exploration/heatmap_sensor_row_count_per_time_segment.html")
if config["HEATMAP_PHONE_DATA_YIELD_PER_PARTICIPANT_PER_TIME_SEGMENT"]["PLOT"]:
if not config["PHONE_DATA_YIELD"]["PROVIDERS"]["RAPIDS"]["COMPUTE"]:
raise ValueError("Error: [PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] must be True in config.yaml to get heatmaps of overall data yield.")
files_to_compute.append("reports/data_exploration/heatmap_phone_data_yield_per_participant_per_time_segment.html")
if config["HEATMAP_FEATURE_CORRELATION_MATRIX"]["PLOT"]:
files_to_compute.append("reports/data_exploration/heatmap_feature_correlation_matrix.html")
# Data Cleaning
for provider in config["ALL_CLEANING_INDIVIDUAL"]["PROVIDERS"].keys():
if config["ALL_CLEANING_INDIVIDUAL"]["PROVIDERS"][provider]["COMPUTE"]:
if provider == "STRAW":
files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features_cleaned_" + provider.lower() + "_py.csv", pid=config["PIDS"]))
else:
files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features_cleaned_" + provider.lower() + "_R.csv", pid=config["PIDS"]))
for provider in config["ALL_CLEANING_OVERALL"]["PROVIDERS"].keys():
if config["ALL_CLEANING_OVERALL"]["PROVIDERS"][provider]["COMPUTE"]:
if provider == "STRAW":
for target in config["PARAMS_FOR_ANALYSIS"]["TARGET"]["ALL_LABELS"]:
files_to_compute.extend(expand("data/processed/features/all_participants/all_sensor_features_cleaned_" + provider.lower() +"_py_(" + target + ").csv"))
else:
files_to_compute.extend(expand("data/processed/features/all_participants/all_sensor_features_cleaned_" + provider.lower() +"_R.csv"))
# Baseline features
if config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["COMPUTE"]:
files_to_compute.extend(expand("data/raw/baseline_merged.csv"))
files_to_compute.extend(expand("data/raw/{pid}/participant_baseline_raw.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/interim/{pid}/baseline_questionnaires.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/processed/features/{pid}/baseline_features.csv", pid=config["PIDS"]))
# Targets (labels)
if config["PARAMS_FOR_ANALYSIS"]["TARGET"]["COMPUTE"]:
files_to_compute.extend(expand("data/processed/models/individual_model/{pid}/input.csv", pid=config["PIDS"]))
for target in config["PARAMS_FOR_ANALYSIS"]["TARGET"]["ALL_LABELS"]:
files_to_compute.extend(expand("data/processed/models/population_model/input_" + target + ".csv"))
rule all:
input:

0
__init__.py 100644
View File

57
automl_test.py 100644
View File

@ -0,0 +1,57 @@
from pprint import pprint
import sklearn.metrics
import autosklearn.regression
import datetime
import importlib
import os
import sys
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import yaml
from sklearn import linear_model, svm, kernel_ridge, gaussian_process
from sklearn.model_selection import LeaveOneGroupOut, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.impute import SimpleImputer
model_input = pd.read_csv("data/processed/models/population_model/input_PANAS_negative_affect_mean.csv") # Standardizirani podatki
model_input.dropna(axis=1, how="all", inplace=True)
model_input.dropna(axis=0, how="any", subset=["target"], inplace=True)
categorical_feature_colnames = ["gender", "startlanguage"]
categorical_feature_colnames += [col for col in model_input.columns if "mostcommonactivity" in col or "homelabel" in col]
categorical_features = model_input[categorical_feature_colnames].copy()
mode_categorical_features = categorical_features.mode().iloc[0]
categorical_features = categorical_features.fillna(mode_categorical_features)
categorical_features = categorical_features.apply(lambda col: col.astype("category"))
if not categorical_features.empty:
categorical_features = pd.get_dummies(categorical_features)
numerical_features = model_input.drop(categorical_feature_colnames, axis=1)
model_in = pd.concat([numerical_features, categorical_features], axis=1)
index_columns = ["local_segment", "local_segment_label", "local_segment_start_datetime", "local_segment_end_datetime"]
model_in.set_index(index_columns, inplace=True)
X_train, X_test, y_train, y_test = train_test_split(model_in.drop(["target", "pid"], axis=1), model_in["target"], test_size=0.30)
automl = autosklearn.regression.AutoSklearnRegressor(
time_left_for_this_task=7200,
per_run_time_limit=120
)
automl.fit(X_train, y_train, dataset_name='straw')
print(automl.leaderboard())
pprint(automl.show_models(), indent=4)
train_predictions = automl.predict(X_train)
print("Train R2 score:", sklearn.metrics.r2_score(y_train, train_predictions))
test_predictions = automl.predict(X_test)
print("Test R2 score:", sklearn.metrics.r2_score(y_test, test_predictions))
import sys
sys.exit()

View File

@ -3,16 +3,17 @@
########################################################################################################################
# See https://www.rapids.science/latest/setup/configuration/#participant-files
PIDS: [test01]
PIDS: ['p031', 'p032', 'p033', 'p034', 'p035', 'p036', 'p037', 'p038', 'p039', 'p040', 'p042', 'p043', 'p044', 'p045', 'p046', 'p049', 'p050', 'p052', 'p053', 'p054', 'p055', 'p057', 'p058', 'p059', 'p060', 'p061', 'p062', 'p064', 'p067', 'p068', 'p069', 'p070', 'p071', 'p072', 'p073', 'p074', 'p075', 'p076', 'p077', 'p078', 'p079', 'p080', 'p081', 'p082', 'p083', 'p084', 'p085', 'p086', 'p088', 'p089', 'p090', 'p091', 'p092', 'p093', 'p106', 'p107']
# See https://www.rapids.science/latest/setup/configuration/#automatic-creation-of-participant-files
CREATE_PARTICIPANT_FILES:
CSV_FILE_PATH: "data/external/example_participants.csv" # see docs for required format
USERNAMES_CSV: "data/external/main_study_usernames.csv"
CSV_FILE_PATH: "data/external/main_study_participants.csv" # see docs for required format
PHONE_SECTION:
ADD: True
IGNORED_DEVICE_IDS: []
FITBIT_SECTION:
ADD: True
ADD: False
IGNORED_DEVICE_IDS: []
EMPATICA_SECTION:
ADD: True
@ -20,19 +21,25 @@ CREATE_PARTICIPANT_FILES:
# See https://www.rapids.science/latest/setup/configuration/#time-segments
TIME_SEGMENTS: &time_segments
TYPE: PERIODIC # FREQUENCY, PERIODIC, EVENT
FILE: "data/external/timesegments_periodic.csv"
INCLUDE_PAST_PERIODIC_SEGMENTS: FALSE # Only relevant if TYPE=PERIODIC, see docs
TYPE: EVENT # FREQUENCY, PERIODIC, EVENT
FILE: "data/external/straw_events.csv"
INCLUDE_PAST_PERIODIC_SEGMENTS: TRUE # Only relevant if TYPE=PERIODIC, see docs
TAILORED_EVENTS: # Only relevant if TYPE=EVENT
COMPUTE: True
SEGMENTING_METHOD: "30_before" # 30_before, 90_before, stress_event
INTERVAL_OF_INTEREST: 10 # duration of event of interest [minutes]
IOI_ERROR_TOLERANCE: 5 # interval of interest erorr tolerance (before and after IOI) [minutes]
# See https://www.rapids.science/latest/setup/configuration/#timezone-of-your-study
TIMEZONE:
TYPE: SINGLE
TYPE: MULTIPLE
SINGLE:
TZCODE: America/New_York
TZCODE: Europe/Ljubljana
MULTIPLE:
TZCODES_FILE: data/external/multiple_timezones_example.csv
IF_MISSING_TZCODE: STOP
DEFAULT_TZCODE: America/New_York
TZ_FILE: data/external/timezone.csv
TZCODES_FILE: data/external/multiple_timezones.csv
IF_MISSING_TZCODE: USE_DEFAULT
DEFAULT_TZCODE: Europe/Ljubljana
FITBIT:
ALLOW_MULTIPLE_TZ_PER_DEVICE: False
INFER_FROM_SMARTPHONE_TZ: False
@ -43,12 +50,15 @@ TIMEZONE:
# See https://www.rapids.science/latest/setup/configuration/#data-stream-configuration
PHONE_DATA_STREAMS:
USE: aware_mysql
USE: aware_postgresql
# AVAILABLE:
aware_mysql:
DATABASE_GROUP: MY_GROUP
aware_postgresql:
DATABASE_GROUP: PSQL_STRAW
aware_csv:
FOLDER: data/external/aware_csv
@ -65,7 +75,6 @@ PHONE_ACCELEROMETER:
COMPUTE: False
FEATURES: ["maxmagnitude", "minmagnitude", "avgmagnitude", "medianmagnitude", "stdmagnitude"]
SRC_SCRIPT: src/features/phone_accelerometer/rapids/main.py
PANDA:
COMPUTE: False
VALID_SENSED_MINUTES: False
@ -77,12 +86,12 @@ PHONE_ACCELEROMETER:
# See https://www.rapids.science/latest/features/phone-activity-recognition/
PHONE_ACTIVITY_RECOGNITION:
CONTAINER:
ANDROID: plugin_google_activity_recognition
ANDROID: google_ar
IOS: plugin_ios_activity_recognition
EPISODE_THRESHOLD_BETWEEN_ROWS: 5 # minutes. Max time difference for two consecutive rows to be considered within the same AR episode.
PROVIDERS:
RAPIDS:
COMPUTE: False
COMPUTE: True
FEATURES: ["count", "mostcommonactivity", "countuniqueactivities", "durationstationary", "durationmobile", "durationvehicle"]
ACTIVITY_CLASSES:
STATIONARY: ["still", "tilting"]
@ -95,35 +104,52 @@ PHONE_APPLICATIONS_CRASHES:
CONTAINER: applications_crashes
APPLICATION_CATEGORIES:
CATALOGUE_SOURCE: FILE # FILE (genres are read from CATALOGUE_FILE) or GOOGLE (genres are scrapped from the Play Store)
CATALOGUE_FILE: "data/external/stachl_application_genre_catalogue.csv"
UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether or not to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
SCRAPE_MISSING_CATEGORIES: False # whether or not to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
CATALOGUE_FILE: "data/external/play_store_application_genre_catalogue.csv"
UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
SCRAPE_MISSING_CATEGORIES: False # whether to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
PROVIDERS: # None implemented yet but this sensor can be used in PHONE_DATA_YIELD
# See https://www.rapids.science/latest/features/phone-applications-foreground/
PHONE_APPLICATIONS_FOREGROUND:
CONTAINER: applications_foreground
CONTAINER: applications
APPLICATION_CATEGORIES:
CATALOGUE_SOURCE: FILE # FILE (genres are read from CATALOGUE_FILE) or GOOGLE (genres are scrapped from the Play Store)
CATALOGUE_FILE: "data/external/stachl_application_genre_catalogue.csv"
UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether or not to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
SCRAPE_MISSING_CATEGORIES: False # whether or not to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
CATALOGUE_FILE: "data/external/play_store_application_genre_catalogue.csv"
# Refer to data/external/play_store_categories_count.csv for a list of categories (genres) and their frequency.
UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
SCRAPE_MISSING_CATEGORIES: False # whether to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
PROVIDERS:
RAPIDS:
COMPUTE: False
SINGLE_CATEGORIES: ["all", "email"]
COMPUTE: True
INCLUDE_EPISODE_FEATURES: True
SINGLE_CATEGORIES: ["Productivity", "Tools", "Communication", "Education", "Social"]
MULTIPLE_CATEGORIES:
social: ["socialnetworks", "socialmediatools"]
entertainment: ["entertainment", "gamingknowledge", "gamingcasual", "gamingadventure", "gamingstrategy", "gamingtoolscommunity", "gamingroleplaying", "gamingaction", "gaminglogic", "gamingsports", "gamingsimulation"]
SINGLE_APPS: ["top1global", "com.facebook.moments", "com.google.android.youtube", "com.twitter.android"] # There's no entropy for single apps
EXCLUDED_CATEGORIES: []
EXCLUDED_APPS: ["com.fitbit.FitbitMobile", "com.aware.plugin.upmc.cancer"]
FEATURES: ["count", "timeoffirstuse", "timeoflastuse", "frequencyentropy"]
games: ["Puzzle", "Card", "Casual", "Board", "Strategy", "Trivia", "Word", "Adventure", "Role Playing", "Simulation", "Board, Brain Games", "Racing"]
social: ["Communication", "Social", "Dating"]
productivity: ["Tools", "Productivity", "Finance", "Education", "News & Magazines", "Business", "Books & Reference"]
health: ["Health & Fitness", "Lifestyle", "Food & Drink", "Sports", "Medical", "Parenting"]
entertainment: ["Shopping", "Music & Audio", "Entertainment", "Travel & Local", "Photography", "Video Players & Editors", "Personalization", "House & Home", "Art & Design", "Auto & Vehicles", "Entertainment,Music & Video",
"Puzzle", "Card", "Casual", "Board", "Strategy", "Trivia", "Word", "Adventure", "Role Playing", "Simulation", "Board, Brain Games", "Racing" # Add all games.
]
maps_weather: ["Maps & Navigation", "Weather"]
CUSTOM_CATEGORIES:
SINGLE_APPS: []
EXCLUDED_CATEGORIES: ["System", "STRAW"]
# Note: A special option here is "is_system_app".
# This excludes applications that have is_system_app = TRUE, which is a separate column in the table.
# However, all of these applications have been assigned System category.
# I will therefore filter by that category, which is a superset and is more complete. JL
EXCLUDED_APPS: []
FEATURES:
APP_EVENTS: ["countevent", "timeoffirstuse", "timeoflastuse", "frequencyentropy"]
APP_EPISODES: ["countepisode", "minduration", "maxduration", "meanduration", "sumduration"]
IGNORE_EPISODES_SHORTER_THAN: 0 # in minutes, set to 0 to disable
IGNORE_EPISODES_LONGER_THAN: 300 # in minutes, set to 0 to disable
SRC_SCRIPT: src/features/phone_applications_foreground/rapids/main.py
# See https://www.rapids.science/latest/features/phone-applications-notifications/
PHONE_APPLICATIONS_NOTIFICATIONS:
CONTAINER: applications_notifications
CONTAINER: notifications
APPLICATION_CATEGORIES:
CATALOGUE_SOURCE: FILE # FILE (genres are read from CATALOGUE_FILE) or GOOGLE (genres are scrapped from the Play Store)
CATALOGUE_FILE: "data/external/stachl_application_genre_catalogue.csv"
@ -137,7 +163,7 @@ PHONE_BATTERY:
EPISODE_THRESHOLD_BETWEEN_ROWS: 30 # minutes. Max time difference for two consecutive rows to be considered within the same battery episode.
PROVIDERS:
RAPIDS:
COMPUTE: False
COMPUTE: True
FEATURES: ["countdischarge", "sumdurationdischarge", "countcharge", "sumdurationcharge", "avgconsumptionrate", "maxconsumptionrate"]
SRC_SCRIPT: src/features/phone_battery/rapids/main.py
@ -151,7 +177,7 @@ PHONE_BLUETOOTH:
SRC_SCRIPT: src/features/phone_bluetooth/rapids/main.R
DORYAB:
COMPUTE: False
COMPUTE: True
FEATURES:
ALL:
DEVICES: ["countscans", "uniquedevices", "meanscans", "stdscans"]
@ -169,10 +195,11 @@ PHONE_BLUETOOTH:
# See https://www.rapids.science/latest/features/phone-calls/
PHONE_CALLS:
CONTAINER: calls
CONTAINER: call
PROVIDERS:
RAPIDS:
COMPUTE: False
COMPUTE: True
FEATURES_TYPE: EPISODES # EVENTS or EPISODES
CALL_TYPES: [missed, incoming, outgoing]
FEATURES:
missed: [count, distinctcontacts, timefirstcall, timelastcall, countmostfrequentcontact]
@ -181,7 +208,7 @@ PHONE_CALLS:
SRC_SCRIPT: src/features/phone_calls/rapids/main.R
# See https://www.rapids.science/latest/features/phone-conversation/
PHONE_CONVERSATION:
PHONE_CONVERSATION: # TODO Adapt for speech
CONTAINER:
ANDROID: plugin_studentlife_audio_android
IOS: plugin_studentlife_audio
@ -200,14 +227,35 @@ PHONE_CONVERSATION:
# See https://www.rapids.science/latest/features/phone-data-yield/
PHONE_DATA_YIELD:
SENSORS: []
SENSORS: [#PHONE_ACCELEROMETER,
PHONE_ACTIVITY_RECOGNITION,
PHONE_APPLICATIONS_FOREGROUND,
PHONE_APPLICATIONS_NOTIFICATIONS,
PHONE_BATTERY,
PHONE_BLUETOOTH,
PHONE_CALLS,
PHONE_LIGHT,
PHONE_LOCATIONS,
PHONE_MESSAGES,
PHONE_SCREEN,
PHONE_WIFI_VISIBLE]
PROVIDERS:
RAPIDS:
COMPUTE: False
COMPUTE: True
FEATURES: [ratiovalidyieldedminutes, ratiovalidyieldedhours]
MINUTE_RATIO_THRESHOLD_FOR_VALID_YIELDED_HOURS: 0.5 # 0 to 1, minimum percentage of valid minutes in an hour to be considered valid.
SRC_SCRIPT: src/features/phone_data_yield/rapids/main.R
PHONE_ESM:
CONTAINER: esm
PROVIDERS:
STRAW:
COMPUTE: True
SCALES: ["PANAS_positive_affect", "PANAS_negative_affect", "JCQ_job_demand", "JCQ_job_control", "JCQ_supervisor_support", "JCQ_coworker_support",
"appraisal_stressfulness_period", "appraisal_stressfulness_event", "appraisal_threat", "appraisal_challenge"]
FEATURES: [mean]
SRC_SCRIPT: src/features/phone_esm/straw/main.py
# See https://www.rapids.science/latest/features/phone-keyboard/
PHONE_KEYBOARD:
CONTAINER: keyboard
@ -219,10 +267,10 @@ PHONE_KEYBOARD:
# See https://www.rapids.science/latest/features/phone-light/
PHONE_LIGHT:
CONTAINER: light
CONTAINER: light_sensor
PROVIDERS:
RAPIDS:
COMPUTE: False
COMPUTE: True
FEATURES: ["count", "maxlux", "minlux", "avglux", "medianlux", "stdlux"]
SRC_SCRIPT: src/features/phone_light/rapids/main.py
@ -232,12 +280,12 @@ PHONE_LOCATIONS:
LOCATIONS_TO_USE: ALL_RESAMPLED # ALL, GPS, ALL_RESAMPLED, OR FUSED_RESAMPLED
FUSED_RESAMPLED_CONSECUTIVE_THRESHOLD: 30 # minutes, only replicate location samples to the next sensed bin if the phone did not stop collecting data for more than this threshold
FUSED_RESAMPLED_TIME_SINCE_VALID_LOCATION: 720 # minutes, only replicate location samples to consecutive sensed bins if they were logged within this threshold after a valid location row
ACCURACY_LIMIT: 100 # meters, drops location coordinates with an accuracy equal or higher than this. This number means there's a 68% probability the true location is within this radius
PROVIDERS:
DORYAB:
COMPUTE: False
COMPUTE: True
FEATURES: ["locationvariance","loglocationvariance","totaldistance","avgspeed","varspeed", "numberofsignificantplaces","numberlocationtransitions","radiusgyration","timeattop1location","timeattop2location","timeattop3location","movingtostaticratio","outlierstimepercent","maxlengthstayatclusters","minlengthstayatclusters","avglengthstayatclusters","stdlengthstayatclusters","locationentropy","normalizedlocationentropy","timeathome", "homelabel"]
ACCURACY_LIMIT: 100 # meters, drops location coordinates with an accuracy higher than this. This number means there's a 68% probability the true location is within this radius
DBSCAN_EPS: 100 # meters
DBSCAN_MINSAMPLES: 5
THRESHOLD_STATIC : 1 # km/h
@ -251,9 +299,8 @@ PHONE_LOCATIONS:
SRC_SCRIPT: src/features/phone_locations/doryab/main.py
BARNETT:
COMPUTE: False
COMPUTE: True
FEATURES: ["hometime","disttravelled","rog","maxdiam","maxhomedist","siglocsvisited","avgflightlen","stdflightlen","avgflightdur","stdflightdur","probpause","siglocentropy","circdnrtn","wkenddayrtn"]
ACCURACY_LIMIT: 100 # meters, drops location coordinates with an accuracy higher than this. This number means there's a 68% probability the true location is within this radius
IF_MULTIPLE_TIMEZONES: USE_MOST_COMMON
MINUTES_DATA_USED: False # Use this for quality control purposes, how many minutes of data (location coordinates gruped by minute) were used to compute features
SRC_SCRIPT: src/features/phone_locations/barnett/main.R
@ -267,10 +314,10 @@ PHONE_LOG:
# See https://www.rapids.science/latest/features/phone-messages/
PHONE_MESSAGES:
CONTAINER: messages
CONTAINER: sms
PROVIDERS:
RAPIDS:
COMPUTE: False
COMPUTE: True
MESSAGES_TYPES : [received, sent]
FEATURES:
received: [count, distinctcontacts, timefirstmessage, timelastmessage, countmostfrequentcontact]
@ -282,14 +329,23 @@ PHONE_SCREEN:
CONTAINER: screen
PROVIDERS:
RAPIDS:
COMPUTE: False
COMPUTE: True
REFERENCE_HOUR_FIRST_USE: 0
IGNORE_EPISODES_SHORTER_THAN: 0 # in minutes, set to 0 to disable
IGNORE_EPISODES_LONGER_THAN: 0 # in minutes, set to 0 to disable
IGNORE_EPISODES_LONGER_THAN: 360 # in minutes, set to 0 to disable
FEATURES: ["countepisode", "sumduration", "maxduration", "minduration", "avgduration", "stdduration", "firstuseafter"] # "episodepersensedminutes" needs to be added later
EPISODE_TYPES: ["unlock"]
SRC_SCRIPT: src/features/phone_screen/rapids/main.py
# Custom added sensor
PHONE_SPEECH:
CONTAINER: speech
PROVIDERS:
STRAW:
COMPUTE: True
FEATURES: ["meanspeech", "stdspeech", "nlargest", "nsmallest", "medianspeech"]
SRC_SCRIPT: src/features/phone_speech/straw/main.py
# See https://www.rapids.science/latest/features/phone-wifi-connected/
PHONE_WIFI_CONNECTED:
CONTAINER: sensor_wifi
@ -304,7 +360,7 @@ PHONE_WIFI_VISIBLE:
CONTAINER: wifi
PROVIDERS:
RAPIDS:
COMPUTE: False
COMPUTE: True
FEATURES: ["countscans", "uniquedevices", "countscansmostuniquedevice"]
SRC_SCRIPT: src/features/phone_wifi_visible/rapids/main.R
@ -407,7 +463,6 @@ FITBIT_SLEEP_INTRADAY:
UNIFIED: [awake, asleep]
SLEEP_TYPES: [main, nap, all]
SRC_SCRIPT: src/features/fitbit_sleep_intraday/rapids/main.py
PRICE:
COMPUTE: False
FEATURES: [avgduration, avgratioduration, avgstarttimeofepisodemain, avgendtimeofepisodemain, avgmidpointofepisodemain, stdstarttimeofepisodemain, stdendtimeofepisodemain, stdmidpointofepisodemain, socialjetlag, rmssdmeanstarttimeofepisodemain, rmssdmeanendtimeofepisodemain, rmssdmeanmidpointofepisodemain, rmssdmedianstarttimeofepisodemain, rmssdmedianendtimeofepisodemain, rmssdmedianmidpointofepisodemain]
@ -443,13 +498,15 @@ FITBIT_STEPS_INTRADAY:
RAPIDS:
COMPUTE: False
FEATURES:
STEPS: ["sum", "max", "min", "avg", "std"]
STEPS: ["sum", "max", "min", "avg", "std", "firststeptime", "laststeptime"]
SEDENTARY_BOUT: ["countepisode", "sumduration", "maxduration", "minduration", "avgduration", "stdduration"]
ACTIVE_BOUT: ["countepisode", "sumduration", "maxduration", "minduration", "avgduration", "stdduration"]
REFERENCE_HOUR: 0
THRESHOLD_ACTIVE_BOUT: 10 # steps
INCLUDE_ZERO_STEP_ROWS: False
SRC_SCRIPT: src/features/fitbit_steps_intraday/rapids/main.py
########################################################################################################################
# EMPATICA #
########################################################################################################################
@ -471,6 +528,15 @@ EMPATICA_ACCELEROMETER:
COMPUTE: False
FEATURES: ["maxmagnitude", "minmagnitude", "avgmagnitude", "medianmagnitude", "stdmagnitude"]
SRC_SCRIPT: src/features/empatica_accelerometer/dbdp/main.py
CR:
COMPUTE: True
FEATURES: ["totalMagnitudeBand", "absoluteMeanBand", "varianceBand"] # Acc features
WINDOWS:
COMPUTE: True
WINDOW_LENGTH: 15 # specify window length in seconds
SECOND_ORDER_FEATURES: ['mean', 'median', 'sd', 'nlargest', 'nsmallest', 'count_windows']
SRC_SCRIPT: src/features/empatica_accelerometer/cr/main.py
# See https://www.rapids.science/latest/features/empatica-heartrate/
EMPATICA_HEARTRATE:
@ -489,6 +555,15 @@ EMPATICA_TEMPERATURE:
COMPUTE: False
FEATURES: ["maxtemp", "mintemp", "avgtemp", "mediantemp", "modetemp", "stdtemp", "diffmaxmodetemp", "diffminmodetemp", "entropytemp"]
SRC_SCRIPT: src/features/empatica_temperature/dbdp/main.py
CR:
COMPUTE: True
FEATURES: ["maximum", "minimum", "meanAbsChange", "longestStrikeAboveMean", "longestStrikeBelowMean",
"stdDev", "median", "meanChange", "sumSquared", "squareSumOfComponent", "sumOfSquareComponents"]
WINDOWS:
COMPUTE: True
WINDOW_LENGTH: 300 # specify window length in seconds
SECOND_ORDER_FEATURES: ['mean', 'median', 'sd', 'nlargest', 'nsmallest', 'count_windows']
SRC_SCRIPT: src/features/empatica_temperature/cr/main.py
# See https://www.rapids.science/latest/features/empatica-electrodermal-activity/
EMPATICA_ELECTRODERMAL_ACTIVITY:
@ -498,6 +573,19 @@ EMPATICA_ELECTRODERMAL_ACTIVITY:
COMPUTE: False
FEATURES: ["maxeda", "mineda", "avgeda", "medianeda", "modeeda", "stdeda", "diffmaxmodeeda", "diffminmodeeda", "entropyeda"]
SRC_SCRIPT: src/features/empatica_electrodermal_activity/dbdp/main.py
CR:
COMPUTE: True
FEATURES: ['mean', 'std', 'q25', 'q75', 'qd', 'deriv', 'power', 'numPeaks', 'ratePeaks', 'powerPeaks', 'sumPosDeriv', 'propPosDeriv', 'derivTonic',
'sigTonicDifference', 'freqFeats','maxPeakAmplitudeChangeBefore', 'maxPeakAmplitudeChangeAfter', 'avgPeakAmplitudeChangeBefore',
'avgPeakAmplitudeChangeAfter', 'avgPeakChangeRatio', 'maxPeakIncreaseTime', 'maxPeakDecreaseTime', 'maxPeakDuration', 'maxPeakChangeRatio',
'avgPeakIncreaseTime', 'avgPeakDecreaseTime', 'avgPeakDuration', 'signalOverallChange', 'changeDuration', 'changeRate', 'significantIncrease',
'significantDecrease']
WINDOWS:
COMPUTE: True
WINDOW_LENGTH: 60 # specify window length in seconds
SECOND_ORDER_FEATURES: ['mean', 'median', 'sd', 'nlargest', 'nsmallest', count_windows, eda_num_peaks_non_zero]
IMPUTE_NANS: True
SRC_SCRIPT: src/features/empatica_electrodermal_activity/cr/main.py
# See https://www.rapids.science/latest/features/empatica-blood-volume-pulse/
EMPATICA_BLOOD_VOLUME_PULSE:
@ -507,6 +595,15 @@ EMPATICA_BLOOD_VOLUME_PULSE:
COMPUTE: False
FEATURES: ["maxbvp", "minbvp", "avgbvp", "medianbvp", "modebvp", "stdbvp", "diffmaxmodebvp", "diffminmodebvp", "entropybvp"]
SRC_SCRIPT: src/features/empatica_blood_volume_pulse/dbdp/main.py
CR:
COMPUTE: False
FEATURES: ['meanHr', 'ibi', 'sdnn', 'sdsd', 'rmssd', 'pnn20', 'pnn50', 'sd', 'sd2', 'sd1/sd2', 'numRR', # Time features
'VLF', 'LF', 'LFnorm', 'HF', 'HFnorm', 'LF/HF', 'fullIntegral'] # Freq features
WINDOWS:
COMPUTE: True
WINDOW_LENGTH: 300 # specify window length in seconds
SECOND_ORDER_FEATURES: ['mean', 'median', 'sd', 'nlargest', 'nsmallest', 'count_windows', 'hrv_num_windows_non_nan']
SRC_SCRIPT: src/features/empatica_blood_volume_pulse/cr/main.py
# See https://www.rapids.science/latest/features/empatica-inter-beat-interval/
EMPATICA_INTER_BEAT_INTERVAL:
@ -516,6 +613,16 @@ EMPATICA_INTER_BEAT_INTERVAL:
COMPUTE: False
FEATURES: ["maxibi", "minibi", "avgibi", "medianibi", "modeibi", "stdibi", "diffmaxmodeibi", "diffminmodeibi", "entropyibi"]
SRC_SCRIPT: src/features/empatica_inter_beat_interval/dbdp/main.py
CR:
COMPUTE: True
FEATURES: ['meanHr', 'ibi', 'sdnn', 'sdsd', 'rmssd', 'pnn20', 'pnn50', 'sd', 'sd2', 'sd1/sd2', 'numRR', # Time features
'VLF', 'LF', 'LFnorm', 'HF', 'HFnorm', 'LF/HF', 'fullIntegral'] # Freq features
PATCH_WITH_BVP: True
WINDOWS:
COMPUTE: True
WINDOW_LENGTH: 300 # specify window length in seconds
SECOND_ORDER_FEATURES: ['mean', 'median', 'sd', 'nlargest', 'nsmallest', 'count_windows', 'hrv_num_windows_non_nan']
SRC_SCRIPT: src/features/empatica_inter_beat_interval/cr/main.py
# See https://www.rapids.science/latest/features/empatica-tags/
EMPATICA_TAGS:
@ -556,3 +663,96 @@ HEATMAP_FEATURE_CORRELATION_MATRIX:
CORR_THRESHOLD: 0.1
CORR_METHOD: "pearson" # choose from {"pearson", "kendall", "spearman"}
########################################################################################################################
# Data Cleaning #
########################################################################################################################
ALL_CLEANING_INDIVIDUAL:
PROVIDERS:
RAPIDS:
COMPUTE: False
IMPUTE_SELECTED_EVENT_FEATURES:
COMPUTE: False
MIN_DATA_YIELDED_MINUTES_TO_IMPUTE: 0.33
COLS_NAN_THRESHOLD: 1 # set to 1 to disable
COLS_VAR_THRESHOLD: True
ROWS_NAN_THRESHOLD: 1 # set to 1 to disable
DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_HOURS # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
DATA_YIELD_RATIO_THRESHOLD: 0 # set to 0 to disable
DROP_HIGHLY_CORRELATED_FEATURES:
COMPUTE: True
MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
CORR_THRESHOLD: 0.95
SRC_SCRIPT: src/features/all_cleaning_individual/rapids/main.R
STRAW:
COMPUTE: True
PHONE_DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_MINUTES # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
PHONE_DATA_YIELD_RATIO_THRESHOLD: 0.5 # set to 0 to disable
EMPATICA_DATA_YIELD_RATIO_THRESHOLD: 0.5 # set to 0 to disable
ROWS_NAN_THRESHOLD: 0.33 # set to 1 to disable
COLS_NAN_THRESHOLD: 0.9 # set to 1 to remove only columns that contains all (100% of) NaN
COLS_VAR_THRESHOLD: True
DROP_HIGHLY_CORRELATED_FEATURES:
COMPUTE: True
MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
CORR_THRESHOLD: 0.95
STANDARDIZATION: True
SRC_SCRIPT: src/features/all_cleaning_individual/straw/main.py
ALL_CLEANING_OVERALL:
PROVIDERS:
RAPIDS:
COMPUTE: False
IMPUTE_SELECTED_EVENT_FEATURES:
COMPUTE: False
MIN_DATA_YIELDED_MINUTES_TO_IMPUTE: 0.33
COLS_NAN_THRESHOLD: 1 # set to 1 to disable
COLS_VAR_THRESHOLD: True
ROWS_NAN_THRESHOLD: 1 # set to 1 to disable
DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_HOURS # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
DATA_YIELD_RATIO_THRESHOLD: 0 # set to 0 to disable
DROP_HIGHLY_CORRELATED_FEATURES:
COMPUTE: True
MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
CORR_THRESHOLD: 0.95
SRC_SCRIPT: src/features/all_cleaning_overall/rapids/main.R
STRAW:
COMPUTE: True
PHONE_DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_MINUTES # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
PHONE_DATA_YIELD_RATIO_THRESHOLD: 0.5 # set to 0 to disable
EMPATICA_DATA_YIELD_RATIO_THRESHOLD: 0.5 # set to 0 to disable
ROWS_NAN_THRESHOLD: 0.33 # set to 1 to disable
COLS_NAN_THRESHOLD: 0.8 # set to 1 to remove only columns that contains all (100% of) NaN
COLS_VAR_THRESHOLD: True
DROP_HIGHLY_CORRELATED_FEATURES:
COMPUTE: True
MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
CORR_THRESHOLD: 0.95
STANDARDIZATION: True
TARGET_STANDARDIZATION: False
SRC_SCRIPT: src/features/all_cleaning_overall/straw/main.py
########################################################################################################################
# Baseline #
########################################################################################################################
PARAMS_FOR_ANALYSIS:
BASELINE:
COMPUTE: True
FOLDER: data/external/baseline
CONTAINER: [results-survey637813_final.csv, # Slovenia
results-survey358134_final.csv, # Belgium 1
results-survey413767_final.csv # Belgium 2
]
QUESTION_LIST: survey637813+question_text.csv
FEATURES: [age, gender, startlanguage, limesurvey_demand, limesurvey_control, limesurvey_demand_control_ratio, limesurvey_demand_control_ratio_quartile]
CATEGORICAL_FEATURES: [gender]
TARGET:
COMPUTE: True
LABEL: appraisal_stressfulness_event_mean
ALL_LABELS: [PANAS_positive_affect_mean, PANAS_negative_affect_mean, JCQ_job_demand_mean, JCQ_job_control_mean, JCQ_supervisor_support_mean, JCQ_coworker_support_mean, appraisal_stressfulness_period_mean]
# PANAS_positive_affect_mean, PANAS_negative_affect_mean, JCQ_job_demand_mean, JCQ_job_control_mean, JCQ_supervisor_support_mean,
# JCQ_coworker_support_mean, appraisal_stressfulness_period_mean, appraisal_stressfulness_event_mean, appraisal_threat_mean, appraisal_challenge_mean

View File

@ -0,0 +1,9 @@
"_id","timestamp","device_id","call_type","call_duration","trace"
1,1587663260695,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",2,14,"d5e84f8af01b2728021d4f43f53a163c0c90000c"
2,1587739118007,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",3,0,"47c125dc7bd163b8612cdea13724a814917b6e93"
5,1587746544891,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",2,95,"9cc793ffd6e88b1d850ce540b5d7e000ef5650d4"
6,1587911379859,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",2,63,"51fb9344e988049a3fec774c7ca622358bf80264"
7,1587992647361,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",3,0,"2a862a7730cfdfaf103a9487afe3e02935fd6e02"
8,1588020039448,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",1,11,"a2c53f6a086d98622c06107780980cf1bb4e37bd"
11,1588176189024,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",2,65,"56589df8c830c70e330b644921ed38e08d8fd1f3"
12,1588197745079,"a748ee1a-1d0b-4ae9-9074-279a2b6ba524",3,0,"cab458018a8ed3b626515e794c70b6f415318adc"
1 _id timestamp device_id call_type call_duration trace
2 1 1587663260695 a748ee1a-1d0b-4ae9-9074-279a2b6ba524 2 14 d5e84f8af01b2728021d4f43f53a163c0c90000c
3 2 1587739118007 a748ee1a-1d0b-4ae9-9074-279a2b6ba524 3 0 47c125dc7bd163b8612cdea13724a814917b6e93
4 5 1587746544891 a748ee1a-1d0b-4ae9-9074-279a2b6ba524 2 95 9cc793ffd6e88b1d850ce540b5d7e000ef5650d4
5 6 1587911379859 a748ee1a-1d0b-4ae9-9074-279a2b6ba524 2 63 51fb9344e988049a3fec774c7ca622358bf80264
6 7 1587992647361 a748ee1a-1d0b-4ae9-9074-279a2b6ba524 3 0 2a862a7730cfdfaf103a9487afe3e02935fd6e02
7 8 1588020039448 a748ee1a-1d0b-4ae9-9074-279a2b6ba524 1 11 a2c53f6a086d98622c06107780980cf1bb4e37bd
8 11 1588176189024 a748ee1a-1d0b-4ae9-9074-279a2b6ba524 2 65 56589df8c830c70e330b644921ed38e08d8fd1f3
9 12 1588197745079 a748ee1a-1d0b-4ae9-9074-279a2b6ba524 3 0 cab458018a8ed3b626515e794c70b6f415318adc

Binary file not shown.

View File

@ -0,0 +1,57 @@
label,empatica_id
uploader_79170,A0245B
uploader_89788,A02731
uploader_68294,A02705
uploader_92856,A024AF
uploader_23726,A0231C
uploader_66620,A02305
uploader_58435,A026B5
uploader_87801,A022A8
uploader_96055,A027BA
uploader_69549,A0226C
uploader_26363,A0263D
uploader_72010,A023FA
uploader_13997,A024AF
uploader_31156,A02305
uploader_63187,A027BA
uploader_94821,A022A8
uploader_65413,A023F1;A023FA
uploader_36488,A02713
uploader_91087,A0231C
uploader_35174,A025D1
uploader_73880,A02705
uploader_78650,A02731
uploader_70578,A0245B
uploader_88313,A02736
uploader_58482,A0261A
uploader_80601,A027BA
uploader_93729,A0226C
uploader_61663,A0245B
uploader_80848,A025D1
uploader_57312,A023F9;A02361;A027A0
uploader_52087,A02666
uploader_98770,A02953
uploader_51327,A0245F
uploader_11737,A02732
uploader_77440,A0264E
uploader_57277,A02422
uploader_13098,A026E5
uploader_80719,A023C8
uploader_54698,A02953
uploader_95571,A02853
uploader_21880,A024DC
uploader_92905,A02920
uploader_12108,A023F4
uploader_17436,A026E5
uploader_58440,A0273F
uploader_22172,A0245F
uploader_39250,A02422
uploader_15311,A023F9
uploader_45766,A02920
uploader_23096,A02361
uploader_78243,A02422
uploader_58777,A0245F
uploader_82941,A02666
uploader_89606,A023F4
uploader_82969,A023C8
uploader_53573,A024DC;A02361
1 label empatica_id
2 uploader_79170 A0245B
3 uploader_89788 A02731
4 uploader_68294 A02705
5 uploader_92856 A024AF
6 uploader_23726 A0231C
7 uploader_66620 A02305
8 uploader_58435 A026B5
9 uploader_87801 A022A8
10 uploader_96055 A027BA
11 uploader_69549 A0226C
12 uploader_26363 A0263D
13 uploader_72010 A023FA
14 uploader_13997 A024AF
15 uploader_31156 A02305
16 uploader_63187 A027BA
17 uploader_94821 A022A8
18 uploader_65413 A023F1;A023FA
19 uploader_36488 A02713
20 uploader_91087 A0231C
21 uploader_35174 A025D1
22 uploader_73880 A02705
23 uploader_78650 A02731
24 uploader_70578 A0245B
25 uploader_88313 A02736
26 uploader_58482 A0261A
27 uploader_80601 A027BA
28 uploader_93729 A0226C
29 uploader_61663 A0245B
30 uploader_80848 A025D1
31 uploader_57312 A023F9;A02361;A027A0
32 uploader_52087 A02666
33 uploader_98770 A02953
34 uploader_51327 A0245F
35 uploader_11737 A02732
36 uploader_77440 A0264E
37 uploader_57277 A02422
38 uploader_13098 A026E5
39 uploader_80719 A023C8
40 uploader_54698 A02953
41 uploader_95571 A02853
42 uploader_21880 A024DC
43 uploader_92905 A02920
44 uploader_12108 A023F4
45 uploader_17436 A026E5
46 uploader_58440 A0273F
47 uploader_22172 A0245F
48 uploader_39250 A02422
49 uploader_15311 A023F9
50 uploader_45766 A02920
51 uploader_23096 A02361
52 uploader_78243 A02422
53 uploader_58777 A0245F
54 uploader_82941 A02666
55 uploader_89606 A023F4
56 uploader_82969 A023C8
57 uploader_53573 A024DC;A02361

View File

@ -0,0 +1,11 @@
PHONE:
DEVICE_IDS: [4b62a655-cbf0-4ac0-a448-06726f45b56a]
PLATFORMS: [android]
LABEL: uploader_53573
START_DATE: 2021-05-21 09:21:24
END_DATE: 2021-07-12 17:32:07
EMPATICA:
DEVICE_IDS: [uploader_53573]
LABEL: uploader_53573
START_DATE: 2021-05-21 09:21:24
END_DATE: 2021-07-12 17:32:07

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,45 @@
genre,n
System,261
Tools,96
Productivity,71
Health & Fitness,60
Finance,54
Communication,39
Music & Audio,39
Shopping,38
Lifestyle,33
Education,28
News & Magazines,24
Maps & Navigation,23
Entertainment,21
Business,18
Travel & Local,18
Books & Reference,16
Social,16
Weather,16
Food & Drink,14
Sports,14
Other,13
Photography,13
Puzzle,13
Video Players & Editors,12
Card,9
Casual,9
Personalization,8
Medical,7
Board,5
Strategy,4
House & Home,3
Trivia,3
Word,3
Adventure,2
Art & Design,2
Auto & Vehicles,2
Dating,2
Role Playing,2
STRAW,2
Simulation,2
"Board,Brain Games",1
"Entertainment,Music & Video",1
Parenting,1
Racing,1
1 genre n
2 System 261
3 Tools 96
4 Productivity 71
5 Health & Fitness 60
6 Finance 54
7 Communication 39
8 Music & Audio 39
9 Shopping 38
10 Lifestyle 33
11 Education 28
12 News & Magazines 24
13 Maps & Navigation 23
14 Entertainment 21
15 Business 18
16 Travel & Local 18
17 Books & Reference 16
18 Social 16
19 Weather 16
20 Food & Drink 14
21 Sports 14
22 Other 13
23 Photography 13
24 Puzzle 13
25 Video Players & Editors 12
26 Card 9
27 Casual 9
28 Personalization 8
29 Medical 7
30 Board 5
31 Strategy 4
32 House & Home 3
33 Trivia 3
34 Word 3
35 Adventure 2
36 Art & Design 2
37 Auto & Vehicles 2
38 Dating 2
39 Role Playing 2
40 STRAW 2
41 Simulation 2
42 Board,Brain Games 1
43 Entertainment,Music & Video 1
44 Parenting 1
45 Racing 1

View File

@ -0,0 +1,3 @@
label,start_time,length,repeats_on,repeats_value
daily,04:00:00,23H 59M 59S,every_day,0
working_day,04:00:00,18H 00M 00S,every_day,0
1 label start_time length repeats_on repeats_value
2 daily 04:00:00 23H 59M 59S every_day 0
3 working_day 04:00:00 18H 00M 00S every_day 0

View File

@ -1,2 +1,2 @@
label,length
thirtyminutes,30
fiveminutes,5
1 label length
2 thirtyminutes fiveminutes 30 5

View File

@ -1,9 +1,2 @@
label,start_time,length,repeats_on,repeats_value
threeday,00:00:00,2D 23H 59M 59S,every_day,0
daily, 00:00:00,23H 59M 59S, every_day, 0
morning,06:00:00,5H 59M 59S,every_day,0
afternoon,12:00:00,5H 59M 59S,every_day,0
evening,18:00:00,5H 59M 59S,every_day,0
night,00:00:00,5H 59M 59S,every_day,0
two_weeks_overlapping,00:00:00,13D 23H 59M 59S,every_day,0
weekends,00:00:00,2D 23H 59M 59S,wday,5
daily,00:00:00,23H 59M 59S,every_day,0

1 label start_time length repeats_on repeats_value
2 threeday daily 00:00:00 2D 23H 59M 59S 23H 59M 59S every_day 0
daily 00:00:00 23H 59M 59S every_day 0
morning 06:00:00 5H 59M 59S every_day 0
afternoon 12:00:00 5H 59M 59S every_day 0
evening 18:00:00 5H 59M 59S every_day 0
night 00:00:00 5H 59M 59S every_day 0
two_weeks_overlapping 00:00:00 13D 23H 59M 59S every_day 0
weekends 00:00:00 2D 23H 59M 59S wday 5

4109
data/external/timezone.csv vendored 100644

File diff suppressed because it is too large Load Diff

View File

@ -1,8 +1,8 @@
# Analysis Workflow Example
!!! info "TL;DR"
- In addition to using RAPIDS to extract behavioral features and create plots, you can structure your data analysis within RAPIDS (i.e. cleaning your features and creating ML/statistical models)
- We include an analysis example in RAPIDS that covers raw data processing, cleaning, feature extraction, machine learning modeling, and evaluation
- In addition to using RAPIDS to extract behavioral features, create plots, and clean sensor features, you can structure your data analysis within RAPIDS (i.e. creating ML/statistical models and evaluating your models)
- We include an analysis example in RAPIDS that covers raw data processing, feature extraction, cleaning, machine learning modeling, and evaluation
- Use this example as a guide to structure your own analysis within RAPIDS
- RAPIDS analysis workflows are compatible with your favorite data science tools and libraries
- RAPIDS analysis workflows are reproducible and we encourage you to publish them along with your research papers
@ -52,7 +52,7 @@ Note you will see a lot of warning messages, you can ignore them since they happ
## Modules of our analysis workflow example
??? info "1. Feature extraction"
We extract daily behavioral features for data yield, received and sent messages, missed, incoming and outgoing calls, resample fused location data using Doryab provider, activity recognition, battery, Bluetooth, screen, light, applications foreground, conversations, Wi-Fi connected, Wi-Fi visible, Fitbit heart rate summary and intraday data, Fitbit sleep summary data, and Fitbit step summary and intraday data without excluding sleep periods with an active bout threshold of 10 steps. In total, we obtained 237 daily sensor features over 12 days per participant.
We extract daily behavioral features for data yield, received and sent messages, missed, incoming and outgoing calls, resample fused location data using Doryab provider, activity recognition, battery, Bluetooth, screen, light, applications foreground, conversations, Wi-Fi connected, Wi-Fi visible, Fitbit heart rate summary and intraday data, Fitbit sleep summary data, and Fitbit step summary and intraday data without excluding sleep periods with an active bout threshold of 10 steps. In total, we obtained 245 daily sensor features over 12 days per participant.
??? info "2. Extract demographic data."
It is common to have demographic data in addition to mobile and target (ground truth) data. In this example we include participants age, gender and the number of days they spent in hospital after their surgery as features in our model. We extract these three columns from the `data/external/example_workflow/participant_info.csv` file. As these three features remain the same within participants, they are used only on the population model. Refer to the `demographic_features` rule in `rules/models.smk`.
@ -69,12 +69,12 @@ Note you will see a lot of warning messages, you can ignore them since they happ
??? info "6. Feature cleaning."
In this stage we perform four steps to clean our sensor feature file. First, we discard days with a data yield hour ratio less than or equal to 0.75, i.e. we include days with at least 18 hours of data. Second, we drop columns (features) with more than 30% of missing rows. Third, we drop columns with zero variance. Fourth, we drop rows (days) with more than 30% of missing columns (features). In this cleaning stage several parameters are created and exposed in `example_profile/example_config.yaml`.
After this step, we kept 158 features over 11 days for the individual model of p01, 101 features over 12 days for the individual model of p02 and 106 features over 20 days for the population model. Note that the difference in the number of features between p01 and p02 is mostly due to iOS restrictions that stops researchers from collecting the same number of sensors than in Android phones.
After this step, we kept 173 features over 11 days for the individual model of p01, 101 features over 12 days for the individual model of p02 and 117 features over 22 days for the population model. Note that the difference in the number of features between p01 and p02 is mostly due to iOS restrictions that stops researchers from collecting the same number of sensors than in Android phones.
Feature cleaning for the individual models is done in the `clean_sensor_features_for_individual_participants` rule and for the population model in the `clean_sensor_features_for_all_participants` rule in `rules/models.smk`.
??? info "7. Merge features and targets."
In this step we merge the cleaned features and target labels for our individual models in the `merge_features_and_targets_for_individual_model` rule in `rules/models.smk`. Additionally, we merge the cleaned features, target labels, and demographic features of our two participants for the population model in the `merge_features_and_targets_for_population_model` rule in `rules/models.smk`. These two merged files are the input for our individual and population models.
In this step we merge the cleaned features and target labels for our individual models in the `merge_features_and_targets_for_individual_model` rule in `rules/features.smk`. Additionally, we merge the cleaned features, target labels, and demographic features of our two participants for the population model in the `merge_features_and_targets_for_population_model` rule in `rules/features.smk`. These two merged files are the input for our individual and population models.
??? info "8. Modelling."
This stage has three phases: model building, training and evaluation.

View File

@ -0,0 +1,92 @@
Data Cleaning
=============
The goal of this module is to perform basic clean tasks on the behavioral features that RAPIDS computes. You might need to do further processing depending on your analysis objectives. This module can clean features at the individual level and at the study level. If you are interested in creating individual models (using each participant's features independently of the others) use [`ALL_CLEANING_INDIVIDUAL`]. If you are interested in creating population models (using everyone's data in the same model) use [`ALL_CLEANING_OVERALL`]
## Clean sensor features for individual participants
!!! info "File Sequence"
```bash
- data/processed/features/{pid}/all_sensor_features.csv
- data/processed/features/{pid}/all_sensor_features_cleaned_{provider_key}.csv
```
### RAPIDS provider
Parameters description for `[ALL_CLEANING_INDIVIDUAL][PROVIDERS][RAPIDS]`:
|Key&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; | Description |
|----------------|-----------------------------------------------------------------------------------------------------------------------------------
|`[COMPUTE]` | Set to `True` to execute the cleaning tasks described below. You can use the parameters of each task to tweak them or deactivate them|
|`[IMPUTE_SELECTED_EVENT_FEATURES]` | Fill NAs with 0 only for event-based features, see table below
|`[COLS_NAN_THRESHOLD]` | Discard columns with missing value ratios higher than `[COLS_NAN_THRESHOLD]`. Set to 1 to disable
|`[COLS_VAR_THRESHOLD]` | Set to `True` to discard columns with zero variance
|`[ROWS_NAN_THRESHOLD]` | Discard rows with missing value ratios higher than `[ROWS_NAN_THRESHOLD]`. Set to 1 to disable
|`[DATA_YIELD_FEATURE]` | `RATIO_VALID_YIELDED_HOURS` or `RATIO_VALID_YIELDED_MINUTES`
|`[DATA_YIELD_RATIO_THRESHOLD]` | Discard rows with `ratiovalidyieldedhours` or `ratiovalidyieldedminutes` feature less than `[DATA_YIELD_RATIO_THRESHOLD]`. The feature name is determined by `[DATA_YIELD_FEATURE]` parameter. Set to 0 to disable
|`DROP_HIGHLY_CORRELATED_FEATURES` | Discard highly correlated features, see table below
Parameters description for `[ALL_CLEANING_INDIVIDUAL][PROVIDERS][RAPIDS][IMPUTE_SELECTED_EVENT_FEATURES]`:
|Parameters | Description |
|-------------------------------------- |----------------------------------------------------------------|
|`[COMPUTE]` | Set to `True` to fill NAs with 0 for phone event-based features
|`[MIN_DATA_YIELDED_MINUTES_TO_IMPUTE]` | Any feature value in a time segment instance with phone data yield > `[MIN_DATA_YIELDED_MINUTES_TO_IMPUTE]` will be replaced with a zero. See below for an explanation. |
Parameters description for `[ALL_CLEANING_INDIVIDUAL][PROVIDERS][RAPIDS][DROP_HIGHLY_CORRELATED_FEATURES]`:
|Parameters | Description |
|-------------------------------------- |----------------------------------------------------------------|
|`[COMPUTE]` | Set to `True` to drop highly correlated features
|`[MIN_OVERLAP_FOR_CORR_THRESHOLD]` | Minimum ratio of observations required per pair of columns (features) to be considered as a valid correlation.
|`[CORR_THRESHOLD]` | The absolute values of pair-wise correlations are calculated. If two variables have a valid correlation higher than `[CORR_THRESHOLD]`, we looks at the mean absolute correlation of each variable and removes the variable with the largest mean absolute correlation.
Steps to clean sensor features for individual participants. It only considers the **phone sensors** currently.
??? info "1. Fill NA with 0 for the selected event features."
Some event features should be zero instead of NA. In this step, we fill those missing features with 0 when the `phone_data_yield_rapids_ratiovalidyieldedminutes` column is higher than the `[IMPUTE_SELECTED_EVENT_FEATURES][MIN_DATA_YIELDED_MINUTES_TO_IMPUTE]` parameter. Plugins such as Activity Recognition sensor are not considered. You can skip this step by setting `[IMPUTE_SELECTED_EVENT_FEATURES][COMPUTE]` to `False`.
Take phone calls sensor as an example. If there are no calls records during a time segment for a participant, then (1) the calls sensor was not working during that time segment; or (2) the calls sensor was working and the participant did not have any calls during that time segment. To differentiate these two situations, we assume the selected sensors are working when `phone_data_yield_rapids_ratiovalidyieldedminutes > [MIN_DATA_YIELDED_MINUTES_TO_IMPUTE]`.
The following phone event-based features are considered currently:
- Application foreground: countevent, countepisode, minduration, maxduration, meanduration, sumduration.
- Battery: all features.
- Calls: count, distinctcontacts, sumduration, minduration, maxduration, meanduration, modeduration.
- Keyboard: sessioncount, averagesessionlength, changeintextlengthlessthanminusone, changeintextlengthequaltominusone, changeintextlengthequaltoone, changeintextlengthmorethanone, maxtextlength, totalkeyboardtouches.
- Messages: count, distinctcontacts.
- Screen: sumduration, maxduration, minduration, avgduration, countepisode.
- WiFi: all connected and visible features.
??? info "2. Discard unreliable rows."
Extracted features might be not reliable if the sensor only works for a short period during a time segment. In this step, we discard rows when the `phone_data_yield_rapids_ratiovalidyieldedminutes` column or the `phone_data_yield_rapids_ratiovalidyieldedhours` column is less than the `[DATA_YIELD_RATIO_THRESHOLD]` parameter. We recommend using `phone_data_yield_rapids_ratiovalidyieldedminutes` column (set `[DATA_YIELD_FEATURE]` to `RATIO_VALID_YIELDED_MINUTES`) on time segments that are shorter than two or three hours and `phone_data_yield_rapids_ratiovalidyieldedhours` (set `[DATA_YIELD_FEATURE]` to `RATIO_VALID_YIELDED_HOURS`) for longer segments. We do not recommend you to skip this step, but you can do it by setting `[DATA_YIELD_RATIO_THRESHOLD]` to 0.
??? info "3. Discard columns (features) with too many missing values."
In this step, we discard columns with missing value ratios higher than `[COLS_NAN_THRESHOLD]`. We do not recommend you to skip this step, but you can do it by setting `[COLS_NAN_THRESHOLD]` to 1.
??? info "4. Discard columns (features) with zero variance."
In this step, we discard columns with zero variance. We do not recommend you to skip this step, but you can do it by setting `[COLS_VAR_THRESHOLD]` to `False`.
??? info "5. Drop highly correlated features."
As highly correlated features might not bring additional information and will increase the complexity of a model, we drop them in this step. The absolute values of pair-wise correlations are calculated. Each correlation vector between two variables is regarded as valid only if the ratio of valid value pairs (i.e. non NA pairs) is greater than or equal to `[DROP_HIGHLY_CORRELATED_FEATURES][MIN_OVERLAP_FOR_CORR_THRESHOLD]`. If two variables have a correlation coefficient higher than `[DROP_HIGHLY_CORRELATED_FEATURES][CORR_THRESHOLD]`, we look at the mean absolute correlation of each variable and remove the variable with the largest mean absolute correlation. This step can be skipped by setting `[DROP_HIGHLY_CORRELATED_FEATURES][COMPUTE]` to False.
??? info "6. Discard rows with too many missing values."
In this step, we discard rows with missing value ratios higher than `[ROWS_NAN_THRESHOLD]`. We do not recommend you to skip this step, but you can do it by setting `[ROWS_NAN_THRESHOLD]` to 1. In other words, we are discarding time segments (e.g. days) that did not have enough data to be considered reliable. This step is similar to step 2 except the ratio is computed based on NA values instead of a phone data yield threshold.
## Clean sensor features for all participants
!!! info "File Sequence"
```bash
- data/processed/features/all_participants/all_sensor_features.csv
- data/processed/features/all_participants/all_sensor_features_cleaned_{provider_key}.csv
```
### RAPIDS provider
Parameters description and the steps are the same as the above [RAPIDS provider](#rapids-provider) section for individual participants.

View File

@ -1,4 +1,42 @@
# Change Log
## v1.8.0
- Add data stream for AWARE Micro server
- Fix the NA bug in PHONE_LOCATIONS BARNETT provider
- Fix the bug of data type for call_duration field
- Fix the index bug of heatmap_sensors_per_minute_per_time_segment
## v1.7.1
- Update docs for Git Flow section
- Update RAPIDS paper information
## v1.7.0
- Add firststeptime and laststeptime features to FITBIT_STEPS_INTRADAY RAPIDS provider
- Update tests for Fitbit steps intraday features
- Add tests for phone battery features
- Add a data cleaning module to replace NAs with 0 in selected event-based features, discard unreliable rows and columns, discard columns with zero variance, and discard highly correlated columns
## v1.6.0
- Refactor PHONE_CALLS RAPIDS provider to compute features based on call episodes or events
- Refactor PHONE_LOCATIONS DORYAB provider to compute features based on location episodes
- Temporary revert PHONE_LOCATIONS BARNETT provider to use R script
- Update the default IGNORE_EPISODES_LONGER_THAN to be 6 hours for screen RAPIDS provider
- Fix the bug of step intraday features when INCLUDE_ZERO_STEP_ROWS is False
## v1.5.0
- Update Barnett location features with faster Python implementation
- Fix rounding bug in data yield features
- Add tests for data yield, Fitbit and accelerometer features
- Small fixes of documentation
## v1.4.1
- Update home page
- Add PHONE_MESSAGES tests
## v1.4.0
- Add new Application Foreground episode features and tests
- Update VSCode setup instructions for our Docker container
- Add tests for phone calls features
- Add tests for WiFI features and fix a bug that incorrectly counted the most scanned device within the current time segment instances instead of globally
- Add tests for phone conversation features
- Add tests for Bluetooth features and choose the most scanned device alphabetically when ties exist
- Add tests for Activity Recognition features and fix iOS unknown activity parsing
- Fix Fitbit bug that parsed date-times with the current time zone in rare cases
- Update the visualizations to be more precise and robust with different time segments.
- Fix regression crash of the example analysis workflow
## v1.3.0
- Refactor PHONE_LOCATIONS DORYAB provider. Fix bugs and faster execution up to 30x
- New PHONE_KEYBOARD features
@ -124,4 +162,4 @@
- Update [virtual environment](../developers/virtual-environments) guide
- Update analysis workflow [example](../workflow-examples/analysis)
- Add a [Code of Conduct](../code_of_conduct)
- Update [Team](../team) page
- Update [Team](../team) page

View File

@ -5,14 +5,10 @@
## RAPIDS
If you used RAPIDS, please cite [this paper](https://preprints.jmir.org/preprint/23246).
If you used RAPIDS, please cite [this paper](https://www.frontiersin.org/article/10.3389/fdgth.2021.769823).
!!! cite "RAPIDS et al. citation"
Vega J, Li M, Aguillera K, Goel N, Joshi E, Durica KC, Kunta AR, Low CA
RAPIDS: Reproducible Analysis Pipeline for Data Streams Collected with Mobile Devices
JMIR Preprints. 18/08/2020:23246
DOI: 10.2196/preprints.23246
URL: https://preprints.jmir.org/preprint/23246
Vega, J., Li, M., Aguillera, K., Goel, N., Joshi, E., Khandekar, K., ... & Low, C. A. (2021). Reproducible Analysis Pipeline for Data Streams (RAPIDS): Open-Source Software to Process Data Collected with Mobile Devices. Frontiers in Digital Health, 168.
## DBDP (all Empatica sensors)

View File

@ -0,0 +1,15 @@
# `aware_micro_mysql`
This [data stream](../../datastreams/data-streams-introduction) handles iOS and Android sensor data collected with the [AWARE Framework's](https://awareframework.com/) [AWARE Micro](https://github.com/denzilferreira/aware-micro) server and stored in a MySQL database.
## Container
A MySQL database with a table per sensor, each containing the data for all participants. Sensor data is stored in a JSON field within each table called `data`
The script to connect and download data from this container is at:
```bash
src/data/streams/aware_micro_mysql/container.R
```
## Format
--8<---- "docs/snippets/aware_format.md"

View File

@ -16,6 +16,7 @@ For reference, these are the data streams we currently support:
| Data Stream | Device | Format | Container | Docs
|--|--|--|--|--|
| `aware_mysql`| Phone | AWARE app | MySQL | [link](../aware-mysql)
| `aware_micro_mysql`| Phone | AWARE Micro server | MySQL | [link](../aware-micro-mysql)
| `aware_csv`| Phone | AWARE app | CSV files | [link](../aware-csv)
| `aware_influxdb` (beta)| Phone | AWARE app | InfluxDB | [link](../aware-influxdb)
| `fitbitjson_mysql`| Fitbit | JSON (per [Fitbit's API](https://dev.fitbit.com/build/reference/web-api/)) | MySQL | [link](../fitbitjson-mysql)

View File

@ -127,9 +127,9 @@ git branch -d release/v[NEW_RELEASE]
```
git checkout master
git merge --ff-only develop
git push
git push # Unlock the master branch before merging
```
1. Go to [GitHub](https://github.com/carissalow/rapids/tags) and create a new release based on the newest tag `v[NEW_RELEASE]` (remember to add the change log)
1. Release happens automatically after passing the tests
## Release a Hotfix
1. Pull the latest master
@ -156,6 +156,6 @@ git branch -d hotfix/v[NEW_HOTFIX]
```
git checkout master
git merge --ff-only v[NEW_HOTFIX]
git push
git push # Unlock the master branch before merging
```
1. Go to [GitHub](https://github.com/carissalow/rapids/tags) and create a new release based on the newest tag `v[NEW_HOTFIX]` (remember to add the change log)
1. Release happens automatically after passing the tests

View File

@ -7,158 +7,260 @@ The following is a list of the sensors that testing is currently available.
| Sensor | Provider | Periodic | Frequency | Event |
|-------------------------------|----------|----------|-----------|-------|
| Phone Accelerometer | Panda | N | N | N |
| Phone Accelerometer | RAPIDS | N | N | N |
| Phone Activity Recognition | RAPIDS | N | N | N |
| Phone Applications Foreground | RAPIDS | N | N | N |
| Phone Battery | RAPIDS | Y | Y | N |
| Phone Bluetooth | Doryab | N | N | N |
| Phone Accelerometer | Panda | Y | Y | Y |
| Phone Accelerometer | RAPIDS | Y | Y | Y |
| Phone Activity Recognition | RAPIDS | Y | Y | Y |
| Phone Applications Foreground | RAPIDS | Y | Y | Y |
| Phone Battery | RAPIDS | Y | Y | Y |
| Phone Bluetooth | Doryab | Y | Y | Y |
| Phone Bluetooth | RAPIDS | Y | Y | Y |
| Phone Calls | RAPIDS | Y | Y | N |
| Phone Conversation | RAPIDS | Y | Y | N |
| Phone Data Yield | RAPIDS | N | N | N |
| Phone Light | RAPIDS | Y | Y | N |
| Phone Locations | Doryab | N | N | N |
| Phone Calls | RAPIDS | Y | Y | Y |
| Phone Conversation | RAPIDS | Y | Y | Y |
| Phone Data Yield | RAPIDS | Y | Y | Y |
| Phone Light | RAPIDS | Y | Y | Y |
| Phone Locations | Doryab | Y | Y | Y |
| Phone Locations | Barnett | N | N | N |
| Phone Messages | RAPIDS | Y | Y | N |
| Phone Screen | RAPIDS | Y | N | N |
| Phone WiFi Connected | RAPIDS | Y | Y | N |
| Phone WiFi Visible | RAPIDS | Y | Y | N |
| Phone Messages | RAPIDS | Y | Y | Y |
| Phone Screen | RAPIDS | Y | Y | Y |
| Phone WiFi Connected | RAPIDS | Y | Y | Y |
| Phone WiFi Visible | RAPIDS | Y | Y | Y |
| Fitbit Calories Intraday | RAPIDS | Y | Y | Y |
| Fitbit Data Yield | RAPIDS | N | N | N |
| Fitbit Heart Rate Summary | RAPIDS | N | N | N |
| Fitbit Heart Rate Intraday | RAPIDS | N | N | N |
| Fitbit Sleep Summary | RAPIDS | N | N | N |
| Fitbit Data Yield | RAPIDS | Y | Y | Y |
| Fitbit Heart Rate Summary | RAPIDS | Y | Y | Y |
| Fitbit Heart Rate Intraday | RAPIDS | Y | Y | Y |
| Fitbit Sleep Summary | RAPIDS | Y | Y | Y |
| Fitbit Sleep Intraday | RAPIDS | Y | Y | Y |
| Fitbit Sleep Intraday | PRICE | Y | Y | Y |
| Fitbit Steps Summary | RAPIDS | N | N | N |
| Fitbit Steps Intraday | RAPIDS | N | N | N |
| Fitbit Steps Summary | RAPIDS | Y | Y | Y |
| Fitbit Steps Intraday | RAPIDS | Y | Y | Y |
## Accelerometer
Description
- The raw accelerometer data file, `phone_accelerometer_raw.csv`, contains data for 4 separate days
- One episode for each daily segment (night, morning, afternoon and evening)
- Two episodes locate in the same 30-min segment (`Fri 00:15:00` and `Fri 00:21:21`)
- Two episodes locate in the same daily segment (`Fri 00:15:00` and `Fri 18:12:00`)
- One episode before the time switch (`Sun 00:02:00`) and one episode after the time switch (`Sun 04:18:00`)
- Multiple episodes within one min which cause variance in magnitude (`Fri 00:10:25`, `Fri 00:10:27` and `Fri 00:10:46`)
Checklist
|time segment| single tz | multi tz|platform|
|-|-|-|-|
|30min|OK|OK|android, ios|
|morning|OK|OK|android, ios|
|daily|OK|OK|android, ios|
|threeday|OK|OK|android, ios|
|weekend|OK|OK|android, ios|
|beforeMarchEvent|OK|OK|android, ios|
|beforeNovemberEvent|OK|OK|android, ios|
## Messages (SMS)
- The raw message data file contains data for 2 separate days.
- The data for the first day contains records 5 records for every
`epoch`.
- The second day\'s data contains 6 records for each of only 2
`epoch` (currently `morning` and `evening`)
- The raw message data contains records for both `message_types`
(i.e. `recieved` and `sent`) in both days in all epochs. The
number records with each `message_types` per epoch is randomly
distributed There is at least one records with each
`message_types` per epoch.
- There is one raw message data file each, as described above, for
testing both iOS and Android data.
- There is also an additional empty data file for both android and
iOS for testing empty data files
Description
- The raw message data file, `phone_messages_raw.csv`, contains data for 4 separate days
- One episode for each daily segment (night, morning, afternoon and evening)
- Two `sent` episodes locate in the same 30-min segment (`Fri 16:08:03.000` and `Fri 16:19:35.000`)
- Two `received` episodes locate in the same 30-min segment (`Sat 06:45:05.000` and `Fri 06:45:05.000`)
- Two episodes locate in the same daily segment (`Fri 11:57:56.385` and `Sat 10:54:10.000`)
- One episode before the time switch (`Sun 00:48:01.000`) and one episode after the time switch (`Sun 06:21:01.000`)
Checklist
|time segment| single tz | multi tz|platform|
|-|-|-|-|
|30min|OK|OK|android|
|morning|OK|OK|android|
|daily|OK|OK|android|
|threeday|OK|OK|android|
|weekend|OK|OK|android|
|beforeMarchEvent|OK|OK|android|
|beforeNovemberEvent|OK|OK|android|
## Calls
Due to the difference in the format of the raw call data for iOS and Android the following is the expected results the `calls_with_datetime_unified.csv`. This would give a better idea of the use cases being tested since the `calls_with_datetime_unified.csv` would make both the iOS and Android data comparable.
Due to the difference in the format of the raw data for iOS and Android the following is the expected results
the `phone_calls.csv`.
- The call data would contain data for 2 days.
- The data for the first day contains 6 records for every `epoch`.
- The second day\'s data contains 6 records for each of only 2
`epoch` (currently `morning` and `evening`)
- The call data contains records for all `call_types` (i.e.
`incoming`, `outgoing` and `missed`) in both days in all epochs.
The number records with each of the `call_types` per epoch is
randomly distributed. There is at least one records with each
`call_types` per epoch.
- There is one call data file each, as described above, for testing
both iOS and Android data.
- There is also an additional empty data file for both android and
iOS for testing empty data files
Description
- One missed episode, one outgoing episode and one incoming episode on Friday night, morning, afternoon and evening
- There is at least one episode of each type of phone calls on each day
- One incoming episode crossing two 30-mins segments
- One outgoing episode crossing two 30-mins segments
- One missed episode before, during and after the `event`
- There is one incoming episode before, during or after the `event`
- There is one outcoming episode before, during or after the `event`
- There is one missed episode before, during or after the `event`
Data format
| Device | Missed | Outgoing | Incoming |
|-|-|-|-|
|android| 3 | 2 | 1 |
|ios| 1,4 or 3,4 | 3,2,4 | 1,2,4 |
Note
When generating test data, all traces for iOS device need to be unique otherwise the episode with duplicate trace will be dropped
Checklist
|time segment| single tz | multi tz|platform|
|-|-|-|-|
|30min|OK|OK|android, iOS|
|morning|OK|OK|android, iOS|
|daily|OK|OK|android, iOS|
|threeday|OK|OK|android, iOS|
|weekend|OK|OK|android, iOS|
|beforeMarchEvent|OK|OK|android, iOS|
|beforeNovemberEvent|OK|OK|android, iOS|
## Screen
Due to the difference in the format of the raw screen data for iOS and Android the following is the expected results the `screen_deltas.csv`. This would give a better idea of the use cases being tested since the `screen_eltas.csv` would make both the iOS and Android data comparable These files are used to calculate the features for the screen sensor
Due to the difference in the format of the raw screen data for iOS and Android the following is the expected results the `phone_screen.csv`.
- The screen delta data file contains data for 1 day.
- The screen delta data contains 1 record to represent an `unlock`
Description
- The screen data file contains data for 4 days.
- The screen data contains 1 record to represent an `unlock`
episode that falls within an `epoch` for every `epoch`.
- The screen delta data contains 1 record to represent an `unlock`
- The screen data contains 1 record to represent an `unlock`
episode that falls across the boundary of 2 epochs. Namely the
`unlock` episode starts in one epoch and ends in the next, thus
there is a record for `unlock` episodes that fall across `night`
to `morning`, `morning` to `afternoon` and finally `afternoon` to
`night`
- The testing is done for `unlock` episode\_type.
- There is one screen data file each for testing both iOS and
Android data formats.
- There is also an additional empty data file for both android and
iOS for testing empty data files
- One episode that crossing two `30-min` segments
Data format
| Device | unlock |
|-|-|
| Android | 3, 0|
| iOS | 3, 2|
Checklist
|time segment| single tz | multi tz|platform|
|-|-|-|-|
|30min|OK|OK|android, iOS|
|morning|OK|OK|android, iOS|
|daily|OK|OK|android, iOS|
|threeday|OK|OK|android, iOS|
|weekend|OK|OK|android, iOS|
|beforeMarchEvent|OK|OK|android, iOS|
|beforeNovemberEvent|OK|OK|android, iOS|
## Battery
Due to the difference in the format of the raw battery data for iOS and Android as well as versions of iOS the following is the expected results the `battery_deltas.csv`. This would give a better idea of the use cases being tested since the `battery_deltas.csv` would make both the iOS and Android data comparable. These files are used to calculate the features for the battery sensor.
Description
- The battery delta data file contains data for 1 day.
- The battery delta data contains 1 record each for a `charging` and
`discharging` episode that falls within an `epoch` for every
`epoch`. Thus, for the `daily` epoch there would be multiple
`charging` and `discharging` episodes
- Since either a `charging` episode or a `discharging` episode and
not both can occur across epochs, in order to test episodes that
occur across epochs alternating episodes of `charging` and
`discharging` episodes that fall across `night` to `morning`,
`morning` to `afternoon` and finally `afternoon` to `night` are
present in the battery delta data. This starts with a
`discharging` episode that begins in `night` and end in `morning`.
- There is one battery data file each, for testing both iOS and
Android data formats.
- There is also an additional empty data file for both android and
iOS for testing empty data files
- The 4-day raw data is contained in `phone_battery_raw.csv`
- One discharge episode acrossing two 30-min time segements (`Fri 05:57:30.123` to `Fri 06:04:32.456`)
- One charging episode acrossing two 30-min time segments (`Fri 11:55:58.416` to `Fri 12:08:07.876`)
- One discharge episode and one charging episode locate within the same 30-min time segement (`Fri 21:30:00` to `Fri 22:00:00`)
- One episode before the time switch (`Sun 00:24:00.000`) and one episode after the time switch (`Sun 21:58:00`)
- Two episodes locate in the same daily segment
Checklist
|time segment| single tz | multi tz|platform|
|-|-|-|-|
|30min|OK|OK|android|
|morning|OK|OK|android|
|daily|OK|OK|android|
|threeday|OK|OK|android|
|weekend|OK|OK|android|
|beforeMarchEvent|OK|OK|android|
|beforeNovemberEvent|OK|OK|android|
## Bluetooth
- The raw Bluetooth data file contains data for 1 day.
- The raw Bluetooth data contains at least 2 records for each
`epoch`. Each `epoch` has a record with a `timestamp` for the
beginning boundary for that `epoch` and a record with a
`timestamp` for the ending boundary for that `epoch`. (e.g. For
the `morning` epoch there is a record with a `timestamp` for
`6:00AM` and another record with a `timestamp` for `11:59:59AM`.
These are to test edge cases)
- An option of 5 Bluetooth devices are randomly distributed
throughout the data records.
- There is one raw Bluetooth data file each, for testing both iOS
and Android data formats.
- There is also an additional empty data file for both android and
iOS for testing empty data files.
Description
- The 4-day raw data is contained in `phone_bluetooth_raw.csv`
- One episode for each daily segment (`night`, `morning`, `afternoon` and `evening`)
- Two episodes locate in the same 30-min segment (`Fri 23:38:45.789` and `Fri 23:59:59.465`)
- Two episodes locate in the same daily segment (`Fri 00:00:00.798` and `Fri 00:49:04.132`)
- One episode before the time switch (`Sun 00:24:00.000`) and one episode after the time switch (`Sun 17:32:00.000`)
Checklist
|time segment| single tz | multi tz|platform|
|-|-|-|-|
|30min|OK|OK|android|
|morning|OK|OK|android|
|daily|OK|OK|android|
|threeday|OK|OK|android|
|weekend|OK|OK|android|
|beforeMarchEvent|OK|OK|android|
|beforeNovemberEvent|OK|OK|android|
## WIFI
- There are 2 data files (`wifi_raw.csv` and `sensor_wifi_raw.csv`)
for each fake participant for each phone platform.
- The raw WIFI data files contain data for 1 day.
- The `sensor_wifi_raw.csv` data contains at least 2 records for
each `epoch`. Each `epoch` has a record with a `timestamp` for the
beginning boundary for that `epoch` and a record with a
`timestamp` for the ending boundary for that `epoch`. (e.g. For
the `morning` epoch there is a record with a `timestamp` for
`6:00AM` and another record with a `timestamp` for `11:59:59AM`.
These are to test edge cases)
- The `wifi_raw.csv` data contains 3 records with random timestamps
for each `epoch` to represent visible broadcasting WIFI network.
This file is empty for the iOS phone testing data.
- An option of 10 access point devices is randomly distributed
throughout the data records. 5 each for `sensor_wifi_raw.csv` and
`wifi_raw.csv`.
- There data files for testing both iOS and Android data formats.
- There are also additional empty data files for both android and
iOS for testing empty data files.
There are two wifi features (`phone wifi connected` and `phone wifi visible`). The raw test data are seperatly stored in the `phone_wifi_connected_raw.csv` and `phone_wifi_visible_raw.csv`.
Description
- One episode for each `epoch` (`night`, `morining`, `afternoon` and `evening`)
- Two two episodes in the same time segment (`daily` and `30-min`)
- Two episodes around the transition of `epochs` (e.g. one at the end of `night` and one at the beginning of `morning`)
- One episode before and after the time switch on Sunday
phone wifi connected
Checklist
|time segment| single tz | multi tz|platform|
|-|-|-|-|
|30min|OK|OK|android, iOS|
|morning|OK|OK|android, iOS|
|daily|OK|OK|android, iOS|
|threeday|OK|OK|android, iOS|
|weekend|OK|OK|android, iOS|
|beforeMarchEvent|OK|OK|android, iOS|
|beforeNovemberEvent|OK|OK|android, iOS|
phone wifi visible
Checklist
|time segment| single tz | multi tz|platform|
|-|-|-|-|
|30min|OK|OK|android|
|morning|OK|OK|android|
|daily|OK|OK|android|
|threeday|OK|OK|android|
|weekend|OK|OK|android|
|beforeMarchEvent|OK|OK|android|
|beforeNovemberEvent|OK|OK|android|
## Light
- The raw light data file contains data for 1 day.
- The raw light data contains 3 or 4 rows of data for each `epoch`
except `night`. The single row of data for `night` is for testing
features for single values inputs. (Example testing the standard
deviation of one input value)
- Since light is only available for Android there is only one file
that contains data for Android. All other files (i.e. for iPhone)
are empty data files.
Description
- The 4-day raw light data is contained in `phone_light_raw.csv`
- One episode for each daily segment (`night`, `morning`, `afternoon` and `evening`)
- Two episodes locate in the same 30-min segment (`Fri 00:07:27.000` and `Fri 00:12:00.000`)
- Two episodes locate in the same daily segment (`Fri 01:00:00` and `Fri 03:59:59.654`)
- One episode before the time switch (`Sun 00:08:00.000`) and one episode after the time switch (`Sun 05:36:00.000`)
Checklist
|time segment| single tz | multi tz|platform|
|-|-|-|-|
|30min|OK|OK|android|
|morning|OK|OK|android|
|daily|OK|OK|android|
|threeday|OK|OK|android|
|weekend|OK|OK|android|
|beforeMarchEvent|OK|OK|android|
|beforeNovemberEvent|OK|OK|android|
## Locations
@ -171,58 +273,81 @@ Description
## Application Foreground
- The raw application foreground data file contains data for 1 day.
- The raw application foreground data contains 7 - 9 rows of data
for each `epoch`. The records for each `epoch` contains apps that
are randomly selected from a list of apps that are from the
`MULTIPLE_CATEGORIES` and `SINGLE_CATEGORIES` (See
[testing\_config.yaml]()). There are also records in each epoch
that have apps randomly selected from a list of apps that are from
the `EXCLUDED_CATEGORIES` and `EXCLUDED_APPS`. This is to test
that these apps are actually being excluded from the calculations
of features. There are also records to test `SINGLE_APPS`
calculations.
- Since application foreground is only available for Android there
is only one file that contains data for Android. All other files
(i.e. for iPhone) are empty data files.
- The 4-day raw application data is contained in `phone_applications_foreground_raw.csv`
- One episode for each daily segment (night, morning, afternoon and evening)
- Two episodes locate in the same 30-min segment (`Fri 10:12:56.385` and `Fri 10:18:48.895`)
- Two episodes locate in the same daily segment (`Fri 11:57:56.385` and `Fri 12:02:56.385`)
- One episode before the time switch (`Sun 00:07:48.001`) and one episode after the time switch (`Sun 05:10:30.001`)
- Two custom category (`Dating`) episode, one at `Fri 06:05:10.385`, another one at ` Fri 11:53:00.385`
Checklist:
|time segment| single tz | multi tz|platform|
|-|-|-|-|
|30min|OK|OK|android|
|morning|OK|OK|android|
|daily|OK|OK|android|
|threeday|OK|OK|android|
|weekend|OK|OK|android|
|beforeMarchEvent|OK|OK|android|
|beforeNovemberEvent|OK|OK|android|
## Activity Recognition
- The raw Activity Recognition data file contains data for 1 day.
- The raw Activity Recognition data each `epoch` period contains
rows that records 2 - 5 different `activity_types`. The is such
that durations of activities can be tested. Additionally, there
are records that mimic the duration of an activity over the time
boundary of neighboring epochs. (For example, there a set of
records that mimic the participant `in_vehicle` from `afternoon`
into `evening`)
- There is one file each with raw Activity Recognition data for
testing both iOS and Android data formats.
(plugin\_google\_activity\_recognition\_raw.csv for android and
plugin\_ios\_activity\_recognition\_raw.csv for iOS)
- There is also an additional empty data file for both android and
iOS for testing empty data files.
Description
- The 4-day raw activity data is contained in `plugin_google_activity_recognition_raw.csv` and `plugin_ios_activity_recognition_raw.csv`.
- Two episodes locate in the same 30-min segment (`Fri 04:01:54` and `Fri 04:13:52`)
- One episode for each daily segment (`night`, `morning`, `afternoon` and `evening`)
- Two episodes locate in the same daily segment (`Fri 05:03:09` and `Fri 05:50:36`)
- Two episodes with the time difference less than `5 mins` threshold (`Fri 07:14:21` and `Fri 07:18:50`)
- One episode before the time switch (`Sun 00:46:00`) and one episode after the time switch (`Sun 03:42:00`)
Checklist
|time segment| single tz | multi tz|platform|
|-|-|-|-|
|30min|OK|OK|android, iOS|
|morning|OK|OK|android, iOS|
|daily|OK|OK|android, iOS|
|threeday|OK|OK|android, iOS|
|weekend|OK|OK|android, iOS|
|beforeMarchEvent|OK|OK|android, iOS|
|beforeNovemberEvent|OK|OK|android, iOS|
## Conversation
- The raw conversation data file contains data for 2 day.
- The raw conversation data contains records with a sample of both
`datatypes` (i.e. `voice/noise` = `0`, and `conversation` = `2` )
as well as rows with for samples of each of the `inference` values
(i.e. `silence` = `0`, `noise` = `1`, `voice` = `2`, and `unknown`
= `3`) for each `epoch`. The different `datatype` and `inference`
records are randomly distributed throughout the `epoch`.
- Additionally there are 2 - 5 records for conversations (`datatype`
= 2, and `inference` = -1) in each `epoch` and for each `epoch`
except night, there is a conversation record that has a
`double_convo_start` `timestamp` that is from the previous
`epoch`. This is to test the calculations of features across
`epochs`.
- There is a raw conversation data file for both android and iOS
platforms (`plugin_studentlife_audio_android_raw.csv` and
`plugin_studentlife_audio_raw.csv` respectively).
- Finally, there are also additional empty data files for both
android and iOS for testing empty data files
The 4-day raw conversation data is contained in `phone_conversation_raw.csv`. The different `inference` records are
randomly distributed throughout the `epoch`.
Description
- One episode for each daily segment (`night`, `morning`, `afternoon` and `evening`) on each day
- Two episodes near the transition of the daily segment, one starts at the end of the afternoon, `Fri 17:10:00` and another one starts at the beginning of the evening, `Fri 18:01:00`
- One episode across two segments, `daily` and `30-mins`, (from `Fri 05:55:00` to `Fri 06:00:41`)
- Two episodes locate in the same daily segment (`Sat 12:45:36` and `Sat 16:48:22`)
- One episode before the time switch, `Sun 00:15:06`, and one episode after the time switch, `Sun 06:01:00`
Data format
| inference | type |
| - | - |
| 0 | silence |
| 1 | noise |
| 2 | voice |
| 3 | unknown |
Checklist
|time segment| single tz | multi tz|platform|
|-|-|-|-|
|30min|OK|OK|android|
|morning|OK|OK|android|
|daily|OK|OK|android|
|threeday|OK|OK|android|
|weekend|OK|OK|android|
|beforeMarchEvent|OK|OK|android|
|beforeNovemberEvent|OK|OK|android|
## Keyboard
@ -243,6 +368,40 @@ Description
- One three-minute episode with a 1-minute row on Sun 08:59:54.65 and 09:00:00,another on Sun 12:01:02 that are considering a single episode in multi-timezone event segments to showcase how
inferring time zone data for Keyboard from phone data can produce inaccurate results around the tz change. This happens because the device was on LA time until 11:59 and switched to NY time at 12pm, in terms of actual time 09 am LA and 12 pm NY represent the same moment in time so 09:00 LA and 12:01 NY are consecutive minutes.
## Application Episodes
- The feature requires raw application foreground data file and raw phone screen data file
- The raw data files contains data for 4 day.
- The raw conversation data contains records with difference in `timestamp` ranging from milliseconds to minutes.
- An app episode starts when an app is launched and ends when another app is launched, marking the episode end of the first one,
or when the screen locks. Thus, we are taking into account the screen unlock episodes.
- There are multiple apps usage within each screen unlock episode to verify creation of different app episodes in each
screen unlock session. In the screen unlock episode starting from Fri 05:56:51, Fri 10:00:24, Sat 17:48:01, Sun 22:02:00, and Mon 21:05:00 we have multiple apps, both system and non-system apps, to check this.
- The 22 minute chunk starting from Fri 10:03:56 checks app episodes for system apps only.
- The screen unlock episode starting from Mon 21:05:00 and Sat 17:48:01 checks if the screen lock marks the end of episode for that particular app which was launched a few milliseconds to 8 mins before the screen lock.
- Finally, since application foreground is only for Android devices, this feature is also for Android devices only. All other files are empty data files
## Data Yield
Description
- Two sensors were picked for testing, `phone_screen` and `phone_light`. `phone_screen` is event based and `phone_light` is sampling at regular frequency
- A 31-min episode (from `Fri 01:00:00` to `Fri 01:30:00`) in phone_light data, which is considered as a `validyieldedhours`
Checklist
|time segment| single tz | multi tz|platform|
|-|-|-|-|
|30min|OK|OK|android, ios|
|morning|OK|OK|android, ios|
|daily|OK|OK|android, ios|
|threeday|OK|OK|android, ios|
|weekend|OK|OK|android, ios|
|beforeMarchEvent|OK|OK|android, ios|
|beforeNovemberEvent|OK|OK|android, ios|
## Fitbit Calories Intraday
@ -263,6 +422,31 @@ Description
- A four-minute sedentary episode on Sun 10:01 that will be ignored for Novembers's multi-timezone event segments since the test segment ends at 10am on that weekend.
- A three-minute very active episode on Sat 16:03. This episode and the one at 16:00 are counted as one for lowmet episodes
Checklist
|time segment| single tz | multi tz|platform|
|-|-|-|-|
|30min|OK|OK|fitbit|
|morning|OK|OK|fitbit|
|daily|OK|OK|fitbit|
|threeday|OK|OK|fitbit|
|weekend|OK|OK|fitbit|
|beforeMarchEvent|OK|OK|fitbit|
|beforeNovemberEvent|OK|OK|fitbit|
## Fitbit Heartrate intraday
Description:
- The 4-day raw heartrate data is contained in `fitbit_heartrate_intraday_raw.csv`
- One episode for each daily segment (`night`, `morning`, `afternoon` and `evening`)
- Two episodes locate in the same 30-min segment (`Fri 00:49:00` and `Fri 00:52:00`)
- Two different types of heartrate zone episodes locate in the same 30-min segment (`Fri 05:49:00 outofrange` and `Fri 05:57:00 fatburn`)
- Two episodes locate in the same daily segment (`Fri 12:02:00` and `Fri 19:38:00`)
- One episode before the time switch, `Sun 00:08:00`, and one episode after the time switch, `Sun 07:28:00`
Checklist
|time segment| single tz | multi tz|platform|
@ -322,3 +506,82 @@ Checklist
|weekend|OK|OK|fitbit|
|beforeMarchEvent|OK|OK|fitbit|
|beforeNovemberEvent|OK|OK|fitbit|
## Fitbit Heartrate Summary
Description
- The 4-day raw heartrate summary data is contained in `fitbit_heartrate_summary_raw.csv`.
- As heartrate summary is periodic, it only generates results in periodic feature, there will be no result in frequency and event.
Checklist
|time segment| single tz | multi tz|platform|
|-|-|-|-|
|30min|OK|OK|fitbit|
|morning|OK|OK|fitbit|
|daily|OK|OK|fitbit|
|threeday|OK|OK|fitbit|
|weekend|OK|OK|fitbit|
|beforeMarchEvent|OK|OK|fitbit|
|beforeNovemberEvent|OK|OK|fitbit|
## Fitbit Step Intraday
Description
- The 4-day raw heartrate summary data is contained in `fitbit_steps_intraday_raw.csv`
- One episode for each daily segment (`night`, `morning`, `afternoon` and `evening`) on each day
- Two episodes within the same 30-min segment (`Fri 05:58:00` and `Fri 05:59:00`)
- A one-min episode at `2020-03-07 09:00:00` that will be converted to New York time `2020-03-07 12:00:00`
- One episode before the time switch, `Sun 00:19:00`, and one episode after the time switch, `Sun 09:01:00`
- Episodes cross two 30-min segments (`Fri 11:59:00` and `Fri 12:00:00`)
Checklist
|time segment| single tz | multi tz|platform|
|-|-|-|-|
|30min|OK|OK|fitbit|
|morning|OK|OK|fitbit|
|daily|OK|OK|fitbit|
|threeday|OK|OK|fitbit|
|weekend|OK|OK|fitbit|
|beforeMarchEvent|OK|OK|fitbit|
|beforeNovemberEvent|OK|OK|fitbit|
## Fitbit Step Summary
Description
- The 4-day raw heartrate summary data is contained in `fitbit_steps_summary_raw.csv`.
- As heartrate summary is periodic, it only generates results in periodic feature, there will be no result in frequency and event.
Checklist
|time segment| single tz | multi tz|platform|
|-|-|-|-|
|30min|OK|OK|fitbit|
|morning|OK|OK|fitbit|
|daily|OK|OK|fitbit|
|threeday|OK|OK|fitbit|
|weekend|OK|OK|fitbit|
|beforeMarchEvent|OK|OK|fitbit|
|beforeNovemberEvent|OK|OK|fitbit|
## Fitbit Data Yield
Checklist
|time segment| single tz | multi tz|platform|
|-|-|-|-|
|30min|OK|OK|fitbit|
|morning|OK|OK|fitbit|
|daily|OK|OK|fitbit|
|threeday|OK|OK|fitbit|
|weekend|OK|OK|fitbit|
|beforeMarchEvent|OK|OK|fitbit|
|beforeNovemberEvent|OK|OK|fitbit|

View File

@ -29,6 +29,7 @@ Parameters description for `[FITBIT_STEPS_INTRADAY][PROVIDERS][RAPIDS]`:
|----------------|-----------------------------------------------------------------------------------------------------------------------------------
|`[COMPUTE]` | Set to `True` to extract `FITBIT_STEPS_INTRADAY` features from the `RAPIDS` provider|
|`[FEATURES]` | Features to be computed from steps intraday data, see table below |
|`[REFERENCE_HOUR]` | The reference point from which `firststeptime` or `laststeptime` is to be computed, default is midnight |
|`[THRESHOLD_ACTIVE_BOUT]` | Every minute with Fitbit steps data wil be labelled as `sedentary` if its step count is below this threshold, otherwise, `active`. |
|`[INCLUDE_ZERO_STEP_ROWS]` | Whether or not to include time segments with a 0 step count during the whole day. |
@ -42,6 +43,8 @@ Features description for `[FITBIT_STEPS_INTRADAY][PROVIDERS][RAPIDS]`:
|minsteps |steps |The minimum step count during a time segment.
|avgsteps |steps |The average step count during a time segment.
|stdsteps |steps |The standard deviation of step count during a time segment.
|firststeptime |minutes |Minutes until the first non-zero step count.
|laststeptime |minutes |Minutes until the last non-zero step count.
|countepisodesedentarybout |bouts |Number of sedentary bouts during a time segment.
|sumdurationsedentarybout |minutes |Total duration of all sedentary bouts during a time segment.
|maxdurationsedentarybout |minutes |The maximum duration of any sedentary bout during a time segment.

View File

@ -44,7 +44,7 @@ Features description for `[PHONE_ACTIVITY_RECOGNITION][PROVIDERS][RAPIDS]`:
|count |rows | Number of episodes.
|mostcommonactivity |activity type | The most common activity type (e.g. `still`, `on_foot`, etc.). If there is a tie, the first one is chosen.
|countuniqueactivities |activity type | Number of unique activities.
|durationstationary |minutes | The total duration of `[ACTIVITY_CLASSES][STATIONARY]` episodes
|durationstationary |minutes | The total duration of `[ACTIVITY_CLASSES][STATIONARY]` episodes of still and tilting activities
|durationmobile |minutes | The total duration of `[ACTIVITY_CLASSES][MOBILE]` episodes of on foot, running, and on bicycle activities
|durationvehicle |minutes | The total duration of `[ACTIVITY_CLASSES][VEHICLE]` episodes of on vehicle activity

View File

@ -33,25 +33,36 @@ Parameters description for `[PHONE_APPLICATIONS_FOREGROUND][PROVIDERS][RAPIDS]`:
|Key&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; | Description |
|----------------|-----------------------------------------------------------------------------------------------------------------------------------
|`[COMPUTE]`| Set to `True` to extract `PHONE_APPLICATIONS_FOREGROUND` features from the `RAPIDS` provider|
|`[INCLUDE_EPISODE_FEATURES]`| Set to `True` to extract features from application usage episodes using Screen data |
|`[FEATURES]` | Features to be computed, see table below
|`[SINGLE_CATEGORIES]` | An array of app categories to be *included* in the feature extraction computation. The special keyword `all` represents a category with all the apps from each participant. By default we use the category catalogue pointed by `[APPLICATION_CATEGORIES][CATALOGUE_FILE]` (see the Sensor parameters description table above)
|`[MULTIPLE_CATEGORIES]` | An array of collections representing meta-categories (a group of categories). They key of each element is the name of the `meta-category` and the value is an array of member app categories. By default we use the category catalogue pointed by `[APPLICATION_CATEGORIES][CATALOGUE_FILE]` (see the Sensor parameters description table above)
|`[SINGLE_CATEGORIES]` | An array of app categories to be *included* in the feature extraction computation. The special keyword `all` represents a category with all the apps from each participant. By default, we use the category catalog pointed by `[APPLICATION_CATEGORIES][CATALOGUE_FILE]` (see the Sensor parameters description table above)
|`[CUSTOM_CATEGORIES]` | An array of collections representing your own app categories. The key of each element is the name of the custom category, and the value is an array of the package names (apps) included in that category.
|`[MULTIPLE_CATEGORIES]` | An array of collections representing meta-categories (a group of categories). The key of each element is the name of the `meta-category` and the value is an array of member app categories. By default, we use the category catalog pointed by `[APPLICATION_CATEGORIES][CATALOGUE_FILE]` (see the Sensor parameters description table above)
|`[SINGLE_APPS]` | An array of apps to be *included* in the feature extraction computation. Use their package name (e.g. `com.google.android.youtube`) or the reserved keyword `top1global` (the most used app by a participant over the whole monitoring study)
|`[EXCLUDED_CATEGORIES]` | An array of app categories to be *excluded* from the feature extraction computation. By default we use the category catalogue pointed by `[APPLICATION_CATEGORIES][CATALOGUE_FILE]` (see the Sensor parameters description table above)
|`[EXCLUDED_CATEGORIES]` | An array of app categories to be *excluded* from the feature extraction computation. By default, we use the category catalog pointed by `[APPLICATION_CATEGORIES][CATALOGUE_FILE]` (see the Sensor parameters description table above)
|`[EXCLUDED_APPS]` | An array of apps to be excluded from the feature extraction computation. Use their package name, for example: `com.google.android.youtube`
Features description for `[PHONE_APPLICATIONS_FOREGROUND][PROVIDERS][RAPIDS]`:
|Feature |Units |Description|
|-------------------------- |---------- |---------------------------|
|count |apps | Number of times a single app or apps within a category were used (i.e. they were brought to the foreground either by tapping their icon or switching to it from another app)
|countevent |apps | Number of times a single app or apps within a category were used (i.e. they were brought to the foreground either by tapping their icon or switching to it from another app)
|timeoffirstuse |minutes | The time in minutes between 12:00am (midnight) and the first use of a single app or apps within a category during a `time_segment`
|timeoflastuse |minutes | The time in minutes between 12:00am (midnight) and the last use of a single app or apps within a category during a `time_segment`
|frequencyentropy |nats | The entropy of the used apps within a category during a `time_segment` (each app is seen as a unique event, the more apps were used, the higher the entropy). This is especially relevant when computed over all apps. Entropy cannot be obtained for a single app
|countepisode |apps | Number of times a usage episode of a single app or apps within a category were logged. In contrast to `countevent`, if an app was used across more than one time segment (for example, across more than one 30-minute segment), the `countepisode` will be one on each time segment instance.
|minduration |minutes | For a `time_segment`, the minimum duration an application was used in minutes
|maxduration |minutes | For a `time_segment`, the maximum duration an application was used in minutes
|meanduration |minutes | For a `time_segment`, the mean duration of all the applications used in minutes
|sumduration |minutes | For a `time_segment`, the sum duration of all the applications used in minutes
!!! note "Assumptions/Observations"
Features can be computed by app, by apps grouped under a single category (genre) and by multiple categories grouped together (meta-categories). For example, we can get features for `Facebook` (single app), for `Social Network` apps (a category including Facebook and other social media apps) or for `Social` (a meta-category formed by `Social Network` and `Social Media Tools` categories).
1. Features can be computed by app, by apps grouped under a single category (genre), by your own categories, or by multiple categories grouped together (meta-categories). For example, we can get features for `Facebook` (single app), for `Social Network` apps (a category including Facebook and other social media apps), for `Traditional Social Media` (a custom category that includes Twitter and Facebook), or for `Social` (a meta-category formed by `Social Network` and `Social Media Tools` categories).
Apps installed by default like YouTube are considered systems apps on some phones. We do an exact match to exclude apps where "genre" == `EXCLUDED_CATEGORIES` or "package_name" == `EXCLUDED_APPS`.
2. Apps installed by default like YouTube are considered systems apps on some phones. We do an exact match to exclude apps where "genre" == `EXCLUDED_CATEGORIES` or "package_name" == `EXCLUDED_APPS`.
We provide three ways of classifying and app within a category (genre): a) by automatically scraping its official category from the Google Play Store, b) by using the catalogue created by Stachl et al. which we provide in RAPIDS (`data/external/stachl_application_genre_catalogue.csv`), or c) by manually creating a personalized catalogue. You can choose a, b or c by modifying `[APPLICATION_GENRES]` keys and values (see the Sensor parameters description table above).
3. We provide four ways of classifying an app within a category (genre): a) by automatically scraping its official category from the Google Play Store, b) by using the catalog created by Stachl et al., which we provide in RAPIDS (`data/external/stachl_application_genre_catalogue.csv`), c) by manually creating a personalized catalog, or d) by defining a custom category in `config.yaml`. You can choose a, b, or c by modifying `[APPLICATION_GENRES]` keys and values (see the first table of this page).
4. We count `episodes` and `events` separately. Events are single app logs (when an app was opened), but episodes span from the time an app was opened until a new app is in the foreground or the screen is locked. Episodes will be chunked across any overlapping time segments. The `top1global` of `episodes` might not be the same as the `top1global` of `events`.
5. The application episodes are calculated using the application foreground and screen unlock episode data. An application episode starts when the application is launched and ends when new application is launched, or the screen is locked.

View File

@ -86,6 +86,7 @@ Features description for `[PHONE_BLUETOOTH][PROVIDERS][DORYAB]`:
!!! note "Assumptions/Observations"
- Devices are classified as belonging to the participant (`own`) or to other people (`others`) using k-means based on the number of times and the number of days each device was detected across each participant's dataset. See [Doryab et al](../../citation#doryab-bluetooth) for more details.
- If ownership cannot be computed because all devices were detected on only one day, they are all considered as `other`. Thus `all` and `other` features will be equal. The likelihood of this scenario decreases the more days of data you have.
- When searching for the most frequent device across 30-minute segments, the search range is equivalent to the sum of all segments of the same time period. For instance, the `countscansmostfrequentdeviceacrosssegments` for the time segment (`Fri 00:00:00, Fri 00:29:59`) will get the count in that segment of the most frequent device found within all (`00:00:00, 00:29:59`) time segments. To find `countscansmostfrequentdeviceacrosssegments` for `other` devices, the search range needs to filter out all `own` devices. But no need to do so for `countscansmostfrequentdeviceacrosssedataset`. The most frequent device across the dataset stays the same for `countscansmostfrequentdeviceacrossdatasetall`, `countscansmostfrequentdeviceacrossdatasetown` and `countscansmostfrequentdeviceacrossdatasetother`. Same rule applies to the least frequent device across the dataset.
- The most and least frequent devices will be the same across time segment instances and across the entire dataset when every time segment instance covers every hour of a dataset. For example, daily segments (00:00 to 23:59) fall in this category but morning segments (06:00am to 11:59am) or periodic 30-minute segments don't.
??? info "Example"

View File

@ -26,6 +26,7 @@ Parameters description for `[PHONE_CALLS][PROVIDERS][RAPIDS]`:
| Key&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; | Description |
|-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|`[COMPUTE]`| Set to `True` to extract `PHONE_CALLS` features from the `RAPIDS` provider|
|`[FEATURES_TYPE]`| Set to `EPISODES` to extract features based on call episodes or `EVENTS` to extract features based on events.|
| `[CALL_TYPES]` | The particular call_type that will be analyzed. The options for this parameter are incoming, outgoing or missed. |
| `[FEATURES]` | Features to be computed for `outgoing`, `incoming`, and `missed` calls. Note that the same features are available for both incoming and outgoing calls, while missed calls has its own set of features. See the tables below. |
@ -60,4 +61,4 @@ Features description for `[PHONE_CALLS][PROVIDERS][RAPIDS]` missed calls:
!!! note "Assumptions/Observations"
1. Traces for iOS calls are unique even for the same contact calling a participant more than once which renders `countmostfrequentcontact` meaningless and `distinctcontacts` equal to the total number of traces.
2. `[CALL_TYPES]` and `[FEATURES]` keys in `config.yaml` need to match. For example, `[CALL_TYPES]` `outgoing` matches the `[FEATURES]` key `outgoing`
3. iOS calls data is transformed to match Android calls data format. See our [algorithm](algorithms/phone-algorithms.md#phone-calls)
3. iOS calls data is transformed to match Android calls data format.

View File

@ -6,6 +6,12 @@ Sensor parameters description for `[PHONE_KEYBOARD]`:
|----------------|-----------------------------------------------------------------------------------------------------------------------------------
|`[CONTAINER]`| Data stream [container](../../datastreams/data-streams-introduction/) (database table, CSV file, etc.) where the keyboard data is stored
## RAPIDS provider
!!! info "Available time segments and platforms"
- Available for all time segments
- Available for Android only
!!! info "File Sequence"
```bash
- data/raw/{pid}/phone_keyboard_raw.csv

View File

@ -6,8 +6,9 @@ Sensor parameters description for `[PHONE_LOCATIONS]`:
|----------------|-----------------------------------------------------------------------------------------------------------------------------------
|`[CONTAINER]`| Data stream [container](../../datastreams/data-streams-introduction/) (database table, CSV file, etc.) where the location data is stored
|`[LOCATIONS_TO_USE]`| Type of location data to use, one of `ALL`, `GPS`, `ALL_RESAMPLED` or `FUSED_RESAMPLED`. This filter is based on the `provider` column of the locations table, `ALL` includes every row, `GPS` only includes rows where the provider is gps, `ALL_RESAMPLED` includes all rows after being resampled, and `FUSED_RESAMPLED` only includes rows where the provider is fused after being resampled.
|`[FUSED_RESAMPLED_CONSECUTIVE_THRESHOLD]`| if `ALL_RESAMPLED` or `FUSED_RESAMPLED` is used, the original fused data has to be resampled, a location row is resampled to the next valid timestamp (see the Assumptions/Observations below) only if the time difference between them is less or equal than this threshold (in minutes).
|`[FUSED_RESAMPLED_TIME_SINCE_VALID_LOCATION]`| if `ALL_RESAMPLED` or `FUSED_RESAMPLED` is used, the original fused data has to be resampled, a location row is resampled at most for this long (in minutes)
|`[FUSED_RESAMPLED_CONSECUTIVE_THRESHOLD]`| If `ALL_RESAMPLED` or `FUSED_RESAMPLED` is used, the original fused data has to be resampled. A location row is resampled to the next valid timestamp (see the Assumptions/Observations below) only if the time difference between them is less or equal than this threshold (in minutes).
|`[FUSED_RESAMPLED_TIME_SINCE_VALID_LOCATION]`| If `ALL_RESAMPLED` or `FUSED_RESAMPLED` is used, the original fused data has to be resampled. A location row is resampled at most for this long (in minutes).
|`[ACCURACY_LIMIT]` | An integer in meters, any location rows with an accuracy higher or equal than this is dropped. This number means there's a 68% probability the actual location is within this radius.
!!! note "Assumptions/Observations"
**Types of location data to use**
@ -16,9 +17,9 @@ Sensor parameters description for `[PHONE_LOCATIONS]`:
- If you want to use only the GPS provider, set `[LOCATIONS_TO_USE]` to `GPS`
- If you want to use all providers, set `[LOCATIONS_TO_USE]` to `ALL`
- If you collected location data from different providers, including the fused API, use `ALL_RESAMPLED`
- If your mobile client was configured to use fused location only or want to focus only on this provider, set `[LOCATIONS_TO_USE]` to `RESAMPLE_FUSED`.
- If your mobile client was configured to use fused location only or want to focus only on this provider, set `[LOCATIONS_TO_USE]` to `FUSED_RESAMPLED`.
`ALL_RESAMPLED` and `RESAMPLE_FUSED` take the original location coordinates and replicate each pair forward in time as long as the phone was sensing data as indicated by the joined timestamps of [`[PHONE_DATA_YIELD][SENSORS]`](../phone-data-yield/). This is done because Google's API only logs a new location coordinate pair when it is sufficiently different in time or space from the previous one and because GPS and network providers can log data at variable rates.
`ALL_RESAMPLED` and `FUSED_RESAMPLED` take the original location coordinates and replicate each pair forward in time as long as the phone was sensing data as indicated by the joined timestamps of [`[PHONE_DATA_YIELD][SENSORS]`](../phone-data-yield/). This is done because Google's API only logs a new location coordinate pair when it is sufficiently different in time or space from the previous one and because GPS and network providers can log data at variable rates.
There are two parameters associated with resampling fused location.
@ -41,6 +42,7 @@ These features are based on the original open-source implementation by [Barnett
- data/raw/{pid}/phone_locations_raw.csv
- data/interim/{pid}/phone_locations_processed.csv
- data/interim/{pid}/phone_locations_processed_with_datetime.csv
- data/interim/{pid}/phone_locations_barnett_daily.csv
- data/interim/{pid}/phone_locations_features/phone_locations_{language}_{provider_key}.csv
- data/processed/features/{pid}/phone_locations.csv
```
@ -52,7 +54,6 @@ Parameters description for `[PHONE_LOCATIONS][PROVIDERS][BARNETT]`:
|----------------|-----------------------------------------------------------------------------------------------------------------------------------
|`[COMPUTE]`| Set to `True` to extract `PHONE_LOCATIONS` features from the `BARNETT` provider|
|`[FEATURES]` | Features to be computed, see table below
|`[ACCURACY_LIMIT]` | An integer in meters, any location rows with an accuracy higher than this is dropped. This number means there's a 68% probability the actual location is within this radius
|`[IF_MULTIPLE_TIMEZONES]` | Currently, `USE_MOST_COMMON` is the only value supported. If the location data for a participant belongs to multiple time zones, we select the most common because Barnett's algorithm can only handle one time zone
|`[MINUTES_DATA_USED]` | Set to `True` to include an extra column in the final location feature file containing the number of minutes used to compute the features on each time segment. Use this for quality control purposes; the more data minutes exist for a period, the more reliable its features should be. For fused location, a single minute can contain more than one coordinate pair if the participant is moving fast enough.
@ -111,7 +112,9 @@ These features are based on the original implementation by [Doryab et al.](../..
- data/raw/{pid}/phone_locations_raw.csv
- data/interim/{pid}/phone_locations_processed.csv
- data/interim/{pid}/phone_locations_processed_with_datetime.csv
- data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns.csv
- data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns_episodes.csv
- data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns_episodes_resampled.csv
- data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns_episodes_resampled_with_datetime.csv
- data/interim/{pid}/phone_locations_features/phone_locations_{language}_{provider_key}.csv
- data/processed/features/{pid}/phone_locations.csv
```
@ -121,9 +124,8 @@ Parameters description for `[PHONE_LOCATIONS][PROVIDERS][DORYAB]`:
|Key&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; | Description |
|----------------|-----------------------------------------------------------------------------------------------------------------------------------
|`[COMPUTE]`| Set to `True` to extract `PHONE_LOCATIONS` features from the `BARNETT` provider|
|`[COMPUTE]`| Set to `True` to extract `PHONE_LOCATIONS` features from the `DORYAB` provider|
|`[FEATURES]` | Features to be computed, see table below
|`[ACCURACY_LIMIT]` | An integer in meters, any location rows with an accuracy higher than this will be dropped. This number means there's a 68% probability the true location is within this radius
| `[DBSCAN_EPS]` | The maximum distance in meters between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.
| `[DBSCAN_MINSAMPLES]` | The number of samples (or total weight) in a neighborhood for a point to be considered as a core point of a cluster. This includes the point itself.
| `[THRESHOLD_STATIC]` | It is the threshold value in km/hr which labels a row as Static or Moving.
@ -143,8 +145,8 @@ Features description for `[PHONE_LOCATIONS][PROVIDERS][DORYAB]`:
|locationvariance |$meters^2$ |The sum of the variances of the latitude and longitude columns.
|loglocationvariance | - | Log of the sum of the variances of the latitude and longitude columns.
|totaldistance |meters |Total distance traveled in a time segment using the haversine formula.
|avgspeed |km/hr |Average speed in a time segment considering only the instances labeled as Moving.
|varspeed |km/hr |Speed variance in a time segment considering only the instances labeled as Moving.
|avgspeed |km/hr |Average speed in a time segment considering only the instances labeled as Moving. This feature is 0 when the participant is stationary during a time segment.
|varspeed |km/hr |Speed variance in a time segment considering only the instances labeled as Moving. This feature is 0 when the participant is stationary during a time segment.
|{--circadianmovement--} |- | Deprecated, see Observations below. \ "It encodes the extent to which a person's location patterns follow a 24-hour circadian cycle.\" [Doryab et al.](../../citation#doryab-locations).
|numberofsignificantplaces |places |Number of significant locations visited. It is calculated using the DBSCAN/OPTICS clustering algorithm which takes in EPS and MIN_SAMPLES as parameters to identify clusters. Each cluster is a significant place.
|numberlocationtransitions |transitions |Number of movements between any two clusters in a time segment.
@ -165,7 +167,7 @@ Features description for `[PHONE_LOCATIONS][PROVIDERS][DORYAB]`:
!!! note "Assumptions/Observations"
**Significant Locations Identified**
Significant locations are determined using DBSCAN clustering on locations that a patient visit over the course of the period of data collection.
Significant locations are determined using `DBSCAN` or `OPTICS` clustering on locations that a participant visited over the course of the period of data collection. The most significant location is the place where the participant stayed for the longest time.
**Circadian Movement Calculation**
Note Feb 3 2021. It seems the implementation of this feature is not correct; we suggest not to use this feature until a fix is in place. For a detailed description of how this should be calculated, see [Saeb et al](https://pubmed.ncbi.nlm.nih.gov/28344895/).
@ -195,4 +197,5 @@ Features description for `[PHONE_LOCATIONS][PROVIDERS][DORYAB]`:
the candidate will be regarded as the home cluster; otherwise, the home cluster will be the last valid day's cluster.
If there are no valid clusters before that day, the first home location in the days after is used.
**Clustering algorithms**
[`DBSCAN`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html) and [`OPTICS`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.OPTICS.html#r2c55e37003fe-1) algorithms are available currently. Duplicated locations are discarded while clustering. The `DBSCAN` algorithm takes the time spent at each location into consideration. However, the `OPTICS` algorithm ignores it as it is not supported in the current [scikit-learn](https://github.com/scikit-learn/scikit-learn/issues/12394) implementation.

View File

@ -32,7 +32,7 @@ Parameters description for `[PHONE_SCREEN][PROVIDERS][RAPIDS]`:
|`[FEATURES]` | Features to be computed, see table below
|`[REFERENCE_HOUR_FIRST_USE]` | The reference point from which `firstuseafter` is to be computed, default is midnight
|`[IGNORE_EPISODES_SHORTER_THAN]` | Ignore episodes that are shorter than this threshold (minutes). Set to 0 to disable this filter.
|`[IGNORE_EPISODES_LONGER_THAN]` | Ignore episodes that are longer than this threshold (minutes). Set to 0 to disable this filter.
|`[IGNORE_EPISODES_LONGER_THAN]` | Ignore episodes that are longer than this threshold (minutes), default is 6 hours. Set to 0 to disable this filter.
|`[EPISODE_TYPES]` | Currently we only support `unlock` episodes (from when the phone is unlocked until the screen is off)

File diff suppressed because one or more lines are too long

Binary file not shown.

Before

Width:  |  Height:  |  Size: 180 KiB

After

Width:  |  Height:  |  Size: 127 KiB

File diff suppressed because one or more lines are too long

Binary file not shown.

Before

Width:  |  Height:  |  Size: 190 KiB

After

Width:  |  Height:  |  Size: 133 KiB

File diff suppressed because one or more lines are too long

Binary file not shown.

Before

Width:  |  Height:  |  Size: 288 KiB

After

Width:  |  Height:  |  Size: 605 KiB

File diff suppressed because one or more lines are too long

Binary file not shown.

Before

Width:  |  Height:  |  Size: 184 KiB

After

Width:  |  Height:  |  Size: 100 KiB

View File

@ -1,12 +1,12 @@
# Welcome to RAPIDS documentation
Reproducible Analysis Pipeline for Data Streams (RAPIDS) allows you to process smartphone and wearable data to [extract](features/feature-introduction.md) and [create](features/add-new-features.md) **behavioral features** (a.k.a. digital biomarkers), [visualize](visualizations/data-quality-visualizations.md) mobile sensor data, and [structure](workflow-examples/analysis.md) your analysis into reproducible workflows.
Reproducible Analysis Pipeline for Data Streams (RAPIDS) allows you to process smartphone and wearable data to [extract](features/feature-introduction.md) and [create](features/add-new-features.md) **behavioral features** (a.k.a. digital biomarkers), [visualize](visualizations/data-quality-visualizations.md) mobile sensor data, and [structure](analysis/complete-workflow-example.md) your analysis into reproducible workflows. Check out our [paper](https://www.frontiersin.org/article/10.3389/fdgth.2021.769823)!
RAPIDS is open source, documented, multi-platform, modular, tested, and reproducible. At the moment, we support [data streams](datastreams/data-streams-introduction) logged by smartphones, Fitbit wearables, and Empatica wearables in collaboration with the [DBDP](https://dbdp.org/).
RAPIDS is open source, documented, multi-platform, modular, tested, and reproducible. At the moment, we support [data streams](datastreams/data-streams-introduction) logged by smartphones, Fitbit wearables, and Empatica wearables (the latter in collaboration with the [DBDP](https://dbdp.org/)).
!!! tip "Where do I start?"
:material-power-standby: New to RAPIDS? Check our [Overview + FAQ](setup/overview/) and [minimal example](workflow-examples/minimal)
:material-power-standby: New to RAPIDS? Check our [Overview + FAQ](setup/overview/) and [minimal example](analysis/minimal)
:material-play-speed: [Install](setup/installation), [configure](setup/configuration), and [execute](setup/execution) RAPIDS to [extract](features/feature-introduction.md) and [plot](visualizations/data-quality-visualizations.md) behavioral features
@ -25,7 +25,7 @@ RAPIDS is open source, documented, multi-platform, modular, tested, and reproduc
1. **Consistent analysis**. Every participant sensor dataset is analyzed in the same way and isolated from each other.
2. **Efficient analysis**. Every analysis step is executed only once. Whenever your data or configuration changes, only the affected files are updated.
5. **Parallel execution**. Thanks to Snakemake, your analysis can be executed over multiple cores without changing your code.
5. **Parallel execution**. Thanks to [Snakemake](https://snakemake.github.io/), your analysis can be executed over multiple cores without changing your code.
6. **Code-free features**. Extract any of the behavioral features offered by RAPIDS without writing any code.
7. **Extensible code**. You can easily add your own data streams or behavioral features in R or Python, share them with the community, and keep authorship and citations.
8. **Time zone aware**. Your data is adjusted to one or more time zones per participant.
@ -37,7 +37,7 @@ RAPIDS is open source, documented, multi-platform, modular, tested, and reproduc
## Users and Contributors
??? quote "Community Contributors"
Many thanks to our community contributions and the [whole team](../team):
Many thanks to the [whole team](./team) and our community contributions:
- Agam Kumar (CMU)
- Yasaman S. Sefidgar (University of Washington)
@ -46,7 +46,7 @@ RAPIDS is open source, documented, multi-platform, modular, tested, and reproduc
- Stephen Price (CMU)
- Neil Singh (University of Virginia)
Many thanks to the researchers that made [their work](../citation) open source:
Many thanks to the researchers that made [their work](./citation) open source:
- Panda et al. [paper](https://pubmed.ncbi.nlm.nih.gov/31657854/)
- Stachl et al. [paper](https://www.pnas.org/content/117/30/17680)
@ -57,9 +57,10 @@ RAPIDS is open source, documented, multi-platform, modular, tested, and reproduc
??? quote "Publications using RAPIDS"
- Predicting Symptoms of Depression and Anxiety Using Smartphone and Wearable Data [link](https://www.frontiersin.org/articles/10.3389/fpsyt.2021.625247/full)
- Predicting Depression from Smartphone Behavioral Markers Using Machine Learning Methods, Hyper-parameter Optimization, and Feature Importance Analysis: An Exploratory Study [link](https://preprints.jmir.org/preprint/26540)
- Predicting Depression from Smartphone Behavioral Markers Using Machine Learning Methods, Hyperparameter Optimization, and Feature Importance Analysis: Exploratory Study [link](https://mhealth.jmir.org/2021/7/e26540)
- Digital Biomarkers of Symptom Burden Self-Reported by Perioperative Patients Undergoing Pancreatic Surgery: Prospective Longitudinal Study [link](https://cancer.jmir.org/2021/2/e27975/)
- An Automated Machine Learning Pipeline for Monitoring and Forecasting Mobile Health Data [link](https://edas.info/showManuscript.php?m=1570708269&random=750318666&type=final&ext=pdf&title=PDF+file)
- An Automated Machine Learning Pipeline for Monitoring and Forecasting Mobile Health Data [link](https://ieeexplore.ieee.org/abstract/document/9483755/)
- Mobile Footprinting: Linking Individual Distinctiveness in Mobility Patterns to Mood, Sleep, and Brain Functional Connectivity [link](https://www.biorxiv.org/content/10.1101/2021.05.17.444568v1.abstract)
<div class="users">
<div><img alt="carnegie mellon university" loading="lazy" src="./img/logos/cmu.png" /></div>

View File

@ -35,14 +35,21 @@ You can install RAPIDS using Docker (the fastest), or native instructions for Ma
```
7. *Optional*. You can edit RAPIDS files with `vim` but we recommend using `Visual Studio Code` and its `Remote Containers` extension
??? info "How to configure Remote Containers extension"
??? info "How to configure the Remote Containers extension"
- Make sure RAPIDS Docker container is running
- Make sure RAPIDS container is running
- Install the [Remote - Containers extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers)
- Go to the `Remote Explorer` panel on the left hand sidebar
- On the top right dropdown menu choose `Containers`
- Double click on the `moshiresearch/rapids` container in the`CONTAINERS` tree
- A new VS Code session should open on RAPIDS main folder inside the container.
- Install VS Code and its [Remote - Containers extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers)
- Click the `Remote Explorer` icon on the left-hand sidebar (the icon is a computer monitor)
- On the top right dropdown menu, choose `Containers`
- Right-click on the `moshiresearch/rapids` container in the `CONTAINERS` tree and select `Attach to Container`. A new VS Code window should open
- In the new window, open the `/rapids/` folder via the `File/Open...` menu
- Run RAPIDS inside a terminal in VS Code. Open one with the `Terminal/New Terminal` menu
!!! warning
If you installed RAPIDS using Docker for Windows on Windows 10, the container will have [limits](https://stackoverflow.com/questions/43460770/docker-windows-container-memory-limit) on the amount of RAM it can use. If you find that RAPIDS crashes due to running out of memory, [increase](https://stackoverflow.com/a/56583203/6030343) this limit.

View File

@ -23,10 +23,10 @@ Let's review some key concepts we use throughout these docs:
- [Add your own behavioral features](../../features/add-new-features/) (we can include them in RAPIDS if you want to share them with the community)
- [Add support for new data streams](../../datastreams/add-new-data-streams/) if yours cannot be processed by RAPIDS yet
- Create visualizations for [data quality control](../../visualizations/data-quality-visualizations/) and [feature inspection](../../visualizations/feature-visualizations/)
- [Extending RAPIDS to organize your analysis](../../workflow-examples/analysis/) and publish a code repository along with your code
- [Extending RAPIDS to organize your analysis](../../analysis/complete-workflow-example/) and publish a code repository along with your code
!!! hint
- We recommend you follow the [Minimal Example](../../workflow-examples/minimal/) tutorial to get familiar with RAPIDS
- We recommend you follow the [Minimal Example](../../analysis/minimal/) tutorial to get familiar with RAPIDS
- In order to follow any of the previous tutorials, you will have to [Install](../installation/), [Configure](../configuration/), and learn how to [Execute](../execution/) RAPIDS.

View File

@ -43,9 +43,9 @@ All columns are mandatory; however, all except `device_id` and `local_date_time`
|device_id |local_date_time |heartrate_daily_restinghr |heartrate_daily_caloriesoutofrange |heartrate_daily_caloriesfatburn |heartrate_daily_caloriescardio |heartrate_daily_caloriespeak |
|-------------------------------------- |----------------- |------- |-------------- |------------- |------------ |-------|
|a748ee1a-1d0b-4ae9-9074-279a2b6ba524 |2020-10-07 |72 |1200.6102 |760.3020 |15.2048 |0 |
|a748ee1a-1d0b-4ae9-9074-279a2b6ba524 |2020-10-08 |70 |1100.1120 |660.0012 |23.7088 |0 |
|a748ee1a-1d0b-4ae9-9074-279a2b6ba524 |2020-10-09 |69 |750.3615 |734.1516 |131.8579 |0 |
|a748ee1a-1d0b-4ae9-9074-279a2b6ba524 |2020-10-07 00:00:00 |72 |1200.6102 |760.3020 |15.2048 |0 |
|a748ee1a-1d0b-4ae9-9074-279a2b6ba524 |2020-10-08 00:00:00 |70 |1100.1120 |660.0012 |23.7088 |0 |
|a748ee1a-1d0b-4ae9-9074-279a2b6ba524 |2020-10-09 00:00:00 |69 |750.3615 |734.1516 |131.8579 |0 |
??? info "FITBIT_HEARTRATE_INTRADAY"

View File

@ -9,16 +9,12 @@ If you are interested in contributing feel free to submit a pull request or cont
??? abstract "About"
Julio Vega is a postdoctoral associate at the Mobile Sensing + Health Institute. He is interested in personalized methodologies to monitor chronic conditions that affect daily human behavior using mobile and wearable data.
- *vegaju* at *upmc* . *edu*
- [Personal Website](https://juliovega.info/)
### Meng Li
??? abstract "About"
Meng Li received her Master of Science degree in Information Science from the University of Pittsburgh. She is interested in applying machine learning algorithms to the medical field.
- *lim11* at *upmc* . *edu*
- [Linkedin Profile](https://www.linkedin.com/in/meng-li-57238414a)
- [Github Profile](https://github.com/Meng6)
### Abhineeth Reddy Kunta
@ -49,10 +45,22 @@ If you are interested in contributing feel free to submit a pull request or cont
### Nikunj Goel
??? abstract "About"
Nik is a graduate student at the University of Pittsburgh pursuing Master of Science in Information Science. He earned his Bachelor of Technology degree in Information Technology from India. He is a Data Enthusiasts and passionate about finding the meaning out of raw data. In a long term, his goal is to create a breakthrough in Data Science and Deep Learning.
Nik is a graduate student at the University of Pittsburgh pursuing Master of Science in Information Science. He earned his Bachelor of Technology degree in Information Technology from India. He is a Data Enthusiast and passionate about finding the meaning out of raw data. In a long term, his goal is to create a breakthrough in Data Science and Deep Learning.
- [Linkedin Profile](https://www.linkedin.com/in/nikunjgoel95/)
### Kirtiraj Khandekar
??? abstract "About"
Raj is a graduate student at the University of Pittsburgh pursuing Master of Science in Information Science.
### Weiyu Huang
??? abstract "About"
Weiyu is a graduate student at the University of Pittsburgh pursuing Master of Science in Information Science.
- [Github Profile](https://github.com/ChinW97)
## Community Contributors
### Agam Kumar
@ -88,6 +96,19 @@ If you are interested in contributing feel free to submit a pull request or cont
??? abstract "About"
University of Virginia
### Ian Barnett
??? abstract "About"
University of Pennsylvania
- [Profile](https://www.dbei.med.upenn.edu/bio/ian-j-barnett-phd)
### Shirley Anugrah Hayati
??? abstract "About"
University of Pennsylvania
- [Personal Website](https://www.shirley.id/)
## Advisors
### Afsaneh Doryab
@ -98,4 +119,4 @@ If you are interested in contributing feel free to submit a pull request or cont
### Carissa Low
??? abstract "About"
- [Profile](https://www.moshi.pitt.edu/people/carissa-low-phd)
- [Profile](https://www.moshi.pitt.edu/people/carissa-low-phd)

View File

@ -20,7 +20,7 @@ These plots can be used as a rough indication of the smartphone monitoring cover
## 2. Heatmaps of overall data yield
These heatmaps are a break down per time segment and per participant of [Visualization 1](#1-histograms-of-phone-data-yield). Heatmap's rows represent participants, columns represent time segment instances and the cells color represent the valid yielded minute or hour ratio for a participant during a time segment instance.
As different participants might join a study on different dates and time segments can be of any length and start on any day, the x-axis can be labelled with the absolute time of the start of each time segment instance or the time delta between the start of each time segment instance minus the start of the first instance. These plots provide a quick study overview of the monitoring coverage per person and per time segment.
As different participants might join a study on different dates and time segments can be of any length and start on any day, the x-axis can be labelled with the absolute time of each time segment instance or the time delta between each time segment instance and the start of the first instance for each participant. These plots provide a quick study overview of the monitoring coverage per person and per time segment.
The figure below shows the heatmap of the valid yielded minute ratio for participants example01 and example02 on daily segments and, as we inferred from the previous histogram, the lighter (yellow) color on most time segment instances (cells) indicate both phones sensed data without interruptions for most days (except for the first and last ones).
@ -63,7 +63,7 @@ The figure below shows this heatmap for phone sensors collected by participant e
## 4. Heatmap of sensor row count
These heatmaps are a per-sensor breakdown of [Visualization 1](#1-histograms-of-phone-data-yield) and [Visualization 2](#2-heatmaps-of-overall-data-yield). Note that the second row (ratio of valid yielded minutes) of this heatmap matches the respective participant (bottom) row the screenshot in Visualization 2.
In these heatmaps rows represent phone or Fitbit sensors, columns represent time segment instances and cells color shows the normalized (0 to 1) row count of each sensor within a time segment instance. RAPIDS creates one heatmap per participant and they can be used to judge missing data on a per participant and per sensor basis.
In these heatmaps rows represent phone or Fitbit sensors, columns represent time segment instances and cells color shows the normalized (0 to 1) row count of each sensor within a time segment instance. A grey cell represents missing data in that time segment instance. RAPIDS creates one heatmap per participant and they can be used to judge missing data on a per participant and per sensor basis.
The figure below shows data for 14 phone sensors (including data yield) of example01s daily segments. From the top two rows, we can see that the phone was sensing data for most of the monitoring period (as suggested by Figure 3 and Figure 4). We can also infer how phone usage influenced the different sensor streams; there are peaks of screen events during the first day (Apr 23rd), peaks of location coordinates on Apr 26th and Apr 30th, and no sent or received SMS except for Apr 23rd, Apr 29th and Apr 30th (unlabeled row between screen and locations).

View File

@ -0,0 +1,39 @@
"""
Please do not make any changes, as RAPIDS is running on tmux server ...
"""
# !
# !
"""
Please do not make any changes, as RAPIDS is running on tmux server ...
"""
# !
# !
"""
Please do not make any changes, as RAPIDS is running on tmux server ...
"""
# !
# !
"""
Please do not make any changes, as RAPIDS is running on tmux server ...
"""
# !
# !
"""
Please do not make any changes, as RAPIDS is running on tmux server ...
"""
# !
# !
"""
Please do not make any changes, as RAPIDS is running on tmux server ...
"""
# !
# !
"""
Please do not make any changes, as RAPIDS is running on tmux server ...
"""
# !
# !
"""
Please do not make any changes, as RAPIDS is running on tmux server ...
"""
# !

View File

@ -1,109 +1,30 @@
name: rapids202012
name: rapids
channels:
- conda-forge
- defaults
dependencies:
- _py-xgboost-mutex=2.0
- appdirs=1.4.4
- arrow=0.16.0
- asn1crypto=1.4.0
- astropy=4.2
- attrs=20.3.0
- binaryornot=0.4.4
- blas=1.0
- brotlipy=0.7.0
- bzip2=1.0.8
- ca-certificates=2020.12.8
- certifi=2020.12.5
- cffi=1.14.4
- chardet=3.0.4
- click=7.1.2
- cookiecutter=1.6.0
- cryptography=3.3.1
- datrie=0.8.2
- docutils=0.16
- future=0.18.2
- gitdb=4.0.5
- gitdb2=4.0.2
- gitpython=3.1.11
- idna=2.10
- imbalanced-learn=0.6.2
- importlib-metadata=2.0.0
- importlib_metadata=2.0.0
- intel-openmp=2019.4
- jinja2=2.11.2
- jinja2-time=0.2.0
- joblib=1.0.0
- jsonschema=3.2.0
- libblas=3.8.0
- libcblas=3.8.0
- libcxx=10.0.0
- libedit=3.1.20191231
- libffi=3.3
- libgfortran
- liblapack=3.8.0
- libxgboost=0.90
- lightgbm=3.1.1
- llvm-openmp=10.0.0
- markupsafe=1.1.1
- mkl
- mkl-service=2.3.0
- mkl_fft=1.2.0
- mkl_random=1.1.1
- more-itertools=8.6.0
- ncurses=6.2
- numpy=1.19.2
- numpy-base=1.19.2
- openssl=1.1.1i
- pandas=1.1.5
- pbr=5.5.1
- pip=20.3.3
- plotly=4.14.1
- poyo=0.5.0
- psutil=5.7.2
- py-xgboost=0.90
- pycparser=2.20
- pyerfa=1.7.1.1
- pyopenssl=20.0.1
- pysocks=1.7.1
- python=3.7.9
- python-dateutil=2.8.1
- python_abi=3.7
- pytz=2020.4
- pyyaml=5.3.1
- readline=8.0
- requests=2.25.0
- retrying=1.3.3
- scikit-learn=0.23.2
- scipy=1.5.2
- setuptools=51.0.0
- six=1.15.0
- smmap=3.0.4
- smmap2=3.0.1
- sqlite=3.33.0
- threadpoolctl=2.1.0
- tk=8.6.10
- urllib3=1.25.11
- wheel=0.36.2
- whichcraft=0.6.1
- wrapt=1.12.1
- xgboost=0.90
- xz=5.2.5
- yaml=0.2.5
- zipp=3.4.0
- zlib=1.2.11
- pip:
- amply==0.1.4
- configargparse==0.15.1
- decorator==4.4.2
- ipython-genutils==0.2.0
- jupyter-core==4.6.3
- nbformat==5.0.7
- pulp==2.4
- pyparsing==2.4.7
- pyrsistent==0.15.5
- ratelimiter==1.2.0.post0
- snakemake==5.30.2
- toposort==1.5
- traitlets==4.3.3
prefix: /usr/local/Caskroom/miniconda/base/envs/rapids202012
- auto-sklearn
- hmmlearn
- imbalanced-learn
- jsonschema
- lightgbm
- matplotlib
- numpy
- pandas
- peakutils
- pip
- plotly
- python-dateutil
- pytz
- pywavelets
- pyyaml
- scikit-learn
- scipy
- seaborn
- setuptools
- bioconda::snakemake
- bioconda::snakemake-minimal
- tqdm
- xgboost
- pip:
- biosppy
- cr_features>=0.2

View File

@ -3,7 +3,7 @@ include: "../rules/common.smk"
include: "../rules/renv.smk"
include: "../rules/preprocessing.smk"
include: "../rules/features.smk"
include: "../rules/models.smk"
include: "../rules/models_example.smk"
include: "../rules/reports.smk"
import itertools
@ -204,15 +204,29 @@ for provider in config["PHONE_LOCATIONS"]["PROVIDERS"].keys():
else:
raise ValueError("Error: Add PHONE_LOCATIONS (and as many PHONE_SENSORS as you have) to [PHONE_DATA_YIELD][SENSORS] in config.yaml. This is necessary to compute phone_yielded_timestamps (time when the smartphone was sensing data) which is used to resample fused location data (ALL_RESAMPLED and RESAMPLED_FUSED)")
if provider == "BARNETT":
files_to_compute.extend(expand("data/interim/{pid}/phone_locations_barnett_daily.csv", pid=config["PIDS"]))
if provider == "DORYAB":
files_to_compute.extend(expand("data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns_episodes.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns_episodes_resampled_with_datetime.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/raw/{pid}/phone_locations_raw.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/interim/{pid}/phone_locations_processed.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/interim/{pid}/phone_locations_processed_with_datetime.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/interim/{pid}/phone_locations_processed_with_datetime_with_home.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/interim/{pid}/phone_locations_features/phone_locations_{language}_{provider_key}.csv", pid=config["PIDS"], language=get_script_language(config["PHONE_LOCATIONS"]["PROVIDERS"][provider]["SRC_SCRIPT"]), provider_key=provider.lower()))
files_to_compute.extend(expand("data/processed/features/{pid}/phone_locations.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"]))
files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv")
for provider in config["FITBIT_CALORIES_INTRADAY"]["PROVIDERS"].keys():
if config["FITBIT_CALORIES_INTRADAY"]["PROVIDERS"][provider]["COMPUTE"]:
files_to_compute.extend(expand("data/raw/{pid}/fitbit_calories_intraday_raw.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/raw/{pid}/fitbit_calories_intraday_with_datetime.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/interim/{pid}/fitbit_calories_intraday_features/fitbit_calories_intraday_{language}_{provider_key}.csv", pid=config["PIDS"], language=get_script_language(config["FITBIT_CALORIES_INTRADAY"]["PROVIDERS"][provider]["SRC_SCRIPT"]), provider_key=provider.lower()))
files_to_compute.extend(expand("data/processed/features/{pid}/fitbit_calories_intraday.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"]))
files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv")
for provider in config["FITBIT_DATA_YIELD"]["PROVIDERS"].keys():
if config["FITBIT_DATA_YIELD"]["PROVIDERS"][provider]["COMPUTE"]:
files_to_compute.extend(expand("data/raw/{pid}/fitbit_heartrate_intraday_raw.csv", pid=config["PIDS"]))
@ -271,6 +285,12 @@ for provider in config["FITBIT_STEPS_SUMMARY"]["PROVIDERS"].keys():
for provider in config["FITBIT_STEPS_INTRADAY"]["PROVIDERS"].keys():
if config["FITBIT_STEPS_INTRADAY"]["PROVIDERS"][provider]["COMPUTE"]:
if config["FITBIT_STEPS_INTRADAY"]["EXCLUDE_SLEEP"]["TIME_BASED"]["EXCLUDE"] or config["FITBIT_STEPS_INTRADAY"]["EXCLUDE_SLEEP"]["FITBIT_BASED"]["EXCLUDE"]:
if config["FITBIT_STEPS_INTRADAY"]["EXCLUDE_SLEEP"]["FITBIT_BASED"]["EXCLUDE"]:
files_to_compute.extend(expand("data/raw/{pid}/fitbit_sleep_summary_raw.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/interim/{pid}/fitbit_steps_intraday_with_datetime_exclude_sleep.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/raw/{pid}/fitbit_steps_intraday_raw.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/raw/{pid}/fitbit_steps_intraday_with_datetime.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/interim/{pid}/fitbit_steps_intraday_features/fitbit_steps_intraday_{language}_{provider_key}.csv", pid=config["PIDS"], language=get_script_language(config["FITBIT_STEPS_INTRADAY"]["PROVIDERS"][provider]["SRC_SCRIPT"]), provider_key=provider.lower()))
@ -357,11 +377,21 @@ if config["HEATMAP_SENSOR_ROW_COUNT_PER_TIME_SEGMENT"]["PLOT"]:
files_to_compute.append("reports/data_exploration/heatmap_sensor_row_count_per_time_segment.html")
if config["HEATMAP_PHONE_DATA_YIELD_PER_PARTICIPANT_PER_TIME_SEGMENT"]["PLOT"]:
if not config["PHONE_DATA_YIELD"]["PROVIDERS"]["RAPIDS"]["COMPUTE"]:
raise ValueError("Error: [PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] must be True in config.yaml to get heatmaps of overall data yield.")
files_to_compute.append("reports/data_exploration/heatmap_phone_data_yield_per_participant_per_time_segment.html")
if config["HEATMAP_FEATURE_CORRELATION_MATRIX"]["PLOT"]:
files_to_compute.append("reports/data_exploration/heatmap_feature_correlation_matrix.html")
# Data Cleaning
for provider in config["ALL_CLEANING_INDIVIDUAL"]["PROVIDERS"].keys():
if config["ALL_CLEANING_INDIVIDUAL"]["PROVIDERS"][provider]["COMPUTE"]:
files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features_cleaned_" + provider.lower() +".csv", pid=config["PIDS"]))
for provider in config["ALL_CLEANING_OVERALL"]["PROVIDERS"].keys():
if config["ALL_CLEANING_OVERALL"]["PROVIDERS"][provider]["COMPUTE"]:
files_to_compute.extend(expand("data/processed/features/all_participants/all_sensor_features_cleaned_" + provider.lower() +".csv"))
# Analysis Workflow Example
models, scalers = [], []
for model_name in config["PARAMS_FOR_ANALYSIS"]["MODEL_NAMES"]:
@ -379,7 +409,6 @@ files_to_compute.extend(expand("data/raw/{pid}/participant_target_with_datetime.
files_to_compute.extend(expand("data/processed/targets/{pid}/parsed_targets.csv", pid=config["PIDS"]))
# Individual model
files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features_cleaned.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/processed/models/individual_model/{pid}/input.csv", pid=config["PIDS"]))
files_to_compute.extend(expand("data/processed/models/individual_model/{pid}/output_{cv_method}/baselines.csv", pid=config["PIDS"], cv_method=config["PARAMS_FOR_ANALYSIS"]["CV_METHODS"]))
files_to_compute.extend(expand(
@ -392,7 +421,6 @@ files_to_compute.extend(expand(
scaler=scalers))
# Population model
files_to_compute.append("data/processed/features/all_participants/all_sensor_features_cleaned.csv")
files_to_compute.append("data/processed/models/population_model/input.csv")
files_to_compute.extend(expand("data/processed/models/population_model/output_{cv_method}/baselines.csv", cv_method=config["PARAMS_FOR_ANALYSIS"]["CV_METHODS"]))
files_to_compute.extend(expand(

View File

@ -84,6 +84,7 @@ PHONE_APPLICATIONS_CRASHES:
CATALOGUE_SOURCE: FILE # FILE (genres are read from CATALOGUE_FILE) or GOOGLE (genres are scrapped from the Play Store)
CATALOGUE_FILE: "data/external/stachl_application_genre_catalogue.csv"
UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether or not to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
PACKAGE_NAMES_HASHED: False
SCRAPE_MISSING_CATEGORIES: False # whether or not to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
PROVIDERS: # None implemented yet but this sensor can be used in PHONE_DATA_YIELD
@ -93,11 +94,13 @@ PHONE_APPLICATIONS_FOREGROUND:
APPLICATION_CATEGORIES:
CATALOGUE_SOURCE: FILE # FILE (genres are read from CATALOGUE_FILE) or GOOGLE (genres are scrapped from the Play Store)
CATALOGUE_FILE: "data/external/stachl_application_genre_catalogue.csv"
PACKAGE_NAMES_HASHED: False
UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether or not to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
SCRAPE_MISSING_CATEGORIES: False # whether or not to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
PROVIDERS:
RAPIDS:
COMPUTE: True
INCLUDE_EPISODE_FEATURES: False
SINGLE_CATEGORIES: ["all", "email"]
MULTIPLE_CATEGORIES:
social: ["socialnetworks", "socialmediatools"]
@ -105,7 +108,11 @@ PHONE_APPLICATIONS_FOREGROUND:
SINGLE_APPS: ["top1global", "com.facebook.moments", "com.google.android.youtube", "com.twitter.android"] # There's no entropy for single apps
EXCLUDED_CATEGORIES: ["system_apps"]
EXCLUDED_APPS: ["com.fitbit.FitbitMobile", "com.aware.plugin.upmc.cancer"]
FEATURES: ["count", "timeoffirstuse", "timeoflastuse", "frequencyentropy"]
FEATURES:
APP_EVENTS: ["countevent", "timeoffirstuse", "timeoflastuse", "frequencyentropy"]
APP_EPISODES: ["countepisode", "minduration", "maxduration", "meanduration", "sumduration"]
IGNORE_EPISODES_SHORTER_THAN: 0 # in minutes, set to 0 to disable
IGNORE_EPISODES_LONGER_THAN: 300 # in minutes, set to 0 to disable
SRC_SCRIPT: src/features/phone_applications_foreground/rapids/main.py
# See https://www.rapids.science/latest/features/phone-applications-notifications/
@ -115,6 +122,7 @@ PHONE_APPLICATIONS_NOTIFICATIONS:
CATALOGUE_SOURCE: FILE # FILE (genres are read from CATALOGUE_FILE) or GOOGLE (genres are scrapped from the Play Store)
CATALOGUE_FILE: "data/external/stachl_application_genre_catalogue.csv"
UPDATE_CATALOGUE_FILE: False # if CATALOGUE_SOURCE is equal to FILE, whether or not to update CATALOGUE_FILE, if CATALOGUE_SOURCE is equal to GOOGLE all scraped genres will be saved to CATALOGUE_FILE
PACKAGE_NAMES_HASHED: False
SCRAPE_MISSING_CATEGORIES: False # whether or not to scrape missing genres, only effective if CATALOGUE_SOURCE is equal to FILE. If CATALOGUE_SOURCE is equal to GOOGLE, all genres are scraped anyway
PROVIDERS: # None implemented yet but this sensor can be used in PHONE_DATA_YIELD
@ -198,7 +206,11 @@ PHONE_DATA_YIELD:
# See https://www.rapids.science/latest/features/phone-keyboard/
PHONE_KEYBOARD:
CONTAINER: keyboard
PROVIDERS: # None implemented yet but this sensor can be used in PHONE_DATA_YIELD
PROVIDERS:
RAPIDS:
COMPUTE: False
FEATURES: ["sessioncount","averageinterkeydelay","averagesessionlength","changeintextlengthlessthanminusone","changeintextlengthequaltominusone","changeintextlengthequaltoone","changeintextlengthmorethanone","maxtextlength","lastmessagelength","totalkeyboardtouches"]
SRC_SCRIPT: src/features/phone_keyboard/rapids/main.py
# See https://www.rapids.science/latest/features/phone-light/
PHONE_LIGHT:
@ -215,12 +227,12 @@ PHONE_LOCATIONS:
LOCATIONS_TO_USE: FUSED_RESAMPLED # ALL, GPS, ALL_RESAMPLED, OR FUSED_RESAMPLED
FUSED_RESAMPLED_CONSECUTIVE_THRESHOLD: 30 # minutes, only replicate location samples to the next sensed bin if the phone did not stop collecting data for more than this threshold
FUSED_RESAMPLED_TIME_SINCE_VALID_LOCATION: 720 # minutes, only replicate location samples to consecutive sensed bins if they were logged within this threshold after a valid location row
ACCURACY_LIMIT: 51 # meters
PROVIDERS:
DORYAB:
COMPUTE: True
FEATURES: ["locationvariance","loglocationvariance","totaldistance","avgspeed","varspeed", "numberofsignificantplaces","numberlocationtransitions","radiusgyration","timeattop1location","timeattop2location","timeattop3location","movingtostaticratio","outlierstimepercent","maxlengthstayatclusters","minlengthstayatclusters","avglengthstayatclusters","stdlengthstayatclusters","locationentropy","normalizedlocationentropy","timeathome"]
ACCURACY_LIMIT: 51 # meters, drops location coordinates with an accuracy higher than this. This number means there's a 68% probability the true location is within this radius
DBSCAN_EPS: 10 # meters
DBSCAN_MINSAMPLES: 5
THRESHOLD_STATIC : 1 # km/h
@ -236,7 +248,6 @@ PHONE_LOCATIONS:
BARNETT:
COMPUTE: False
FEATURES: ["hometime","disttravelled","rog","maxdiam","maxhomedist","siglocsvisited","avgflightlen","stdflightlen","avgflightdur","stdflightdur","probpause","siglocentropy","circdnrtn","wkenddayrtn"]
ACCURACY_LIMIT: 51 # meters, drops location coordinates with an accuracy higher than this. This number means there's a 68% probability the true location is within this radius
IF_MULTIPLE_TIMEZONES: USE_MOST_COMMON
MINUTES_DATA_USED: False # Use this for quality control purposes, how many minutes of data (location coordinates gruped by minute) were used to compute features
SRC_SCRIPT: src/features/phone_locations/barnett/main.R
@ -513,7 +524,7 @@ HEATMAP_SENSORS_PER_MINUTE_PER_TIME_SEGMENT:
# See https://www.rapids.science/latest/visualizations/data-quality-visualizations/#4-heatmap-of-sensor-row-count
HEATMAP_SENSOR_ROW_COUNT_PER_TIME_SEGMENT:
PLOT: True
PLOT: False
SENSORS: [PHONE_ACTIVITY_RECOGNITION, PHONE_APPLICATIONS_FOREGROUND, PHONE_BATTERY, PHONE_BLUETOOTH, PHONE_CALLS, PHONE_CONVERSATION, PHONE_LIGHT, PHONE_LOCATIONS, PHONE_MESSAGES, PHONE_SCREEN, PHONE_WIFI_CONNECTED, PHONE_WIFI_VISIBLE]
# Features ------
@ -526,6 +537,46 @@ HEATMAP_FEATURE_CORRELATION_MATRIX:
CORR_METHOD: "pearson" # choose from {"pearson", "kendall", "spearman"}
########################################################################################################################
# Data Cleaning #
########################################################################################################################
ALL_CLEANING_INDIVIDUAL:
PROVIDERS:
RAPIDS:
COMPUTE: True
IMPUTE_SELECTED_EVENT_FEATURES:
COMPUTE: False
MIN_DATA_YIELDED_MINUTES_TO_IMPUTE: 0.33
COLS_NAN_THRESHOLD: 0.3 # set to 1 to disable
COLS_VAR_THRESHOLD: True
ROWS_NAN_THRESHOLD: 0.3 # set to 1 to disable
DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_HOURS # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
DATA_YIELD_RATIO_THRESHOLD: 0.75 # set to 0 to disable
DROP_HIGHLY_CORRELATED_FEATURES:
COMPUTE: False
MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
CORR_THRESHOLD: 0.95
SRC_SCRIPT: src/features/all_cleaning_individual/rapids/main.R
ALL_CLEANING_OVERALL:
PROVIDERS:
RAPIDS:
COMPUTE: True
IMPUTE_SELECTED_EVENT_FEATURES:
COMPUTE: False
MIN_DATA_YIELDED_MINUTES_TO_IMPUTE: 0.33
COLS_NAN_THRESHOLD: 0.3 # set to 1 to disable
COLS_VAR_THRESHOLD: True
ROWS_NAN_THRESHOLD: 0.3 # set to 1 to disable
DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_HOURS # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
DATA_YIELD_RATIO_THRESHOLD: 0.75 # set to 0 to disable
DROP_HIGHLY_CORRELATED_FEATURES:
COMPUTE: False
MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
CORR_THRESHOLD: 0.95
SRC_SCRIPT: src/features/all_cleaning_overall/rapids/main.R
########################################################################################################################
# Analysis Workflow Example #
@ -543,12 +594,6 @@ PARAMS_FOR_ANALYSIS:
TARGET:
FOLDER: data/external/example_workflow
CONTAINER: participant_target.csv
# Cleaning Parameters
COLS_NAN_THRESHOLD: 0.3
COLS_VAR_THRESHOLD: True
ROWS_NAN_THRESHOLD: 0.3
DATA_YIELDED_HOURS_RATIO_THRESHOLD: 0.75
MODEL_NAMES: [LogReg, kNN , SVM, DT, RF, GB, XGBoost, LightGBM]
CV_METHODS: [LeaveOneOut]

View File

@ -74,7 +74,7 @@ extra_css:
nav:
- Home: 'index.md'
- Overview: setup/overview.md
- Minimal Example: workflow-examples/minimal.md
- Minimal Example: analysis/minimal.md
- Citation: citation.md
- Contributing: contributing.md
- Setup:
@ -85,6 +85,7 @@ nav:
- Introduction: datastreams/data-streams-introduction.md
- Phone:
- aware_mysql: datastreams/aware-mysql.md
- aware_micro_mysql: datastreams/aware-micro-mysql.md
- aware_csv: datastreams/aware-csv.md
- aware_influxdb (beta): datastreams/aware-influxdb.md
- Mandatory Phone Format: datastreams/mandatory-phone-format.md
@ -140,8 +141,9 @@ nav:
- Visualizations:
- Data Quality: visualizations/data-quality-visualizations.md
- Features: visualizations/feature-visualizations.md
- Analysis Workflows:
- Complete Example: workflow-examples/analysis.md
- Analysis:
- Data Cleaning: analysis/data-cleaning.md
- Complete Workflow Example: analysis/complete-workflow-example.md
- Developers:
- Git Flow: developers/git-flow.md
- Remote Support: developers/remote-support.md

View File

@ -0,0 +1,33 @@
Warning: 1241 parsing failures.
row col expected actual file
1 is_system_app an integer TRUE 'data/raw/p011/phone_applications_foreground_with_datetime_with_categories.csv'
2 is_system_app an integer FALSE 'data/raw/p011/phone_applications_foreground_with_datetime_with_categories.csv'
3 is_system_app an integer TRUE 'data/raw/p011/phone_applications_foreground_with_datetime_with_categories.csv'
4 is_system_app an integer TRUE 'data/raw/p011/phone_applications_foreground_with_datetime_with_categories.csv'
5 is_system_app an integer TRUE 'data/raw/p011/phone_applications_foreground_with_datetime_with_categories.csv'
... ............. .......... ...... ...............................................................................
See problems(...) for more details.
Warning message:
The following named parsers don't match the column names: application_name
Error: Problem with `filter()` input `..1`.
✖ object 'application_name' not found
Input `..1` is `!is.na(application_name)`.
Backtrace:
1. ├─`%>%`(...)
2. ├─dplyr::mutate(...)
3. ├─utils::head(., -1)
4. ├─dplyr::select(., -c("timestamp"))
5. ├─dplyr::filter(., !is.na(application_name))
6. ├─dplyr:::filter.data.frame(., !is.na(application_name))
7. │ └─dplyr:::filter_rows(.data, ...)
8. │ ├─base::withCallingHandlers(...)
9. │ └─mask$eval_all_filter(dots, env_filter)
10. └─base::.handleSimpleError(...)
11. └─dplyr:::h(simpleError(msg, call))
Execution halted
[Mon Dec 13 17:19:06 2021]
Error in rule app_episodes:
jobid: 54
output: data/interim/p011/phone_app_episodes.csv

View File

@ -0,0 +1,5 @@
Warning message:
In barnett_daily_features(snakemake) :
Barnett's location features cannot be computed for data or time segments that do not span one or more entire days (00:00:00 to 23:59:59). Values below point to the problem:
Location data rows within a daily time segment: 0
Location data time span in days: 398.6

725
renv.lock

File diff suppressed because it is too large Load Diff

View File

@ -14,9 +14,6 @@ local({
# signal that we're loading renv during R startup
Sys.setenv("RENV_R_INITIALIZING" = "true")
on.exit(Sys.unsetenv("RENV_R_INITIALIZING"), add = TRUE)
if(grepl("Darwin", Sys.info()["sysname"], fixed = TRUE) & grepl("ARM64", Sys.info()["version"], fixed = TRUE)) # M1 Macs
Sys.setenv("TZDIR" = file.path(R.home(), "share", "zoneinfo"))
# signal that we've consented to use renv
options(renv.consent = TRUE)

View File

@ -23,10 +23,16 @@ def get_barnett_daily(wildcards):
def get_locations_python_input(wildcards):
if wildcards.provider_key.upper() == "DORYAB":
return "data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns.csv"
return "data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns_episodes_resampled_with_datetime.csv"
else:
return "data/interim/{pid}/phone_locations_processed_with_datetime.csv"
def get_calls_input(wildcards):
if (wildcards.provider_key.upper() == "RAPIDS") and (config["PHONE_CALLS"]["PROVIDERS"]["RAPIDS"]["FEATURES_TYPE"] == "EPISODES"):
return "data/interim/{pid}/phone_calls_episodes_resampled_with_datetime.csv"
else:
return "data/raw/{pid}/phone_calls_with_datetime.csv"
def find_features_files(wildcards):
feature_files = []
for provider_key, provider in config[(wildcards.sensor_key).upper()]["PROVIDERS"].items():
@ -34,6 +40,17 @@ def find_features_files(wildcards):
feature_files.extend(expand("data/interim/{{pid}}/{sensor_key}_features/{sensor_key}_{language}_{provider_key}.csv", sensor_key=wildcards.sensor_key.lower(), language=get_script_language(provider["SRC_SCRIPT"]), provider_key=provider_key.lower()))
return(feature_files)
def find_joint_non_empatica_sensor_files(wildcards):
joined_files = []
for config_key in config.keys():
if config_key.startswith(("PHONE", "FITBIT")) and "PROVIDERS" in config[config_key] and isinstance(config[config_key]["PROVIDERS"], dict):
for provider_key, provider in config[config_key]["PROVIDERS"].items():
if "COMPUTE" in provider.keys() and provider["COMPUTE"]:
joined_files.append("data/processed/features/{pid}/" + config_key.lower() + ".csv")
break
return joined_files
def optional_steps_sleep_input(wildcards):
if config["FITBIT_STEPS_INTRADAY"]["EXCLUDE_SLEEP"]["FITBIT_BASED"]["EXCLUDE"]:
return "data/raw/{pid}/fitbit_sleep_summary_raw.csv"
@ -108,7 +125,16 @@ def input_tzcodes_file(wilcards):
if not config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"].lower().endswith(".csv"):
raise ValueError("[TIMEZONE][MULTIPLE][TZCODES_FILE] should point to a CSV file, instead you typed: " + config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"])
if not Path(config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"]).exists():
raise ValueError("[TIMEZONE][MULTIPLE][TZCODES_FILE] should point to a CSV file, the file in the path you typed does not exist: " + config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"])
try:
config["TIMEZONE"]["MULTIPLE"]["TZ_FILE"]
except KeyError:
raise ValueError("To create TZCODES_FILE, a list of timezones should be created " +
"with the rule preprocessing.smk/prepare_tzcodes_file " +
"which will create a file specified as config['TIMEZONE']['MULTIPLE']['TZ_FILE']." +
"\n An alternative is to provide the file manually:" +
"[TIMEZONE][MULTIPLE][TZCODES_FILE] should point to a CSV file," +
"but the file in the path you typed does not exist: " +
config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"])
return [config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"]]
return []

View File

@ -32,7 +32,7 @@ rule phone_data_yield_r_features:
output:
"data/interim/{pid}/phone_data_yield_features/phone_data_yield_r_{provider_key}.csv"
script:
"../src/features/entry.R"
"../src/features/entry.R"
rule phone_accelerometer_python_features:
input:
@ -125,6 +125,7 @@ rule phone_applications_crashes_r_features:
rule phone_applications_foreground_python_features:
input:
sensor_data = "data/raw/{pid}/phone_applications_foreground_with_datetime_with_categories.csv",
episode_data = "data/interim/{pid}/phone_app_episodes_resampled_with_datetime.csv",
time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
params:
provider = lambda wildcards: config["PHONE_APPLICATIONS_FOREGROUND"]["PROVIDERS"][wildcards.provider_key.upper()],
@ -138,6 +139,7 @@ rule phone_applications_foreground_python_features:
rule phone_applications_foreground_r_features:
input:
sensor_data = "data/raw/{pid}/phone_applications_foreground_with_datetime_with_categories.csv",
episode_data = "data/interim/{pid}/phone_app_episodes_resampled_with_datetime.csv",
time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
params:
provider = lambda wildcards: config["PHONE_APPLICATIONS_FOREGROUND"]["PROVIDERS"][wildcards.provider_key.upper()],
@ -262,9 +264,17 @@ rule phone_bluetooth_r_features:
script:
"../src/features/entry.R"
rule calls_python_features:
rule calls_episodes:
input:
sensor_data = "data/raw/{pid}/phone_calls_with_datetime.csv",
calls = "data/raw/{pid}/phone_calls_raw.csv"
output:
"data/interim/{pid}/phone_calls_episodes.csv"
script:
"../src/features/phone_calls/episodes/calls_episodes.py"
rule phone_calls_python_features:
input:
sensor_data = get_calls_input,
time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
params:
provider = lambda wildcards: config["PHONE_CALLS"]["PROVIDERS"][wildcards.provider_key.upper()],
@ -275,9 +285,9 @@ rule calls_python_features:
script:
"../src/features/entry.py"
rule calls_r_features:
rule phone_calls_r_features:
input:
sensor_data = "data/raw/{pid}/phone_calls_with_datetime.csv",
sensor_data = get_calls_input,
time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
params:
provider = lambda wildcards: config["PHONE_CALLS"]["PROVIDERS"][wildcards.provider_key.upper()],
@ -314,6 +324,40 @@ rule conversation_r_features:
script:
"../src/features/entry.R"
rule preprocess_esm:
input: "data/raw/{pid}/phone_esm_with_datetime.csv"
params:
scales=lambda wildcards: config["PHONE_ESM"]["PROVIDERS"]["STRAW"]["SCALES"]
output: "data/interim/{pid}/phone_esm_clean.csv"
script:
"../src/features/phone_esm/straw/preprocess.py"
rule esm_features:
input:
sensor_data = "data/interim/{pid}/phone_esm_clean.csv",
time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
params:
provider = lambda wildcards: config["PHONE_ESM"]["PROVIDERS"][wildcards.provider_key.upper()],
provider_key = "{provider_key}",
sensor_key = "phone_esm",
scales=lambda wildcards: config["PHONE_ESM"]["PROVIDERS"][wildcards.provider_key.upper()]["SCALES"]
output: "data/interim/{pid}/phone_esm_features/phone_esm_python_{provider_key}.csv"
script:
"../src/features/entry.py"
rule phone_speech_python_features:
input:
sensor_data = "data/raw/{pid}/phone_speech_with_datetime.csv",
time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
params:
provider = lambda wildcards: config["PHONE_SPEECH"]["PROVIDERS"][wildcards.provider_key.upper()],
provider_key = "{provider_key}",
sensor_key = "phone_speech"
output:
"data/interim/{pid}/phone_speech_features/phone_speech_python_{provider_key}.csv"
script:
"../src/features/entry.py"
rule phone_keyboard_python_features:
input:
sensor_data = "data/raw/{pid}/phone_keyboard_with_datetime.csv",
@ -372,7 +416,7 @@ rule phone_locations_add_doryab_extra_columns:
params:
provider = config["PHONE_LOCATIONS"]["PROVIDERS"]["DORYAB"]
output:
"data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns.csv"
"data/interim/{pid}/phone_locations_processed_with_datetime_with_doryab_columns_episodes.csv"
script:
"../src/features/phone_locations/doryab/add_doryab_extra_columns.py"
@ -448,6 +492,15 @@ rule screen_episodes:
script:
"../src/features/phone_screen/episodes/screen_episodes.R"
rule app_episodes:
input:
screen = "data/interim/{pid}/phone_screen_episodes.csv",
app = "data/raw/{pid}/phone_applications_foreground_with_datetime_with_categories.csv"
output:
"data/interim/{pid}/phone_app_episodes.csv"
script:
"../src/features/phone_applications_foreground/episodes/app_episodes.R"
rule phone_screen_python_features:
input:
sensor_episodes = "data/interim/{pid}/phone_screen_episodes_resampled_with_datetime.csv",
@ -742,22 +795,6 @@ rule fitbit_sleep_intraday_r_features:
script:
"../src/features/entry.R"
rule merge_sensor_features_for_individual_participants:
input:
feature_files = input_merge_sensor_features_for_individual_participants
output:
"data/processed/features/{pid}/all_sensor_features.csv"
script:
"../src/features/utils/merge_sensor_features_for_individual_participants.R"
rule merge_sensor_features_for_all_participants:
input:
feature_files = expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"])
output:
"data/processed/features/all_participants/all_sensor_features.csv"
script:
"../src/features/utils/merge_sensor_features_for_all_participants.R"
rule empatica_accelerometer_python_features:
input:
sensor_data = "data/raw/{pid}/empatica_accelerometer_with_datetime.csv",
@ -767,7 +804,8 @@ rule empatica_accelerometer_python_features:
provider_key = "{provider_key}",
sensor_key = "empatica_accelerometer"
output:
"data/interim/{pid}/empatica_accelerometer_features/empatica_accelerometer_python_{provider_key}.csv"
"data/interim/{pid}/empatica_accelerometer_features/empatica_accelerometer_python_{provider_key}.csv",
"data/interim/{pid}/empatica_accelerometer_features/empatica_accelerometer_python_{provider_key}_windows.csv"
script:
"../src/features/entry.py"
@ -793,7 +831,8 @@ rule empatica_heartrate_python_features:
provider_key = "{provider_key}",
sensor_key = "empatica_heartrate"
output:
"data/interim/{pid}/empatica_heartrate_features/empatica_heartrate_python_{provider_key}.csv"
"data/interim/{pid}/empatica_heartrate_features/empatica_heartrate_python_{provider_key}.csv",
"data/interim/{pid}/empatica_heartrate_features/empatica_heartrate_python_{provider_key}_windows.csv"
script:
"../src/features/entry.py"
@ -819,7 +858,8 @@ rule empatica_temperature_python_features:
provider_key = "{provider_key}",
sensor_key = "empatica_temperature"
output:
"data/interim/{pid}/empatica_temperature_features/empatica_temperature_python_{provider_key}.csv"
"data/interim/{pid}/empatica_temperature_features/empatica_temperature_python_{provider_key}.csv",
"data/interim/{pid}/empatica_temperature_features/empatica_temperature_python_{provider_key}_windows.csv"
script:
"../src/features/entry.py"
@ -845,7 +885,8 @@ rule empatica_electrodermal_activity_python_features:
provider_key = "{provider_key}",
sensor_key = "empatica_electrodermal_activity"
output:
"data/interim/{pid}/empatica_electrodermal_activity_features/empatica_electrodermal_activity_python_{provider_key}.csv"
"data/interim/{pid}/empatica_electrodermal_activity_features/empatica_electrodermal_activity_python_{provider_key}.csv",
"data/interim/{pid}/empatica_electrodermal_activity_features/empatica_electrodermal_activity_python_{provider_key}_windows.csv"
script:
"../src/features/entry.py"
@ -871,7 +912,8 @@ rule empatica_blood_volume_pulse_python_features:
provider_key = "{provider_key}",
sensor_key = "empatica_blood_volume_pulse"
output:
"data/interim/{pid}/empatica_blood_volume_pulse_features/empatica_blood_volume_pulse_python_{provider_key}.csv"
"data/interim/{pid}/empatica_blood_volume_pulse_features/empatica_blood_volume_pulse_python_{provider_key}.csv",
"data/interim/{pid}/empatica_blood_volume_pulse_features/empatica_blood_volume_pulse_python_{provider_key}_windows.csv"
script:
"../src/features/entry.py"
@ -897,7 +939,8 @@ rule empatica_inter_beat_interval_python_features:
provider_key = "{provider_key}",
sensor_key = "empatica_inter_beat_interval"
output:
"data/interim/{pid}/empatica_inter_beat_interval_features/empatica_inter_beat_interval_python_{provider_key}.csv"
"data/interim/{pid}/empatica_inter_beat_interval_features/empatica_inter_beat_interval_python_{provider_key}.csv",
"data/interim/{pid}/empatica_inter_beat_interval_features/empatica_inter_beat_interval_python_{provider_key}_windows.csv"
script:
"../src/features/entry.py"
@ -939,3 +982,48 @@ rule empatica_tags_r_features:
"data/interim/{pid}/empatica_tags_features/empatica_tags_r_{provider_key}.csv"
script:
"../src/features/entry.R"
rule merge_sensor_features_for_individual_participants:
input:
feature_files = input_merge_sensor_features_for_individual_participants
output:
"data/processed/features/{pid}/all_sensor_features.csv"
script:
"../src/features/utils/merge_sensor_features_for_individual_participants.R"
rule merge_sensor_features_for_all_participants:
input:
feature_files = expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"])
output:
"data/processed/features/all_participants/all_sensor_features.csv"
script:
"../src/features/utils/merge_sensor_features_for_all_participants.R"
rule clean_sensor_features_for_individual_participants:
input:
sensor_data = rules.merge_sensor_features_for_individual_participants.output
wildcard_constraints:
pid = "("+"|".join(config["PIDS"])+")"
params:
provider = lambda wildcards: config["ALL_CLEANING_INDIVIDUAL"]["PROVIDERS"][wildcards.provider_key.upper()],
provider_key = "{provider_key}",
script_extension = "{script_extension}",
sensor_key = "all_cleaning_individual"
output:
"data/processed/features/{pid}/all_sensor_features_cleaned_{provider_key}_{script_extension}.csv"
script:
"../src/features/entry.{params.script_extension}"
rule clean_sensor_features_for_all_participants:
input:
sensor_data = rules.merge_sensor_features_for_all_participants.output
params:
provider = lambda wildcards: config["ALL_CLEANING_OVERALL"]["PROVIDERS"][wildcards.provider_key.upper()],
provider_key = "{provider_key}",
script_extension = "{script_extension}",
sensor_key = "all_cleaning_overall",
target = "{target}"
output:
"data/processed/features/all_participants/all_sensor_features_cleaned_{provider_key}_{script_extension}_({target}).csv"
script:
"../src/features/entry.{params.script_extension}"

View File

@ -1,165 +1,52 @@
rule download_demographic_data:
rule merge_baseline_data:
input:
data = expand(config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["FOLDER"] + "/{container}", container=config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["CONTAINER"])
output:
"data/raw/baseline_merged.csv"
script:
"../src/data/merge_baseline_data.py"
rule download_baseline_data:
input:
participant_file = "data/external/participant_files/{pid}.yaml",
data = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FOLDER"] + "/" + config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["CONTAINER"]
data = "data/raw/baseline_merged.csv"
output:
"data/raw/{pid}/participant_info_raw.csv"
"data/raw/{pid}/participant_baseline_raw.csv"
script:
"../src/data/workflow_example/download_demographic_data.R"
"../src/data/download_baseline_data.py"
rule demographic_features:
rule baseline_features:
input:
participant_info = "data/raw/{pid}/participant_info_raw.csv"
"data/raw/{pid}/participant_baseline_raw.csv"
params:
pid = "{pid}",
features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FEATURES"]
pid="{pid}",
features=config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["FEATURES"],
question_filename=config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["FOLDER"] + "/" + config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["QUESTION_LIST"]
output:
"data/processed/features/{pid}/demographic_features.csv"
interim="data/interim/{pid}/baseline_questionnaires.csv",
features="data/processed/features/{pid}/baseline_features.csv"
script:
"../src/features/workflow_example/demographic_features.py"
"../src/data/baseline_features.py"
rule download_target_data:
rule select_target:
input:
participant_file = "data/external/participant_files/{pid}.yaml",
data = config["PARAMS_FOR_ANALYSIS"]["TARGET"]["FOLDER"] + "/" + config["PARAMS_FOR_ANALYSIS"]["TARGET"]["CONTAINER"]
output:
"data/raw/{pid}/participant_target_raw.csv"
script:
"../src/data/workflow_example/download_target_data.R"
rule target_readable_datetime:
input:
sensor_input = "data/raw/{pid}/participant_target_raw.csv",
time_segments = "data/interim/time_segments/{pid}_time_segments.csv",
pid_file = "data/external/participant_files/{pid}.yaml",
tzcodes_file = input_tzcodes_file,
cleaned_sensor_features = "data/processed/features/{pid}/all_sensor_features_cleaned_straw_py.csv"
params:
device_type = "fitbit",
timezone_parameters = config["TIMEZONE"],
pid = "{pid}",
time_segments_type = config["TIME_SEGMENTS"]["TYPE"],
include_past_periodic_segments = config["TIME_SEGMENTS"]["INCLUDE_PAST_PERIODIC_SEGMENTS"]
output:
"data/raw/{pid}/participant_target_with_datetime.csv"
script:
"../src/data/datetime/readable_datetime.R"
rule parse_targets:
input:
targets = "data/raw/{pid}/participant_target_with_datetime.csv",
time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
output:
"data/processed/targets/{pid}/parsed_targets.csv"
script:
"../src/models/workflow_example/parse_targets.py"
rule clean_sensor_features_for_individual_participants:
input:
rules.merge_sensor_features_for_individual_participants.output
params:
cols_nan_threshold = config["PARAMS_FOR_ANALYSIS"]["COLS_NAN_THRESHOLD"],
cols_var_threshold = config["PARAMS_FOR_ANALYSIS"]["COLS_VAR_THRESHOLD"],
rows_nan_threshold = config["PARAMS_FOR_ANALYSIS"]["ROWS_NAN_THRESHOLD"],
data_yielded_hours_ratio_threshold = config["PARAMS_FOR_ANALYSIS"]["DATA_YIELDED_HOURS_RATIO_THRESHOLD"],
output:
"data/processed/features/{pid}/all_sensor_features_cleaned.csv"
script:
"../src/models/workflow_example/clean_sensor_features.R"
rule clean_sensor_features_for_all_participants:
input:
rules.merge_sensor_features_for_all_participants.output
params:
cols_nan_threshold = config["PARAMS_FOR_ANALYSIS"]["COLS_NAN_THRESHOLD"],
cols_var_threshold = config["PARAMS_FOR_ANALYSIS"]["COLS_VAR_THRESHOLD"],
rows_nan_threshold = config["PARAMS_FOR_ANALYSIS"]["ROWS_NAN_THRESHOLD"],
data_yielded_hours_ratio_threshold = config["PARAMS_FOR_ANALYSIS"]["DATA_YIELDED_HOURS_RATIO_THRESHOLD"],
output:
"data/processed/features/all_participants/all_sensor_features_cleaned.csv"
script:
"../src/models/workflow_example/clean_sensor_features.R"
rule merge_features_and_targets_for_individual_model:
input:
cleaned_sensor_features = "data/processed/features/{pid}/all_sensor_features_cleaned.csv",
targets = "data/processed/targets/{pid}/parsed_targets.csv",
target_variable = config["PARAMS_FOR_ANALYSIS"]["TARGET"]["LABEL"]
output:
"data/processed/models/individual_model/{pid}/input.csv"
script:
"../src/models/workflow_example/merge_features_and_targets_for_individual_model.py"
"../src/models/select_targets.py"
rule merge_features_and_targets_for_population_model:
input:
cleaned_sensor_features = "data/processed/features/all_participants/all_sensor_features_cleaned.csv",
demographic_features = expand("data/processed/features/{pid}/demographic_features.csv", pid=config["PIDS"]),
targets = expand("data/processed/targets/{pid}/parsed_targets.csv", pid=config["PIDS"]),
output:
"data/processed/models/population_model/input.csv"
script:
"../src/models/workflow_example/merge_features_and_targets_for_population_model.py"
rule baselines_for_individual_model:
input:
"data/processed/models/individual_model/{pid}/input.csv"
cleaned_sensor_features = "data/processed/features/all_participants/all_sensor_features_cleaned_straw_py_({target}).csv",
demographic_features = expand("data/processed/features/{pid}/baseline_features.csv", pid=config["PIDS"]),
params:
cv_method = "{cv_method}",
colnames_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FEATURES"],
target_variable="{target}"
output:
"data/processed/models/individual_model/{pid}/output_{cv_method}/baselines.csv"
log:
"data/processed/models/individual_model/{pid}/output_{cv_method}/baselines_notes.log"
"data/processed/models/population_model/input_{target}.csv"
script:
"../src/models/workflow_example/baselines.py"
"../src/models/merge_features_and_targets_for_population_model.py"
rule baselines_for_population_model:
input:
"data/processed/models/population_model/input.csv"
params:
cv_method = "{cv_method}",
colnames_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FEATURES"],
output:
"data/processed/models/population_model/output_{cv_method}/baselines.csv"
log:
"data/processed/models/population_model/output_{cv_method}/baselines_notes.log"
script:
"../src/models/workflow_example/baselines.py"
rule modelling_for_individual_participants:
input:
data = "data/processed/models/individual_model/{pid}/input.csv"
params:
model = "{model}",
cv_method = "{cv_method}",
scaler = "{scaler}",
categorical_operators = config["PARAMS_FOR_ANALYSIS"]["CATEGORICAL_OPERATORS"],
categorical_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["CATEGORICAL_FEATURES"],
model_hyperparams = config["PARAMS_FOR_ANALYSIS"]["MODEL_HYPERPARAMS"],
output:
fold_predictions = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/fold_predictions.csv",
fold_metrics = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/fold_metrics.csv",
overall_results = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/overall_results.csv",
fold_feature_importances = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/fold_feature_importances.csv"
log:
"data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/notes.log"
script:
"../src/models/workflow_example/modelling.py"
rule modelling_for_all_participants:
input:
data = "data/processed/models/population_model/input.csv"
params:
model = "{model}",
cv_method = "{cv_method}",
scaler = "{scaler}",
categorical_operators = config["PARAMS_FOR_ANALYSIS"]["CATEGORICAL_OPERATORS"],
categorical_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["CATEGORICAL_FEATURES"],
model_hyperparams = config["PARAMS_FOR_ANALYSIS"]["MODEL_HYPERPARAMS"],
output:
fold_predictions = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/fold_predictions.csv",
fold_metrics = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/fold_metrics.csv",
overall_results = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/overall_results.csv",
fold_feature_importances = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/fold_feature_importances.csv"
log:
"data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/notes.log"
script:
"../src/models/workflow_example/modelling.py"

View File

@ -0,0 +1,139 @@
rule download_demographic_data:
input:
participant_file = "data/external/participant_files/{pid}.yaml",
data = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FOLDER"] + "/" + config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["CONTAINER"]
output:
"data/raw/{pid}/participant_info_raw.csv"
script:
"../src/data/workflow_example/download_demographic_data.R"
rule demographic_features:
input:
participant_info = "data/raw/{pid}/participant_info_raw.csv"
params:
pid = "{pid}",
features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FEATURES"]
output:
"data/processed/features/{pid}/demographic_features.csv"
script:
"../src/features/workflow_example/demographic_features.py"
rule download_target_data:
input:
participant_file = "data/external/participant_files/{pid}.yaml",
data = config["PARAMS_FOR_ANALYSIS"]["TARGET"]["FOLDER"] + "/" + config["PARAMS_FOR_ANALYSIS"]["TARGET"]["CONTAINER"]
output:
"data/raw/{pid}/participant_target_raw.csv"
script:
"../src/data/workflow_example/download_target_data.R"
rule target_readable_datetime:
input:
sensor_input = "data/raw/{pid}/participant_target_raw.csv",
time_segments = "data/interim/time_segments/{pid}_time_segments.csv",
pid_file = "data/external/participant_files/{pid}.yaml",
tzcodes_file = input_tzcodes_file,
params:
device_type = "fitbit",
timezone_parameters = config["TIMEZONE"],
pid = "{pid}",
time_segments_type = config["TIME_SEGMENTS"]["TYPE"],
include_past_periodic_segments = config["TIME_SEGMENTS"]["INCLUDE_PAST_PERIODIC_SEGMENTS"]
output:
"data/raw/{pid}/participant_target_with_datetime.csv"
script:
"../src/data/datetime/readable_datetime.R"
rule parse_targets:
input:
targets = "data/raw/{pid}/participant_target_with_datetime.csv",
time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
output:
"data/processed/targets/{pid}/parsed_targets.csv"
script:
"../src/models/workflow_example/parse_targets.py"
rule merge_features_and_targets_for_individual_model:
input:
cleaned_sensor_features = "data/processed/features/{pid}/all_sensor_features_cleaned_rapids.csv",
targets = "data/processed/targets/{pid}/parsed_targets.csv",
output:
"data/processed/models/individual_model/{pid}/input.csv"
script:
"../src/models/workflow_example/merge_features_and_targets_for_individual_model.py"
rule merge_features_and_targets_for_population_model:
input:
cleaned_sensor_features = "data/processed/features/all_participants/all_sensor_features_cleaned_rapids.csv",
demographic_features = expand("data/processed/features/{pid}/demographic_features.csv", pid=config["PIDS"]),
targets = expand("data/processed/targets/{pid}/parsed_targets.csv", pid=config["PIDS"]),
output:
"data/processed/models/population_model/input.csv"
script:
"../src/models/workflow_example/merge_features_and_targets_for_population_model.py"
rule baselines_for_individual_model:
input:
"data/processed/models/individual_model/{pid}/input.csv"
params:
cv_method = "{cv_method}",
colnames_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FEATURES"],
output:
"data/processed/models/individual_model/{pid}/output_{cv_method}/baselines.csv"
log:
"data/processed/models/individual_model/{pid}/output_{cv_method}/baselines_notes.log"
script:
"../src/models/workflow_example/baselines.py"
rule baselines_for_population_model:
input:
"data/processed/models/population_model/input.csv"
params:
cv_method = "{cv_method}",
colnames_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["FEATURES"],
output:
"data/processed/models/population_model/output_{cv_method}/baselines.csv"
log:
"data/processed/models/population_model/output_{cv_method}/baselines_notes.log"
script:
"../src/models/workflow_example/baselines.py"
rule modelling_for_individual_participants:
input:
data = "data/processed/models/individual_model/{pid}/input.csv"
params:
model = "{model}",
cv_method = "{cv_method}",
scaler = "{scaler}",
categorical_operators = config["PARAMS_FOR_ANALYSIS"]["CATEGORICAL_OPERATORS"],
categorical_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["CATEGORICAL_FEATURES"],
model_hyperparams = config["PARAMS_FOR_ANALYSIS"]["MODEL_HYPERPARAMS"],
output:
fold_predictions = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/fold_predictions.csv",
fold_metrics = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/fold_metrics.csv",
overall_results = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/overall_results.csv",
fold_feature_importances = "data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/fold_feature_importances.csv"
log:
"data/processed/models/individual_model/{pid}/output_{cv_method}/{model}/{scaler}/notes.log"
script:
"../src/models/workflow_example/modelling.py"
rule modelling_for_all_participants:
input:
data = "data/processed/models/population_model/input.csv"
params:
model = "{model}",
cv_method = "{cv_method}",
scaler = "{scaler}",
categorical_operators = config["PARAMS_FOR_ANALYSIS"]["CATEGORICAL_OPERATORS"],
categorical_demographic_features = config["PARAMS_FOR_ANALYSIS"]["DEMOGRAPHIC"]["CATEGORICAL_FEATURES"],
model_hyperparams = config["PARAMS_FOR_ANALYSIS"]["MODEL_HYPERPARAMS"],
output:
fold_predictions = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/fold_predictions.csv",
fold_metrics = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/fold_metrics.csv",
overall_results = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/overall_results.csv",
fold_feature_importances = "data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/fold_feature_importances.csv"
log:
"data/processed/models/population_model/output_{cv_method}/{model}/{scaler}/notes.log"
script:
"../src/models/workflow_example/modelling.py"

View File

@ -4,6 +4,36 @@ rule create_example_participant_files:
shell:
"echo 'PHONE:\n DEVICE_IDS: [a748ee1a-1d0b-4ae9-9074-279a2b6ba524]\n PLATFORMS: [android]\n LABEL: test-01\n START_DATE: 2020-04-23 00:00:00\n END_DATE: 2020-05-04 23:59:59\nFITBIT:\n DEVICE_IDS: [a748ee1a-1d0b-4ae9-9074-279a2b6ba524]\n LABEL: test-01\n START_DATE: 2020-04-23 00:00:00\n END_DATE: 2020-05-04 23:59:59\n' >> ./data/external/participant_files/example01.yaml && echo 'PHONE:\n DEVICE_IDS: [13dbc8a3-dae3-4834-823a-4bc96a7d459d]\n PLATFORMS: [ios]\n LABEL: test-02\n START_DATE: 2020-04-23 00:00:00\n END_DATE: 2020-05-04 23:59:59\nFITBIT:\n DEVICE_IDS: [13dbc8a3-dae3-4834-823a-4bc96a7d459d]\n LABEL: test-02\n START_DATE: 2020-04-23 00:00:00\n END_DATE: 2020-05-04 23:59:59\n' >> ./data/external/participant_files/example02.yaml"
# rule query_usernames_device_empatica_ids:
# params:
# baseline_folder = "/mnt/e/STRAWbaseline/"
# output:
# usernames_file = config["CREATE_PARTICIPANT_FILES"]["USERNAMES_CSV"],
# timezone_file = config["TIMEZONE"]["MULTIPLE"]["TZ_FILE"]
# script:
# "../../participants/prepare_usernames_file.py"
rule prepare_tzcodes_file:
input:
timezone_file = config["TIMEZONE"]["MULTIPLE"]["TZ_FILE"]
output:
tzcodes_file = config["TIMEZONE"]["MULTIPLE"]["TZCODES_FILE"]
script:
"../tools/create_multi_timezones_file.py"
rule prepare_participants_csv:
input:
username_list = config["CREATE_PARTICIPANT_FILES"]["USERNAMES_CSV"]
params:
data_configuration = config["PHONE_DATA_STREAMS"][config["PHONE_DATA_STREAMS"]["USE"]],
participants_table = "participants",
device_id_table = "esm",
start_end_date_table = "esm"
output:
participants_file = config["CREATE_PARTICIPANT_FILES"]["CSV_FILE_PATH"]
script:
"../src/data/translate_usernames_into_participants_data.R"
rule create_participants_files:
input:
participants_file = config["CREATE_PARTICIPANT_FILES"]["CSV_FILE_PATH"]
@ -98,7 +128,8 @@ rule process_phone_locations_types:
params:
consecutive_threshold = config["PHONE_LOCATIONS"]["FUSED_RESAMPLED_CONSECUTIVE_THRESHOLD"],
time_since_valid_location = config["PHONE_LOCATIONS"]["FUSED_RESAMPLED_TIME_SINCE_VALID_LOCATION"],
locations_to_use = config["PHONE_LOCATIONS"]["LOCATIONS_TO_USE"]
locations_to_use = config["PHONE_LOCATIONS"]["LOCATIONS_TO_USE"],
accuracy_limit = config["PHONE_LOCATIONS"]["ACCURACY_LIMIT"]
output:
"data/interim/{pid}/phone_locations_processed.csv"
script:
@ -146,7 +177,6 @@ rule resample_episodes_with_datetime:
script:
"../src/data/datetime/readable_datetime.R"
rule phone_application_categories:
input:
"data/raw/{pid}/phone_applications_{type}_with_datetime.csv"
@ -217,5 +247,33 @@ rule empatica_readable_datetime:
include_past_periodic_segments = config["TIME_SEGMENTS"]["INCLUDE_PAST_PERIODIC_SEGMENTS"]
output:
"data/raw/{pid}/empatica_{sensor}_with_datetime.csv"
resources:
mem_mb=50000
script:
"../src/data/datetime/readable_datetime.R"
rule extract_event_information_from_esm:
input:
esm_raw_input = "data/raw/{pid}/phone_esm_raw.csv",
pid_file = "data/external/participant_files/{pid}.yaml"
params:
stage = "extract",
pid = "{pid}"
output:
"data/raw/ers/{pid}_ers.csv",
"data/raw/ers/{pid}_stress_event_targets.csv"
script:
"../src/features/phone_esm/straw/process_user_event_related_segments.py"
rule merge_event_related_segments_files:
input:
ers_files = expand("data/raw/ers/{pid}_ers.csv", pid=config["PIDS"]),
se_files = expand("data/raw/ers/{pid}_stress_event_targets.csv", pid=config["PIDS"])
params:
stage = "merge"
output:
"data/external/straw_events.csv",
"data/external/stress_event_targets.csv"
script:
"../src/features/phone_esm/straw/process_user_event_related_segments.py"

View File

@ -1,6 +1,8 @@
rule histogram_phone_data_yield:
input:
"data/processed/features/all_participants/all_sensor_features.csv"
params:
time_segments_type = config["TIME_SEGMENTS"]["TYPE"]
output:
"reports/data_exploration/histogram_phone_data_yield.html"
script:
@ -12,7 +14,8 @@ rule heatmap_sensors_per_minute_per_time_segment:
participant_file = "data/external/participant_files/{pid}.yaml",
time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
params:
pid = "{pid}"
pid = "{pid}",
time_segments_type = config["TIME_SEGMENTS"]["TYPE"]
output:
"reports/interim/{pid}/heatmap_sensors_per_minute_per_time_segment.html"
script:
@ -33,7 +36,9 @@ rule heatmap_sensor_row_count_per_time_segment:
participant_file = "data/external/participant_files/{pid}.yaml",
time_segments_labels = "data/interim/time_segments/{pid}_time_segments_labels.csv"
params:
pid = "{pid}"
pid = "{pid}",
sensor_names = config["HEATMAP_SENSOR_ROW_COUNT_PER_TIME_SEGMENT"]["SENSORS"],
time_segments_type = config["TIME_SEGMENTS"]["TYPE"]
output:
"reports/interim/{pid}/heatmap_sensor_row_count_per_time_segment.html"
script:
@ -49,11 +54,13 @@ rule merge_heatmap_sensor_row_count_per_time_segment:
rule heatmap_phone_data_yield_per_participant_per_time_segment:
input:
phone_data_yield = expand("data/processed/features/{pid}/phone_data_yield.csv", pid=config["PIDS"]),
participant_file = expand("data/external/participant_files/{pid}.yaml", pid=config["PIDS"]),
time_segments_labels = expand("data/interim/time_segments/{pid}_time_segments_labels.csv", pid=config["PIDS"])
participant_files = expand("data/external/participant_files/{pid}.yaml", pid=config["PIDS"]),
time_segments_file = config["TIME_SEGMENTS"]["FILE"],
phone_data_yield = "data/processed/features/all_participants/all_sensor_features.csv",
params:
time = config["HEATMAP_PHONE_DATA_YIELD_PER_PARTICIPANT_PER_TIME_SEGMENT"]["TIME"]
pids = config["PIDS"],
time = config["HEATMAP_PHONE_DATA_YIELD_PER_PARTICIPANT_PER_TIME_SEGMENT"]["TIME"],
time_segments_type = config["TIME_SEGMENTS"]["TYPE"]
output:
"reports/data_exploration/heatmap_phone_data_yield_per_participant_per_time_segment.html"
script:
@ -63,6 +70,7 @@ rule heatmap_feature_correlation_matrix:
input:
all_sensor_features = "data/processed/features/all_participants/all_sensor_features.csv" # before data cleaning
params:
time_segments_type = config["TIME_SEGMENTS"]["TYPE"],
min_rows_ratio = config["HEATMAP_FEATURE_CORRELATION_MATRIX"]["MIN_ROWS_RATIO"],
corr_threshold = config["HEATMAP_FEATURE_CORRELATION_MATRIX"]["CORR_THRESHOLD"],
corr_method = config["HEATMAP_FEATURE_CORRELATION_MATRIX"]["CORR_METHOD"]

View File

@ -0,0 +1,182 @@
import numpy as np
import pandas as pd
pid = snakemake.params["pid"]
requested_features = snakemake.params["features"]
baseline_interim = pd.DataFrame(columns=["qid", "question", "score_original", "score"])
baseline_features = pd.DataFrame(columns=requested_features)
question_filename = snakemake.params["question_filename"]
JCQ_DEMAND = "JobEisen"
JCQ_CONTROL = "JobControle"
dict_JCQ_demand_control_reverse = {
JCQ_DEMAND: {
3: " [Od mene se ne zahteva,",
4: " [Imam dovolj časa, da končam",
5: " [Pri svojem delu se ne srečujem s konfliktnimi",
},
JCQ_CONTROL: {
2: " |Moje delo vključuje veliko ponavljajočega",
6: " [Pri svojem delu imam zelo malo svobode",
},
}
LIMESURVEY_JCQ_MIN = 1
LIMESURVEY_JCQ_MAX = 4
DEMAND_CONTROL_RATIO_MIN = 5 / (9 * 4)
DEMAND_CONTROL_RATIO_MAX = (4 * 5) / 9
JCQ_NORMS = {
"F": {
0: DEMAND_CONTROL_RATIO_MIN,
1: 0.45,
2: 0.52,
3: 0.62,
4: DEMAND_CONTROL_RATIO_MAX,
},
"M": {
0: DEMAND_CONTROL_RATIO_MIN,
1: 0.41,
2: 0.48,
3: 0.56,
4: DEMAND_CONTROL_RATIO_MAX,
},
}
participant_info = pd.read_csv(snakemake.input[0], parse_dates=["date_of_birth"])
if not participant_info.empty:
if "age" in requested_features:
now = pd.Timestamp("now")
baseline_features.loc[0, "age"] = (
now - participant_info.loc[0, "date_of_birth"]
).days / 365.25245
if "gender" in requested_features:
baseline_features.loc[0, "gender"] = participant_info.loc[0, "gender"]
if "startlanguage" in requested_features:
baseline_features.loc[0, "startlanguage"] = participant_info.loc[
0, "startlanguage"
]
if (
("limesurvey_demand" in requested_features)
or ("limesurvey_control" in requested_features)
or ("limesurvey_demand_control_ratio" in requested_features)
):
participant_info_t = participant_info.T
rows_baseline = participant_info_t.index
if ("limesurvey_demand" in requested_features) or (
"limesurvey_demand_control_ratio" in requested_features
):
# Find questions about demand, but disregard time (duration of filling in questionnaire)
rows_demand = rows_baseline.str.startswith(
JCQ_DEMAND
) & ~rows_baseline.str.endswith("Time")
limesurvey_demand = (
participant_info_t[rows_demand]
.reset_index()
.rename(columns={"index": "question", 0: "score_original"})
)
# Extract question IDs from names such as JobEisen[3]
limesurvey_demand["qid"] = (
limesurvey_demand["question"].str.extract(r"\[(\d+)\]").astype(int)
)
limesurvey_demand["score"] = limesurvey_demand["score_original"]
# Identify rows that include questions to be reversed.
rows_demand_reverse = limesurvey_demand["qid"].isin(
dict_JCQ_demand_control_reverse[JCQ_DEMAND].keys()
)
# Reverse the score, so that the maximum value becomes the minimum etc.
limesurvey_demand.loc[rows_demand_reverse, "score"] = (
LIMESURVEY_JCQ_MAX
+ LIMESURVEY_JCQ_MIN
- limesurvey_demand.loc[rows_demand_reverse, "score_original"]
)
baseline_interim = pd.concat([baseline_interim, limesurvey_demand], axis=0, ignore_index=True)
if "limesurvey_demand" in requested_features:
baseline_features.loc[0, "limesurvey_demand"] = limesurvey_demand[
"score"
].sum()
if ("limesurvey_control" in requested_features) or (
"limesurvey_demand_control_ratio" in requested_features
):
# Find questions about control, but disregard time (duration of filling in questionnaire)
rows_control = rows_baseline.str.startswith(
JCQ_CONTROL
) & ~rows_baseline.str.endswith("Time")
limesurvey_control = (
participant_info_t[rows_control]
.reset_index()
.rename(columns={"index": "question", 0: "score_original"})
)
# Extract question IDs from names such as JobControle[3]
limesurvey_control["qid"] = (
limesurvey_control["question"].str.extract(r"\[(\d+)\]").astype(int)
)
limesurvey_control["score"] = limesurvey_control["score_original"]
# Identify rows that include questions to be reversed.
rows_control_reverse = limesurvey_control["qid"].isin(
dict_JCQ_demand_control_reverse[JCQ_CONTROL].keys()
)
# Reverse the score, so that the maximum value becomes the minimum etc.
limesurvey_control.loc[rows_control_reverse, "score"] = (
LIMESURVEY_JCQ_MAX
+ LIMESURVEY_JCQ_MIN
- limesurvey_control.loc[rows_control_reverse, "score_original"]
)
baseline_interim = pd.concat([baseline_interim, limesurvey_control], axis=0, ignore_index=True)
if "limesurvey_control" in requested_features:
baseline_features.loc[0, "limesurvey_control"] = limesurvey_control[
"score"
].sum()
if "limesurvey_demand_control_ratio" in requested_features:
if limesurvey_control["score"].sum():
limesurvey_demand_control_ratio = (
limesurvey_demand["score"].sum() / limesurvey_control["score"].sum()
)
else:
limesurvey_demand_control_ratio = 0
if (
JCQ_NORMS[participant_info.loc[0, "gender"]][0]
<= limesurvey_demand_control_ratio
< JCQ_NORMS[participant_info.loc[0, "gender"]][1]
):
limesurvey_quartile = 1
elif (
JCQ_NORMS[participant_info.loc[0, "gender"]][1]
<= limesurvey_demand_control_ratio
< JCQ_NORMS[participant_info.loc[0, "gender"]][2]
):
limesurvey_quartile = 2
elif (
JCQ_NORMS[participant_info.loc[0, "gender"]][2]
<= limesurvey_demand_control_ratio
< JCQ_NORMS[participant_info.loc[0, "gender"]][3]
):
limesurvey_quartile = 3
elif (
JCQ_NORMS[participant_info.loc[0, "gender"]][3]
<= limesurvey_demand_control_ratio
< JCQ_NORMS[participant_info.loc[0, "gender"]][4]
):
limesurvey_quartile = 4
else:
limesurvey_quartile = np.nan
baseline_features.loc[
0, "limesurvey_demand_control_ratio"
] = limesurvey_demand_control_ratio
baseline_features.loc[
0, "limesurvey_demand_control_ratio_quartile"
] = limesurvey_quartile
if not baseline_interim.empty:
baseline_interim.to_csv(snakemake.output["interim"], index=False, encoding="utf-8")
baseline_features.to_csv(snakemake.output["features"], index=False, encoding="utf-8")

View File

@ -1,6 +1,6 @@
source("renv/activate.R")
library(RMariaDB)
#library(RMariaDB)
library(stringr)
library(purrr)
library(readr)
@ -58,7 +58,7 @@ participants %>%
lines <- append(lines, empty_fitbit)
if(add_empatica_section == TRUE && !is.na(row[empatica_device_id_column])){
lines <- append(lines, c("EMPATICA:", paste0(" DEVICE_IDS: [",row[empatica_device_id_column],"]"),
lines <- append(lines, c("EMPATICA:", paste0(" DEVICE_IDS: [",row$label,"]"),
paste(" LABEL:",row$label), paste(" START_DATE:", start_date), paste(" END_DATE:", end_date)))
} else
lines <- append(lines, empty_empatica)
@ -73,7 +73,7 @@ participants %>%
file_lines <-readLines("./config.yaml")
for (i in 1:length(file_lines)){
if(startsWith(file_lines[i], "PIDS:")){
file_lines[i] <- paste0("PIDS: [", paste(participants$pid, collapse = ", "), "]")
file_lines[i] <- paste0("PIDS: ['", paste(participants$pid, collapse = "', '"), "']")
}
}
writeLines(file_lines, con = "./config.yaml")

View File

@ -5,13 +5,16 @@ options(scipen=999)
assign_rows_to_segments <- function(data, segments){
# This function is used by all segment types, we use data.tables because they are fast
data <- data.table::as.data.table(data)
data[, assigned_segments := ""]
for(i in seq_len(nrow(segments))) {
segment <- segments[i,]
data[segment$segment_start_ts<= timestamp & segment$segment_end_ts >= timestamp,
assigned_segments := stringi::stri_c(assigned_segments, segment$segment_id, sep = "|")]
}
data[,assigned_segments:=substring(assigned_segments, 2)]
data
}

View File

@ -0,0 +1,14 @@
import pandas as pd
import yaml
filename = snakemake.input["data"]
baseline = pd.read_csv(filename)
with open(snakemake.input["participant_file"], "r") as file:
participant = yaml.safe_load(file)
username = participant["PHONE"]["LABEL"]
baseline[baseline["username"] == username].to_csv(snakemake.output[0],
index=False,
encoding="utf-8",)

View File

@ -0,0 +1,30 @@
import pandas as pd
VARIABLES_TO_TRANSLATE = {
"Gebruikersnaam": "username",
"Geslacht": "gender",
"Geboortedatum": "date_of_birth",
}
filenames = snakemake.input["data"]
baseline_dfs = []
for fn in filenames:
baseline_dfs.append(pd.read_csv(fn,
parse_dates=["Geboortedatum"],
infer_datetime_format=True,
cache_dates=True,
))
baseline = (
pd.concat(baseline_dfs, join="inner")
.reset_index()
.drop(columns="index")
)
baseline.rename(columns=VARIABLES_TO_TRANSLATE, copy=False, inplace=True)
baseline.to_csv(snakemake.output[0],
index=False,
encoding="utf-8",)

View File

@ -6,9 +6,10 @@ library(tidyr)
consecutive_threshold <- snakemake@params[["consecutive_threshold"]]
time_since_valid_location <- snakemake@params[["time_since_valid_location"]]
locations_to_use <- snakemake@params[["locations_to_use"]]
accuracy_limit <- snakemake@params[["accuracy_limit"]]
locations <- read.csv(snakemake@input[["locations"]]) %>%
filter(double_latitude != 0 & double_longitude != 0) %>%
filter(double_latitude != 0 & double_longitude != 0 & accuracy < accuracy_limit) %>%
drop_na(double_longitude, double_latitude) %>%
group_by(timestamp) %>% # keep only the row with the best accuracy if two or more have the same timestamp
filter(accuracy == min(accuracy, na.rm=TRUE)) %>%
@ -63,7 +64,7 @@ if(locations_to_use == "ALL"){
# you can think of consecutive_threshold as the period a location row is valid for
mutate(limit = pmin(lead(timestamp, default = 9999999999999) - 1, limit + (1000 * 60 * consecutive_threshold)),
n_resample = (limit - timestamp)%/%60001,
n_resample = if_else(n_resample == 0, 1, n_resample)) %>%
n_resample = n_resample + 1) %>%
drop_na(double_longitude, double_latitude) %>%
uncount(weights = n_resample, .id = "id") %>%
mutate(provider = if_else(id > 1, "resampled", provider),

View File

@ -0,0 +1,85 @@
# if you need a new package, you should add it with renv::install(package) so your renv venv is updated
library(RMariaDB)
library(yaml)
#' @description
#' Auxiliary function to parse the connection credentials from a specifc group in ./credentials.yaml
#' You can reause most of this function if you are connection to a DB or Web API.
#' It's OK to delete this function if you don't need credentials, e.g., you are pulling data from a CSV for example.
#' @param group the yaml key containing the credentials to connect to a database
#' @preturn dbEngine a database engine (connection) ready to perform queries
get_db_engine <- function(group){
# The working dir is aways RAPIDS root folder, so your credentials file is always /credentials.yaml
credentials <- read_yaml("./credentials.yaml")
if(!group %in% names(credentials))
stop(paste("The credentials group",group, "does not exist in ./credentials.yaml. The only groups that exist in that file are:", paste(names(credentials), collapse = ","), ". Did you forget to set the group in [PHONE_DATA_STREAMS][aware_mysql][DATABASE_GROUP] in config.yaml?"))
dbEngine <- dbConnect(MariaDB(), db = credentials[[group]][["database"]],
username = credentials[[group]][["user"]],
password = credentials[[group]][["password"]],
host = credentials[[group]][["host"]],
port = credentials[[group]][["port"]])
return(dbEngine)
}
# This file gets executed for each PHONE_SENSOR of each participant
# If you are connecting to a database the env file containing its credentials is available at "./.env"
# If you are reading a CSV file instead of a DB table, the @param sensor_container wil contain the file path as set in config.yaml
# You are not bound to databases or files, you can query a web API or whatever data source you need.
#' @description
#' RAPIDS allows users to use the keyword "infer" (previously "multiple") to automatically infer the mobile Operative System a device was running.
#' If you have a way to infer the OS of a device ID, implement this function. For example, for AWARE data we use the "aware_device" table.
#'
#' If you don't have a way to infer the OS, call stop("Error Message") so other users know they can't use "infer" or the inference failed,
#' and they have to assign the OS manually in the participant file
#'
#' @param stream_parameters The PHONE_STREAM_PARAMETERS key in config.yaml. If you need specific parameters add them there.
#' @param device A device ID string
#' @return The OS the device ran, "android" or "ios"
infer_device_os <- function(stream_parameters, device){
dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
query <- paste0("SELECT device_id,brand FROM aware_device WHERE device_id = '", device, "'")
message(paste0("Executing the following query to infer phone OS: ", query))
os <- dbGetQuery(dbEngine, query)
dbDisconnect(dbEngine)
if(nrow(os) > 0)
return(os %>% mutate(os = ifelse(brand == "iPhone", "ios", "android")) %>% pull(os))
else
stop(paste("We cannot infer the OS of the following device id because it does not exist in the aware_device table:", device))
return(os)
}
#' @description
#' Gets the sensor data for a specific device id from a database table, file or whatever source you want to query
#'
#' @param stream_parameters The PHONE_STREAM_PARAMETERS key in config.yaml. If you need specific parameters add them there.
#' @param device A device ID string
#' @param sensor_container database table or file containing the sensor data for all participants. This is the PHONE_SENSOR[CONTAINER] key in config.yaml
#' @param columns the columns needed from this sensor (we recommend to only return these columns instead of every column in sensor_container)
#' @return A dataframe with the sensor data for device
pull_data <- function(stream_parameters, device, sensor, sensor_container, columns){
dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
select_items <- c()
for (column in columns) {
select_items <- append(select_items, paste0("data->>'$.", column, "' ", column))
}
query <- paste0("SELECT ", paste(select_items, collapse = ",")," FROM ", sensor_container, " WHERE ", columns$DEVICE_ID ," = '", device,"'")
# Letting the user know what we are doing
message(paste0("Executing the following query to download data: ", query))
sensor_data <- dbGetQuery(dbEngine, query)
dbDisconnect(dbEngine)
if(nrow(sensor_data) == 0)
warning(paste("The device '", device,"' did not have data in ", sensor_container))
return(sensor_data)
}

View File

@ -0,0 +1,337 @@
PHONE_ACCELEROMETER:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
DOUBLE_VALUES_0: double_values_0
DOUBLE_VALUES_1: double_values_1
DOUBLE_VALUES_2: double_values_2
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
IOS:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
DOUBLE_VALUES_0: double_values_0
DOUBLE_VALUES_1: double_values_1
DOUBLE_VALUES_2: double_values_2
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
PHONE_ACTIVITY_RECOGNITION:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
ACTIVITY_NAME: activity_name
ACTIVITY_TYPE: activity_type
CONFIDENCE: confidence
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
IOS:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
ACTIVITY_NAME: FLAG_TO_MUTATE
ACTIVITY_TYPE: FLAG_TO_MUTATE
CONFIDENCE: FLAG_TO_MUTATE
MUTATION:
COLUMN_MAPPINGS:
ACTIVITIES: activities
CONFIDENCE: confidence
SCRIPTS: # List any python or r scripts that mutate your raw data
- "src/data/streams/mutations/phone/aware/activity_recogniton_ios_unification.R"
PHONE_APPLICATIONS_CRASHES:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
PACKAGE_NAME: package_name
APPLICATION_NAME: application_name
APPLICATION_VERSION: application_version
ERROR_SHORT: error_short
ERROR_LONG: error_long
ERROR_CONDITION: error_condition
IS_SYSTEM_APP: is_system_app
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
PHONE_APPLICATIONS_FOREGROUND:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
PACKAGE_NAME: package_name
APPLICATION_NAME: application_name
IS_SYSTEM_APP: is_system_app
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
PHONE_APPLICATIONS_NOTIFICATIONS:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
PACKAGE_NAME: package_name
APPLICATION_NAME: application_name
TEXT: text
SOUND: sound
VIBRATE: vibrate
DEFAULTS: defaults
FLAGS: flags
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
PHONE_BATTERY:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
BATTERY_STATUS: battery_status
BATTERY_LEVEL: battery_level
BATTERY_SCALE: battery_scale
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
IOS:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
BATTERY_STATUS: FLAG_TO_MUTATE
BATTERY_LEVEL: battery_level
BATTERY_SCALE: battery_scale
MUTATION:
COLUMN_MAPPINGS:
BATTERY_STATUS: battery_status
SCRIPTS:
- "src/data/streams/mutations/phone/aware/battery_ios_unification.R"
PHONE_BLUETOOTH:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
BT_ADDRESS: bt_address
BT_NAME: bt_name
BT_RSSI: bt_rssi
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
IOS:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
BT_ADDRESS: bt_address
BT_NAME: bt_name
BT_RSSI: bt_rssi
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
PHONE_CALLS:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
CALL_TYPE: call_type
CALL_DURATION: call_duration
TRACE: trace
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
IOS:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
CALL_TYPE: FLAG_TO_MUTATE
CALL_DURATION: call_duration
TRACE: trace
MUTATION:
COLUMN_MAPPINGS:
CALL_TYPE: call_type
SCRIPTS:
- "src/data/streams/mutations/phone/aware/calls_ios_unification.R"
PHONE_CONVERSATION:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
DOUBLE_ENERGY: double_energy
INFERENCE: inference
DOUBLE_CONVO_START: double_convo_start
DOUBLE_CONVO_END: double_convo_end
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
IOS:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
DOUBLE_ENERGY: double_energy
INFERENCE: inference
DOUBLE_CONVO_START: double_convo_start
DOUBLE_CONVO_END: double_convo_end
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
- "src/data/streams/mutations/phone/aware/conversation_ios_timestamp.R"
PHONE_KEYBOARD:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
PACKAGE_NAME: package_name
BEFORE_TEXT: before_text
CURRENT_TEXT: current_text
IS_PASSWORD: is_password
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
PHONE_LIGHT:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
DOUBLE_LIGHT_LUX: double_light_lux
ACCURACY: accuracy
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
PHONE_LOCATIONS:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
DOUBLE_LATITUDE: double_latitude
DOUBLE_LONGITUDE: double_longitude
DOUBLE_BEARING: double_bearing
DOUBLE_SPEED: double_speed
DOUBLE_ALTITUDE: double_altitude
PROVIDER: provider
ACCURACY: accuracy
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
IOS:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
DOUBLE_LATITUDE: double_latitude
DOUBLE_LONGITUDE: double_longitude
DOUBLE_BEARING: double_bearing
DOUBLE_SPEED: double_speed
DOUBLE_ALTITUDE: double_altitude
PROVIDER: provider
ACCURACY: accuracy
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
PHONE_LOG:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
LOG_MESSAGE: log_message
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
IOS:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
LOG_MESSAGE: log_message
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
PHONE_MESSAGES:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
MESSAGE_TYPE: message_type
TRACE: trace
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
PHONE_SCREEN:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
SCREEN_STATUS: screen_status
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
IOS:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
SCREEN_STATUS: FLAG_TO_MUTATE
MUTATION:
COLUMN_MAPPINGS:
SCREEN_STATUS: screen_status
SCRIPTS: # List any python or r scripts that mutate your raw data
- "src/data/streams/mutations/phone/aware/screen_ios_unification.R"
PHONE_WIFI_CONNECTED:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
MAC_ADDRESS: mac_address
SSID: ssid
BSSID: bssid
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
IOS:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
MAC_ADDRESS: mac_address
SSID: ssid
BSSID: bssid
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
PHONE_WIFI_VISIBLE:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
SSID: ssid
BSSID: bssid
SECURITY: security
FREQUENCY: frequency
RSSI: rssi
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
IOS:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
SSID: ssid
BSSID: bssid
SECURITY: security
FREQUENCY: frequency
RSSI: rssi
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data

View File

@ -0,0 +1,212 @@
# if you need a new package, you should add it with renv::install(package) so your renv venv is updated
library(RPostgres)
# Needs libpq-dev for compiling from source.
# Error installing package 'RPostgres':
# =====================================
#
# * installing *source* package 'RPostgres' ...
# ** package 'RPostgres' successfully unpacked and MD5 sums checked
# ** using staged installation
# Using PKG_CFLAGS=
# Using PKG_LIBS=-lpq
# Using PKG_PLOGR=
# ------------------------- ANTICONF ERROR ---------------------------
# Configuration failed because libpq was not found. Try installing:
# * deb: libpq-dev (Debian, Ubuntu, etc)
# * rpm: postgresql-devel (Fedora, EPEL)
# * rpm: postgreql8-devel, psstgresql92-devel, postgresql93-devel, or postgresql94-devel (Amazon Linux)
# * csw: postgresql_dev (Solaris)
# * brew: libpq (OSX)
# If libpq is already installed, check that either:
# (i) 'pkg-config' is in your PATH AND PKG_CONFIG_PATH contains
# a libpq.pc file; or
# (ii) 'pg_config' is in your PATH.
# If neither can detect , you can set INCLUDE_DIR
# and LIB_DIR manually via:
# R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
# --------------------------[ ERROR MESSAGE ]----------------------------
# <stdin>:1:10: fatal error: libpq-fe.h: No such file or directory
# compilation terminated.
library(dbplyr)
library(yaml)
#' @description
#' Auxiliary function to parse the connection credentials from a specifc group in ./credentials.yaml
#' You can reause most of this function if you are connection to a DB or Web API.
#' It's OK to delete this function if you don't need credentials, e.g., you are pulling data from a CSV for example.
#' @param group the yaml key containing the credentials to connect to a database
#' @preturn dbEngine a database engine (connection) ready to perform queries
get_db_engine <- function(group){
# The working dir is aways RAPIDS root folder, so your credentials file is always /credentials.yaml
credentials <- read_yaml("./credentials.yaml")
if(!group %in% names(credentials))
stop(paste("The credentials group",group, "does not exist in ./credentials.yaml. The only groups that exist in that file are:", paste(names(credentials), collapse = ","), ". Did you forget to set the group in [PHONE_DATA_STREAMS][aware_mysql][DATABASE_GROUP] in config.yaml?"))
dbEngine <- dbConnect(Postgres(), db = credentials[[group]][["database"]],
user = credentials[[group]][["user"]],
password = credentials[[group]][["password"]],
host = credentials[[group]][["host"]],
port = credentials[[group]][["port"]])
return(dbEngine)
}
# This file gets executed for each PHONE_SENSOR of each participant
# If you are connecting to a database the env file containing its credentials is available at "./.env"
# If you are reading a CSV file instead of a DB table, the @param sensor_container wil contain the file path as set in config.yaml
# You are not bound to databases or files, you can query a web API or whatever data source you need.
#' @description
#' RAPIDS allows users to use the keyword "infer" (previously "multiple") to automatically infer the mobile Operative System a device was running.
#' If you have a way to infer the OS of a device ID, implement this function. For example, for AWARE data we use the "aware_device" table.
#'
#' If you don't have a way to infer the OS, call stop("Error Message") so other users know they can't use "infer" or the inference failed,
#' and they have to assign the OS manually in the participant file
#'
#' @param stream_parameters The PHONE_STREAM_PARAMETERS key in config.yaml. If you need specific parameters add them there.
#' @param device A device ID string
#' @return The OS the device ran, "android" or "ios"
infer_device_os <- function(stream_parameters, device){
#dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
#query <- paste0("SELECT device_id,brand FROM aware_device WHERE device_id = '", device, "'")
#message(paste0("Executing the following query to infer phone OS: ", query))
#os <- dbGetQuery(dbEngine, query)
#dbDisconnect(dbEngine)
#if(nrow(os) > 0)
# return(os %>% mutate(os = ifelse(brand == "iPhone", "ios", "android")) %>% pull(os))
#else
stop(paste("We cannot infer the OS of the following device id because the aware_device table does not exist."))
#return(os)
}
#' @description
#' Gets the sensor data for a specific device id from a database table, file or whatever source you want to query
#'
#' @param stream_parameters The PHONE_STREAM_PARAMETERS key in config.yaml. If you need specific parameters add them there.
#' @param device A device ID string
#' @param sensor_container database table or file containing the sensor data for all participants. This is the PHONE_SENSOR[CONTAINER] key in config.yaml
#' @param columns the columns needed from this sensor (we recommend to only return these columns instead of every column in sensor_container)
#' @return A dataframe with the sensor data for device
pull_data <- function(stream_parameters, device, sensor, sensor_container, columns){
dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
query <- paste0("SELECT ", paste(columns, collapse = ",")," FROM ", sensor_container, " WHERE ", columns$DEVICE_ID ," = '", device,"'")
# Letting the user know what we are doing
message(paste0("Executing the following query to download data: ", query))
sensor_data <- dbGetQuery(dbEngine, query)
dbDisconnect(dbEngine)
if(nrow(sensor_data) == 0)
warning(paste("The device '", device,"' did not have data in ", sensor_container))
return(sensor_data)
}
#' @description
#' Gets participants' IDs for specified usernames.
#'
#' @param stream_parameters The PHONE_DATA_STREAMS key in config.yaml. If you need specific parameters add them there.
#' @param usernames A vector of usernames
#' @param participants_container The name of the database table containing participants data, such as their username.
#' @return A dataframe with participant IDs matching usernames
pull_participants_ids <- function(stream_parameters, usernames, participants_container) {
dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
query_participant_id <- tbl(dbEngine, participants_container) %>%
filter(username %in% usernames) %>%
select(username, id)
message(paste0("Executing the following query to get participants' IDs: \n", sql_render(query_participant_id)))
participant_data <- query_participant_id %>% collect()
dbDisconnect(dbEngine)
if(nrow(participant_data) == 0)
warning(paste("We could not find requested usernames (", usernames, ") in ", participants_container))
return(participant_data)
}
#' @description
#' Gets participants' IDs for specified participant IDs
#'
#' @param stream_parameters The PHONE_DATA_STREAMS key in config.yaml. If you need specific parameters add them there.
#' @param participants_ids A vector of numeric participant IDs
#' @param device_id_container The name of the database table which will be used to determine distinct device ID. Ideally, a table that reliably contains data, but not too much.
#' @return A dataframe with a row matching each distinct device ID with a participant ID
pull_participants_device_ids <- function(stream_parameters, participants_ids, device_id_container) {
dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
query_device_id <- tbl(dbEngine, device_id_container) %>%
filter(participant_id %in% !!participants_ids) %>%
group_by(participant_id) %>%
distinct(device_id, .keep_all = FALSE)
message(
paste0(
"Executing the following query to get the distinct device IDs: \n",
sql_render(query_device_id),
"\n NOTE: This might take a long time."
)
)
device_ids <- query_device_id %>% collect()
dbDisconnect(dbEngine)
if(nrow(device_ids) == 0)
warning(paste("We could not find device IDs for requested participant IDs (", participants_ids, ") in ", device_id_container))
return(device_ids)
}
#' @description
#' Gets start and end datetimes for specified participant IDs.
#'
#' @param stream_parameters The PHONE_DATA_STREAMS key in config.yaml. If you need specific parameters add them there.
#' @param participants_ids A vector of numeric participant IDs
#' @param start_end_date_container The name of the database table which will be used to determine when a participant started and ended their participation. Briefing and debriefing EMAs can be meaningfully used here.
#' @return A dataframe relating participant IDs with their start and end datetimes.
pull_participants_start_end_dates <- function(stream_parameters, participants_ids, start_end_date_container) {
dbEngine <- get_db_engine(stream_parameters$DATABASE_GROUP)
query_timestamps <- tbl(dbEngine, start_end_date_container) %>%
filter(
participant_id %in% !!participants_ids,
double_esm_user_answer_timestamp > 0
) %>%
group_by(participant_id) %>%
summarise(
timestamp_min = min(double_esm_user_answer_timestamp, na.rm = TRUE),
timestamp_max = max(double_esm_user_answer_timestamp, na.rm = TRUE)
) %>%
select(participant_id, timestamp_min, timestamp_max)
message(paste0("Executing the following query to get the starting and ending datetimes: \n", sql_render(query_timestamps)))
start_end_timestamps <- query_timestamps %>% collect()
if(nrow(start_end_timestamps) == 0)
warning(paste("We could not find datetimes for requested participant IDs (", participants_ids, ") in ", start_end_date_container))
start_end_times <- start_end_timestamps %>%
mutate(
datetime_start = as_datetime(timestamp_min/1000, tz = "UTC"),
datetime_end = as_datetime(timestamp_max/1000, tz = "UTC")
) %>%
select(-c(timestamp_min, timestamp_max))
dbDisconnect(dbEngine)
return(start_end_times)
}

View File

@ -0,0 +1,372 @@
PHONE_ACCELEROMETER:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
DOUBLE_VALUES_0: double_values_0
DOUBLE_VALUES_1: double_values_1
DOUBLE_VALUES_2: double_values_2
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
IOS:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
DOUBLE_VALUES_0: double_values_0
DOUBLE_VALUES_1: double_values_1
DOUBLE_VALUES_2: double_values_2
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
PHONE_ACTIVITY_RECOGNITION:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
ACTIVITY_NAME: activity_name
ACTIVITY_TYPE: activity_type
CONFIDENCE: confidence
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
IOS:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
ACTIVITY_NAME: FLAG_TO_MUTATE
ACTIVITY_TYPE: FLAG_TO_MUTATE
CONFIDENCE: FLAG_TO_MUTATE
MUTATION:
COLUMN_MAPPINGS:
ACTIVITIES: activities
CONFIDENCE: confidence
SCRIPTS: # List any python or r scripts that mutate your raw data
- "src/data/streams/mutations/phone/aware/activity_recogniton_ios_unification.R"
PHONE_APPLICATIONS_CRASHES:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
PACKAGE_NAME: package_name
APPLICATION_NAME: application_name
APPLICATION_VERSION: application_version
ERROR_SHORT: error_short
ERROR_LONG: error_long
ERROR_CONDITION: error_condition
IS_SYSTEM_APP: is_system_app
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
PHONE_APPLICATIONS_FOREGROUND:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
PACKAGE_NAME: package_hash
APPLICATION_NAME: FLAG_TO_MUTATE
IS_SYSTEM_APP: is_system_app
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS:
- src/data/streams/mutations/phone/straw/app_add_name.R
PHONE_APPLICATIONS_NOTIFICATIONS:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
PACKAGE_NAME: package_hash
APPLICATION_NAME: FLAG_TO_MUTATE
SOUND: sound
VIBRATE: vibrate
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS:
- src/data/streams/mutations/phone/straw/app_add_name.R
PHONE_BATTERY:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
BATTERY_STATUS: battery_status
BATTERY_LEVEL: battery_level
BATTERY_SCALE: battery_scale
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
IOS:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
BATTERY_STATUS: FLAG_TO_MUTATE
BATTERY_LEVEL: battery_level
BATTERY_SCALE: battery_scale
MUTATION:
COLUMN_MAPPINGS:
BATTERY_STATUS: battery_status
SCRIPTS:
- "src/data/streams/mutations/phone/aware/battery_ios_unification.R"
PHONE_BLUETOOTH:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
BT_ADDRESS: bt_address
BT_NAME: bt_name
BT_RSSI: bt_rssi
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
IOS:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
BT_ADDRESS: bt_address
BT_NAME: bt_name
BT_RSSI: bt_rssi
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
PHONE_CALLS:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
CALL_TYPE: call_type
CALL_DURATION: call_duration
TRACE: trace
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
IOS:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
CALL_TYPE: FLAG_TO_MUTATE
CALL_DURATION: call_duration
TRACE: trace
MUTATION:
COLUMN_MAPPINGS:
CALL_TYPE: call_type
SCRIPTS:
- "src/data/streams/mutations/phone/aware/calls_ios_unification.R"
PHONE_CONVERSATION:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
DOUBLE_ENERGY: double_energy
INFERENCE: inference
DOUBLE_CONVO_START: double_convo_start
DOUBLE_CONVO_END: double_convo_end
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
IOS:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
DOUBLE_ENERGY: double_energy
INFERENCE: inference
DOUBLE_CONVO_START: double_convo_start
DOUBLE_CONVO_END: double_convo_end
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
- "src/data/streams/mutations/phone/aware/conversation_ios_timestamp.R"
PHONE_ESM:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: double_esm_user_answer_timestamp
DEVICE_ID: device_id
ESM_STATUS: esm_status
ESM_USER_ANSWER: esm_user_answer
ESM_JSON: esm_json
ESM_TRIGGER: esm_trigger
ESM_SESSION: esm_session
ESM_NOTIFICATION_ID: esm_notification_id
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS:
PHONE_KEYBOARD:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
PACKAGE_NAME: package_name
BEFORE_TEXT: before_text
CURRENT_TEXT: current_text
IS_PASSWORD: is_password
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
PHONE_LIGHT:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
DOUBLE_LIGHT_LUX: double_light_lux
ACCURACY: accuracy
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
PHONE_LOCATIONS:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
DOUBLE_LATITUDE: double_latitude
DOUBLE_LONGITUDE: double_longitude
DOUBLE_BEARING: double_bearing
DOUBLE_SPEED: double_speed
DOUBLE_ALTITUDE: double_altitude
PROVIDER: provider
ACCURACY: accuracy
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
IOS:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
DOUBLE_LATITUDE: double_latitude
DOUBLE_LONGITUDE: double_longitude
DOUBLE_BEARING: double_bearing
DOUBLE_SPEED: double_speed
DOUBLE_ALTITUDE: double_altitude
PROVIDER: provider
ACCURACY: accuracy
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
PHONE_LOG:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
LOG_MESSAGE: log_message
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
IOS:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
LOG_MESSAGE: log_message
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
PHONE_MESSAGES:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
MESSAGE_TYPE: message_type
TRACE: trace
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
PHONE_SCREEN:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
SCREEN_STATUS: screen_status
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
IOS:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
SCREEN_STATUS: FLAG_TO_MUTATE
MUTATION:
COLUMN_MAPPINGS:
SCREEN_STATUS: screen_status
SCRIPTS: # List any python or r scripts that mutate your raw data
- "src/data/streams/mutations/phone/aware/screen_ios_unification.R"
PHONE_WIFI_CONNECTED:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
MAC_ADDRESS: mac_address
SSID: ssid
BSSID: bssid
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
IOS:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
MAC_ADDRESS: mac_address
SSID: ssid
BSSID: bssid
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
PHONE_WIFI_VISIBLE:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
SSID: ssid
BSSID: bssid
SECURITY: security
FREQUENCY: frequency
RSSI: rssi
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
IOS:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
SSID: ssid
BSSID: bssid
SECURITY: security
FREQUENCY: frequency
RSSI: rssi
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
PHONE_SPEECH:
ANDROID:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
SPEECH_PROPORTION: speech_proportion
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data
IOS:
RAPIDS_COLUMN_MAPPINGS:
TIMESTAMP: timestamp
DEVICE_ID: device_id
SPEECH_PROPORTION: speech_proportion
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data

View File

@ -2,11 +2,16 @@ from zipfile import ZipFile
import warnings
from pathlib import Path
import pandas as pd
import numpy as np
from pandas.core import indexing
import yaml
import csv
from collections import OrderedDict
from io import BytesIO, StringIO
import sys, os
from cr_features.hrv import get_HRV_features, get_patched_ibi_with_bvp
from cr_features.helper_functions import empatica1d_to_array, empatica2d_to_array
def processAcceleration(x, y, z):
x = float(x)
@ -52,6 +57,8 @@ def extract_empatica_data(data, sensor):
df = pd.DataFrame.from_dict(ddict, orient='index', columns=[column])
df[column] = df[column].astype(float)
df.index.name = 'timestamp'
if df.empty:
return df
elif sensor == 'EMPATICA_ACCELEROMETER':
ddict = readFile(sensor_data_file, sensor)
@ -60,15 +67,22 @@ def extract_empatica_data(data, sensor):
df['y'] = df['y'].astype(float)
df['z'] = df['z'].astype(float)
df.index.name = 'timestamp'
if df.empty:
return df
elif sensor == 'EMPATICA_INTER_BEAT_INTERVAL':
df = pd.read_csv(sensor_data_file, names=['timestamp', column], header=None)
df = pd.read_csv(sensor_data_file, names=['timings', column], header=None)
df['timestamp'] = df['timings']
if df.empty:
df = df.set_index('timestamp')
return df
timestampstart = float(df['timestamp'][0])
df['timestamp'] = (df['timestamp'][1:len(df)]).astype(float) + timestampstart
df['timestamp'] = (df['timestamp'][1:len(df)]).astype(float) + timestampstart
df = df.drop([0])
df[column] = df[column].astype(float)
df = df.set_index('timestamp')
else:
raise ValueError(
"sensor has an invalid name: {}".format(sensor))
@ -84,6 +98,10 @@ def pull_data(data_configuration, device, sensor, container, columns_to_download
participant_data = pd.DataFrame(columns=columns_to_download.values())
participant_data.set_index('timestamp', inplace=True)
with open('config.yaml', 'r') as stream:
config = yaml.load(stream, Loader=yaml.FullLoader)
cr_ibi_provider = config['EMPATICA_INTER_BEAT_INTERVAL']['PROVIDERS']['CR']
available_zipfiles = list((Path(data_configuration["FOLDER"]) / Path(device)).rglob("*.zip"))
if len(available_zipfiles) == 0:
warnings.warn("There were no zip files in: {}. If you were expecting data for this participant the [EMPATICA][DEVICE_IDS] key in their participant file is missing the pid".format((Path(data_configuration["FOLDER"]) / Path(device))))
@ -94,7 +112,13 @@ def pull_data(data_configuration, device, sensor, container, columns_to_download
listOfFileNames = zipFile.namelist()
for fileName in listOfFileNames:
if fileName == sensor_csv:
participant_data = pd.concat([participant_data, extract_empatica_data(zipFile.read(fileName), sensor)], axis=0)
if sensor == "EMPATICA_INTER_BEAT_INTERVAL" and cr_ibi_provider.get('PATCH_WITH_BVP', False):
participant_data = \
pd.concat([participant_data, patch_ibi_with_bvp(zipFile.read('IBI.csv'), zipFile.read('BVP.csv'))], axis=0)
#print("patch with ibi")
else:
participant_data = pd.concat([participant_data, extract_empatica_data(zipFile.read(fileName), sensor)], axis=0)
#print("no patching")
warning = False
if warning:
warnings.warn("We could not find a zipped file for {} in {} (we tried to find {})".format(sensor, zipFile, sensor_csv))
@ -105,4 +129,54 @@ def pull_data(data_configuration, device, sensor, container, columns_to_download
participant_data["device_id"] = device
return(participant_data)
def patch_ibi_with_bvp(ibi_data, bvp_data):
ibi_data_file = BytesIO(ibi_data).getvalue().decode('utf-8')
ibi_data_file = StringIO(ibi_data_file)
# Begin with the cr-features part
try:
ibi_data, ibi_start_timestamp = empatica2d_to_array(ibi_data_file)
except (IndexError, KeyError) as e:
# Checks whether IBI.csv is empty
# It may raise a KeyError if df is empty here: startTimeStamp = df.time[0]
df_test = pd.read_csv(ibi_data_file, names=['timings', 'inter_beat_interval'], header=None)
if df_test.empty:
df_test['timestamp'] = df_test['timings']
df_test = df_test.set_index('timestamp')
return df_test
else:
raise IndexError("Something went wrong with indices. Error that was previously caught:\n", repr(e))
bvp_data_file = BytesIO(bvp_data).getvalue().decode('utf-8')
bvp_data_file = StringIO(bvp_data_file)
bvp_data, bvp_start_timestamp, sample_rate = empatica1d_to_array(bvp_data_file)
hrv_time_and_freq_features, sample, bvp_rr, bvp_timings, peak_indx = \
get_HRV_features(bvp_data, ma=False,
detrend=False, m_deternd=False, low_pass=False, winsorize=True,
winsorize_value=25, hampel_fiter=False, median_filter=False,
mod_z_score_filter=True, sampling=64, feature_names=['meanHr'])
ibi_timings, ibi_rr = get_patched_ibi_with_bvp(ibi_data[0], ibi_data[1], bvp_timings, bvp_rr)
df = \
pd.DataFrame(np.array([ibi_timings, ibi_rr]).transpose(), columns=['timestamp', 'inter_beat_interval'])
df.loc[-1] = [ibi_start_timestamp, 'IBI'] # adding a row
df.index = df.index + 1 # shifting index
df = df.sort_index() # sorting by index
# Repeated as in extract_empatica_data for IBI
df['timings'] = df['timestamp']
timestampstart = float(df['timestamp'][0])
df['timestamp'] = (df['timestamp'][1:len(df)]).astype(float) + timestampstart
df = df.drop([0])
df['inter_beat_interval'] = df['inter_beat_interval'].astype(float)
df = df.set_index('timestamp')
# format timestamps
df.index *= 1000
df.index = df.index.astype(int)
return(df)
# print(pull_data({'FOLDER': 'data/external/empatica'}, "e01", "EMPATICA_accelerometer", {'TIMESTAMP': 'timestamp', 'DEVICE_ID': 'device_id', 'DOUBLE_VALUES_0': 'x', 'DOUBLE_VALUES_1': 'y', 'DOUBLE_VALUES_2': 'z'}))

View File

@ -50,6 +50,7 @@ EMPATICA_INTER_BEAT_INTERVAL:
TIMESTAMP: timestamp
DEVICE_ID: device_id
INTER_BEAT_INTERVAL: inter_beat_interval
TIMINGS: timings
MUTATION:
COLUMN_MAPPINGS:
SCRIPTS: # List any python or r scripts that mutate your raw data

View File

@ -18,7 +18,7 @@ def parseCaloriesData(calories_data):
dataset = record["activities-calories-intraday"]["dataset"]
for data in dataset:
d_time = datetime.strptime(data["time"], '%H:%M:%S').time()
d_datetime = datetime.combine(curr_date, d_time)
d_datetime = datetime.combine(curr_date, d_time).strftime("%Y-%m-%d %H:%M:%S")
row_intraday = (device_id, data["level"], data["mets"], data["value"], d_datetime, 0)
records_intraday.append(row_intraday)

View File

@ -32,13 +32,13 @@ def parseHeartrateZones(heartrate_data):
def parseHeartrateIntradayData(records_intraday, dataset, device_id, curr_date, heartrate_zones_range):
for data in dataset:
d_time = datetime.strptime(data["time"], '%H:%M:%S').time()
d_datetime = datetime.combine(curr_date, d_time)
d_datetime = datetime.combine(curr_date, d_time).strftime("%Y-%m-%d %H:%M:%S")
d_hr = data["value"]
# Get heartrate zone by range: min <= heartrate < max
d_hrzone = None
for hrzone, hrrange in heartrate_zones_range.items():
if d_hr >= hrrange[0] and d_hr < hrrange[1]:
if d_hr >= hrrange[0] and d_hr <= hrrange[1]:
d_hrzone = hrzone
break

View File

@ -1,6 +1,5 @@
import json
import pandas as pd
from datetime import datetime
HR_SUMMARY_COLUMNS = ("device_id",
@ -55,7 +54,7 @@ def parseHeartrateData(heartrate_data):
for record in heartrate_data.json_fitbit_column:
record = json.loads(record) # Parse text into JSON
if "activities-heart" in record:
curr_date = datetime.strptime(record["activities-heart"][0]["dateTime"], "%Y-%m-%d")
curr_date = record["activities-heart"][0]["dateTime"] + " 00:00:00"
record_summary = record["activities-heart"][0]
row_summary = parseHeartrateSummaryData(record_summary, device_id, curr_date)

View File

@ -64,7 +64,7 @@ def parseOneRecordForV1(record, device_id, d_is_main_sleep, records_intraday, ty
d_time = datetime.strptime(data["dateTime"], '%H:%M:%S').time()
if is_before_midnight and d_time.hour == 0:
curr_date = end_date
d_datetime = datetime.combine(curr_date, d_time)
d_datetime = datetime.combine(curr_date, d_time).strftime("%Y-%m-%d %H:%M:%S")
# API 1.2 stores original_level as strings, so we convert original_levels of API 1 to strings too
# (1: "asleep", 2: "restless", 3: "awake")
@ -86,7 +86,7 @@ def parseOneRecordForV12(record, device_id, d_is_main_sleep, records_intraday, t
if sleep_record_type == "classic":
for data in record["levels"]["data"]:
d_datetime = dateutil.parser.parse(data["dateTime"])
d_datetime = data["dateTime"][:19].replace("T", " ")
row_intraday = (device_id, type_episode_id, data["seconds"],
data["level"], d_is_main_sleep, sleep_record_type,
@ -95,9 +95,10 @@ def parseOneRecordForV12(record, device_id, d_is_main_sleep, records_intraday, t
else:
# For sleep type "stages"
for data in mergeLongAndShortData(record["levels"]):
d_datetime = data[0].strftime("%Y-%m-%d %H:%M:%S")
row_intraday = (device_id, type_episode_id, 30,
data[1], d_is_main_sleep, sleep_record_type,
data[0], 0)
d_datetime, 0)
records_intraday.append(row_intraday)

View File

@ -1,8 +1,5 @@
import json, yaml
import json
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import dateutil.parser
SLEEP_SUMMARY_COLUMNS = ("device_id", "efficiency",
"minutes_after_wakeup", "minutes_asleep", "minutes_awake", "minutes_to_fall_asleep", "minutes_in_bed",
@ -16,8 +13,8 @@ def parseOneSleepRecord(record, device_id, d_is_main_sleep, records_summary, epi
sleep_record_type = episode_type
d_start_datetime = datetime.strptime(record["startTime"][:18], "%Y-%m-%dT%H:%M:%S")
d_end_datetime = datetime.strptime(record["endTime"][:18], "%Y-%m-%dT%H:%M:%S")
d_start_datetime = record["startTime"][:19].replace("T", " ")
d_end_datetime = record["endTime"][:19].replace("T", " ")
# Summary data
row_summary = (device_id, record["efficiency"],
record["minutesAfterWakeup"], record["minutesAsleep"], record["minutesAwake"], record["minutesToFallAsleep"], record["timeInBed"],

View File

@ -23,7 +23,7 @@ def parseStepsData(steps_data):
dataset = record["activities-steps-intraday"]["dataset"]
for data in dataset:
d_time = datetime.strptime(data["time"], '%H:%M:%S').time()
d_datetime = datetime.combine(curr_date, d_time)
d_datetime = datetime.combine(curr_date, d_time).strftime("%Y-%m-%d %H:%M:%S")
row_intraday = (device_id,
data["value"],

View File

@ -1,6 +1,5 @@
import json
import pandas as pd
from datetime import datetime
STEPS_COLUMNS = ("device_id", "steps", "local_date_time", "timestamp")
@ -16,7 +15,7 @@ def parseStepsData(steps_data):
for record in steps_data.json_fitbit_column:
record = json.loads(record) # Parse text into JSON
if "activities-steps" in record.keys():
curr_date = datetime.strptime(record["activities-steps"][0]["dateTime"], "%Y-%m-%d")
curr_date = record["activities-steps"][0]["dateTime"] + " 00:00:00"
row_summary = (device_id,
record["activities-steps"][0]["value"],

View File

@ -36,7 +36,8 @@ unify_ios_activity_recognition <- function(ios_gar){
activities == "cycling" ~ "on_bicycle",
activities == "walking" ~ "walking",
activities == "running" ~ "running",
activities == "stationary" ~ "still"),
activities == "stationary" ~ "still",
activities == "unknown" ~ "unknown"),
activity_type = case_when(activities == "automotive" ~ 0,
activities == "cycling" ~ 1,
activities == "walking" ~ 7,

View File

@ -39,7 +39,7 @@ unify_ios_calls <- function(ios_calls){
assigned_segments = first(assigned_segments))
}
else {
ios_calls <- ios_calls %>% summarise(call_type_sequence = paste(call_type, collapse = ","), call_duration = sum(call_duration), timestamp = first(timestamp), device_id = first(device_id))
ios_calls <- ios_calls %>% summarise(call_type_sequence = paste(call_type, collapse = ","), call_duration = sum(as.numeric(call_duration)), timestamp = first(timestamp), device_id = first(device_id))
}
ios_calls <- ios_calls %>% mutate(call_type = case_when(
call_type_sequence == "1,2,4" | call_type_sequence == "2,1,4" ~ 1, # incoming

View File

@ -0,0 +1,8 @@
source("renv/activate.R") # needed to use RAPIDS renv environment
library(dplyr)
main <- function(data, stream_parameters){
data <- data %>%
mutate(application_name = "hashed")
return(data)
}

View File

@ -0,0 +1,5 @@
import pandas as pd
def main(data, stream_parameters):
data["application_name"] = "hashed"
return(data)

View File

@ -35,11 +35,8 @@ PHONE_APPLICATIONS_NOTIFICATIONS:
- DEVICE_ID
- PACKAGE_NAME
- APPLICATION_NAME
- TEXT
- SOUND
- VIBRATE
- DEFAULTS
- FLAGS
PHONE_BATTERY:
- TIMESTAMP
@ -70,6 +67,16 @@ PHONE_CONVERSATION:
- DOUBLE_CONVO_START
- DOUBLE_CONVO_END
PHONE_ESM:
- TIMESTAMP
- DEVICE_ID
- ESM_STATUS
- ESM_USER_ANSWER
- ESM_JSON
- ESM_TRIGGER
- ESM_SESSION
- ESM_NOTIFICATION_ID
PHONE_KEYBOARD:
- TIMESTAMP
- DEVICE_ID
@ -111,6 +118,11 @@ PHONE_SCREEN:
- DEVICE_ID
- SCREEN_STATUS
PHONE_SPEECH:
- TIMESTAMP
- DEVICE_ID
- SPEECH_PROPORTION
PHONE_WIFI_CONNECTED:
- TIMESTAMP
- DEVICE_ID
@ -220,6 +232,7 @@ EMPATICA_INTER_BEAT_INTERVAL:
- TIMESTAMP
- DEVICE_ID
- INTER_BEAT_INTERVAL
- TIMINGS
EMPATICA_TAGS:
- TIMESTAMP

View File

@ -0,0 +1,62 @@
source("renv/activate.R")
source("src/data/streams/aware_postgresql/container.R")
library(RPostgres)
library(magrittr)
library(tidyverse)
library(lubridate)
prepare_participants_file <- function() {
username_list_csv_location <- snakemake@input[["username_list"]]
data_configuration <- snakemake@params[["data_configuration"]]
participants_container <- snakemake@params[["participants_table"]]
device_id_container <- snakemake@params[["device_id_table"]]
start_end_date_container <- snakemake@params[["start_end_date_table"]]
output_data_file <- snakemake@output[["participants_file"]]
platform <- "android"
pid_format <- "p%03d"
datetime_format <- "%Y-%m-%d %H:%M:%S"
participant_data <- read_csv(username_list_csv_location, col_types = "cc", progress = FALSE)
usernames <- participant_data$label
participant_ids <- pull_participants_ids(data_configuration, usernames, participants_container)
participant_data %<>%
left_join(participant_ids, by = c("label" = "username")) %>%
rename(participant_id = id)
device_ids <- pull_participants_device_ids(data_configuration, participant_data$participant_id, device_id_container)
device_ids %<>%
filter(device_id != "") %>%
group_by(participant_id) %>%
summarise(device_ids = list(unique(device_id)))
participant_data %<>%
left_join(device_ids, by = "participant_id")
start_end_datetimes <- pull_participants_start_end_dates(data_configuration, participant_data$participant_id, start_end_date_container)
participant_data %<>%
left_join(start_end_datetimes, by = "participant_id")
participant_data %<>%
mutate(
pid = sprintf(pid_format, participant_id),
start_date = strftime(datetime_start, format=datetime_format, tz = "UTC", usetz = FALSE), #TODO Check what timezone is expected
end_date = strftime(datetime_end, format=datetime_format, tz = "UTC", usetz = FALSE),
device_id = map_chr(device_ids, str_c, collapse = ";"),
number_of_devices = map_int(device_ids, length),
fitbit_id = ""
) %>%
rowwise() %>%
mutate(platform = str_c(replicate(number_of_devices, platform), collapse = ";")) %>%
ungroup() %>%
arrange(pid) %>%
select(pid, label, start_date, end_date, empatica_id, device_id, platform, fitbit_id)
write_csv(participant_data, output_data_file)
}
prepare_participants_file()

View File

View File

@ -0,0 +1,89 @@
source("renv/activate.R")
library(tidyr)
library("dplyr", warn.conflicts = F)
library(tidyverse)
library(caret)
library(corrr)
rapids_cleaning <- function(sensor_data_files, provider){
clean_features <- read.csv(sensor_data_files[["sensor_data"]], stringsAsFactors = FALSE)
impute_selected_event_features <- provider[["IMPUTE_SELECTED_EVENT_FEATURES"]]
cols_nan_threshold <- as.numeric(provider[["COLS_NAN_THRESHOLD"]])
drop_zero_variance_columns <- as.logical(provider[["COLS_VAR_THRESHOLD"]])
rows_nan_threshold <- as.numeric(provider[["ROWS_NAN_THRESHOLD"]])
data_yield_unit <- tolower(str_split_fixed(provider[["DATA_YIELD_FEATURE"]], "_", 4)[[4]])
data_yield_column <- paste0("phone_data_yield_rapids_ratiovalidyielded", data_yield_unit)
data_yield_ratio_threshold <- as.numeric(provider[["DATA_YIELD_RATIO_THRESHOLD"]])
drop_highly_correlated_features <- provider[["DROP_HIGHLY_CORRELATED_FEATURES"]]
# Impute selected event features
if(as.logical(impute_selected_event_features$COMPUTE)){
if(!"phone_data_yield_rapids_ratiovalidyieldedminutes" %in% colnames(clean_features)){
stop("Error: RAPIDS provider needs to impute the selected event features based on phone_data_yield_rapids_ratiovalidyieldedminutes column, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyieldedminutes' in [FEATURES].")
}
column_names <- colnames(clean_features)
selected_apps_features <- column_names[grepl("^phone_applications_foreground_rapids_(countevent|countepisode|minduration|maxduration|meanduration|sumduration)", column_names)]
selected_battery_features <- column_names[grepl("^phone_battery_rapids_", column_names)]
selected_calls_features <- column_names[grepl("^phone_calls_rapids_.*_(count|distinctcontacts|sumduration|minduration|maxduration|meanduration|modeduration)", column_names)]
selected_keyboard_features <- column_names[grepl("^phone_keyboard_rapids_(sessioncount|averagesessionlength|changeintextlengthlessthanminusone|changeintextlengthequaltominusone|changeintextlengthequaltoone|changeintextlengthmorethanone|maxtextlength|totalkeyboardtouches)", column_names)]
selected_messages_features <- column_names[grepl("^phone_messages_rapids_.*_(count|distinctcontacts)", column_names)]
selected_screen_features <- column_names[grepl("^phone_screen_rapids_(sumduration|maxduration|minduration|avgduration|countepisode)", column_names)]
selected_wifi_features <- column_names[grepl("^phone_wifi_(connected|visible)_rapids_", column_names)]
selected_columns <- c(selected_apps_features, selected_battery_features, selected_calls_features, selected_keyboard_features, selected_messages_features, selected_screen_features, selected_wifi_features)
clean_features[selected_columns][is.na(clean_features[selected_columns]) & (clean_features$phone_data_yield_rapids_ratiovalidyieldedminutes > impute_selected_event_features$MIN_DATA_YIELDED_MINUTES_TO_IMPUTE)] <- 0
}
# Drop rows with the value of data_yield_column less than data_yield_ratio_threshold
if(!data_yield_column %in% colnames(clean_features)){
stop(paste0("Error: RAPIDS provider needs to clean data based on ", data_yield_column, " column, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyielded", data_yield_unit, "' in [FEATURES]."))
}
if (data_yield_ratio_threshold > 0) {
clean_features <- clean_features %>%
filter(.[[data_yield_column]] >= data_yield_ratio_threshold)
}
# Drop columns with a percentage of NA values above cols_nan_threshold
if(nrow(clean_features))
clean_features <- clean_features %>% select(where(~ sum(is.na(.)) / length(.) <= cols_nan_threshold ), starts_with("phone_esm"))
# Drop columns with zero variance
if(drop_zero_variance_columns)
clean_features <- clean_features %>% select_if(grepl("pid|local_segment|local_segment_label|local_segment_start_datetime|local_segment_end_datetime|phone_esm",names(.)) | sapply(., n_distinct, na.rm = T) > 1)
# Drop highly correlated features
if(as.logical(drop_highly_correlated_features$COMPUTE)){
min_overlap_for_corr_threshold <- as.numeric(drop_highly_correlated_features$MIN_OVERLAP_FOR_CORR_THRESHOLD)
corr_threshold <- as.numeric(drop_highly_correlated_features$CORR_THRESHOLD)
features_for_corr <- clean_features %>%
select_if(is.numeric) %>%
select_if(sapply(., n_distinct, na.rm = T) > 1)
valid_pairs <- crossprod(!is.na(features_for_corr)) >= min_overlap_for_corr_threshold * nrow(features_for_corr)
if((nrow(features_for_corr) != 0) & (ncol(features_for_corr) != 0)){
highly_correlated_features <- features_for_corr %>%
correlate(use = "pairwise.complete.obs", method = "spearman") %>%
column_to_rownames(., var = "term") %>%
as.matrix() %>%
replace(!valid_pairs | is.na(.), 0) %>%
findCorrelation(., cutoff = corr_threshold, verbose = F, names = T)
clean_features <- clean_features[, !names(clean_features) %in% highly_correlated_features]
}
}
# Drop rows with a percentage of NA values above rows_nan_threshold
clean_features <- clean_features %>%
mutate(percentage_na = rowSums(is.na(.)) / ncol(.)) %>%
filter(percentage_na <= rows_nan_threshold) %>%
select(-percentage_na)
return(clean_features)
}

View File

@ -0,0 +1,180 @@
import pandas as pd
import numpy as np
import math, sys, random
import yaml
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
sys.path.append('/rapids/')
from src.features import empatica_data_yield as edy
pd.set_option('display.max_columns', 20)
def straw_cleaning(sensor_data_files, provider):
features = pd.read_csv(sensor_data_files["sensor_data"][0])
esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] # Get target (esm) columns
with open('config.yaml', 'r') as stream:
config = yaml.load(stream, Loader=yaml.FullLoader)
excluded_columns = ['local_segment', 'local_segment_label', 'local_segment_start_datetime', 'local_segment_end_datetime']
# (1) FILTER_OUT THE ROWS THAT DO NOT HAVE THE TARGET COLUMN AVAILABLE
if config['PARAMS_FOR_ANALYSIS']['TARGET']['COMPUTE']:
target = config['PARAMS_FOR_ANALYSIS']['TARGET']['LABEL'] # get target label from config
if 'phone_esm_straw_' + target in features:
features = features[features['phone_esm_straw_' + target].notna()].reset_index(drop=True)
else:
return features
# (2.1) QUALITY CHECK (DATA YIELD COLUMN) deletes the rows where E4 or phone data is low quality
phone_data_yield_unit = provider["PHONE_DATA_YIELD_FEATURE"].split("_")[3].lower()
phone_data_yield_column = "phone_data_yield_rapids_ratiovalidyielded" + phone_data_yield_unit
if features.empty:
return features
features = edy.calculate_empatica_data_yield(features)
if not phone_data_yield_column in features.columns and not "empatica_data_yield" in features.columns:
raise KeyError(f"RAPIDS provider needs to clean the selected event features based on {phone_data_yield_column} and empatica_data_yield columns. For phone data yield, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyielded{data_yield_unit}' in [FEATURES].")
# Drop rows where phone data yield is less then given threshold
if provider["PHONE_DATA_YIELD_RATIO_THRESHOLD"]:
features = features[features[phone_data_yield_column] >= provider["PHONE_DATA_YIELD_RATIO_THRESHOLD"]].reset_index(drop=True)
# Drop rows where empatica data yield is less then given threshold
if provider["EMPATICA_DATA_YIELD_RATIO_THRESHOLD"]:
features = features[features["empatica_data_yield"] >= provider["EMPATICA_DATA_YIELD_RATIO_THRESHOLD"]].reset_index(drop=True)
if features.empty:
return features
# (2.2) DO THE ROWS CONSIST OF ENOUGH NON-NAN VALUES?
min_count = math.ceil((1 - provider["ROWS_NAN_THRESHOLD"]) * features.shape[1]) # minimal not nan values in row
features.dropna(axis=0, thresh=min_count, inplace=True) # Thresh => at least this many not-nans
# (3) REMOVE COLS IF THEIR NAN THRESHOLD IS PASSED (should be <= if even all NaN columns must be preserved - this solution now drops columns with all NaN rows)
esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] # Get target (esm) columns
features = features.loc[:, features.isna().sum() < provider["COLS_NAN_THRESHOLD"] * features.shape[0]]
# Preserve esm cols if deleted (has to come after drop cols operations)
for esm in esm_cols:
if esm not in features:
features[esm] = esm_cols[esm]
# (4) CONTEXTUAL IMPUTATION
# Impute selected phone features with a high number
impute_w_hn = [col for col in features.columns if \
"timeoffirstuse" in col or
"timeoflastuse" in col or
"timefirstcall" in col or
"timelastcall" in col or
"firstuseafter" in col or
"timefirstmessages" in col or
"timelastmessages" in col]
features[impute_w_hn] = features[impute_w_hn].fillna(1500)
# Impute special case (mostcommonactivity) and (homelabel)
impute_w_sn = [col for col in features.columns if "mostcommonactivity" in col]
features[impute_w_sn] = features[impute_w_sn].fillna(4) # Special case of imputation - nominal/ordinal value
impute_w_sn2 = [col for col in features.columns if "homelabel" in col]
features[impute_w_sn2] = features[impute_w_sn2].fillna(1) # Special case of imputation - nominal/ordinal value
impute_w_sn3 = [col for col in features.columns if "loglocationvariance" in col]
features[impute_w_sn2] = features[impute_w_sn2].fillna(-1000000) # Special case of imputation - nominal/ordinal value
# Impute selected phone features with 0
impute_zero = [col for col in features if \
col.startswith('phone_applications_foreground_rapids_') or
col.startswith('phone_battery_rapids_') or
col.startswith('phone_bluetooth_rapids_') or
col.startswith('phone_light_rapids_') or
col.startswith('phone_calls_rapids_') or
col.startswith('phone_messages_rapids_') or
col.startswith('phone_screen_rapids_') or
col.startswith('phone_wifi_visible')]
features[impute_zero+list(esm_cols.columns)] = features[impute_zero+list(esm_cols.columns)].fillna(0)
## (5) STANDARDIZATION
if provider["STANDARDIZATION"]:
features.loc[:, ~features.columns.isin(excluded_columns)] = StandardScaler().fit_transform(features.loc[:, ~features.columns.isin(excluded_columns)])
# (6) IMPUTATION: IMPUTE DATA WITH KNN METHOD
impute_cols = [col for col in features.columns if col not in excluded_columns]
features.reset_index(drop=True, inplace=True)
features[impute_cols] = impute(features[impute_cols], method="knn")
# (7) REMOVE COLS WHERE VARIANCE IS 0
esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')]
if provider["COLS_VAR_THRESHOLD"]:
features.drop(features.std(numeric_only=True)[features.std(numeric_only=True) == 0].index.values, axis=1, inplace=True)
fe5 = features.copy()
# (8) DROP HIGHLY CORRELATED FEATURES
drop_corr_features = provider["DROP_HIGHLY_CORRELATED_FEATURES"]
if drop_corr_features["COMPUTE"] and features.shape[0]: # If small amount of segments (rows) is present, do not execute correlation check
numerical_cols = features.select_dtypes(include=np.number).columns.tolist()
# Remove columns where NaN count threshold is passed
valid_features = features[numerical_cols].loc[:, features[numerical_cols].isna().sum() < drop_corr_features['MIN_OVERLAP_FOR_CORR_THRESHOLD'] * features[numerical_cols].shape[0]]
corr_matrix = valid_features.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [column for column in upper.columns if any(upper[column] > drop_corr_features["CORR_THRESHOLD"])]
features.drop(to_drop, axis=1, inplace=True)
# Preserve esm cols if deleted (has to come after drop cols operations)
for esm in esm_cols:
if esm not in features:
features[esm] = esm_cols[esm]
# (9) VERIFY IF THERE ARE ANY NANS LEFT IN THE DATAFRAME
if features.isna().any().any():
raise ValueError("There are still some NaNs present in the dataframe. Please check for implementation errors.")
return features
def k_nearest(df):
pd.set_option('display.max_columns', None)
imputer = KNNImputer(n_neighbors=3)
return pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
def impute(df, method='zero'):
return {
'zero': df.fillna(0),
'high_number': df.fillna(1500),
'mean': df.fillna(df.mean()),
'median': df.fillna(df.median()),
'knn': k_nearest(df)
}[method]
def graph_bf_af(features, phase_name, plt_flag=False):
if plt_flag:
sns.set(rc={"figure.figsize":(16, 8)})
sns.heatmap(features.isna(), cbar=False) #features.select_dtypes(include=np.number)
plt.savefig(f'features_overall_nans_{phase_name}.png', bbox_inches='tight')
print(f"\n-------------{phase_name}-------------")
print("Rows number:", features.shape[0])
print("Columns number:", len(features.columns))
print("---------------------------------------------\n")

View File

@ -0,0 +1,89 @@
source("renv/activate.R")
library(tidyr)
library("dplyr", warn.conflicts = F)
library(tidyverse)
library(caret)
library(corrr)
rapids_cleaning <- function(sensor_data_files, provider){
clean_features <- read.csv(sensor_data_files[["sensor_data"]], stringsAsFactors = FALSE)
impute_selected_event_features <- provider[["IMPUTE_SELECTED_EVENT_FEATURES"]]
cols_nan_threshold <- as.numeric(provider[["COLS_NAN_THRESHOLD"]])
drop_zero_variance_columns <- as.logical(provider[["COLS_VAR_THRESHOLD"]])
rows_nan_threshold <- as.numeric(provider[["ROWS_NAN_THRESHOLD"]])
data_yield_unit <- tolower(str_split_fixed(provider[["DATA_YIELD_FEATURE"]], "_", 4)[[4]])
data_yield_column <- paste0("phone_data_yield_rapids_ratiovalidyielded", data_yield_unit)
data_yield_ratio_threshold <- as.numeric(provider[["DATA_YIELD_RATIO_THRESHOLD"]])
drop_highly_correlated_features <- provider[["DROP_HIGHLY_CORRELATED_FEATURES"]]
# Impute selected event features
if(as.logical(impute_selected_event_features$COMPUTE)){
if(!"phone_data_yield_rapids_ratiovalidyieldedminutes" %in% colnames(clean_features)){
stop("Error: RAPIDS provider needs to impute the selected event features based on phone_data_yield_rapids_ratiovalidyieldedminutes column, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyieldedminutes' in [FEATURES].")
}
column_names <- colnames(clean_features)
selected_apps_features <- column_names[grepl("^phone_applications_foreground_rapids_(countevent|countepisode|minduration|maxduration|meanduration|sumduration)", column_names)]
selected_battery_features <- column_names[grepl("^phone_battery_rapids_", column_names)]
selected_calls_features <- column_names[grepl("^phone_calls_rapids_.*_(count|distinctcontacts|sumduration|minduration|maxduration|meanduration|modeduration)", column_names)]
selected_keyboard_features <- column_names[grepl("^phone_keyboard_rapids_(sessioncount|averagesessionlength|changeintextlengthlessthanminusone|changeintextlengthequaltominusone|changeintextlengthequaltoone|changeintextlengthmorethanone|maxtextlength|totalkeyboardtouches)", column_names)]
selected_messages_features <- column_names[grepl("^phone_messages_rapids_.*_(count|distinctcontacts)", column_names)]
selected_screen_features <- column_names[grepl("^phone_screen_rapids_(sumduration|maxduration|minduration|avgduration|countepisode)", column_names)]
selected_wifi_features <- column_names[grepl("^phone_wifi_(connected|visible)_rapids_", column_names)]
selected_columns <- c(selected_apps_features, selected_battery_features, selected_calls_features, selected_keyboard_features, selected_messages_features, selected_screen_features, selected_wifi_features)
clean_features[selected_columns][is.na(clean_features[selected_columns]) & (clean_features$phone_data_yield_rapids_ratiovalidyieldedminutes > impute_selected_event_features$MIN_DATA_YIELDED_MINUTES_TO_IMPUTE)] <- 0
}
# Drop rows with the value of data_yield_column less than data_yield_ratio_threshold
if(!data_yield_column %in% colnames(clean_features)){
stop(paste0("Error: RAPIDS provider needs to clean data based on ", data_yield_column, " column, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyielded", data_yield_unit, "' in [FEATURES]."))
}
if (data_yield_ratio_threshold > 0) {
clean_features <- clean_features %>%
filter(.[[data_yield_column]] >= data_yield_ratio_threshold)
}
# Drop columns with a percentage of NA values above cols_nan_threshold
if(nrow(clean_features))
clean_features <- clean_features %>% select(where(~ sum(is.na(.)) / length(.) <= cols_nan_threshold ), starts_with("phone_esm"))
# Drop columns with zero variance
if(drop_zero_variance_columns)
clean_features <- clean_features %>% select_if(grepl("pid|local_segment|local_segment_label|local_segment_start_datetime|local_segment_end_datetime|phone_esm",names(.)) | sapply(., n_distinct, na.rm = T) > 1)
# Drop highly correlated features
if(as.logical(drop_highly_correlated_features$COMPUTE)){
min_overlap_for_corr_threshold <- as.numeric(drop_highly_correlated_features$MIN_OVERLAP_FOR_CORR_THRESHOLD)
corr_threshold <- as.numeric(drop_highly_correlated_features$CORR_THRESHOLD)
features_for_corr <- clean_features %>%
select_if(is.numeric) %>%
select_if(sapply(., n_distinct, na.rm = T) > 1)
valid_pairs <- crossprod(!is.na(features_for_corr)) >= min_overlap_for_corr_threshold * nrow(features_for_corr)
if((nrow(features_for_corr) != 0) & (ncol(features_for_corr) != 0)){
highly_correlated_features <- features_for_corr %>%
correlate(use = "pairwise.complete.obs", method = "spearman") %>%
column_to_rownames(., var = "term") %>%
as.matrix() %>%
replace(!valid_pairs | is.na(.), 0) %>%
findCorrelation(., cutoff = corr_threshold, verbose = F, names = T)
clean_features <- clean_features[, !names(clean_features) %in% highly_correlated_features]
}
}
# Drop rows with a percentage of NA values above rows_nan_threshold
clean_features <- clean_features %>%
mutate(percentage_na = rowSums(is.na(.)) / ncol(.)) %>%
filter(percentage_na <= rows_nan_threshold) %>%
select(-percentage_na)
return(clean_features)
}

View File

@ -0,0 +1,275 @@
import pandas as pd
import numpy as np
import math, sys, random, warnings, yaml
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler, minmax_scale
import matplotlib.pyplot as plt
import seaborn as sns
sys.path.append('/rapids/')
from src.features import empatica_data_yield as edy
def straw_cleaning(sensor_data_files, provider, target):
features = pd.read_csv(sensor_data_files["sensor_data"][0])
with open('config.yaml', 'r') as stream:
config = yaml.load(stream, Loader=yaml.FullLoader)
esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] # Get target (esm) columns
excluded_columns = ['local_segment', 'local_segment_label', 'local_segment_start_datetime', 'local_segment_end_datetime']
graph_bf_af(features, "1target_rows_before")
# (1.0) OVERRIDE STRESSFULNESS EVENT TARGETS IF ERS SEGMENTING_METHOD IS "STRESS_EVENT"
if config["TIME_SEGMENTS"]["TAILORED_EVENTS"]["SEGMENTING_METHOD"] == "stress_event":
stress_events_targets = pd.read_csv("data/external/stress_event_targets.csv")
if "appraisal_stressfulness_event_mean" in config['PARAMS_FOR_ANALYSIS']['TARGET']['ALL_LABELS']:
features.drop(columns=['phone_esm_straw_appraisal_stressfulness_event_mean'], inplace=True)
features = features.merge(stress_events_targets[["label", "appraisal_stressfulness_event"]] \
.rename(columns={'label': 'local_segment_label'}), on=['local_segment_label'], how='inner') \
.rename(columns={'appraisal_stressfulness_event': 'phone_esm_straw_appraisal_stressfulness_event_mean'})
if "appraisal_threat_mean" in config['PARAMS_FOR_ANALYSIS']['TARGET']['ALL_LABELS']:
features.drop(columns=['phone_esm_straw_appraisal_threat_mean'], inplace=True)
features = features.merge(stress_events_targets[["label", "appraisal_threat"]] \
.rename(columns={'label': 'local_segment_label'}), on=['local_segment_label'], how='inner') \
.rename(columns={'appraisal_threat': 'phone_esm_straw_appraisal_threat_mean'})
if "appraisal_challenge_mean" in config['PARAMS_FOR_ANALYSIS']['TARGET']['ALL_LABELS']:
features.drop(columns=['phone_esm_straw_appraisal_challenge_mean'], inplace=True)
features = features.merge(stress_events_targets[["label", "appraisal_challenge"]] \
.rename(columns={'label': 'local_segment_label'}), on=['local_segment_label'], how='inner') \
.rename(columns={'appraisal_challenge': 'phone_esm_straw_appraisal_challenge_mean'})
esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] # Get target (esm) columns
# (1.1) FILTER_OUT THE ROWS THAT DO NOT HAVE THE TARGET COLUMN AVAILABLE
if config['PARAMS_FOR_ANALYSIS']['TARGET']['COMPUTE']:
features = features[features['phone_esm_straw_' + target].notna()].reset_index(drop=True)
if features.empty:
return pd.DataFrame(columns=excluded_columns)
graph_bf_af(features, "2target_rows_after")
# (2) QUALITY CHECK (DATA YIELD COLUMN) drops the rows where E4 or phone data is low quality
phone_data_yield_unit = provider["PHONE_DATA_YIELD_FEATURE"].split("_")[3].lower()
phone_data_yield_column = "phone_data_yield_rapids_ratiovalidyielded" + phone_data_yield_unit
features = edy.calculate_empatica_data_yield(features)
if not phone_data_yield_column in features.columns and not "empatica_data_yield" in features.columns:
raise KeyError(f"RAPIDS provider needs to clean the selected event features based on {phone_data_yield_column} and empatica_data_yield columns. For phone data yield, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyielded{data_yield_unit}' in [FEATURES].")
hist = features[["empatica_data_yield", phone_data_yield_column]].hist()
plt.savefig(f'phone_E4_histogram.png', bbox_inches='tight')
# Drop rows where phone data yield is less then given threshold
if provider["PHONE_DATA_YIELD_RATIO_THRESHOLD"]:
hist = features[phone_data_yield_column].hist(bins=5)
plt.close()
features = features[features[phone_data_yield_column] >= provider["PHONE_DATA_YIELD_RATIO_THRESHOLD"]].reset_index(drop=True)
# Drop rows where empatica data yield is less then given threshold
if provider["EMPATICA_DATA_YIELD_RATIO_THRESHOLD"]:
features = features[features["empatica_data_yield"] >= provider["EMPATICA_DATA_YIELD_RATIO_THRESHOLD"]].reset_index(drop=True)
if features.empty:
return pd.DataFrame(columns=excluded_columns)
graph_bf_af(features, "3data_yield_drop_rows")
if features.empty:
return pd.DataFrame(columns=excluded_columns)
# (3) CONTEXTUAL IMPUTATION
# Impute selected phone features with a high number
impute_w_hn = [col for col in features.columns if \
"timeoffirstuse" in col or
"timeoflastuse" in col or
"timefirstcall" in col or
"timelastcall" in col or
"firstuseafter" in col or
"timefirstmessages" in col or
"timelastmessages" in col]
features[impute_w_hn] = features[impute_w_hn].fillna(1500)
# Impute special case (mostcommonactivity) and (homelabel)
impute_w_sn = [col for col in features.columns if "mostcommonactivity" in col]
features[impute_w_sn] = features[impute_w_sn].fillna(4) # Special case of imputation - nominal/ordinal value
impute_w_sn2 = [col for col in features.columns if "homelabel" in col]
features[impute_w_sn2] = features[impute_w_sn2].fillna(1) # Special case of imputation - nominal/ordinal value
impute_w_sn3 = [col for col in features.columns if "loglocationvariance" in col]
features[impute_w_sn3] = features[impute_w_sn3].fillna(-1000000) # Special case of imputation - loglocation
# Impute location features
impute_locations = [col for col in features \
if col.startswith('phone_locations_doryab_') and
'radiusgyration' not in col
]
# Impute selected phone, location, and esm features with 0
impute_zero = [col for col in features if \
col.startswith('phone_applications_foreground_rapids_') or
col.startswith('phone_activity_recognition_') or
col.startswith('phone_battery_rapids_') or
col.startswith('phone_bluetooth_rapids_') or
col.startswith('phone_light_rapids_') or
col.startswith('phone_calls_rapids_') or
col.startswith('phone_messages_rapids_') or
col.startswith('phone_screen_rapids_') or
col.startswith('phone_bluetooth_doryab_') or
col.startswith('phone_wifi_visible')
]
features[impute_zero+impute_locations+list(esm_cols.columns)] = features[impute_zero+impute_locations+list(esm_cols.columns)].fillna(0)
pd.set_option('display.max_rows', None)
graph_bf_af(features, "4context_imp")
# (4) REMOVE COLS IF THEIR NAN THRESHOLD IS PASSED (should be <= if even all NaN columns must be preserved - this solution now drops columns with all NaN rows)
esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] # Get target (esm) columns
features = features.loc[:, features.isna().sum() < provider["COLS_NAN_THRESHOLD"] * features.shape[0]]
graph_bf_af(features, "5too_much_nans_cols")
# (5) REMOVE COLS WHERE VARIANCE IS 0
if provider["COLS_VAR_THRESHOLD"]:
features.drop(features.std(numeric_only=True)[features.std(numeric_only=True) == 0].index.values, axis=1, inplace=True)
graph_bf_af(features, "6variance_drop")
# Preserve esm cols if deleted (has to come after drop cols operations)
for esm in esm_cols:
if esm not in features:
features[esm] = esm_cols[esm]
# (6) DO THE ROWS CONSIST OF ENOUGH NON-NAN VALUES?
min_count = math.ceil((1 - provider["ROWS_NAN_THRESHOLD"]) * features.shape[1]) # minimal not nan values in row
features.dropna(axis=0, thresh=min_count, inplace=True) # Thresh => at least this many not-nans
graph_bf_af(features, "7too_much_nans_rows")
if features.empty:
return pd.DataFrame(columns=excluded_columns)
# (7) STANDARDIZATION
if provider["STANDARDIZATION"]:
nominal_cols = [col for col in features.columns if "mostcommonactivity" in col or "homelabel" in col] # Excluded nominal features
# Expected warning within this code block
with warnings.catch_warnings():
warnings.simplefilter("ignore", category=RuntimeWarning)
if provider["TARGET_STANDARDIZATION"]:
features.loc[:, ~features.columns.isin(excluded_columns + ["pid"] + nominal_cols)] = \
features.loc[:, ~features.columns.isin(excluded_columns + nominal_cols)].groupby('pid').transform(lambda x: StandardScaler().fit_transform(x.values[:,np.newaxis]).ravel())
else:
features.loc[:, ~features.columns.isin(excluded_columns + ["pid"] + nominal_cols + ['phone_esm_straw_' + target])] = \
features.loc[:, ~features.columns.isin(excluded_columns + nominal_cols + ['phone_esm_straw_' + target])].groupby('pid').transform(lambda x: StandardScaler().fit_transform(x.values[:,np.newaxis]).ravel())
graph_bf_af(features, "8standardization")
# (8) IMPUTATION: IMPUTE DATA WITH KNN METHOD
features.reset_index(drop=True, inplace=True)
impute_cols = [col for col in features.columns if col not in excluded_columns and col != "pid"]
features[impute_cols] = impute(features[impute_cols], method="knn")
graph_bf_af(features, "9knn_after")
# (9) DROP HIGHLY CORRELATED FEATURES
esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')]
drop_corr_features = provider["DROP_HIGHLY_CORRELATED_FEATURES"]
if drop_corr_features["COMPUTE"] and features.shape[0] > 5: # If small amount of segments (rows) is present, do not execute correlation check
numerical_cols = features.select_dtypes(include=np.number).columns.tolist()
# Remove columns where NaN count threshold is passed
valid_features = features[numerical_cols].loc[:, features[numerical_cols].isna().sum() < drop_corr_features['MIN_OVERLAP_FOR_CORR_THRESHOLD'] * features[numerical_cols].shape[0]]
corr_matrix = valid_features.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [column for column in upper.columns if any(upper[column] > drop_corr_features["CORR_THRESHOLD"])]
# sns.heatmap(corr_matrix, cmap="YlGnBu")
# plt.savefig(f'correlation_matrix.png', bbox_inches='tight')
# plt.close()
# s = corr_matrix.unstack()
# so = s.sort_values(ascending=False)
# pd.set_option('display.max_rows', None)
# sorted_upper = upper.unstack().sort_values(ascending=False)
# print(sorted_upper[sorted_upper > drop_corr_features["CORR_THRESHOLD"]])
features.drop(to_drop, axis=1, inplace=True)
# Preserve esm cols if deleted (has to come after drop cols operations)
for esm in esm_cols:
if esm not in features:
features[esm] = esm_cols[esm]
graph_bf_af(features, "10correlation_drop")
# Transform categorical columns to category dtype
cat1 = [col for col in features.columns if "mostcommonactivity" in col]
if cat1: # Transform columns to category dtype (mostcommonactivity)
features[cat1] = features[cat1].astype(int).astype('category')
cat2 = [col for col in features.columns if "homelabel" in col]
if cat2: # Transform columns to category dtype (homelabel)
features[cat2] = features[cat2].astype(int).astype('category')
# (10) DROP ALL WINDOW RELATED COLUMNS
win_count_cols = [col for col in features if "SO_windowsCount" in col]
if win_count_cols:
features.drop(columns=win_count_cols, inplace=True)
# (11) VERIFY IF THERE ARE ANY NANS LEFT IN THE DATAFRAME
if features.isna().any().any():
raise ValueError("There are still some NaNs present in the dataframe. Please check for implementation errors.")
return features
def k_nearest(df):
imputer = KNNImputer(n_neighbors=3)
return pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
def impute(df, method='zero'):
return {
'zero': df.fillna(0),
'high_number': df.fillna(1500),
'mean': df.fillna(df.mean()),
'median': df.fillna(df.median()),
'knn': k_nearest(df)
}[method]
def graph_bf_af(features, phase_name, plt_flag=False):
if plt_flag:
sns.set(rc={"figure.figsize":(16, 8)})
sns.heatmap(features.isna(), cbar=False) #features.select_dtypes(include=np.number)
plt.savefig(f'features_overall_nans_{phase_name}.png', bbox_inches='tight')
print(f"\n-------------{phase_name}-------------")
print("Rows number:", features.shape[0])
print("Columns number:", len(features.columns))
print("NaN values:", features.isna().sum().sum())
print("---------------------------------------------\n")

View File

@ -0,0 +1,59 @@
import pandas as pd
import numpy as np
import math as m
import sys
def extract_second_order_features(intraday_features, so_features_names, prefix=""):
if prefix:
groupby_cols = ['local_segment', 'local_segment_label', 'local_segment_start_datetime', 'local_segment_end_datetime']
else:
groupby_cols = ['local_segment']
if not intraday_features.empty:
so_features = pd.DataFrame()
#print(intraday_features.drop("level_1", axis=1).groupby(["local_segment"]).nsmallest())
if "mean" in so_features_names:
so_features = pd.concat([so_features, intraday_features.drop(prefix+"level_1", axis=1).groupby(groupby_cols).mean(numeric_only=True).add_suffix("_SO_mean")], axis=1)
if "median" in so_features_names:
so_features = pd.concat([so_features, intraday_features.drop(prefix+"level_1", axis=1).groupby(groupby_cols).median(numeric_only=True).add_suffix("_SO_median")], axis=1)
if "sd" in so_features_names:
so_features = pd.concat([so_features, intraday_features.drop(prefix+"level_1", axis=1).groupby(groupby_cols).std(numeric_only=True).fillna(0).add_suffix("_SO_sd")], axis=1)
if "nlargest" in so_features_names: # largest 5 -- maybe there is a faster groupby solution?
for column in intraday_features.loc[:, ~intraday_features.columns.isin(groupby_cols+[prefix+"level_1"])]:
so_features[column+"_SO_nlargest"] = intraday_features.drop(prefix+"level_1", axis=1).groupby(groupby_cols)[column].apply(lambda x: x.nlargest(5).mean())
if "nsmallest" in so_features_names: # smallest 5 -- maybe there is a faster groupby solution?
for column in intraday_features.loc[:, ~intraday_features.columns.isin(groupby_cols+[prefix+"level_1"])]:
so_features[column+"_SO_nsmallest"] = intraday_features.drop(prefix+"level_1", axis=1).groupby(groupby_cols)[column].apply(lambda x: x.nsmallest(5).mean())
if "count_windows" in so_features_names:
so_features["SO_windowsCount"] = intraday_features.groupby(groupby_cols).count()[prefix+"level_1"]
# numPeaksNonZero specialized for EDA sensor
if "eda_num_peaks_non_zero" in so_features_names and prefix+"numPeaks" in intraday_features.columns:
so_features[prefix+"SO_numPeaksNonZero"] = intraday_features.groupby(groupby_cols)[prefix+"numPeaks"].apply(lambda x: (x!=0).sum())
# numWindowsNonZero specialized for BVP and IBI sensors
if "hrv_num_windows_non_nan" in so_features_names and prefix+"meanHr" in intraday_features.columns:
so_features[prefix+"SO_numWindowsNonNaN"] = intraday_features.groupby(groupby_cols)[prefix+"meanHr"].apply(lambda x: (~np.isnan(x)).sum())
so_features.reset_index(inplace=True)
else:
so_features = pd.DataFrame(columns=groupby_cols)
return so_features
def get_sample_rate(data): # To-Do get the sample rate information from the file's metadata
try:
timestamps_diff = data['timestamp'].diff().dropna().mean()
print("Timestamp diff:", timestamps_diff)
except:
raise Exception("Error occured while trying to get the mean sample rate from the data.")
return m.ceil(1000/timestamps_diff)

View File

@ -0,0 +1,75 @@
import pandas as pd
from scipy.stats import entropy
from cr_features.helper_functions import convert_to2d, accelerometer_features, frequency_features
from cr_features.calculate_features_old import calculateFeatures
from cr_features.calculate_features import calculate_features
from cr_features_helper_methods import extract_second_order_features
import sys
def extract_acc_features_from_intraday_data(acc_intraday_data, features, window_length, time_segment, filter_data_by_segment):
acc_intraday_features = pd.DataFrame(columns=["local_segment"] + features)
if not acc_intraday_data.empty:
sample_rate = 32
acc_intraday_data = filter_data_by_segment(acc_intraday_data, time_segment)
if not acc_intraday_data.empty:
acc_intraday_features = pd.DataFrame()
# apply methods from calculate features module
if window_length is None:
acc_intraday_features = \
acc_intraday_data.groupby('local_segment').apply(lambda x: calculate_features( \
convert_to2d(x['double_values_0'], x.shape[0]), \
convert_to2d(x['double_values_1'], x.shape[0]), \
convert_to2d(x['double_values_2'], x.shape[0]), \
fs=sample_rate, feature_names=features, show_progress=False))
else:
acc_intraday_features = \
acc_intraday_data.groupby('local_segment').apply(lambda x: calculate_features( \
convert_to2d(x['double_values_0'], window_length*sample_rate), \
convert_to2d(x['double_values_1'], window_length*sample_rate), \
convert_to2d(x['double_values_2'], window_length*sample_rate), \
fs=sample_rate, feature_names=features, show_progress=False))
acc_intraday_features.reset_index(inplace=True)
return acc_intraday_features
def cr_features(sensor_data_files, time_segment, provider, filter_data_by_segment, *args, **kwargs):
data_types = {'local_timezone': 'str', 'device_id': 'str', 'timestamp': 'int64', 'double_values_0': 'float64',
'double_values_1': 'float64', 'double_values_2': 'float64', 'local_date_time': 'str', 'local_date': "str",
'local_time': "str", 'local_hour': "str", 'local_minute': "str", 'assigned_segments': "str"}
acc_intraday_data = pd.read_csv(sensor_data_files["sensor_data"], dtype=data_types)
requested_intraday_features = provider["FEATURES"]
calc_windows = kwargs.get('calc_windows', False)
if provider["WINDOWS"]["COMPUTE"] and calc_windows:
requested_window_length = provider["WINDOWS"]["WINDOW_LENGTH"]
else:
requested_window_length = None
# name of the features this function can compute
base_intraday_features_names = accelerometer_features + frequency_features
# the subset of requested features this function can compute
intraday_features_to_compute = list(set(requested_intraday_features) & set(base_intraday_features_names))
# extract features from intraday data
acc_intraday_features = extract_acc_features_from_intraday_data(acc_intraday_data, intraday_features_to_compute,
requested_window_length, time_segment, filter_data_by_segment)
if calc_windows:
so_features_names = provider["WINDOWS"]["SECOND_ORDER_FEATURES"]
acc_second_order_features = extract_second_order_features(acc_intraday_features, so_features_names)
return acc_intraday_features, acc_second_order_features
return acc_intraday_features

View File

@ -0,0 +1,73 @@
import pandas as pd
from sklearn.preprocessing import StandardScaler
from cr_features.helper_functions import convert_to2d, hrv_features
from cr_features.hrv import extract_hrv_features_2d_wrapper
from cr_features_helper_methods import extract_second_order_features
import sys
# pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', None)
def extract_bvp_features_from_intraday_data(bvp_intraday_data, features, window_length, time_segment, filter_data_by_segment):
bvp_intraday_features = pd.DataFrame(columns=["local_segment"] + features)
if not bvp_intraday_data.empty:
sample_rate = 64
bvp_intraday_data = filter_data_by_segment(bvp_intraday_data, time_segment)
if not bvp_intraday_data.empty:
bvp_intraday_features = pd.DataFrame()
# apply methods from calculate features module
if window_length is None:
bvp_intraday_features = \
bvp_intraday_data.groupby('local_segment').apply(\
lambda x:
extract_hrv_features_2d_wrapper(
convert_to2d(x['blood_volume_pulse'], x.shape[0]),
sampling=sample_rate, hampel_fiter=False, median_filter=False, mod_z_score_filter=True, feature_names=features))
else:
bvp_intraday_features = \
bvp_intraday_data.groupby('local_segment').apply(\
lambda x:
extract_hrv_features_2d_wrapper(
convert_to2d(x['blood_volume_pulse'], window_length*sample_rate),
sampling=sample_rate, hampel_fiter=False, median_filter=False, mod_z_score_filter=True, feature_names=features))
bvp_intraday_features.reset_index(inplace=True)
return bvp_intraday_features
def cr_features(sensor_data_files, time_segment, provider, filter_data_by_segment, *args, **kwargs):
bvp_intraday_data = pd.read_csv(sensor_data_files["sensor_data"])
requested_intraday_features = provider["FEATURES"]
calc_windows = kwargs.get('calc_windows', False)
if provider["WINDOWS"]["COMPUTE"] and calc_windows:
requested_window_length = provider["WINDOWS"]["WINDOW_LENGTH"]
else:
requested_window_length = None
# name of the features this function can compute
base_intraday_features_names = hrv_features
# the subset of requested features this function can compute
intraday_features_to_compute = list(set(requested_intraday_features) & set(base_intraday_features_names))
# extract features from intraday data
bvp_intraday_features = extract_bvp_features_from_intraday_data(bvp_intraday_data, intraday_features_to_compute,
requested_window_length, time_segment, filter_data_by_segment)
if calc_windows:
so_features_names = provider["WINDOWS"]["SECOND_ORDER_FEATURES"]
bvp_second_order_features = extract_second_order_features(bvp_intraday_features, so_features_names)
return bvp_intraday_features, bvp_second_order_features
return bvp_intraday_features

View File

@ -0,0 +1,32 @@
import pandas as pd
import numpy as np
from datetime import datetime
import sys, yaml
def calculate_empatica_data_yield(features): # TODO
# Get time segment duration in seconds from all segments in features dataframe
datetime_start = pd.to_datetime(features['local_segment_start_datetime'], format='%Y-%m-%d %H:%M:%S')
datetime_end = pd.to_datetime(features['local_segment_end_datetime'], format='%Y-%m-%d %H:%M:%S')
tseg_duration = (datetime_end - datetime_start).dt.total_seconds()
with open('config.yaml', 'r') as stream:
config = yaml.load(stream, Loader=yaml.FullLoader)
sensors = ["EMPATICA_ACCELEROMETER", "EMPATICA_TEMPERATURE", "EMPATICA_ELECTRODERMAL_ACTIVITY", "EMPATICA_INTER_BEAT_INTERVAL"]
for sensor in sensors:
features[f"{sensor.lower()}_data_yield"] = \
(features[f"{sensor.lower()}_cr_SO_windowsCount"] * config[sensor]["PROVIDERS"]["CR"]["WINDOWS"]["WINDOW_LENGTH"]) / tseg_duration \
if f'{sensor.lower()}_cr_SO_windowsCount' in features else 0
empatica_data_yield_cols = [sensor.lower() + "_data_yield" for sensor in sensors]
pd.set_option('display.max_rows', None)
# Assigns 1 to values that are over 1 (in case of windows not being filled fully)
features[empatica_data_yield_cols] = features[empatica_data_yield_cols].apply(lambda x: [y if y <= 1 or np.isnan(y) else 1 for y in x])
features["empatica_data_yield"] = features[empatica_data_yield_cols].mean(axis=1, numeric_only=True).fillna(0)
features.drop(empatica_data_yield_cols, axis=1, inplace=True) # In case of if the advanced operations will later not be needed (e.g., weighted average)
return features

View File

@ -0,0 +1,82 @@
import pandas as pd
import numpy as np
from scipy.stats import entropy
from cr_features.helper_functions import convert_to2d, gsr_features
from cr_features.calculate_features import calculate_features
from cr_features.gsr import extractGsrFeatures2D
from cr_features_helper_methods import extract_second_order_features
import sys
#pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', None)
#np.seterr(invalid='ignore')
def extract_eda_features_from_intraday_data(eda_intraday_data, features, window_length, time_segment, filter_data_by_segment):
eda_intraday_features = pd.DataFrame(columns=["local_segment"] + features)
if not eda_intraday_data.empty:
sample_rate = 4
eda_intraday_data = filter_data_by_segment(eda_intraday_data, time_segment)
if not eda_intraday_data.empty:
eda_intraday_features = pd.DataFrame()
# apply methods from calculate features module
if window_length is None:
eda_intraday_features = \
eda_intraday_data.groupby('local_segment').apply(\
lambda x: extractGsrFeatures2D(convert_to2d(x['electrodermal_activity'], x.shape[0]), sampleRate=sample_rate, featureNames=features,
threshold=.01, offset=1, riseTime=5, decayTime=15))
else:
eda_intraday_features = \
eda_intraday_data.groupby('local_segment').apply(\
lambda x: extractGsrFeatures2D(convert_to2d(x['electrodermal_activity'], window_length*sample_rate), sampleRate=sample_rate, featureNames=features,
threshold=.01, offset=1, riseTime=5, decayTime=15))
eda_intraday_features.reset_index(inplace=True)
return eda_intraday_features
def cr_features(sensor_data_files, time_segment, provider, filter_data_by_segment, *args, **kwargs):
data_types = {'local_timezone': 'str', 'device_id': 'str', 'timestamp': 'int64', 'electrodermal_activity': 'float64', 'local_date_time': 'str',
'local_date': "str", 'local_time': "str", 'local_hour': "str", 'local_minute': "str", 'assigned_segments': "str"}
eda_intraday_data = pd.read_csv(sensor_data_files["sensor_data"], dtype=data_types)
requested_intraday_features = provider["FEATURES"]
calc_windows = kwargs.get('calc_windows', False)
if provider["WINDOWS"]["COMPUTE"] and calc_windows:
requested_window_length = provider["WINDOWS"]["WINDOW_LENGTH"]
else:
requested_window_length = None
# name of the features this function can compute
base_intraday_features_names = gsr_features
# the subset of requested features this function can compute
intraday_features_to_compute = list(set(requested_intraday_features) & set(base_intraday_features_names))
# extract features from intraday data
eda_intraday_features = extract_eda_features_from_intraday_data(eda_intraday_data, intraday_features_to_compute,
requested_window_length, time_segment, filter_data_by_segment)
if calc_windows:
if provider["WINDOWS"]["IMPUTE_NANS"]:
eda_intraday_features[eda_intraday_features["numPeaks"] == 0] = \
eda_intraday_features[eda_intraday_features["numPeaks"] == 0].fillna(0)
pd.set_option('display.max_columns', None)
so_features_names = provider["WINDOWS"]["SECOND_ORDER_FEATURES"]
eda_second_order_features = extract_second_order_features(eda_intraday_features, so_features_names)
return eda_intraday_features, eda_second_order_features
return eda_intraday_features

Some files were not shown because too many files have changed in this diff Show More