From 87e5209a9fb5078fabe702ccbf743e0612a2edcd Mon Sep 17 00:00:00 2001 From: Primoz Date: Thu, 8 Dec 2022 16:04:39 +0000 Subject: [PATCH] Squashed commit of the following: commit 8a6b52a97c95dcd8b70b980b4f46421b1a847905 Author: Primoz Date: Tue Nov 29 11:35:49 2022 +0000 Switch to 30_before ERS with corresponding targets. commit 244a05373014b14bc4c75db8ceb68a04dc5328df Author: Primoz Date: Tue Nov 29 11:19:43 2022 +0000 Change output files settings to nonstandardized. commit be0324fd01d70c58a9eefd84ccb23d06a42ab57c Author: Primoz Date: Mon Nov 28 12:44:25 2022 +0000 Fix some bugs and set categorical columns as categories dtypes. commit 99c2fab8f9ab9cf2ed40a952019f994299b78c05 Author: Primoz Date: Wed Nov 16 09:50:18 2022 +0000 Fix a bug in the making of the individual model (when there is no target in the participants columns). commit 286de93bfd55710d77c2e4bab899e73381f9a4a3 Author: Primoz Date: Tue Nov 15 11:21:51 2022 +0000 Fix some bugs and extend ERS and cleaning scripts with multiple stress event targets logic. commit ab803ee49c2898d2dd4d49de64f763f549c077e3 Author: Primoz Date: Tue Nov 15 10:14:07 2022 +0000 Add additional appraisal targets. commit 621f11b2d98cb3e17d86c3be902ebd40f94d3079 Author: Primoz Date: Tue Nov 15 09:53:31 2022 +0000 Fix a bug related to wrong user input (duplicated events). commit bd41f42a5da7e28b9de190a53b5f2195fad0920d Author: Primoz Date: Mon Nov 14 15:07:36 2022 +0000 Rename target_ to segmenting_ method. commit a543ce372f1fd6cdd402be8ebdbe54610e449ce2 Author: Primoz Date: Mon Nov 14 15:04:16 2022 +0000 Add comments for event_related_script understanding. commit 74b454b07bf69bdcd0517a940c5296e5f77c492e Author: Primoz Date: Fri Nov 11 09:15:12 2022 +0000 Apply changes to string answers to make them language-generic. commit 6ebe83e47ea4da0066f4cd9dedbdb726cef5d06c Author: Primoz Date: Thu Nov 10 12:42:52 2022 +0000 Improve the ERS extract method with a couple of validations. commit 00350ef8ca2ef43aed71f76609917021ce920f31 Author: Primoz Date: Thu Nov 10 10:32:58 2022 +0000 Change config for stressfulness event target method. commit e4985c91214ac6ecd43501d52b67816d26d4fdef Author: Primoz Date: Thu Nov 10 10:29:11 2022 +0000 Override stressfulness event target with extracted values from csv. commit a668b6e8dad4dd393802ed4d5e948d6d42d60a01 Author: Primoz Date: Thu Nov 10 09:37:27 2022 +0000 Extract ERS and stress event targets to csv files (completed). commit 9199b53ded1c5d858882d9826aa15e5e2102ab08 Author: Primoz Date: Wed Nov 9 15:11:51 2022 +0000 Get, join and start processing required ERS stress event data. commit f3c6a66da9a7ce92d4296436927f48b3d1a467b8 Author: Primoz Date: Tue Nov 8 15:53:43 2022 +0000 Begin with stress events in the ERS script. commit 0b3e9226b3c683e87ef758756f0d83350642a716 Author: Primoz Date: Tue Nov 8 14:44:24 2022 +0000 Make small corrections in ERS file. commit 2d83f7ddecbc3880e8d901fc2aae25621a280f75 Author: Primoz Date: Tue Nov 8 11:32:05 2022 +0000 Begin the ERS logic for 90-minutes events. commit 1da72a7cbe3178ef911ddb263ea30e5da79407ad Author: Primoz Date: Tue Nov 8 09:45:37 2022 +0000 Rename targets method in config. commit 9f441afc16ff9b75e131eb47d41edb570b51190c Author: Primoz Date: Fri Nov 4 15:09:04 2022 +0000 Begin ERS logic for 90-minutes events. commit c1c9f4d05ac8bdf03f4759623165f8e795eaa5a9 Merge: 62f46ea3 7ab0280d Author: Primoz Date: Fri Nov 4 09:11:58 2022 +0000 Merge branch 'imputation_and_cleaning' of https://repo.ijs.si/junoslukan/rapids into imputation_and_cleaning commit 62f46ea3763ba47af2a7c62d6b0b453cfd549929 Author: Primoz Date: Fri Nov 4 09:11:53 2022 +0000 Prepare method-based logic for ERS generating. commit 7ab0280d7ed23022a4d2de0caa7bba3a43dff9d4 Author: Primoz Date: Fri Nov 4 08:58:08 2022 +0000 Correctly rename stressful event target variable. commit eefa9f3f4d4c95374dd31907787ed767c636fe37 Author: Primoz Date: Thu Nov 3 14:49:54 2022 +0000 Add new target: stressfulness_event. commit 5e8174dd41f5a4c2aa35c74fb4bd9b19a918d1c4 Author: Primoz Date: Thu Nov 3 13:52:45 2022 +0000 Add new target: stressfulness_period. commit 35c1a762e7179c7b11f8f1154c1a4f4133402324 Author: Primoz Date: Thu Nov 3 13:51:18 2022 +0000 Improve filtering by esm_session and device_id. commit 02264b21fd43212dd4636d300321e254da392149 Author: Primoz Date: Thu Nov 3 09:30:12 2022 +0000 Add logic for target selection in ERS processing. commit 0ce8723bdb72bafa40924ff4fd3a2328e81808eb Author: Primoz Date: Wed Nov 2 14:01:21 2022 +0000 Extend imputation logic within the cleaning script. commit 30b38bfc028b6e3261701feb915f9e3820a77c75 Author: Primoz Date: Fri Oct 28 09:00:13 2022 +0000 Fix the generating procedure of ERS file for participants with multiple devices. commit cd137af15a9f03724593595a152f34a24323088f Author: Primoz Date: Thu Oct 27 14:20:15 2022 +0000 Config for 30 minute EMA segments. commit 3c0585a566ede91dd97e2ba0a6617a4f42e72617 Author: Primoz Date: Thu Oct 27 14:12:56 2022 +0000 Remove obsolete comments. commit 6b487fcf7b64a722e9c5486764155764332ba73c Author: Primoz Date: Thu Oct 27 14:11:42 2022 +0000 Set E4 data yield to 1 if it is over 1. Optimize E4 data_yield script. commit 5d17c92e54427cc70a7f747be398ad59af98c5bf Merge: a31fdd14 0d143e6a Author: Primoz Date: Wed Oct 26 14:18:20 2022 +0000 Merge branch 'imputation_and_cleaning' of https://repo.ijs.si/junoslukan/rapids into imputation_and_cleaning commit a31fdd1479193a67b6a6985cc42fdc66ac7be1a4 Author: Primoz Date: Wed Oct 26 14:18:08 2022 +0000 Start to test empatica_data_yield precieved error. commit 936324d234b6636366b1a569793452a4cca54847 Author: Primoz Date: Wed Oct 26 14:17:27 2022 +0000 Switch config for 30 minutes event related segments. commit da0a4596f814a40997efe782239cb296e89fe606 Author: Primoz Date: Wed Oct 26 14:16:25 2022 +0000 Add additional ESM processing logic for ERS csv extraction. commit d4d74818e69eab2755f988391978400096902b5d Author: Primoz Date: Wed Oct 26 14:14:32 2022 +0000 Fix a bug - missing time_segment column when df is empty commit 14ff59914b4b3071ff7a40bf9daaba558ff66746 Author: Primoz Date: Wed Oct 26 09:59:46 2022 +0000 Fix to correct dtypes. commit 6ab0ac5329d0be06bbe8357203422c7d981c325a Author: Primoz Date: Wed Oct 26 09:57:26 2022 +0000 Optimize memory consumption with dtype definition while reading csv file. commit 0d143e6aadaf9b67b2fbfd6cdf8c218d5720f157 Merge: 8acac501 b92a3aa3 Author: Primoz Date: Tue Oct 25 15:28:27 2022 +0000 Merge branch 'imputation_and_cleaning' of https://repo.ijs.si/junoslukan/rapids into imputation_and_cleaning commit 8acac501251105332d1d3d4863e88744b9528844 Author: Primoz Date: Tue Oct 25 15:26:43 2022 +0000 Add safenet when features dataframe is empty. commit b92a3aa37a1968b7129cdb2b7ebaf9beda981428 Author: Primoz Date: Tue Oct 25 15:25:22 2022 +0000 Remove unwanted output or other error producing code. commit bfd637eb9c40872b954b608aef420b24393d3e6f Author: Primoz Date: Tue Oct 25 08:53:44 2022 +0000 Improve strings formatting in straw_events file. commit 0d81ad5756c54f03b2b200b801672b9eb39f27d5 Author: Primoz Date: Wed Oct 19 13:35:04 2022 +0000 Debug assignment of segments to rows commit cea451d344e3b0bcc110c77f2845a9077bbe5d75 Merge: e88bbd54 cf38d9f1 Author: Primoz Date: Tue Oct 18 09:15:06 2022 +0000 Merge branch 'imputation_and_cleaning' of https://repo.ijs.si/junoslukan/rapids into imputation_and_cleaning commit e88bbd548fcd2bb1f9057b27b572afa5a6d028a5 Author: Primoz Date: Tue Oct 18 09:15:00 2022 +0000 Add new daily segment and filter by segment in the cleaning script. commit cf38d9f175c5bcc9d0f8c70f49e59d2b56d8ada7 Author: Primoz Date: Mon Oct 17 15:07:33 2022 +0000 Implement ERS generating logic. commit f3ca56cdbf22d7eed38f920889c4b19f49da7760 Author: Primoz Date: Fri Oct 14 14:46:28 2022 +0000 Start with ERS logic integration within Snakemake. commit 797aa98f4fe2faf411c5130d27c51c5c606e5c0e Author: Primoz Date: Wed Oct 12 15:51:50 2022 +0000 Config for ERS testing. commit 9baff159cd1a61019410838efefa4b59a6c4981c Author: Primoz Date: Wed Oct 12 15:51:23 2022 +0000 Changes needed for testing and starting of the Event-Related Segments. commit 0f21273508654133a51d1e20e485c3de33dc2779 Author: Primoz Date: Wed Oct 12 12:32:51 2022 +0000 Bugs fix commit 55517eb737463e21c6f6ad18f4711d38bd02fec5 Author: Primoz Date: Wed Oct 12 12:23:11 2022 +0000 Necessary commit before proceeding. commit de15a52dba43325c171584464995fa4311403e7e Author: Primoz Date: Tue Oct 11 08:36:23 2022 +0000 Bug fix commit 1ad25bb5727d30affd2ee063386f2f5ca52e0d63 Author: Primoz Date: Tue Oct 11 08:26:17 2022 +0000 Few modifications of some imputation values in cleaning script and feature extraction. commit 9884b383cf6e1b339738deea8cc54a6d264e5c87 Author: Primoz Date: Mon Oct 10 16:45:38 2022 +0000 Testing new data with AutoML. commit 2dc89c083c99d43da5e1eb9cbc1c07d39901f27d Author: Primoz Date: Fri Oct 7 08:52:12 2022 +0000 Small changes in cleaning overall commit 001d40072973797f159b94b48ac3add10138f58e Author: Primoz Date: Thu Oct 6 14:28:12 2022 +0000 Clean features and create input files based on all possible targets. commit 1e38d9bf1e4d1caa10ff8c5245a59a9b20d4d7f9 Author: Primoz Date: Thu Oct 6 13:27:38 2022 +0000 Standardization and correlation visualization in overall cleaning script. commit a34412a18dbd75aafa9f3bba9303ab6a962cec03 Author: Primoz Date: Wed Oct 5 14:16:55 2022 +0000 E4 data yield corrections. Changes in overal cs - standardization. commit 437459648f16f71acec6794a07404bfeb781f907 Author: Primoz Date: Wed Oct 5 13:35:05 2022 +0000 Errors fix: individual script - treat participants missing data. commit 53f6cc60d5d8262a3bbb3cad82e2ebf49cb9e2a5 Author: Primoz Date: Mon Oct 3 13:06:39 2022 +0000 Config and cleaning script necessary changes ... commit bbeabeee6ff7e3f8cc429470404c2c4b246b6a22 Author: Primoz Date: Mon Oct 3 12:53:31 2022 +0000 Last changes before processing on the server. commit 44531c6d94705757110df6d087f3762699d7211a Author: Primoz Date: Fri Sep 30 10:04:07 2022 +0000 Code cleaning, reworking cleaning individual based on changes in overall script. Changes in thresholds. commit 7ac7cd5a3714521fc9a36a8371554a921df132fe Author: Primoz Date: Thu Sep 29 14:33:21 2022 +0000 Preparation of the overall cleaning script. commit 68fd69dadab8c7e14095469d5ed1d9921b4c0ce7 Author: Primoz Date: Thu Sep 29 11:55:25 2022 +0000 Cleaning script for individuals: corrections and comments. commit a4f0d056a047aaf3458b043eba5e98df76da039a Author: Primoz Date: Thu Sep 29 11:44:27 2022 +0000 Fillna for app foreground and activity recognition commit 6286e7a44c7e2e99116b83fc3c06ce4398d022e5 Author: Primoz Date: Wed Sep 28 12:47:08 2022 +0000 firstuseafter column removed from contextual imputation commit 9b3447febd075f4dc5ede6a4071f72821835e7a2 Author: Primoz Date: Wed Sep 28 12:40:05 2022 +0000 Contextual imputation correction commit d6adda30cf95e6d9660e6001225dee461bbe3704 Author: Primoz Date: Wed Sep 28 12:37:51 2022 +0000 Contextual imputation on time(first/last) features. commit 8af4ef11dc711dffd86c71c99b15e9451aa94ecd Author: Primoz Date: Wed Sep 28 10:02:47 2022 +0000 Contextual imputation by feature type. commit 536b9494cdfbc6b31d41a74126ac6af1d8fe62e0 Author: Primoz Date: Tue Sep 27 14:12:08 2022 +0000 Cleaning script corrections commit f0b87c9dd02d31223c2b69c2586221ffe202e21f Author: Primoz Date: Tue Sep 27 09:54:15 2022 +0000 Debugging of the empatica data yield integration. commit 7fcdb873fe910f7243fb71ffe56abfdd9ebb81e3 Merge: 5c7bb0f4 bd53dc16 Author: Primoz Date: Tue Sep 27 07:50:29 2022 +0000 Merge branch 'imputation_and_cleaning' of https://repo.ijs.si/junoslukan/rapids into imputation_and_cleaning commit 5c7bb0f4c14ab4413ed3682268bb7e65ac56fc83 Author: Primoz Date: Tue Sep 27 07:48:32 2022 +0000 Config changes commit bd53dc1684b2b6fadf5653e2a9cc00c953410149 Author: Primoz Date: Mon Sep 26 15:54:00 2022 +0000 Empatica data yield usage in the cleaning script. commit d9a574c550f8beb1cf5662e164ffa66048106dc9 Author: Primoz Date: Fri Sep 23 13:24:50 2022 +0000 Changes in the cleaning script and preparation of empatica data yield method. commit 19aa8707c0b61790d9d4c1834944c1711d8777aa Author: Primoz Date: Thu Sep 22 13:45:51 2022 +0000 Redefined cleaning steps after revision commit 247d758cb7eaa93ed9c7df2b0711325ffa27d985 Merge: 90ee99e4 7493aaa6 Author: Primoz Date: Wed Sep 21 07:18:01 2022 +0000 Merge branch 'imputation_and_cleaning' of https://repo.ijs.si/junoslukan/rapids into imputation_and_cleaning commit 90ee99e4b99eeb32650df06d60e43c0ee1e66a87 Author: Primoz Date: Wed Sep 21 07:16:00 2022 +0000 Remove TODO comments commit 7493aaa64368f81adbf478b43411063f56a9a87e Author: Primoz Date: Tue Sep 20 12:57:55 2022 +0000 Small changes in cleaning scrtipt and missing vals testing. commit eaf4340afd7db8a43174a4789cfb9ba5a831dac4 Author: Primoz Date: Tue Sep 20 08:03:48 2022 +0000 Small imputation and cleaning corrections. commit a96ea508c628d8c8c736126b65772c06c7cf4ddc Author: Primoz Date: Mon Sep 19 07:34:02 2022 +0000 Fill NaN of Empatica's SD second order feature (must be tested). commit 52e11cdcab51ec8ef97d379304e9297dd233773f Author: Primoz Date: Mon Sep 19 07:25:54 2022 +0000 Configurations for new standardization path. commit 92aff93e65f765ca531bf414c2bdb2e6d7ca1240 Author: Primoz Date: Mon Sep 19 07:25:16 2022 +0000 Remove standardization script. commit 18b63127deefb7cee0061ed741af179a4d6fdb01 Author: Primoz Date: Mon Sep 19 06:16:26 2022 +0000 Removed all standardizaton rules and configurations. commit 62982866cd9f3f45a335d26b67c3fdf3035c6d1e Author: Primoz Date: Fri Sep 16 13:24:21 2022 +0000 Phone wifi visible inspection (WIP) commit 0ce6da5444b7d0e3a30c8f6ca73422dc88d928b2 Author: Primoz Date: Fri Sep 16 11:30:08 2022 +0000 kNN imputation relocation and execution only on specific columns. commit e3b78c8a85084b4cd873350df5bc2b75009fae83 Author: Primoz Date: Fri Sep 16 10:58:57 2022 +0000 Impute selected phone features with 0. Wifi visible, screen, and light. commit 7d85f75d218ef9b3af5ae077343a58e62bb5fa2c Author: Primoz Date: Fri Sep 16 09:03:30 2022 +0000 Changes in phone features NaN values script. commit 385e21409d1d3978588fea56cc26a8d241e3af35 Author: Primoz Date: Thu Sep 15 14:16:58 2022 +0000 Changes in NaN values testing script. commit 18002f59e1c0e161f68b2a4cc217e62a6a86e467 Author: Primoz Date: Thu Sep 15 10:48:59 2022 +0000 Doryab bluetooth and locations features fill in NaN values. commit 3cf7ca41aac7772c89ac90ad0a26e938d39f3f34 Merge: d27a4a71 d5ab5a03 Author: Primoz Date: Wed Sep 14 15:38:32 2022 +0000 Merge branch 'imputation_and_cleaning' of https://repo.ijs.si/junoslukan/rapids into imputation_and_cleaning commit d5ab5a0394cdd395088723ff80d78f2257f8e37c Author: Primoz Date: Wed Sep 14 14:13:03 2022 +0000 Writing testing scripts to determine the point of manual imputation. commit dfbb758902bebdbe86b42753b0252b96ec6e791f Author: Primoz Date: Tue Sep 13 13:54:06 2022 +0000 Changes in AutoML params and environment.yml commit 4ec371ed96102d4b1f3b85557cb5fc3c388c1cef Author: Primoz Date: Tue Sep 13 09:51:03 2022 +0000 Testing auto-sklearn commit d27a4a71c81a50f4d30e5eb5bd9b664610da268a Author: Primoz Date: Mon Sep 12 13:44:17 2022 +0000 Reorganisation and reordering of the cleaning script. commit 15d792089d3b37dd381b1cb02b11f1106920d304 Author: Primoz Date: Thu Sep 1 10:33:36 2022 +0000 Changes in cleaning script: - target extracted from config to remove rows where target is nan - prepared sns.heatmap for further missing values analysis - necessary changes in config and participant p01 - picture of heatmap which shows the values state after cleaning commit cb351e0ff6f6325e86fb2867116a44505a44eae1 Author: Primoz Date: Thu Sep 1 10:06:57 2022 +0000 Unnecessary line (rows with no target value will be removed in cleaning script). commit 86299d346b5c3d706190db0dc14555309d17629f Author: Primoz Date: Thu Sep 1 09:57:21 2022 +0000 Impute phone and sms NAs with 0 commit 3f7ec80c18b1237c49dfdd9a6736cdf518e0b7fb Author: Primoz Date: Wed Aug 31 10:18:50 2022 +0000 Preparation a) phone_calls 0 imputation b) remove rows with NaN target --- Snakefile | 130 +------- automl_test.py | 57 ++++ config.yaml | 93 ++---- data/external/participant_files/p01.yaml | 18 +- data/external/timesegments_daily.csv | 1 + environment.yml | 29 +- rules/common.smk | 21 -- rules/features.smk | 146 +-------- rules/models.smk | 30 +- rules/preprocessing.smk | 26 ++ src/data/datetime/assign_to_time_segment.R | 3 + .../all_cleaning_individual/straw/__init__.py | 0 .../all_cleaning_individual/straw/main.py | 198 ++++++++---- .../all_cleaning_overall/straw/__init__.py | 0 .../all_cleaning_overall/straw/main.py | 275 +++++++++++++---- src/features/cr_features_helper_methods.py | 2 +- .../empatica_accelerometer/cr/main.py | 6 +- src/features/empatica_data_yield.py | 32 ++ .../cr/main.py | 6 +- .../empatica_inter_beat_interval/cr/main.py | 5 + src/features/empatica_temperature/cr/main.py | 5 +- src/features/entry.py | 5 +- .../phone_activity_recognition/rapids/main.py | 2 +- .../rapids/main.py | 7 +- src/features/phone_bluetooth/doryab/main.py | 4 +- src/features/phone_calls/rapids/main.R | 12 +- src/features/phone_esm/straw/esm.py | 274 +++++++++++++++++ src/features/phone_esm/straw/main.py | 7 +- .../process_user_event_related_segments.py | 220 ++++++++++++++ src/features/phone_light/rapids/main.py | 2 +- src/features/phone_locations/doryab/main.py | 3 +- src/features/phone_messages/rapids/main.R | 11 +- src/features/phone_screen/rapids/main.py | 2 +- src/features/phone_wifi_visible/rapids/main.R | 13 +- src/features/standardization/main.py | 50 --- src/features/utils/utils.py | 8 +- src/models/helper.py | 9 +- ...atures_and_targets_for_population_model.py | 12 +- src/models/select_targets.py | 6 +- tests/scripts/NaN.png | Bin 0 -> 12199 bytes tests/scripts/missing_vals.py | 4 +- tests/scripts/phone_feats.py | 285 ++++++++++++++++++ tests/scripts/standardization_methods_test.py | 70 +++++ tests/scripts/test_acc.py | 38 +++ 44 files changed, 1547 insertions(+), 580 deletions(-) create mode 100644 automl_test.py create mode 100644 src/features/all_cleaning_individual/straw/__init__.py create mode 100644 src/features/all_cleaning_overall/straw/__init__.py create mode 100644 src/features/empatica_data_yield.py create mode 100644 src/features/phone_esm/straw/esm.py create mode 100644 src/features/phone_esm/straw/process_user_event_related_segments.py delete mode 100644 src/features/standardization/main.py create mode 100644 tests/scripts/NaN.png create mode 100644 tests/scripts/phone_feats.py create mode 100644 tests/scripts/standardization_methods_test.py create mode 100644 tests/scripts/test_acc.py diff --git a/Snakefile b/Snakefile index 7c5749d7..44e9b5bf 100644 --- a/Snakefile +++ b/Snakefile @@ -33,12 +33,6 @@ for provider in config["PHONE_DATA_YIELD"]["PROVIDERS"].keys(): files_to_compute.extend(expand("data/processed/features/{pid}/phone_data_yield.csv", pid=config["PIDS"])) files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"])) files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv") - if provider in config["STANDARDIZATION"]["PROVIDERS"]["OTHER"]["LIST"] and config["STANDARDIZATION"]["PROVIDERS"]["OTHER"]["COMPUTE"] \ - and config["PHONE_DATA_YIELD"]["PROVIDERS"][provider]["STANDARDIZE_FEATURES"]: - files_to_compute.extend(expand("data/processed/features/{pid}/z_phone_data_yield.csv", pid=config["PIDS"])) - if config["STANDARDIZATION"]["MERGE_ALL"]: - files_to_compute.extend(expand("data/processed/features/{pid}/z_all_sensor_features.csv", pid=config["PIDS"])) - files_to_compute.append("data/processed/features/all_participants/z_all_sensor_features.csv") for provider in config["PHONE_MESSAGES"]["PROVIDERS"].keys(): if config["PHONE_MESSAGES"]["PROVIDERS"][provider]["COMPUTE"]: @@ -48,12 +42,6 @@ for provider in config["PHONE_MESSAGES"]["PROVIDERS"].keys(): files_to_compute.extend(expand("data/processed/features/{pid}/phone_messages.csv", pid=config["PIDS"])) files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"])) files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv") - if provider in config["STANDARDIZATION"]["PROVIDERS"]["OTHER"]["LIST"] and config["STANDARDIZATION"]["PROVIDERS"]["OTHER"]["COMPUTE"] \ - and config["PHONE_MESSAGES"]["PROVIDERS"][provider]["STANDARDIZE_FEATURES"]: - files_to_compute.extend(expand("data/processed/features/{pid}/z_phone_messages.csv", pid=config["PIDS"])) - if config["STANDARDIZATION"]["MERGE_ALL"]: - files_to_compute.extend(expand("data/processed/features/{pid}/z_all_sensor_features.csv", pid=config["PIDS"])) - files_to_compute.append("data/processed/features/all_participants/z_all_sensor_features.csv") for provider in config["PHONE_CALLS"]["PROVIDERS"].keys(): if config["PHONE_CALLS"]["PROVIDERS"][provider]["COMPUTE"]: @@ -68,12 +56,6 @@ for provider in config["PHONE_CALLS"]["PROVIDERS"].keys(): files_to_compute.extend(expand("data/processed/features/{pid}/phone_calls.csv", pid=config["PIDS"])) files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"])) files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv") - if provider in config["STANDARDIZATION"]["PROVIDERS"]["OTHER"]["LIST"] and config["STANDARDIZATION"]["PROVIDERS"]["OTHER"]["COMPUTE"] \ - and config["PHONE_CALLS"]["PROVIDERS"][provider]["STANDARDIZE_FEATURES"]: - files_to_compute.extend(expand("data/processed/features/{pid}/z_phone_calls.csv", pid=config["PIDS"])) - if config["STANDARDIZATION"]["MERGE_ALL"]: - files_to_compute.extend(expand("data/processed/features/{pid}/z_all_sensor_features.csv", pid=config["PIDS"])) - files_to_compute.append("data/processed/features/all_participants/z_all_sensor_features.csv") for provider in config["PHONE_BLUETOOTH"]["PROVIDERS"].keys(): if config["PHONE_BLUETOOTH"]["PROVIDERS"][provider]["COMPUTE"]: @@ -83,12 +65,6 @@ for provider in config["PHONE_BLUETOOTH"]["PROVIDERS"].keys(): files_to_compute.extend(expand("data/processed/features/{pid}/phone_bluetooth.csv", pid=config["PIDS"])) files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"])) files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv") - if provider in config["STANDARDIZATION"]["PROVIDERS"]["OTHER"]["LIST"] and config["STANDARDIZATION"]["PROVIDERS"]["OTHER"]["COMPUTE"] \ - and config["PHONE_BLUETOOTH"]["PROVIDERS"][provider]["STANDARDIZE_FEATURES"]: - files_to_compute.extend(expand("data/processed/features/{pid}/z_phone_bluetooth.csv", pid=config["PIDS"])) - if config["STANDARDIZATION"]["MERGE_ALL"]: - files_to_compute.extend(expand("data/processed/features/{pid}/z_all_sensor_features.csv", pid=config["PIDS"])) - files_to_compute.append("data/processed/features/all_participants/z_all_sensor_features.csv") for provider in config["PHONE_ACTIVITY_RECOGNITION"]["PROVIDERS"].keys(): if config["PHONE_ACTIVITY_RECOGNITION"]["PROVIDERS"][provider]["COMPUTE"]: @@ -101,12 +77,6 @@ for provider in config["PHONE_ACTIVITY_RECOGNITION"]["PROVIDERS"].keys(): files_to_compute.extend(expand("data/processed/features/{pid}/phone_activity_recognition.csv", pid=config["PIDS"])) files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"])) files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv") - if provider in config["STANDARDIZATION"]["PROVIDERS"]["OTHER"]["LIST"] and config["STANDARDIZATION"]["PROVIDERS"]["OTHER"]["COMPUTE"] \ - and config["PHONE_ACTIVITY_RECOGNITION"]["PROVIDERS"][provider]["STANDARDIZE_FEATURES"]: - files_to_compute.extend(expand("data/processed/features/{pid}/z_phone_activity_recognition.csv", pid=config["PIDS"])) - if config["STANDARDIZATION"]["MERGE_ALL"]: - files_to_compute.extend(expand("data/processed/features/{pid}/z_all_sensor_features.csv", pid=config["PIDS"])) - files_to_compute.append("data/processed/features/all_participants/z_all_sensor_features.csv") for provider in config["PHONE_BATTERY"]["PROVIDERS"].keys(): if config["PHONE_BATTERY"]["PROVIDERS"][provider]["COMPUTE"]: @@ -118,12 +88,6 @@ for provider in config["PHONE_BATTERY"]["PROVIDERS"].keys(): files_to_compute.extend(expand("data/processed/features/{pid}/phone_battery.csv", pid=config["PIDS"])) files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"])) files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv") - if provider in config["STANDARDIZATION"]["PROVIDERS"]["OTHER"]["LIST"] and config["STANDARDIZATION"]["PROVIDERS"]["OTHER"]["COMPUTE"] \ - and config["PHONE_BATTERY"]["PROVIDERS"][provider]["STANDARDIZE_FEATURES"]: - files_to_compute.extend(expand("data/processed/features/{pid}/z_phone_battery.csv", pid=config["PIDS"])) - if config["STANDARDIZATION"]["MERGE_ALL"]: - files_to_compute.extend(expand("data/processed/features/{pid}/z_all_sensor_features.csv", pid=config["PIDS"])) - files_to_compute.append("data/processed/features/all_participants/z_all_sensor_features.csv") for provider in config["PHONE_SCREEN"]["PROVIDERS"].keys(): if config["PHONE_SCREEN"]["PROVIDERS"][provider]["COMPUTE"]: @@ -140,12 +104,6 @@ for provider in config["PHONE_SCREEN"]["PROVIDERS"].keys(): files_to_compute.extend(expand("data/processed/features/{pid}/phone_screen.csv", pid=config["PIDS"])) files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"])) files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv") - if provider in config["STANDARDIZATION"]["PROVIDERS"]["OTHER"]["LIST"] and config["STANDARDIZATION"]["PROVIDERS"]["OTHER"]["COMPUTE"] \ - and config["PHONE_SCREEN"]["PROVIDERS"][provider]["STANDARDIZE_FEATURES"]: - files_to_compute.extend(expand("data/processed/features/{pid}/z_phone_screen.csv", pid=config["PIDS"])) - if config["STANDARDIZATION"]["MERGE_ALL"]: - files_to_compute.extend(expand("data/processed/features/{pid}/z_all_sensor_features.csv", pid=config["PIDS"])) - files_to_compute.append("data/processed/features/all_participants/z_all_sensor_features.csv") for provider in config["PHONE_LIGHT"]["PROVIDERS"].keys(): if config["PHONE_LIGHT"]["PROVIDERS"][provider]["COMPUTE"]: @@ -155,12 +113,6 @@ for provider in config["PHONE_LIGHT"]["PROVIDERS"].keys(): files_to_compute.extend(expand("data/processed/features/{pid}/phone_light.csv", pid=config["PIDS"],)) files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"])) files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv") - if provider in config["STANDARDIZATION"]["PROVIDERS"]["OTHER"]["LIST"] and config["STANDARDIZATION"]["PROVIDERS"]["OTHER"]["COMPUTE"] \ - and config["PHONE_LIGHT"]["PROVIDERS"][provider]["STANDARDIZE_FEATURES"]: - files_to_compute.extend(expand("data/processed/features/{pid}/z_phone_light.csv", pid=config["PIDS"])) - if config["STANDARDIZATION"]["MERGE_ALL"]: - files_to_compute.extend(expand("data/processed/features/{pid}/z_all_sensor_features.csv", pid=config["PIDS"])) - files_to_compute.append("data/processed/features/all_participants/z_all_sensor_features.csv") for provider in config["PHONE_ACCELEROMETER"]["PROVIDERS"].keys(): if config["PHONE_ACCELEROMETER"]["PROVIDERS"][provider]["COMPUTE"]: @@ -184,12 +136,6 @@ for provider in config["PHONE_APPLICATIONS_FOREGROUND"]["PROVIDERS"].keys(): files_to_compute.extend(expand("data/processed/features/{pid}/phone_applications_foreground.csv", pid=config["PIDS"])) files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"])) files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv") - if provider in config["STANDARDIZATION"]["PROVIDERS"]["OTHER"]["LIST"] and config["STANDARDIZATION"]["PROVIDERS"]["OTHER"]["COMPUTE"] \ - and config["PHONE_APPLICATIONS_FOREGROUND"]["PROVIDERS"][provider]["STANDARDIZE_FEATURES"]: - files_to_compute.extend(expand("data/processed/features/{pid}/z_phone_applications_foreground.csv", pid=config["PIDS"])) - if config["STANDARDIZATION"]["MERGE_ALL"]: - files_to_compute.extend(expand("data/processed/features/{pid}/z_all_sensor_features.csv", pid=config["PIDS"])) - files_to_compute.append("data/processed/features/all_participants/z_all_sensor_features.csv") for provider in config["PHONE_WIFI_VISIBLE"]["PROVIDERS"].keys(): if config["PHONE_WIFI_VISIBLE"]["PROVIDERS"][provider]["COMPUTE"]: @@ -199,12 +145,6 @@ for provider in config["PHONE_WIFI_VISIBLE"]["PROVIDERS"].keys(): files_to_compute.extend(expand("data/processed/features/{pid}/phone_wifi_visible.csv", pid=config["PIDS"])) files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"])) files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv") - if provider in config["STANDARDIZATION"]["PROVIDERS"]["OTHER"]["LIST"] and config["STANDARDIZATION"]["PROVIDERS"]["OTHER"]["COMPUTE"] \ - and config["PHONE_WIFI_VISIBLE"]["PROVIDERS"][provider]["STANDARDIZE_FEATURES"]: - files_to_compute.extend(expand("data/processed/features/{pid}/z_phone_wifi_visible.csv", pid=config["PIDS"])) - if config["STANDARDIZATION"]["MERGE_ALL"]: - files_to_compute.extend(expand("data/processed/features/{pid}/z_all_sensor_features.csv", pid=config["PIDS"])) - files_to_compute.append("data/processed/features/all_participants/z_all_sensor_features.csv") for provider in config["PHONE_WIFI_CONNECTED"]["PROVIDERS"].keys(): if config["PHONE_WIFI_CONNECTED"]["PROVIDERS"][provider]["COMPUTE"]: @@ -233,12 +173,6 @@ for provider in config["PHONE_ESM"]["PROVIDERS"].keys(): files_to_compute.extend(expand("data/processed/features/{pid}/phone_esm.csv", pid=config["PIDS"])) # files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv",pid=config["PIDS"])) # files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv") - if provider in config["STANDARDIZATION"]["PROVIDERS"]["OTHER"]["LIST"] and config["STANDARDIZATION"]["PROVIDERS"]["OTHER"]["COMPUTE"] \ - and config["PHONE_ESM"]["PROVIDERS"][provider]["STANDARDIZE_FEATURES"]: - files_to_compute.extend(expand("data/processed/features/{pid}/z_phone_esm.csv", pid=config["PIDS"])) - if config["STANDARDIZATION"]["MERGE_ALL"]: - files_to_compute.extend(expand("data/processed/features/{pid}/z_all_sensor_features.csv", pid=config["PIDS"])) - files_to_compute.append("data/processed/features/all_participants/z_all_sensor_features.csv") # We can delete these if's as soon as we add feature PROVIDERS to any of these sensors if isinstance(config["PHONE_APPLICATIONS_CRASHES"]["PROVIDERS"], dict): @@ -304,12 +238,6 @@ for provider in config["PHONE_LOCATIONS"]["PROVIDERS"].keys(): files_to_compute.extend(expand("data/processed/features/{pid}/phone_locations.csv", pid=config["PIDS"])) files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"])) files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv") - if provider in config["STANDARDIZATION"]["PROVIDERS"]["OTHER"]["LIST"] and config["STANDARDIZATION"]["PROVIDERS"]["OTHER"]["COMPUTE"] \ - and config["PHONE_LOCATIONS"]["PROVIDERS"][provider]["STANDARDIZE_FEATURES"]: - files_to_compute.extend(expand("data/processed/features/{pid}/z_phone_locations.csv", pid=config["PIDS"])) - if config["STANDARDIZATION"]["MERGE_ALL"]: - files_to_compute.extend(expand("data/processed/features/{pid}/z_all_sensor_features.csv", pid=config["PIDS"])) - files_to_compute.append("data/processed/features/all_participants/z_all_sensor_features.csv") for provider in config["FITBIT_CALORIES_INTRADAY"]["PROVIDERS"].keys(): if config["FITBIT_CALORIES_INTRADAY"]["PROVIDERS"][provider]["COMPUTE"]: @@ -400,13 +328,6 @@ for provider in config["EMPATICA_ACCELEROMETER"]["PROVIDERS"].keys(): files_to_compute.extend(expand("data/processed/features/{pid}/empatica_accelerometer.csv", pid=config["PIDS"])) files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"])) files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv") - if provider in config["STANDARDIZATION"]["PROVIDERS"] and config["STANDARDIZATION"]["PROVIDERS"][provider]["COMPUTE"] \ - and config["EMPATICA_ACCELEROMETER"]["PROVIDERS"][provider]["WINDOWS"]["STANDARDIZE_FEATURES"]: - files_to_compute.extend(expand("data/interim/{pid}/empatica_accelerometer_features/z_empatica_accelerometer_{language}_{provider_key}_windows.csv", pid=config["PIDS"], language=get_script_language(config["STANDARDIZATION"]["PROVIDERS"][provider]["SRC_SCRIPT"]), provider_key=provider.lower())) - files_to_compute.extend(expand("data/processed/features/{pid}/z_empatica_accelerometer.csv", pid=config["PIDS"])) - if config["STANDARDIZATION"]["MERGE_ALL"]: - files_to_compute.extend(expand("data/processed/features/{pid}/z_all_sensor_features.csv", pid=config["PIDS"])) - files_to_compute.append("data/processed/features/all_participants/z_all_sensor_features.csv") for provider in config["EMPATICA_HEARTRATE"]["PROVIDERS"].keys(): if config["EMPATICA_HEARTRATE"]["PROVIDERS"][provider]["COMPUTE"]: @@ -426,13 +347,6 @@ for provider in config["EMPATICA_TEMPERATURE"]["PROVIDERS"].keys(): files_to_compute.extend(expand("data/processed/features/{pid}/empatica_temperature.csv", pid=config["PIDS"])) files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"])) files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv") - if provider in config["STANDARDIZATION"]["PROVIDERS"] and config["STANDARDIZATION"]["PROVIDERS"][provider]["COMPUTE"] \ - and config["EMPATICA_TEMPERATURE"]["PROVIDERS"][provider]["WINDOWS"]["STANDARDIZE_FEATURES"]: - files_to_compute.extend(expand("data/interim/{pid}/empatica_temperature_features/z_empatica_temperature_{language}_{provider_key}_windows.csv", pid=config["PIDS"], language=get_script_language(config["STANDARDIZATION"]["PROVIDERS"][provider]["SRC_SCRIPT"]), provider_key=provider.lower())) - files_to_compute.extend(expand("data/processed/features/{pid}/z_empatica_temperature.csv", pid=config["PIDS"])) - if config["STANDARDIZATION"]["MERGE_ALL"]: - files_to_compute.extend(expand("data/processed/features/{pid}/z_all_sensor_features.csv", pid=config["PIDS"])) - files_to_compute.append("data/processed/features/all_participants/z_all_sensor_features.csv") for provider in config["EMPATICA_ELECTRODERMAL_ACTIVITY"]["PROVIDERS"].keys(): if config["EMPATICA_ELECTRODERMAL_ACTIVITY"]["PROVIDERS"][provider]["COMPUTE"]: @@ -442,13 +356,6 @@ for provider in config["EMPATICA_ELECTRODERMAL_ACTIVITY"]["PROVIDERS"].keys(): files_to_compute.extend(expand("data/processed/features/{pid}/empatica_electrodermal_activity.csv", pid=config["PIDS"])) files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"])) files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv") - if provider in config["STANDARDIZATION"]["PROVIDERS"] and config["STANDARDIZATION"]["PROVIDERS"][provider]["COMPUTE"] \ - and config["EMPATICA_ELECTRODERMAL_ACTIVITY"]["PROVIDERS"][provider]["WINDOWS"]["STANDARDIZE_FEATURES"]: - files_to_compute.extend(expand("data/interim/{pid}/empatica_electrodermal_activity_features/z_empatica_electrodermal_activity_{language}_{provider_key}_windows.csv", pid=config["PIDS"], language=get_script_language(config["STANDARDIZATION"]["PROVIDERS"][provider]["SRC_SCRIPT"]), provider_key=provider.lower())) - files_to_compute.extend(expand("data/processed/features/{pid}/z_empatica_electrodermal_activity.csv", pid=config["PIDS"])) - if config["STANDARDIZATION"]["MERGE_ALL"]: - files_to_compute.extend(expand("data/processed/features/{pid}/z_all_sensor_features.csv", pid=config["PIDS"])) - files_to_compute.append("data/processed/features/all_participants/z_all_sensor_features.csv") for provider in config["EMPATICA_BLOOD_VOLUME_PULSE"]["PROVIDERS"].keys(): if config["EMPATICA_BLOOD_VOLUME_PULSE"]["PROVIDERS"][provider]["COMPUTE"]: @@ -458,13 +365,6 @@ for provider in config["EMPATICA_BLOOD_VOLUME_PULSE"]["PROVIDERS"].keys(): files_to_compute.extend(expand("data/processed/features/{pid}/empatica_blood_volume_pulse.csv", pid=config["PIDS"])) files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"])) files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv") - if provider in config["STANDARDIZATION"]["PROVIDERS"] and config["STANDARDIZATION"]["PROVIDERS"][provider]["COMPUTE"] \ - and config["EMPATICA_BLOOD_VOLUME_PULSE"]["PROVIDERS"][provider]["WINDOWS"]["STANDARDIZE_FEATURES"]: - files_to_compute.extend(expand("data/interim/{pid}/empatica_blood_volume_pulse_features/z_empatica_blood_volume_pulse_{language}_{provider_key}_windows.csv", pid=config["PIDS"], language=get_script_language(config["STANDARDIZATION"]["PROVIDERS"][provider]["SRC_SCRIPT"]), provider_key=provider.lower())) - files_to_compute.extend(expand("data/processed/features/{pid}/z_empatica_blood_volume_pulse.csv", pid=config["PIDS"])) - if config["STANDARDIZATION"]["MERGE_ALL"]: - files_to_compute.extend(expand("data/processed/features/{pid}/z_all_sensor_features.csv", pid=config["PIDS"])) - files_to_compute.append("data/processed/features/all_participants/z_all_sensor_features.csv") for provider in config["EMPATICA_INTER_BEAT_INTERVAL"]["PROVIDERS"].keys(): if config["EMPATICA_INTER_BEAT_INTERVAL"]["PROVIDERS"][provider]["COMPUTE"]: @@ -474,13 +374,6 @@ for provider in config["EMPATICA_INTER_BEAT_INTERVAL"]["PROVIDERS"].keys(): files_to_compute.extend(expand("data/processed/features/{pid}/empatica_inter_beat_interval.csv", pid=config["PIDS"])) files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"])) files_to_compute.append("data/processed/features/all_participants/all_sensor_features.csv") - if provider in config["STANDARDIZATION"]["PROVIDERS"] and config["STANDARDIZATION"]["PROVIDERS"][provider]["COMPUTE"] \ - and config["EMPATICA_INTER_BEAT_INTERVAL"]["PROVIDERS"][provider]["WINDOWS"]["STANDARDIZE_FEATURES"]: - files_to_compute.extend(expand("data/interim/{pid}/empatica_inter_beat_interval_features/z_empatica_inter_beat_interval_{language}_{provider_key}_windows.csv", pid=config["PIDS"], language=get_script_language(config["STANDARDIZATION"]["PROVIDERS"][provider]["SRC_SCRIPT"]), provider_key=provider.lower())) - files_to_compute.extend(expand("data/processed/features/{pid}/z_empatica_inter_beat_interval.csv", pid=config["PIDS"])) - if config["STANDARDIZATION"]["MERGE_ALL"]: - files_to_compute.extend(expand("data/processed/features/{pid}/z_all_sensor_features.csv", pid=config["PIDS"])) - files_to_compute.append("data/processed/features/all_participants/z_all_sensor_features.csv") if isinstance(config["EMPATICA_TAGS"]["PROVIDERS"], dict): for provider in config["EMPATICA_TAGS"]["PROVIDERS"].keys(): @@ -517,24 +410,16 @@ for provider in config["ALL_CLEANING_INDIVIDUAL"]["PROVIDERS"].keys(): if config["ALL_CLEANING_INDIVIDUAL"]["PROVIDERS"][provider]["COMPUTE"]: if provider == "STRAW": files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features_cleaned_" + provider.lower() + "_py.csv", pid=config["PIDS"])) - if config["ALL_CLEANING_INDIVIDUAL"]["CLEAN_STANDARDIZED"]: - files_to_compute.extend(expand("data/processed/features/{pid}/z_all_sensor_features_cleaned_" + provider.lower() + "_py.csv", pid=config["PIDS"])) else: files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features_cleaned_" + provider.lower() + "_R.csv", pid=config["PIDS"])) - if config["ALL_CLEANING_INDIVIDUAL"]["CLEAN_STANDARDIZED"]: - files_to_compute.extend(expand("data/processed/features/{pid}/z_all_sensor_features_cleaned_" + provider.lower() + "_R.csv", pid=config["PIDS"])) for provider in config["ALL_CLEANING_OVERALL"]["PROVIDERS"].keys(): if config["ALL_CLEANING_OVERALL"]["PROVIDERS"][provider]["COMPUTE"]: if provider == "STRAW": - files_to_compute.extend(expand("data/processed/features/all_participants/all_sensor_features_cleaned_" + provider.lower() +"_py.csv")) - if config["ALL_CLEANING_OVERALL"]["CLEAN_STANDARDIZED"]: - files_to_compute.extend(expand("data/processed/features/all_participants/z_all_sensor_features_cleaned_" + provider.lower() +"_py.csv")) + for target in config["PARAMS_FOR_ANALYSIS"]["TARGET"]["ALL_LABELS"]: + files_to_compute.extend(expand("data/processed/features/all_participants/all_sensor_features_cleaned_" + provider.lower() +"_py_(" + target + ").csv")) else: - files_to_compute.extend(expand("data/processed/features/all_participants/all_sensor_features_cleaned_" + provider.lower() +"_R.csv")) - if config["ALL_CLEANING_OVERALL"]["CLEAN_STANDARDIZED"]: - files_to_compute.extend(expand("data/processed/features/all_participants/z_all_sensor_features_cleaned_" + provider.lower() +"_R.csv")) - + files_to_compute.extend(expand("data/processed/features/all_participants/all_sensor_features_cleaned_" + provider.lower() +"_R.csv")) # Baseline features if config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["COMPUTE"]: @@ -545,12 +430,9 @@ if config["PARAMS_FOR_ANALYSIS"]["BASELINE"]["COMPUTE"]: # Targets (labels) if config["PARAMS_FOR_ANALYSIS"]["TARGET"]["COMPUTE"]: - # files_to_compute.extend(expand("data/processed/models/individual_model/{pid}/input.csv", pid=config["PIDS"])) - # files_to_compute.extend(expand("data/processed/models/population_model/input.csv")) - files_to_compute.extend(expand("data/processed/models/individual_model/{pid}/z_input.csv", pid=config["PIDS"])) - files_to_compute.extend(expand("data/processed/models/population_model/z_input.csv")) - -#files_to_compute.extend(expand("data/processed/models/individual_model/{pid}/output_{cv_method}/baselines.csv", pid=config["PIDS"], cv_method=config["PARAMS_FOR_ANALYSIS"]["CV_METHODS"])) + files_to_compute.extend(expand("data/processed/models/individual_model/{pid}/input.csv", pid=config["PIDS"])) + for target in config["PARAMS_FOR_ANALYSIS"]["TARGET"]["ALL_LABELS"]: + files_to_compute.extend(expand("data/processed/models/population_model/input_" + target + ".csv")) rule all: input: diff --git a/automl_test.py b/automl_test.py new file mode 100644 index 00000000..3e2e3b84 --- /dev/null +++ b/automl_test.py @@ -0,0 +1,57 @@ +from pprint import pprint +import sklearn.metrics +import autosklearn.regression + +import datetime +import importlib +import os +import sys + +import numpy as np +import matplotlib.pyplot as plt +import pandas as pd +import seaborn as sns +import yaml + +from sklearn import linear_model, svm, kernel_ridge, gaussian_process +from sklearn.model_selection import LeaveOneGroupOut, cross_val_score, train_test_split +from sklearn.metrics import mean_squared_error, r2_score +from sklearn.impute import SimpleImputer + +model_input = pd.read_csv("data/processed/models/population_model/input_PANAS_negative_affect_mean.csv") # Standardizirani podatki + +model_input.dropna(axis=1, how="all", inplace=True) +model_input.dropna(axis=0, how="any", subset=["target"], inplace=True) + +categorical_feature_colnames = ["gender", "startlanguage"] +categorical_feature_colnames += [col for col in model_input.columns if "mostcommonactivity" in col or "homelabel" in col] +categorical_features = model_input[categorical_feature_colnames].copy() +mode_categorical_features = categorical_features.mode().iloc[0] +categorical_features = categorical_features.fillna(mode_categorical_features) +categorical_features = categorical_features.apply(lambda col: col.astype("category")) +if not categorical_features.empty: + categorical_features = pd.get_dummies(categorical_features) +numerical_features = model_input.drop(categorical_feature_colnames, axis=1) +model_in = pd.concat([numerical_features, categorical_features], axis=1) + +index_columns = ["local_segment", "local_segment_label", "local_segment_start_datetime", "local_segment_end_datetime"] +model_in.set_index(index_columns, inplace=True) + +X_train, X_test, y_train, y_test = train_test_split(model_in.drop(["target", "pid"], axis=1), model_in["target"], test_size=0.30) + +automl = autosklearn.regression.AutoSklearnRegressor( + time_left_for_this_task=7200, + per_run_time_limit=120 +) +automl.fit(X_train, y_train, dataset_name='straw') + +print(automl.leaderboard()) +pprint(automl.show_models(), indent=4) + +train_predictions = automl.predict(X_train) +print("Train R2 score:", sklearn.metrics.r2_score(y_train, train_predictions)) +test_predictions = automl.predict(X_test) +print("Test R2 score:", sklearn.metrics.r2_score(y_test, test_predictions)) + +import sys +sys.exit() diff --git a/config.yaml b/config.yaml index 770305e4..50b7dd41 100644 --- a/config.yaml +++ b/config.yaml @@ -21,9 +21,12 @@ CREATE_PARTICIPANT_FILES: # See https://www.rapids.science/latest/setup/configuration/#time-segments TIME_SEGMENTS: &time_segments - TYPE: PERIODIC # FREQUENCY, PERIODIC, EVENT - FILE: "data/external/timesegments_daily.csv" + TYPE: EVENT # FREQUENCY, PERIODIC, EVENT + FILE: "data/external/straw_events.csv" INCLUDE_PAST_PERIODIC_SEGMENTS: TRUE # Only relevant if TYPE=PERIODIC, see docs + TAILORED_EVENTS: # Only relevant if TYPE=EVENT + COMPUTE: True + SEGMENTING_METHOD: "30_before" # 30_before, 90_before, stress_event # See https://www.rapids.science/latest/setup/configuration/#timezone-of-your-study TIMEZONE: @@ -70,7 +73,6 @@ PHONE_ACCELEROMETER: COMPUTE: False FEATURES: ["maxmagnitude", "minmagnitude", "avgmagnitude", "medianmagnitude", "stdmagnitude"] SRC_SCRIPT: src/features/phone_accelerometer/rapids/main.py - PANDA: COMPUTE: False VALID_SENSED_MINUTES: False @@ -93,7 +95,6 @@ PHONE_ACTIVITY_RECOGNITION: STATIONARY: ["still", "tilting"] MOBILE: ["on_foot", "walking", "running", "on_bicycle"] VEHICLE: ["in_vehicle"] - STANDARDIZE_FEATURES: True SRC_SCRIPT: src/features/phone_activity_recognition/rapids/main.py # See https://www.rapids.science/latest/features/phone-applications-crashes/ @@ -134,7 +135,6 @@ PHONE_APPLICATIONS_FOREGROUND: APP_EPISODES: ["countepisode", "minduration", "maxduration", "meanduration", "sumduration"] IGNORE_EPISODES_SHORTER_THAN: 0 # in minutes, set to 0 to disable IGNORE_EPISODES_LONGER_THAN: 300 # in minutes, set to 0 to disable - STANDARDIZE_FEATURES: True SRC_SCRIPT: src/features/phone_applications_foreground/rapids/main.py # See https://www.rapids.science/latest/features/phone-applications-notifications/ @@ -155,7 +155,6 @@ PHONE_BATTERY: RAPIDS: COMPUTE: True FEATURES: ["countdischarge", "sumdurationdischarge", "countcharge", "sumdurationcharge", "avgconsumptionrate", "maxconsumptionrate"] - STANDARDIZE_FEATURES: True SRC_SCRIPT: src/features/phone_battery/rapids/main.py # See https://www.rapids.science/latest/features/phone-bluetooth/ @@ -163,9 +162,8 @@ PHONE_BLUETOOTH: CONTAINER: bluetooth PROVIDERS: RAPIDS: - COMPUTE: True + COMPUTE: False FEATURES: ["countscans", "uniquedevices", "countscansmostuniquedevice"] - STANDARDIZE_FEATURES: True SRC_SCRIPT: src/features/phone_bluetooth/rapids/main.R DORYAB: @@ -183,7 +181,6 @@ PHONE_BLUETOOTH: DEVICES: ["countscans", "uniquedevices", "meanscans", "stdscans"] SCANS_MOST_FREQUENT_DEVICE: ["withinsegments", "acrosssegments", "acrossdataset"] SCANS_LEAST_FREQUENT_DEVICE: ["withinsegments", "acrosssegments", "acrossdataset"] - STANDARDIZE_FEATURES: True SRC_SCRIPT: src/features/phone_bluetooth/doryab/main.py # See https://www.rapids.science/latest/features/phone-calls/ @@ -198,7 +195,6 @@ PHONE_CALLS: missed: [count, distinctcontacts, timefirstcall, timelastcall, countmostfrequentcontact] incoming: [count, distinctcontacts, meanduration, sumduration, minduration, maxduration, stdduration, modeduration, entropyduration, timefirstcall, timelastcall, countmostfrequentcontact] outgoing: [count, distinctcontacts, meanduration, sumduration, minduration, maxduration, stdduration, modeduration, entropyduration, timefirstcall, timelastcall, countmostfrequentcontact] - STANDARDIZE_FEATURES: True SRC_SCRIPT: src/features/phone_calls/rapids/main.R # See https://www.rapids.science/latest/features/phone-conversation/ @@ -238,7 +234,6 @@ PHONE_DATA_YIELD: COMPUTE: True FEATURES: [ratiovalidyieldedminutes, ratiovalidyieldedhours] MINUTE_RATIO_THRESHOLD_FOR_VALID_YIELDED_HOURS: 0.5 # 0 to 1, minimum percentage of valid minutes in an hour to be considered valid. - STANDARDIZE_FEATURES: True SRC_SCRIPT: src/features/phone_data_yield/rapids/main.R PHONE_ESM: @@ -246,9 +241,9 @@ PHONE_ESM: PROVIDERS: STRAW: COMPUTE: True - SCALES: ["PANAS_positive_affect", "PANAS_negative_affect", "JCQ_job_demand", "JCQ_job_control", "JCQ_supervisor_support", "JCQ_coworker_support"] + SCALES: ["PANAS_positive_affect", "PANAS_negative_affect", "JCQ_job_demand", "JCQ_job_control", "JCQ_supervisor_support", "JCQ_coworker_support", + "appraisal_stressfulness_period", "appraisal_stressfulness_event", "appraisal_threat", "appraisal_challenge"] FEATURES: [mean] - STANDARDIZE_FEATURES: True SRC_SCRIPT: src/features/phone_esm/straw/main.py # See https://www.rapids.science/latest/features/phone-keyboard/ @@ -267,7 +262,6 @@ PHONE_LIGHT: RAPIDS: COMPUTE: True FEATURES: ["count", "maxlux", "minlux", "avglux", "medianlux", "stdlux"] - STANDARDIZE_FEATURES: True SRC_SCRIPT: src/features/phone_light/rapids/main.py # See https://www.rapids.science/latest/features/phone-locations/ @@ -292,7 +286,6 @@ PHONE_LOCATIONS: MINIMUM_DAYS_TO_DETECT_HOME_CHANGES: 3 CLUSTERING_ALGORITHM: DBSCAN # DBSCAN, OPTICS RADIUS_FOR_HOME: 100 - STANDARDIZE_FEATURES: True SRC_SCRIPT: src/features/phone_locations/doryab/main.py BARNETT: @@ -300,7 +293,6 @@ PHONE_LOCATIONS: FEATURES: ["hometime","disttravelled","rog","maxdiam","maxhomedist","siglocsvisited","avgflightlen","stdflightlen","avgflightdur","stdflightdur","probpause","siglocentropy","circdnrtn","wkenddayrtn"] IF_MULTIPLE_TIMEZONES: USE_MOST_COMMON MINUTES_DATA_USED: False # Use this for quality control purposes, how many minutes of data (location coordinates gruped by minute) were used to compute features - STANDARDIZE_FEATURES: True SRC_SCRIPT: src/features/phone_locations/barnett/main.R # See https://www.rapids.science/latest/features/phone-log/ @@ -320,7 +312,6 @@ PHONE_MESSAGES: FEATURES: received: [count, distinctcontacts, timefirstmessage, timelastmessage, countmostfrequentcontact] sent: [count, distinctcontacts, timefirstmessage, timelastmessage, countmostfrequentcontact] - STANDARDIZE_FEATURES: True SRC_SCRIPT: src/features/phone_messages/rapids/main.R # See https://www.rapids.science/latest/features/phone-screen/ @@ -334,7 +325,6 @@ PHONE_SCREEN: IGNORE_EPISODES_LONGER_THAN: 360 # in minutes, set to 0 to disable FEATURES: ["countepisode", "sumduration", "maxduration", "minduration", "avgduration", "stdduration", "firstuseafter"] # "episodepersensedminutes" needs to be added later EPISODE_TYPES: ["unlock"] - STANDARDIZE_FEATURES: True SRC_SCRIPT: src/features/phone_screen/rapids/main.py # See https://www.rapids.science/latest/features/phone-wifi-connected/ @@ -353,7 +343,6 @@ PHONE_WIFI_VISIBLE: RAPIDS: COMPUTE: True FEATURES: ["countscans", "uniquedevices", "countscansmostuniquedevice"] - STANDARDIZE_FEATURES: True SRC_SCRIPT: src/features/phone_wifi_visible/rapids/main.R @@ -455,7 +444,6 @@ FITBIT_SLEEP_INTRADAY: UNIFIED: [awake, asleep] SLEEP_TYPES: [main, nap, all] SRC_SCRIPT: src/features/fitbit_sleep_intraday/rapids/main.py - PRICE: COMPUTE: False FEATURES: [avgduration, avgratioduration, avgstarttimeofepisodemain, avgendtimeofepisodemain, avgmidpointofepisodemain, stdstarttimeofepisodemain, stdendtimeofepisodemain, stdmidpointofepisodemain, socialjetlag, rmssdmeanstarttimeofepisodemain, rmssdmeanendtimeofepisodemain, rmssdmeanmidpointofepisodemain, rmssdmedianstarttimeofepisodemain, rmssdmedianendtimeofepisodemain, rmssdmedianmidpointofepisodemain] @@ -528,7 +516,6 @@ EMPATICA_ACCELEROMETER: COMPUTE: True WINDOW_LENGTH: 15 # specify window length in seconds SECOND_ORDER_FEATURES: ['mean', 'median', 'sd', 'nlargest', 'nsmallest', 'count_windows'] - STANDARDIZE_FEATURES: True SRC_SCRIPT: src/features/empatica_accelerometer/cr/main.py @@ -557,7 +544,6 @@ EMPATICA_TEMPERATURE: COMPUTE: True WINDOW_LENGTH: 300 # specify window length in seconds SECOND_ORDER_FEATURES: ['mean', 'median', 'sd', 'nlargest', 'nsmallest', 'count_windows'] - STANDARDIZE_FEATURES: True SRC_SCRIPT: src/features/empatica_temperature/cr/main.py # See https://www.rapids.science/latest/features/empatica-electrodermal-activity/ @@ -579,7 +565,6 @@ EMPATICA_ELECTRODERMAL_ACTIVITY: COMPUTE: True WINDOW_LENGTH: 60 # specify window length in seconds SECOND_ORDER_FEATURES: ['mean', 'median', 'sd', 'nlargest', 'nsmallest', count_windows, eda_num_peaks_non_zero] - STANDARDIZE_FEATURES: True IMPUTE_NANS: True SRC_SCRIPT: src/features/empatica_electrodermal_activity/cr/main.py @@ -599,7 +584,6 @@ EMPATICA_BLOOD_VOLUME_PULSE: COMPUTE: True WINDOW_LENGTH: 300 # specify window length in seconds SECOND_ORDER_FEATURES: ['mean', 'median', 'sd', 'nlargest', 'nsmallest', 'count_windows', 'hrv_num_windows_non_nan'] - STANDARDIZE_FEATURES: True SRC_SCRIPT: src/features/empatica_blood_volume_pulse/cr/main.py # See https://www.rapids.science/latest/features/empatica-inter-beat-interval/ @@ -619,7 +603,6 @@ EMPATICA_INTER_BEAT_INTERVAL: COMPUTE: True WINDOW_LENGTH: 300 # specify window length in seconds SECOND_ORDER_FEATURES: ['mean', 'median', 'sd', 'nlargest', 'nsmallest', 'count_windows', 'hrv_num_windows_non_nan'] - STANDARDIZE_FEATURES: True SRC_SCRIPT: src/features/empatica_inter_beat_interval/cr/main.py # See https://www.rapids.science/latest/features/empatica-tags/ @@ -667,10 +650,9 @@ HEATMAP_FEATURE_CORRELATION_MATRIX: ######################################################################################################################## ALL_CLEANING_INDIVIDUAL: - CLEAN_STANDARDIZED: True PROVIDERS: RAPIDS: - COMPUTE: True + COMPUTE: False IMPUTE_SELECTED_EVENT_FEATURES: COMPUTE: False MIN_DATA_YIELDED_MINUTES_TO_IMPUTE: 0.33 @@ -684,28 +666,25 @@ ALL_CLEANING_INDIVIDUAL: MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5 CORR_THRESHOLD: 0.95 SRC_SCRIPT: src/features/all_cleaning_individual/rapids/main.R - STRAW: # currently the same as RAPIDS provider with a change in selecting the imputation type + STRAW: COMPUTE: True - IMPUTE_PHONE_SELECTED_EVENT_FEATURES: - COMPUTE: False - TYPE: median # options: zero, mean, median or k-nearest - MIN_DATA_YIELDED_MINUTES_TO_IMPUTE: 0.33 - COLS_NAN_THRESHOLD: 1 # set to 1 to disable + PHONE_DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_MINUTES # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES + PHONE_DATA_YIELD_RATIO_THRESHOLD: 0.5 # set to 0 to disable + EMPATICA_DATA_YIELD_RATIO_THRESHOLD: 0.5 # set to 0 to disable + ROWS_NAN_THRESHOLD: 0.33 # set to 1 to disable + COLS_NAN_THRESHOLD: 0.9 # set to 1 to remove only columns that contains all (100% of) NaN COLS_VAR_THRESHOLD: True - ROWS_NAN_THRESHOLD: 1 # set to 1 to disable - DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_HOURS # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES - DATA_YIELD_RATIO_THRESHOLD: 0 # set to 0 to disable DROP_HIGHLY_CORRELATED_FEATURES: COMPUTE: True MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5 CORR_THRESHOLD: 0.95 + STANDARDIZATION: True SRC_SCRIPT: src/features/all_cleaning_individual/straw/main.py ALL_CLEANING_OVERALL: - CLEAN_STANDARDIZED: True PROVIDERS: RAPIDS: - COMPUTE: True + COMPUTE: False IMPUTE_SELECTED_EVENT_FEATURES: COMPUTE: False MIN_DATA_YIELDED_MINUTES_TO_IMPUTE: 0.33 @@ -719,40 +698,22 @@ ALL_CLEANING_OVERALL: MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5 CORR_THRESHOLD: 0.95 SRC_SCRIPT: src/features/all_cleaning_overall/rapids/main.R - STRAW: # currently the same as RAPIDS provider with a change in selecting the imputation type + STRAW: COMPUTE: True - IMPUTE_PHONE_SELECTED_EVENT_FEATURES: - COMPUTE: False - TYPE: median # options: zero, mean, median or k-nearest - MIN_DATA_YIELDED_MINUTES_TO_IMPUTE: 0.33 - COLS_NAN_THRESHOLD: 1 # set to 1 to disable + PHONE_DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_MINUTES # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES + PHONE_DATA_YIELD_RATIO_THRESHOLD: 0.5 # set to 0 to disable + EMPATICA_DATA_YIELD_RATIO_THRESHOLD: 0.5 # set to 0 to disable + ROWS_NAN_THRESHOLD: 0.33 # set to 1 to disable + COLS_NAN_THRESHOLD: 0.8 # set to 1 to remove only columns that contains all (100% of) NaN COLS_VAR_THRESHOLD: True - ROWS_NAN_THRESHOLD: 1 # set to 1 to disable - DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_HOURS # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES - DATA_YIELD_RATIO_THRESHOLD: 0 # set to 0 to disable DROP_HIGHLY_CORRELATED_FEATURES: COMPUTE: True MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5 CORR_THRESHOLD: 0.95 + STANDARDIZATION: False SRC_SCRIPT: src/features/all_cleaning_overall/straw/main.py -######################################################################################################################## -# Z-score standardization # -######################################################################################################################## - -STANDARDIZATION: # Standardization for both providers is executed if only one of two providers is marked COMPUTE: TRUE - MERGE_ALL: True # Creates the joint standardized file for each participant and all participants. Similar to merge_sensor_features_for_all_participants rule - PROVIDERS: - CR: - COMPUTE: True - SRC_SCRIPT: src/features/standardization/main.py - OTHER: - COMPUTE: True - LIST: [RAPIDS, DORYAB, BARNETT, STRAW] - SRC_SCRIPT: src/features/standardization/main.py - - ######################################################################################################################## # Baseline # ######################################################################################################################## @@ -771,4 +732,8 @@ PARAMS_FOR_ANALYSIS: TARGET: COMPUTE: True - LABEL: PANAS_negative_affect_mean + LABEL: appraisal_stressfulness_event_mean + ALL_LABELS: [PANAS_positive_affect_mean, PANAS_negative_affect_mean, JCQ_job_demand_mean, JCQ_job_control_mean, JCQ_supervisor_support_mean, + JCQ_coworker_support_mean, appraisal_stressfulness_period_mean, appraisal_stressfulness_event_mean, appraisal_threat_mean, appraisal_challenge_mean] + # PANAS_positive_affect_mean, PANAS_negative_affect_mean, JCQ_job_demand_mean, JCQ_job_control_mean, JCQ_supervisor_support_mean, + # JCQ_coworker_support_mean, appraisal_stressfulness_period_mean, appraisal_stressfulness_event_mean, appraisal_threat_mean, appraisal_challenge_mean diff --git a/data/external/participant_files/p01.yaml b/data/external/participant_files/p01.yaml index fe394a76..add25858 100644 --- a/data/external/participant_files/p01.yaml +++ b/data/external/participant_files/p01.yaml @@ -1,11 +1,11 @@ PHONE: - DEVICE_IDS: [a748ee1a-1d0b-4ae9-9074-279a2b6ba524] # the participant's AWARE device id - PLATFORMS: [android] # or ios - LABEL: MyTestP01 # any string - START_DATE: 2020-01-01 # this can also be empty - END_DATE: 2021-01-01 # this can also be empty + DEVICE_IDS: [4b62a655-cbf0-4ac0-a448-06726f45b56a] + PLATFORMS: [android] + LABEL: uploader_53573 + START_DATE: 2021-05-21 09:21:24 + END_DATE: 2021-07-12 17:32:07 EMPATICA: - DEVICE_IDS: [empatica1] - LABEL: test01 - START_DATE: - END_DATE: + DEVICE_IDS: [uploader_53573] + LABEL: uploader_53573 + START_DATE: 2021-05-21 09:21:24 + END_DATE: 2021-07-12 17:32:07 diff --git a/data/external/timesegments_daily.csv b/data/external/timesegments_daily.csv index 605a4a53..183245b9 100644 --- a/data/external/timesegments_daily.csv +++ b/data/external/timesegments_daily.csv @@ -1,2 +1,3 @@ label,start_time,length,repeats_on,repeats_value daily,04:00:00,23H 59M 59S,every_day,0 +working_day,04:00:00,18H 00M 00S,every_day,0 diff --git a/environment.yml b/environment.yml index cba49edc..536149bc 100644 --- a/environment.yml +++ b/environment.yml @@ -86,8 +86,6 @@ dependencies: - readline=8.0 - requests=2.25.0 - retrying=1.3.3 - - scikit-learn=0.23.2 - - scipy=1.5.2 - setuptools=51.0.0 - six=1.15.0 - smmap=3.0.4 @@ -107,34 +105,61 @@ dependencies: - zlib=1.2.11 - pip: - amply==0.1.4 + - auto-sklearn==0.14.7 - bidict==0.22.0 - biosppy==0.8.0 + - build==0.8.0 - cached-property==1.5.2 + - cloudpickle==2.2.0 - configargparse==0.15.1 + - configspace==0.4.21 - cr-features==0.2.1 - cycler==0.11.0 + - cython==0.29.32 + - dask==2022.2.0 - decorator==4.4.2 + - distributed==2022.2.0 + - distro==1.7.0 + - emcee==3.1.2 - fonttools==4.33.2 + - fsspec==2022.8.2 - h5py==3.6.0 + - heapdict==1.0.1 - hmmlearn==0.2.7 - ipython-genutils==0.2.0 - jupyter-core==4.6.3 - kiwisolver==1.4.2 + - liac-arff==2.5.0 + - locket==1.0.0 - matplotlib==3.5.1 + - msgpack==1.0.4 - nbformat==5.0.7 - opencv-python==4.5.5.64 - packaging==21.3 + - partd==1.3.0 - peakutils==1.3.3 + - pep517==0.13.0 - pillow==9.1.0 - pulp==2.4 + - pynisher==0.6.4 - pyparsing==2.4.7 + - pyrfr==0.8.3 - pyrsistent==0.15.5 - pywavelets==1.3.0 - ratelimiter==1.2.0.post0 + - scikit-learn==0.24.2 + - scipy==1.7.3 - seaborn==0.11.2 - shortuuid==1.0.8 + - smac==1.2 - snakemake==5.30.2 + - sortedcontainers==2.4.0 + - tblib==1.7.0 + - tomli==2.0.1 + - toolz==0.12.0 - toposort==1.5 + - tornado==6.2 - traitlets==4.3.3 - typing-extensions==4.2.0 + - zict==2.2.0 prefix: /opt/conda/envs/rapids diff --git a/rules/common.smk b/rules/common.smk index a470012f..80b18485 100644 --- a/rules/common.smk +++ b/rules/common.smk @@ -40,15 +40,6 @@ def find_features_files(wildcards): feature_files.extend(expand("data/interim/{{pid}}/{sensor_key}_features/{sensor_key}_{language}_{provider_key}.csv", sensor_key=wildcards.sensor_key.lower(), language=get_script_language(provider["SRC_SCRIPT"]), provider_key=provider_key.lower())) return(feature_files) -def find_empaticas_standardized_features_files(wildcards): - feature_files = [] - if "empatica" in wildcards.sensor_key: - for provider_key, provider in config[(wildcards.sensor_key).upper()]["PROVIDERS"].items(): - if provider["COMPUTE"] and provider.get("WINDOWS", False) and provider["WINDOWS"]["COMPUTE"]: - if "empatica" in wildcards.sensor_key: - feature_files.extend(expand("data/interim/{{pid}}/{sensor_key}_features/z_{sensor_key}_{language}_{provider_key}.csv", sensor_key=wildcards.sensor_key.lower(), language=get_script_language(provider["SRC_SCRIPT"]), provider_key=provider_key.lower())) - return(feature_files) - def find_joint_non_empatica_sensor_files(wildcards): joined_files = [] for config_key in config.keys(): @@ -82,18 +73,6 @@ def input_merge_sensor_features_for_individual_participants(wildcards): break return feature_files -def input_merge_standardized_sensor_features_for_individual_participants(wildcards): - feature_files = [] - for config_key in config.keys(): - if config_key.startswith(("PHONE", "FITBIT", "EMPATICA")) and "PROVIDERS" in config[config_key] and isinstance(config[config_key]["PROVIDERS"], dict): - for provider_key, provider in config[config_key]["PROVIDERS"].items(): - if "COMPUTE" in provider.keys() and provider["COMPUTE"] and ("STANDARDIZE_FEATURES" in provider.keys() and provider["STANDARDIZE_FEATURES"] or - "WINDOWS" in provider.keys() and "STANDARDIZE_FEATURES" in provider["WINDOWS"].keys() and provider["WINDOWS"]["STANDARDIZE_FEATURES"]): - feature_files.append("data/processed/features/{pid}/z_" + config_key.lower() + ".csv") - break - - return feature_files - def get_phone_sensor_names(): phone_sensor_names = [] for config_key in config.keys(): diff --git a/rules/features.smk b/rules/features.smk index 5331d827..2638a8f3 100644 --- a/rules/features.smk +++ b/rules/features.smk @@ -796,20 +796,6 @@ rule empatica_accelerometer_python_features: script: "../src/features/entry.py" -rule empatica_accelerometer_python_features_standardization: - input: - windows_features_data = "data/interim/{pid}/empatica_accelerometer_features/empatica_accelerometer_python_{provider_key}_windows.csv" - params: - provider = config["STANDARDIZATION"]["PROVIDERS"]["CR"], - provider_key = "{provider_key}", - sensor_key = "empatica_accelerometer", - provider_main = config["EMPATICA_ACCELEROMETER"]["PROVIDERS"]["CR"] - output: - "data/interim/{pid}/empatica_accelerometer_features/z_empatica_accelerometer_python_{provider_key}.csv", - "data/interim/{pid}/empatica_accelerometer_features/z_empatica_accelerometer_python_{provider_key}_windows.csv" - script: - "../src/features/standardization/main.py" - rule empatica_accelerometer_r_features: input: sensor_data = "data/raw/{pid}/empatica_accelerometer_with_datetime.csv", @@ -864,20 +850,6 @@ rule empatica_temperature_python_features: script: "../src/features/entry.py" -rule empatica_temperature_python_features_standardization: - input: - windows_features_data = "data/interim/{pid}/empatica_temperature_features/empatica_temperature_python_{provider_key}_windows.csv" - params: - provider = config["STANDARDIZATION"]["PROVIDERS"]["CR"], - provider_key = "{provider_key}", - sensor_key = "empatica_temperature", - provider_main = config["EMPATICA_TEMPERATURE"]["PROVIDERS"]["CR"] - output: - "data/interim/{pid}/empatica_temperature_features/z_empatica_temperature_python_{provider_key}.csv", - "data/interim/{pid}/empatica_temperature_features/z_empatica_temperature_python_{provider_key}_windows.csv" - script: - "../src/features/standardization/main.py" - rule empatica_temperature_r_features: input: sensor_data = "data/raw/{pid}/empatica_temperature_with_datetime.csv", @@ -905,20 +877,6 @@ rule empatica_electrodermal_activity_python_features: script: "../src/features/entry.py" -rule empatica_electrodermal_activity_python_features_standardization: - input: - windows_features_data = "data/interim/{pid}/empatica_electrodermal_activity_features/empatica_electrodermal_activity_python_{provider_key}_windows.csv" - params: - provider = config["STANDARDIZATION"]["PROVIDERS"]["CR"], - provider_key = "{provider_key}", - sensor_key = "empatica_electrodermal_activity", - provider_main = config["EMPATICA_ELECTRODERMAL_ACTIVITY"]["PROVIDERS"]["CR"] - output: - "data/interim/{pid}/empatica_electrodermal_activity_features/z_empatica_electrodermal_activity_python_{provider_key}.csv", - "data/interim/{pid}/empatica_electrodermal_activity_features/z_empatica_electrodermal_activity_python_{provider_key}_windows.csv" - script: - "../src/features/standardization/main.py" - rule empatica_electrodermal_activity_r_features: input: sensor_data = "data/raw/{pid}/empatica_electrodermal_activity_with_datetime.csv", @@ -946,20 +904,6 @@ rule empatica_blood_volume_pulse_python_features: script: "../src/features/entry.py" -rule empatica_blood_volume_pulse_python_cr_features_standardization: - input: - windows_features_data = "data/interim/{pid}/empatica_blood_volume_pulse_features/empatica_blood_volume_pulse_python_{provider_key}_windows.csv" - params: - provider = config["STANDARDIZATION"]["PROVIDERS"]["CR"], - provider_key = "{provider_key}", - sensor_key = "empatica_blood_volume_pulse", - provider_main = config["EMPATICA_BLOOD_VOLUME_PULSE"]["PROVIDERS"]["CR"] - output: - "data/interim/{pid}/empatica_blood_volume_pulse_features/z_empatica_blood_volume_pulse_python_{provider_key}.csv", - "data/interim/{pid}/empatica_blood_volume_pulse_features/z_empatica_blood_volume_pulse_python_{provider_key}_windows.csv" - script: - "../src/features/standardization/main.py" - rule empatica_blood_volume_pulse_r_features: input: sensor_data = "data/raw/{pid}/empatica_blood_volume_pulse_with_datetime.csv", @@ -987,20 +931,6 @@ rule empatica_inter_beat_interval_python_features: script: "../src/features/entry.py" -rule empatica_inter_beat_interval_python_features_standardization: - input: - windows_features_data = "data/interim/{pid}/empatica_inter_beat_interval_features/empatica_inter_beat_interval_python_{provider_key}_windows.csv" - params: - provider = config["STANDARDIZATION"]["PROVIDERS"]["CR"], - provider_key = "{provider_key}", - sensor_key = "empatica_inter_beat_interval", - provider_main = config["EMPATICA_INTER_BEAT_INTERVAL"]["PROVIDERS"]["CR"] - output: - "data/interim/{pid}/empatica_inter_beat_interval_features/z_empatica_inter_beat_interval_python_{provider_key}.csv", - "data/interim/{pid}/empatica_inter_beat_interval_features/z_empatica_inter_beat_interval_python_{provider_key}_windows.csv" - script: - "../src/features/standardization/main.py" - rule empatica_inter_beat_interval_r_features: input: sensor_data = "data/raw/{pid}/empatica_inter_beat_interval_with_datetime.csv", @@ -1048,38 +978,6 @@ rule merge_sensor_features_for_individual_participants: script: "../src/features/utils/merge_sensor_features_for_individual_participants.R" -rule join_standardized_features_from_empatica: - input: - sensor_features = find_empaticas_standardized_features_files - wildcard_constraints: - sensor_key = '(empatica).*' - output: - "data/processed/features/{pid}/z_{sensor_key}.csv" - script: - "../src/features/utils/join_features_from_providers.R" - -rule standardize_features_from_providers_no_empatica: - input: - sensor_features = find_joint_non_empatica_sensor_files - wildcard_constraints: - sensor_key = '(phone|fitbit).*' - params: - provider = config["STANDARDIZATION"]["PROVIDERS"]["OTHER"], - provider_key = "OTHER", - sensor_key = "{sensor_key}" - output: - "data/processed/features/{pid}/z_{sensor_key}.csv" - script: - "../src/features/standardization/main.py" - -rule merge_standardized_sensor_features_for_individual_participants: - input: - feature_files = input_merge_standardized_sensor_features_for_individual_participants - output: - "data/processed/features/{pid}/z_all_sensor_features.csv" - script: - "../src/features/utils/merge_sensor_features_for_individual_participants.R" - rule merge_sensor_features_for_all_participants: input: feature_files = expand("data/processed/features/{pid}/all_sensor_features.csv", pid=config["PIDS"]) @@ -1088,14 +986,6 @@ rule merge_sensor_features_for_all_participants: script: "../src/features/utils/merge_sensor_features_for_all_participants.R" -rule merge_standardized_sensor_features_for_all_participants: - input: - feature_files = expand("data/processed/features/{pid}/z_all_sensor_features.csv", pid=config["PIDS"]) - output: - "data/processed/features/all_participants/z_all_sensor_features.csv" - script: - "../src/features/utils/merge_standardized_sensor_features_for_all_participants.R" - rule clean_sensor_features_for_individual_participants: input: sensor_data = rules.merge_sensor_features_for_individual_participants.output @@ -1107,7 +997,7 @@ rule clean_sensor_features_for_individual_participants: script_extension = "{script_extension}", sensor_key = "all_cleaning_individual" output: - "data/processed/features/{pid}/all_sensor_features_cleaned_{provider_key}_{script_extension}.csv" # bo predstavljalo probleme za naprej (kako iskati datoteke + standardizacija itd.) + "data/processed/features/{pid}/all_sensor_features_cleaned_{provider_key}_{script_extension}.csv" script: "../src/features/entry.{params.script_extension}" @@ -1118,37 +1008,9 @@ rule clean_sensor_features_for_all_participants: provider = lambda wildcards: config["ALL_CLEANING_OVERALL"]["PROVIDERS"][wildcards.provider_key.upper()], provider_key = "{provider_key}", script_extension = "{script_extension}", - sensor_key = "all_cleaning_overall" + sensor_key = "all_cleaning_overall", + target = "{target}" output: - "data/processed/features/all_participants/all_sensor_features_cleaned_{provider_key}_{script_extension}.csv" + "data/processed/features/all_participants/all_sensor_features_cleaned_{provider_key}_{script_extension}_({target}).csv" script: "../src/features/entry.{params.script_extension}" - -rule clean_standardized_sensor_features_for_individual_participants: - input: - sensor_data = rules.merge_standardized_sensor_features_for_individual_participants.output - wildcard_constraints: - pid = "("+"|".join(config["PIDS"])+")" - params: - provider = lambda wildcards: config["ALL_CLEANING_INDIVIDUAL"]["PROVIDERS"][wildcards.provider_key.upper()], - provider_key = "{provider_key}", - script_extension = "{script_extension}", - sensor_key = "all_cleaning_individual" - output: - "data/processed/features/{pid}/z_all_sensor_features_cleaned_{provider_key}_{script_extension}.csv" - script: - "../src/features/entry.{params.script_extension}" - -rule clean_standardized_sensor_features_for_all_participants: - input: - sensor_data = rules.merge_standardized_sensor_features_for_all_participants.output - params: - provider = lambda wildcards: config["ALL_CLEANING_OVERALL"]["PROVIDERS"][wildcards.provider_key.upper()], - provider_key = "{provider_key}", - script_extension = "{script_extension}", - sensor_key = "all_cleaning_overall" - output: - "data/processed/features/all_participants/z_all_sensor_features_cleaned_{provider_key}_{script_extension}.csv" - script: - "../src/features/entry.{params.script_extension}" - diff --git a/rules/models.smk b/rules/models.smk index 3aade69a..2875297b 100644 --- a/rules/models.smk +++ b/rules/models.smk @@ -30,43 +30,23 @@ rule baseline_features: rule select_target: input: - cleaned_sensor_features = "data/processed/features/{pid}/z_all_sensor_features_cleaned_straw_py.csv" + cleaned_sensor_features = "data/processed/features/{pid}/all_sensor_features_cleaned_straw_py.csv" params: target_variable = config["PARAMS_FOR_ANALYSIS"]["TARGET"]["LABEL"] output: - "data/processed/models/individual_model/{pid}/z_input.csv" + "data/processed/models/individual_model/{pid}/input.csv" script: "../src/models/select_targets.py" rule merge_features_and_targets_for_population_model: input: - cleaned_sensor_features = "data/processed/features/all_participants/z_all_sensor_features_cleaned_straw_py.csv", + cleaned_sensor_features = "data/processed/features/all_participants/all_sensor_features_cleaned_straw_py_({target}).csv", demographic_features = expand("data/processed/features/{pid}/baseline_features.csv", pid=config["PIDS"]), params: - target_variable=config["PARAMS_FOR_ANALYSIS"]["TARGET"]["LABEL"] + target_variable="{target}" output: - "data/processed/models/population_model/z_input.csv" + "data/processed/models/population_model/input_{target}.csv" script: "../src/models/merge_features_and_targets_for_population_model.py" -# rule select_target: -# input: -# cleaned_sensor_features = "data/processed/features/{pid}/all_sensor_features_cleaned_straw_py.csv" -# params: -# target_variable = config["PARAMS_FOR_ANALYSIS"]["TARGET"]["LABEL"] -# output: -# "data/processed/models/individual_model/{pid}/input.csv" -# script: -# "../src/models/select_targets.py" - -# rule merge_features_and_targets_for_population_model: -# input: -# cleaned_sensor_features = "data/processed/features/all_participants/all_sensor_features_cleaned_straw_py.csv", -# demographic_features = expand("data/processed/features/{pid}/baseline_features.csv", pid=config["PIDS"]), -# params: -# target_variable=config["PARAMS_FOR_ANALYSIS"]["TARGET"]["LABEL"] -# output: -# "data/processed/models/population_model/input.csv" -# script: -# "../src/models/merge_features_and_targets_for_population_model.py" diff --git a/rules/preprocessing.smk b/rules/preprocessing.smk index fb583459..bf641c68 100644 --- a/rules/preprocessing.smk +++ b/rules/preprocessing.smk @@ -249,3 +249,29 @@ rule empatica_readable_datetime: "data/raw/{pid}/empatica_{sensor}_with_datetime.csv" script: "../src/data/datetime/readable_datetime.R" + + +rule extract_event_information_from_esm: + input: + esm_raw_input = "data/raw/{pid}/phone_esm_raw.csv", + pid_file = "data/external/participant_files/{pid}.yaml" + params: + stage = "extract", + pid = "{pid}" + output: + "data/raw/ers/{pid}_ers.csv", + "data/raw/ers/{pid}_stress_event_targets.csv" + script: + "../src/features/phone_esm/straw/process_user_event_related_segments.py" + +rule merge_event_related_segments_files: + input: + ers_files = expand("data/raw/ers/{pid}_ers.csv", pid=config["PIDS"]), + se_files = expand("data/raw/ers/{pid}_stress_event_targets.csv", pid=config["PIDS"]) + params: + stage = "merge" + output: + "data/external/straw_events.csv", + "data/external/stress_event_targets.csv" + script: + "../src/features/phone_esm/straw/process_user_event_related_segments.py" \ No newline at end of file diff --git a/src/data/datetime/assign_to_time_segment.R b/src/data/datetime/assign_to_time_segment.R index 4375cb73..b7b9d8da 100644 --- a/src/data/datetime/assign_to_time_segment.R +++ b/src/data/datetime/assign_to_time_segment.R @@ -5,13 +5,16 @@ options(scipen=999) assign_rows_to_segments <- function(data, segments){ # This function is used by all segment types, we use data.tables because they are fast + data <- data.table::as.data.table(data) data[, assigned_segments := ""] for(i in seq_len(nrow(segments))) { segment <- segments[i,] + data[segment$segment_start_ts<= timestamp & segment$segment_end_ts >= timestamp, assigned_segments := stringi::stri_c(assigned_segments, segment$segment_id, sep = "|")] } + data[,assigned_segments:=substring(assigned_segments, 2)] data } diff --git a/src/features/all_cleaning_individual/straw/__init__.py b/src/features/all_cleaning_individual/straw/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/src/features/all_cleaning_individual/straw/main.py b/src/features/all_cleaning_individual/straw/main.py index 6ba3ba5e..5578aac7 100644 --- a/src/features/all_cleaning_individual/straw/main.py +++ b/src/features/all_cleaning_individual/straw/main.py @@ -1,88 +1,174 @@ import pandas as pd import numpy as np -import math, sys +import math, sys, random +import yaml + +from sklearn.impute import KNNImputer +from sklearn.preprocessing import StandardScaler +import matplotlib.pyplot as plt +import seaborn as sns + +sys.path.append('/rapids/') +from src.features import empatica_data_yield as edy + +pd.set_option('display.max_columns', 20) def straw_cleaning(sensor_data_files, provider): features = pd.read_csv(sensor_data_files["sensor_data"][0]) - - # TODO: reorder the cleaning steps so it makes sense for the analysis - # TODO: add conditions that differentiates cleaning steps for standardized and nonstandardized features, for this - # the snakemake rules will also have to come with additional parameter (in rules/features.smk) - - # Impute selected features event - impute_phone_features = provider["IMPUTE_PHONE_SELECTED_EVENT_FEATURES"] - if impute_phone_features["COMPUTE"]: - if not 'phone_data_yield_rapids_ratiovalidyieldedminutes' in features.columns: - raise KeyError("RAPIDS provider needs to impute the selected event features based on phone_data_yield_rapids_ratiovalidyieldedminutes column, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyieldedminutes' in [FEATURES].") - - # TODO: if the type of the imputation will vary for different groups of features make conditional imputations here - phone_cols = [col for col in features if \ - col.startswith('phone_applications_foreground_rapids_') or - col.startswith('phone_battery_rapids_') or - col.startswith('phone_calls_rapids_') or - col.startswith('phone_keyboard_rapids_') or - col.startswith('phone_messages_rapids_') or - col.startswith('phone_screen_rapids_') or - col.startswith('phone_wifi_')] - - mask = features['phone_data_yield_rapids_ratiovalidyieldedminutes'] > impute_phone_features['MIN_DATA_YIELDED_MINUTES_TO_IMPUTE'] - features.loc[mask, phone_cols] = impute(features[mask][phone_cols], method=impute_phone_features["TYPE"].lower()) - - # Drop rows with the value of data_yield_column less than data_yield_ratio_threshold - data_yield_unit = provider["DATA_YIELD_FEATURE"].split("_")[3].lower() - data_yield_column = "phone_data_yield_rapids_ratiovalidyielded" + data_yield_unit - - if not data_yield_column in features.columns: - raise KeyError(f"RAPIDS provider needs to impute the selected event features based on {data_yield_column} column, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyielded{data_yield_unit}' in [FEATURES].") - - if provider["DATA_YIELD_RATIO_THRESHOLD"]: - features = features[features[data_yield_column] >= provider["DATA_YIELD_RATIO_THRESHOLD"]] - - esm_cols = features.loc[:, features.columns.str.startswith('phone_esm')] # For later preservation of esm_cols - - # Remove cols if threshold of NaN values is passed - features = features.loc[:, features.isna().sum() < provider["COLS_NAN_THRESHOLD"] * features.shape[0]] - # Remove cols where variance is 0 - if provider["COLS_VAR_THRESHOLD"]: - features.drop(features.std()[features.std() == 0].index.values, axis=1, inplace=True) + esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] # Get target (esm) columns + + with open('config.yaml', 'r') as stream: + config = yaml.load(stream, Loader=yaml.FullLoader) + + excluded_columns = ['local_segment', 'local_segment_label', 'local_segment_start_datetime', 'local_segment_end_datetime'] + + # (1) FILTER_OUT THE ROWS THAT DO NOT HAVE THE TARGET COLUMN AVAILABLE + if config['PARAMS_FOR_ANALYSIS']['TARGET']['COMPUTE']: + target = config['PARAMS_FOR_ANALYSIS']['TARGET']['LABEL'] # get target label from config + if 'phone_esm_straw_' + target in features: + features = features[features['phone_esm_straw_' + target].notna()].reset_index(drop=True) + else: + return features + + # (2.1) QUALITY CHECK (DATA YIELD COLUMN) deletes the rows where E4 or phone data is low quality + phone_data_yield_unit = provider["PHONE_DATA_YIELD_FEATURE"].split("_")[3].lower() + phone_data_yield_column = "phone_data_yield_rapids_ratiovalidyielded" + phone_data_yield_unit + + features = edy.calculate_empatica_data_yield(features) + + if not phone_data_yield_column in features.columns and not "empatica_data_yield" in features.columns: + raise KeyError(f"RAPIDS provider needs to clean the selected event features based on {phone_data_yield_column} and empatica_data_yield columns. For phone data yield, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyielded{data_yield_unit}' in [FEATURES].") + + # Drop rows where phone data yield is less then given threshold + if provider["PHONE_DATA_YIELD_RATIO_THRESHOLD"]: + features = features[features[phone_data_yield_column] >= provider["PHONE_DATA_YIELD_RATIO_THRESHOLD"]].reset_index(drop=True) + # Drop rows where empatica data yield is less then given threshold + if provider["EMPATICA_DATA_YIELD_RATIO_THRESHOLD"]: + features = features[features["empatica_data_yield"] >= provider["EMPATICA_DATA_YIELD_RATIO_THRESHOLD"]].reset_index(drop=True) + + if features.empty: + return features + + # (2.2) DO THE ROWS CONSIST OF ENOUGH NON-NAN VALUES? + min_count = math.ceil((1 - provider["ROWS_NAN_THRESHOLD"]) * features.shape[1]) # minimal not nan values in row + features.dropna(axis=0, thresh=min_count, inplace=True) # Thresh => at least this many not-nans + + # (3) REMOVE COLS IF THEIR NAN THRESHOLD IS PASSED (should be <= if even all NaN columns must be preserved - this solution now drops columns with all NaN rows) + esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] # Get target (esm) columns + + features = features.loc[:, features.isna().sum() < provider["COLS_NAN_THRESHOLD"] * features.shape[0]] + # Preserve esm cols if deleted (has to come after drop cols operations) for esm in esm_cols: if esm not in features: features[esm] = esm_cols[esm] - - # Drop highly correlated features - To-Do še en thershold var, ki je v config + kako se tretirajo NaNs? + + # (4) CONTEXTUAL IMPUTATION + + # Impute selected phone features with a high number + impute_w_hn = [col for col in features.columns if \ + "timeoffirstuse" in col or + "timeoflastuse" in col or + "timefirstcall" in col or + "timelastcall" in col or + "firstuseafter" in col or + "timefirstmessages" in col or + "timelastmessages" in col] + features[impute_w_hn] = features[impute_w_hn].fillna(1500) + + + # Impute special case (mostcommonactivity) and (homelabel) + impute_w_sn = [col for col in features.columns if "mostcommonactivity" in col] + features[impute_w_sn] = features[impute_w_sn].fillna(4) # Special case of imputation - nominal/ordinal value + + impute_w_sn2 = [col for col in features.columns if "homelabel" in col] + features[impute_w_sn2] = features[impute_w_sn2].fillna(1) # Special case of imputation - nominal/ordinal value + + impute_w_sn3 = [col for col in features.columns if "loglocationvariance" in col] + features[impute_w_sn2] = features[impute_w_sn2].fillna(-1000000) # Special case of imputation - nominal/ordinal value + + + # Impute selected phone features with 0 + impute_zero = [col for col in features if \ + col.startswith('phone_applications_foreground_rapids_') or + col.startswith('phone_battery_rapids_') or + col.startswith('phone_bluetooth_rapids_') or + col.startswith('phone_light_rapids_') or + col.startswith('phone_calls_rapids_') or + col.startswith('phone_messages_rapids_') or + col.startswith('phone_screen_rapids_') or + col.startswith('phone_wifi_visible')] + + features[impute_zero+list(esm_cols.columns)] = features[impute_zero+list(esm_cols.columns)].fillna(0) + + ## (5) STANDARDIZATION + if provider["STANDARDIZATION"]: + features.loc[:, ~features.columns.isin(excluded_columns)] = StandardScaler().fit_transform(features.loc[:, ~features.columns.isin(excluded_columns)]) + + # (6) IMPUTATION: IMPUTE DATA WITH KNN METHOD + impute_cols = [col for col in features.columns if col not in excluded_columns] + features.reset_index(drop=True, inplace=True) + features[impute_cols] = impute(features[impute_cols], method="knn") + + # (7) REMOVE COLS WHERE VARIANCE IS 0 + esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] + + if provider["COLS_VAR_THRESHOLD"]: + features.drop(features.std()[features.std() == 0].index.values, axis=1, inplace=True) + + fe5 = features.copy() + + # (8) DROP HIGHLY CORRELATED FEATURES drop_corr_features = provider["DROP_HIGHLY_CORRELATED_FEATURES"] - if drop_corr_features["COMPUTE"]: + if drop_corr_features["COMPUTE"] and features.shape[0]: # If small amount of segments (rows) is present, do not execute correlation check numerical_cols = features.select_dtypes(include=np.number).columns.tolist() # Remove columns where NaN count threshold is passed valid_features = features[numerical_cols].loc[:, features[numerical_cols].isna().sum() < drop_corr_features['MIN_OVERLAP_FOR_CORR_THRESHOLD'] * features[numerical_cols].shape[0]] - cor_matrix = valid_features.corr(method='spearman').abs() - upper_tri = cor_matrix.where(np.triu(np.ones(cor_matrix.shape), k=1).astype(np.bool)) - to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > drop_corr_features["CORR_THRESHOLD"])] + corr_matrix = valid_features.corr().abs() + upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool)) + to_drop = [column for column in upper.columns if any(upper[column] > drop_corr_features["CORR_THRESHOLD"])] features.drop(to_drop, axis=1, inplace=True) - # Remove rows if threshold of NaN values is passed - min_count = math.ceil((1 - provider["ROWS_NAN_THRESHOLD"]) * features.shape[1]) # minimal not nan values in row - features.dropna(axis=0, thresh=min_count, inplace=True) + # Preserve esm cols if deleted (has to come after drop cols operations) + for esm in esm_cols: + if esm not in features: + features[esm] = esm_cols[esm] + + # (9) VERIFY IF THERE ARE ANY NANS LEFT IN THE DATAFRAME + if features.isna().any().any(): + raise ValueError("There are still some NaNs present in the dataframe. Please check for implementation errors.") return features def impute(df, method='zero'): - def k_nearest(df): # TODO: if needed, implement k-nearest imputation / interpolation - pass + def k_nearest(df): + pd.set_option('display.max_columns', None) + imputer = KNNImputer(n_neighbors=3) + return pd.DataFrame(imputer.fit_transform(df), columns=df.columns) - return { # rest of the columns should be imputed with the selected method + return { 'zero': df.fillna(0), + 'high_number': df.fillna(1500), 'mean': df.fillna(df.mean()), 'median': df.fillna(df.median()), - 'k-nearest': k_nearest(df) + 'knn': k_nearest(df) }[method] - + +def graph_bf_af(features, phase_name, plt_flag=False): + if plt_flag: + sns.set(rc={"figure.figsize":(16, 8)}) + sns.heatmap(features.isna(), cbar=False) #features.select_dtypes(include=np.number) + plt.savefig(f'features_overall_nans_{phase_name}.png', bbox_inches='tight') + + print(f"\n-------------{phase_name}-------------") + print("Rows number:", features.shape[0]) + print("Columns number:", len(features.columns)) + print("---------------------------------------------\n") diff --git a/src/features/all_cleaning_overall/straw/__init__.py b/src/features/all_cleaning_overall/straw/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/src/features/all_cleaning_overall/straw/main.py b/src/features/all_cleaning_overall/straw/main.py index 6ba3ba5e..1362358d 100644 --- a/src/features/all_cleaning_overall/straw/main.py +++ b/src/features/all_cleaning_overall/straw/main.py @@ -1,88 +1,261 @@ import pandas as pd import numpy as np -import math, sys +import math, sys, random, warnings, yaml + +from sklearn.impute import KNNImputer +from sklearn.preprocessing import StandardScaler, minmax_scale +import matplotlib.pyplot as plt +import seaborn as sns + +sys.path.append('/rapids/') +from src.features import empatica_data_yield as edy + +def straw_cleaning(sensor_data_files, provider, target): -def straw_cleaning(sensor_data_files, provider): - features = pd.read_csv(sensor_data_files["sensor_data"][0]) - # TODO: reorder the cleaning steps so it makes sense for the analysis - # TODO: add conditions that differentiates cleaning steps for standardized and nonstandardized features, for this - # the snakemake rules will also have to come with additional parameter (in rules/features.smk) + with open('config.yaml', 'r') as stream: + config = yaml.load(stream, Loader=yaml.FullLoader) - # Impute selected features event - impute_phone_features = provider["IMPUTE_PHONE_SELECTED_EVENT_FEATURES"] - if impute_phone_features["COMPUTE"]: - if not 'phone_data_yield_rapids_ratiovalidyieldedminutes' in features.columns: - raise KeyError("RAPIDS provider needs to impute the selected event features based on phone_data_yield_rapids_ratiovalidyieldedminutes column, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyieldedminutes' in [FEATURES].") - - # TODO: if the type of the imputation will vary for different groups of features make conditional imputations here - phone_cols = [col for col in features if \ - col.startswith('phone_applications_foreground_rapids_') or - col.startswith('phone_battery_rapids_') or - col.startswith('phone_calls_rapids_') or - col.startswith('phone_keyboard_rapids_') or - col.startswith('phone_messages_rapids_') or - col.startswith('phone_screen_rapids_') or - col.startswith('phone_wifi_')] + esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] # Get target (esm) columns - mask = features['phone_data_yield_rapids_ratiovalidyieldedminutes'] > impute_phone_features['MIN_DATA_YIELDED_MINUTES_TO_IMPUTE'] - features.loc[mask, phone_cols] = impute(features[mask][phone_cols], method=impute_phone_features["TYPE"].lower()) + excluded_columns = ['local_segment', 'local_segment_label', 'local_segment_start_datetime', 'local_segment_end_datetime'] - # Drop rows with the value of data_yield_column less than data_yield_ratio_threshold - data_yield_unit = provider["DATA_YIELD_FEATURE"].split("_")[3].lower() - data_yield_column = "phone_data_yield_rapids_ratiovalidyielded" + data_yield_unit + graph_bf_af(features, "1target_rows_before") - if not data_yield_column in features.columns: - raise KeyError(f"RAPIDS provider needs to impute the selected event features based on {data_yield_column} column, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyielded{data_yield_unit}' in [FEATURES].") - - if provider["DATA_YIELD_RATIO_THRESHOLD"]: - features = features[features[data_yield_column] >= provider["DATA_YIELD_RATIO_THRESHOLD"]] - - esm_cols = features.loc[:, features.columns.str.startswith('phone_esm')] # For later preservation of esm_cols - - # Remove cols if threshold of NaN values is passed - features = features.loc[:, features.isna().sum() < provider["COLS_NAN_THRESHOLD"] * features.shape[0]] + # (1.0) OVERRIDE STRESSFULNESS EVENT TARGETS IF ERS SEGMENTING_METHOD IS "STRESS_EVENT" + if config["TIME_SEGMENTS"]["TAILORED_EVENTS"]["SEGMENTING_METHOD"] == "stress_event": - # Remove cols where variance is 0 + stress_events_targets = pd.read_csv("data/external/stress_event_targets.csv") + + if "appraisal_stressfulness_event_mean" in config['PARAMS_FOR_ANALYSIS']['TARGET']['ALL_LABELS']: + features.drop(columns=['phone_esm_straw_appraisal_stressfulness_event_mean'], inplace=True) + features = features.merge(stress_events_targets[["label", "appraisal_stressfulness_event"]] \ + .rename(columns={'label': 'local_segment_label'}), on=['local_segment_label'], how='inner') \ + .rename(columns={'appraisal_stressfulness_event': 'phone_esm_straw_appraisal_stressfulness_event_mean'}) + + if "appraisal_threat_mean" in config['PARAMS_FOR_ANALYSIS']['TARGET']['ALL_LABELS']: + features.drop(columns=['phone_esm_straw_appraisal_threat_mean'], inplace=True) + features = features.merge(stress_events_targets[["label", "appraisal_threat"]] \ + .rename(columns={'label': 'local_segment_label'}), on=['local_segment_label'], how='inner') \ + .rename(columns={'appraisal_threat': 'phone_esm_straw_appraisal_threat_mean'}) + + if "appraisal_challenge_mean" in config['PARAMS_FOR_ANALYSIS']['TARGET']['ALL_LABELS']: + features.drop(columns=['phone_esm_straw_appraisal_challenge_mean'], inplace=True) + features = features.merge(stress_events_targets[["label", "appraisal_challenge"]] \ + .rename(columns={'label': 'local_segment_label'}), on=['local_segment_label'], how='inner') \ + .rename(columns={'appraisal_challenge': 'phone_esm_straw_appraisal_challenge_mean'}) + + esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] # Get target (esm) columns + + # (1.1) FILTER_OUT THE ROWS THAT DO NOT HAVE THE TARGET COLUMN AVAILABLE + if config['PARAMS_FOR_ANALYSIS']['TARGET']['COMPUTE']: + features = features[features['phone_esm_straw_' + target].notna()].reset_index(drop=True) + + if features.empty: + return pd.DataFrame(columns=excluded_columns) + + graph_bf_af(features, "2target_rows_after") + + # (2) QUALITY CHECK (DATA YIELD COLUMN) drops the rows where E4 or phone data is low quality + phone_data_yield_unit = provider["PHONE_DATA_YIELD_FEATURE"].split("_")[3].lower() + phone_data_yield_column = "phone_data_yield_rapids_ratiovalidyielded" + phone_data_yield_unit + + features = edy.calculate_empatica_data_yield(features) + + if not phone_data_yield_column in features.columns and not "empatica_data_yield" in features.columns: + raise KeyError(f"RAPIDS provider needs to clean the selected event features based on {phone_data_yield_column} and empatica_data_yield columns. For phone data yield, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyielded{data_yield_unit}' in [FEATURES].") + + hist = features[["empatica_data_yield", phone_data_yield_column]].hist() + plt.savefig(f'phone_E4_histogram.png', bbox_inches='tight') + + # Drop rows where phone data yield is less then given threshold + if provider["PHONE_DATA_YIELD_RATIO_THRESHOLD"]: + hist = features[phone_data_yield_column].hist(bins=5) + plt.close() + features = features[features[phone_data_yield_column] >= provider["PHONE_DATA_YIELD_RATIO_THRESHOLD"]].reset_index(drop=True) + + # Drop rows where empatica data yield is less then given threshold + if provider["EMPATICA_DATA_YIELD_RATIO_THRESHOLD"]: + features = features[features["empatica_data_yield"] >= provider["EMPATICA_DATA_YIELD_RATIO_THRESHOLD"]].reset_index(drop=True) + + if features.empty: + return pd.DataFrame(columns=excluded_columns) + + graph_bf_af(features, "3data_yield_drop_rows") + + if features.empty: + return pd.DataFrame(columns=excluded_columns) + + # (3) CONTEXTUAL IMPUTATION + + # Impute selected phone features with a high number + impute_w_hn = [col for col in features.columns if \ + "timeoffirstuse" in col or + "timeoflastuse" in col or + "timefirstcall" in col or + "timelastcall" in col or + "firstuseafter" in col or + "timefirstmessages" in col or + "timelastmessages" in col] + features[impute_w_hn] = features[impute_w_hn].fillna(1500) + + # Impute special case (mostcommonactivity) and (homelabel) + impute_w_sn = [col for col in features.columns if "mostcommonactivity" in col] + features[impute_w_sn] = features[impute_w_sn].fillna(4) # Special case of imputation - nominal/ordinal value + + impute_w_sn2 = [col for col in features.columns if "homelabel" in col] + features[impute_w_sn2] = features[impute_w_sn2].fillna(1) # Special case of imputation - nominal/ordinal value + + impute_w_sn3 = [col for col in features.columns if "loglocationvariance" in col] + features[impute_w_sn3] = features[impute_w_sn3].fillna(-1000000) # Special case of imputation - loglocation + + # Impute location features + impute_locations = [col for col in features \ + if col.startswith('phone_locations_doryab_') and + 'radiusgyration' not in col + ] + + # Impute selected phone, location, and esm features with 0 + impute_zero = [col for col in features if \ + col.startswith('phone_applications_foreground_rapids_') or + col.startswith('phone_activity_recognition_') or + col.startswith('phone_battery_rapids_') or + col.startswith('phone_bluetooth_rapids_') or + col.startswith('phone_light_rapids_') or + col.startswith('phone_calls_rapids_') or + col.startswith('phone_messages_rapids_') or + col.startswith('phone_screen_rapids_') or + col.startswith('phone_bluetooth_doryab_') or + col.startswith('phone_wifi_visible') + ] + + features[impute_zero+impute_locations+list(esm_cols.columns)] = features[impute_zero+impute_locations+list(esm_cols.columns)].fillna(0) + + pd.set_option('display.max_rows', None) + + graph_bf_af(features, "4context_imp") + + # (4) REMOVE COLS IF THEIR NAN THRESHOLD IS PASSED (should be <= if even all NaN columns must be preserved - this solution now drops columns with all NaN rows) + esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] # Get target (esm) columns + + features = features.loc[:, features.isna().sum() < provider["COLS_NAN_THRESHOLD"] * features.shape[0]] + + graph_bf_af(features, "5too_much_nans_cols") + # (5) REMOVE COLS WHERE VARIANCE IS 0 + if provider["COLS_VAR_THRESHOLD"]: features.drop(features.std()[features.std() == 0].index.values, axis=1, inplace=True) - + + graph_bf_af(features, "6variance_drop") + # Preserve esm cols if deleted (has to come after drop cols operations) for esm in esm_cols: if esm not in features: features[esm] = esm_cols[esm] - # Drop highly correlated features - To-Do še en thershold var, ki je v config + kako se tretirajo NaNs? + # (6) DO THE ROWS CONSIST OF ENOUGH NON-NAN VALUES? + min_count = math.ceil((1 - provider["ROWS_NAN_THRESHOLD"]) * features.shape[1]) # minimal not nan values in row + features.dropna(axis=0, thresh=min_count, inplace=True) # Thresh => at least this many not-nans + + graph_bf_af(features, "7too_much_nans_rows") + + if features.empty: + return pd.DataFrame(columns=excluded_columns) + + # (7) STANDARDIZATION + if provider["STANDARDIZATION"]: + nominal_cols = [col for col in features.columns if "mostcommonactivity" in col or "homelabel" in col] # Excluded nominal features + # Expected warning within this code block + with warnings.catch_warnings(): + warnings.simplefilter("ignore", category=RuntimeWarning) + features.loc[:, ~features.columns.isin(excluded_columns + ["pid"] + nominal_cols)] = \ + features.loc[:, ~features.columns.isin(excluded_columns + nominal_cols)].groupby('pid').transform(lambda x: StandardScaler().fit_transform(x.values[:,np.newaxis]).ravel()) + + graph_bf_af(features, "8standardization") + + # (8) IMPUTATION: IMPUTE DATA WITH KNN METHOD + features.reset_index(drop=True, inplace=True) + impute_cols = [col for col in features.columns if col not in excluded_columns and col != "pid"] + + features[impute_cols] = impute(features[impute_cols], method="knn") + + graph_bf_af(features, "9knn_after") + + + # (9) DROP HIGHLY CORRELATED FEATURES + esm_cols = features.loc[:, features.columns.str.startswith('phone_esm_straw')] + drop_corr_features = provider["DROP_HIGHLY_CORRELATED_FEATURES"] - if drop_corr_features["COMPUTE"]: + if drop_corr_features["COMPUTE"] and features.shape[0] > 5: # If small amount of segments (rows) is present, do not execute correlation check numerical_cols = features.select_dtypes(include=np.number).columns.tolist() # Remove columns where NaN count threshold is passed valid_features = features[numerical_cols].loc[:, features[numerical_cols].isna().sum() < drop_corr_features['MIN_OVERLAP_FOR_CORR_THRESHOLD'] * features[numerical_cols].shape[0]] - cor_matrix = valid_features.corr(method='spearman').abs() - upper_tri = cor_matrix.where(np.triu(np.ones(cor_matrix.shape), k=1).astype(np.bool)) - to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > drop_corr_features["CORR_THRESHOLD"])] + corr_matrix = valid_features.corr().abs() + upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool)) + to_drop = [column for column in upper.columns if any(upper[column] > drop_corr_features["CORR_THRESHOLD"])] + + # sns.heatmap(corr_matrix, cmap="YlGnBu") + # plt.savefig(f'correlation_matrix.png', bbox_inches='tight') + # plt.close() + + # s = corr_matrix.unstack() + # so = s.sort_values(ascending=False) + + # pd.set_option('display.max_rows', None) + # sorted_upper = upper.unstack().sort_values(ascending=False) + # print(sorted_upper[sorted_upper > drop_corr_features["CORR_THRESHOLD"]]) features.drop(to_drop, axis=1, inplace=True) - # Remove rows if threshold of NaN values is passed - min_count = math.ceil((1 - provider["ROWS_NAN_THRESHOLD"]) * features.shape[1]) # minimal not nan values in row - features.dropna(axis=0, thresh=min_count, inplace=True) + # Preserve esm cols if deleted (has to come after drop cols operations) + for esm in esm_cols: + if esm not in features: + features[esm] = esm_cols[esm] + + graph_bf_af(features, "10correlation_drop") + + # Transform categorical columns to category dtype + + cat1 = [col for col in features.columns if "mostcommonactivity" in col] + if cat1: # Transform columns to category dtype (mostcommonactivity) + features[cat1] = features[cat1].astype(int).astype('category') + + cat2 = [col for col in features.columns if "homelabel" in col] + if cat2: # Transform columns to category dtype (homelabel) + features[cat2] = features[cat2].astype(int).astype('category') + + # (10) VERIFY IF THERE ARE ANY NANS LEFT IN THE DATAFRAME + if features.isna().any().any(): + raise ValueError("There are still some NaNs present in the dataframe. Please check for implementation errors.") return features def impute(df, method='zero'): - def k_nearest(df): # TODO: if needed, implement k-nearest imputation / interpolation - pass + def k_nearest(df): + imputer = KNNImputer(n_neighbors=3) + return pd.DataFrame(imputer.fit_transform(df), columns=df.columns) - return { # rest of the columns should be imputed with the selected method + return { 'zero': df.fillna(0), + 'high_number': df.fillna(1500), 'mean': df.fillna(df.mean()), 'median': df.fillna(df.median()), - 'k-nearest': k_nearest(df) + 'knn': k_nearest(df) }[method] - + +def graph_bf_af(features, phase_name, plt_flag=False): + if plt_flag: + sns.set(rc={"figure.figsize":(16, 8)}) + sns.heatmap(features.isna(), cbar=False) #features.select_dtypes(include=np.number) + plt.savefig(f'features_overall_nans_{phase_name}.png', bbox_inches='tight') + + print(f"\n-------------{phase_name}-------------") + print("Rows number:", features.shape[0]) + print("Columns number:", len(features.columns)) + print("NaN values:", features.isna().sum().sum()) + print("---------------------------------------------\n") diff --git a/src/features/cr_features_helper_methods.py b/src/features/cr_features_helper_methods.py index 9e96c497..7bf02254 100644 --- a/src/features/cr_features_helper_methods.py +++ b/src/features/cr_features_helper_methods.py @@ -21,7 +21,7 @@ def extract_second_order_features(intraday_features, so_features_names, prefix=" so_features = pd.concat([so_features, intraday_features.drop(prefix+"level_1", axis=1).groupby(groupby_cols).median().add_suffix("_SO_median")], axis=1) if "sd" in so_features_names: - so_features = pd.concat([so_features, intraday_features.drop(prefix+"level_1", axis=1).groupby(groupby_cols).std().add_suffix("_SO_sd")], axis=1) + so_features = pd.concat([so_features, intraday_features.drop(prefix+"level_1", axis=1).groupby(groupby_cols).std().fillna(0).add_suffix("_SO_sd")], axis=1) if "nlargest" in so_features_names: # largest 5 -- maybe there is a faster groupby solution? for column in intraday_features.loc[:, ~intraday_features.columns.isin(groupby_cols+[prefix+"level_1"])]: diff --git a/src/features/empatica_accelerometer/cr/main.py b/src/features/empatica_accelerometer/cr/main.py index 77d18bfe..372a0e03 100644 --- a/src/features/empatica_accelerometer/cr/main.py +++ b/src/features/empatica_accelerometer/cr/main.py @@ -43,7 +43,11 @@ def extract_acc_features_from_intraday_data(acc_intraday_data, features, window_ def cr_features(sensor_data_files, time_segment, provider, filter_data_by_segment, *args, **kwargs): - acc_intraday_data = pd.read_csv(sensor_data_files["sensor_data"]) + + data_types = {'local_timezone': 'str', 'device_id': 'str', 'timestamp': 'int64', 'double_values_0': 'float64', + 'double_values_1': 'float64', 'double_values_2': 'float64', 'local_date_time': 'str', 'local_date': "str", + 'local_time': "str", 'local_hour': "str", 'local_minute': "str", 'assigned_segments': "str"} + acc_intraday_data = pd.read_csv(sensor_data_files["sensor_data"], dtype=data_types) requested_intraday_features = provider["FEATURES"] diff --git a/src/features/empatica_data_yield.py b/src/features/empatica_data_yield.py new file mode 100644 index 00000000..2df2bcd2 --- /dev/null +++ b/src/features/empatica_data_yield.py @@ -0,0 +1,32 @@ +import pandas as pd +import numpy as np +from datetime import datetime + +import sys, yaml + +def calculate_empatica_data_yield(features): # TODO + + # Get time segment duration in seconds from all segments in features dataframe + datetime_start = pd.to_datetime(features['local_segment_start_datetime'], format='%Y-%m-%d %H:%M:%S') + datetime_end = pd.to_datetime(features['local_segment_end_datetime'], format='%Y-%m-%d %H:%M:%S') + tseg_duration = (datetime_end - datetime_start).dt.total_seconds() + + with open('config.yaml', 'r') as stream: + config = yaml.load(stream, Loader=yaml.FullLoader) + + sensors = ["EMPATICA_ACCELEROMETER", "EMPATICA_TEMPERATURE", "EMPATICA_ELECTRODERMAL_ACTIVITY", "EMPATICA_INTER_BEAT_INTERVAL"] + for sensor in sensors: + features[f"{sensor.lower()}_data_yield"] = \ + (features[f"{sensor.lower()}_cr_SO_windowsCount"] * config[sensor]["PROVIDERS"]["CR"]["WINDOWS"]["WINDOW_LENGTH"]) / tseg_duration \ + if f'{sensor.lower()}_cr_SO_windowsCount' in features else 0 + + empatica_data_yield_cols = [sensor.lower() + "_data_yield" for sensor in sensors] + pd.set_option('display.max_rows', None) + + # Assigns 1 to values that are over 1 (in case of windows not being filled fully) + features[empatica_data_yield_cols] = features[empatica_data_yield_cols].apply(lambda x: [y if y <= 1 or np.isnan(y) else 1 for y in x]) + + features["empatica_data_yield"] = features[empatica_data_yield_cols].mean(axis=1).fillna(0) + features.drop(empatica_data_yield_cols, axis=1, inplace=True) # In case of if the advanced operations will later not be needed (e.g., weighted average) + + return features diff --git a/src/features/empatica_electrodermal_activity/cr/main.py b/src/features/empatica_electrodermal_activity/cr/main.py index 0b09f02b..4f8e8379 100644 --- a/src/features/empatica_electrodermal_activity/cr/main.py +++ b/src/features/empatica_electrodermal_activity/cr/main.py @@ -44,7 +44,11 @@ def extract_eda_features_from_intraday_data(eda_intraday_data, features, window_ def cr_features(sensor_data_files, time_segment, provider, filter_data_by_segment, *args, **kwargs): - eda_intraday_data = pd.read_csv(sensor_data_files["sensor_data"]) + + data_types = {'local_timezone': 'str', 'device_id': 'str', 'timestamp': 'int64', 'electrodermal_activity': 'float64', 'local_date_time': 'str', + 'local_date': "str", 'local_time': "str", 'local_hour': "str", 'local_minute': "str", 'assigned_segments': "str"} + + eda_intraday_data = pd.read_csv(sensor_data_files["sensor_data"], dtype=data_types) requested_intraday_features = provider["FEATURES"] diff --git a/src/features/empatica_inter_beat_interval/cr/main.py b/src/features/empatica_inter_beat_interval/cr/main.py index 803bf3a8..7797e329 100644 --- a/src/features/empatica_inter_beat_interval/cr/main.py +++ b/src/features/empatica_inter_beat_interval/cr/main.py @@ -50,6 +50,11 @@ def extract_ibi_features_from_intraday_data(ibi_intraday_data, features, window_ def cr_features(sensor_data_files, time_segment, provider, filter_data_by_segment, *args, **kwargs): + + data_types = {'local_timezone': 'str', 'device_id': 'str', 'timestamp': 'int64', 'inter_beat_interval': 'float64', 'timings': 'float64', 'local_date_time': 'str', + 'local_date': "str", 'local_time': "str", 'local_hour': "str", 'local_minute': "str", 'assigned_segments': "str"} + + temperature_intraday_data = pd.read_csv(sensor_data_files["sensor_data"], dtype=data_types) ibi_intraday_data = pd.read_csv(sensor_data_files["sensor_data"]) requested_intraday_features = provider["FEATURES"] diff --git a/src/features/empatica_temperature/cr/main.py b/src/features/empatica_temperature/cr/main.py index 36e720bd..4158e7ee 100644 --- a/src/features/empatica_temperature/cr/main.py +++ b/src/features/empatica_temperature/cr/main.py @@ -37,7 +37,10 @@ def extract_temp_features_from_intraday_data(temperature_intraday_data, features def cr_features(sensor_data_files, time_segment, provider, filter_data_by_segment, *args, **kwargs): - temperature_intraday_data = pd.read_csv(sensor_data_files["sensor_data"]) + data_types = {'local_timezone': 'str', 'device_id': 'str', 'timestamp': 'int64', 'temperature': 'float64', 'local_date_time': 'str', + 'local_date': "str", 'local_time': "str", 'local_hour': "str", 'local_minute': "str", 'assigned_segments': "str"} + + temperature_intraday_data = pd.read_csv(sensor_data_files["sensor_data"], dtype=data_types) requested_intraday_features = provider["FEATURES"] diff --git a/src/features/entry.py b/src/features/entry.py index 288ba168..2b995fc7 100644 --- a/src/features/entry.py +++ b/src/features/entry.py @@ -13,7 +13,10 @@ calc_windows = True if (provider.get("WINDOWS", False) and provider["WINDOWS"].g if sensor_key == "all_cleaning_individual" or sensor_key == "all_cleaning_overall": # Data cleaning - sensor_features = run_provider_cleaning_script(provider, provider_key, sensor_key, sensor_data_files) + if "overall" in sensor_key: + sensor_features = run_provider_cleaning_script(provider, provider_key, sensor_key, sensor_data_files, snakemake.params["target"]) + else: + sensor_features = run_provider_cleaning_script(provider, provider_key, sensor_key, sensor_data_files) else: # Extract sensor features del sensor_data_files["time_segments_labels"] diff --git a/src/features/phone_activity_recognition/rapids/main.py b/src/features/phone_activity_recognition/rapids/main.py index 063061fe..e5752388 100644 --- a/src/features/phone_activity_recognition/rapids/main.py +++ b/src/features/phone_activity_recognition/rapids/main.py @@ -37,6 +37,6 @@ def rapids_features(sensor_data_files, time_segment, provider, filter_data_by_se ar_features.index.names = ["local_segment"] ar_features = ar_features.reset_index() - ar_features.fillna(value={"count": 0, "countuniqueactivities": 0, "durationstationary": 0, "durationmobile": 0, "durationvehicle": 0}, inplace=True) + ar_features.fillna(value={"count": 0, "countuniqueactivities": 0, "durationstationary": 0, "durationmobile": 0, "durationvehicle": 0, "mostcommonactivity": 4}, inplace=True) return ar_features diff --git a/src/features/phone_applications_foreground/rapids/main.py b/src/features/phone_applications_foreground/rapids/main.py index f35fa066..d9204547 100644 --- a/src/features/phone_applications_foreground/rapids/main.py +++ b/src/features/phone_applications_foreground/rapids/main.py @@ -9,19 +9,19 @@ def compute_features(filtered_data, apps_type, requested_features, apps_features if "timeoffirstuse" in requested_features: time_first_event = filtered_data.sort_values(by="timestamp", ascending=True).drop_duplicates(subset="local_segment", keep="first").set_index("local_segment") if time_first_event.empty: - apps_features["timeoffirstuse" + apps_type] = np.nan + apps_features["timeoffirstuse" + apps_type] = 1500 # np.nan else: apps_features["timeoffirstuse" + apps_type] = time_first_event["local_hour"] * 60 + time_first_event["local_minute"] if "timeoflastuse" in requested_features: time_last_event = filtered_data.sort_values(by="timestamp", ascending=False).drop_duplicates(subset="local_segment", keep="first").set_index("local_segment") if time_last_event.empty: - apps_features["timeoflastuse" + apps_type] = np.nan + apps_features["timeoflastuse" + apps_type] = 1500 # np.nan else: apps_features["timeoflastuse" + apps_type] = time_last_event["local_hour"] * 60 + time_last_event["local_minute"] if "frequencyentropy" in requested_features: apps_with_count = filtered_data.groupby(["local_segment","application_name"]).count().sort_values(by="timestamp", ascending=False).reset_index() if (len(apps_with_count.index) < 2 ): - apps_features["frequencyentropy" + apps_type] = np.nan + apps_features["frequencyentropy" + apps_type] = 0 # np.nan else: apps_features["frequencyentropy" + apps_type] = apps_with_count.groupby("local_segment")["timestamp"].agg(entropy) if "countevent" in requested_features: @@ -43,6 +43,7 @@ def compute_features(filtered_data, apps_type, requested_features, apps_features apps_features["sumduration" + apps_type] = filtered_data.groupby(by = ["local_segment"])["duration"].sum() apps_features.index.names = ["local_segment"] + return apps_features def process_app_features(data, requested_features, time_segment, provider, filter_data_by_segment): diff --git a/src/features/phone_bluetooth/doryab/main.py b/src/features/phone_bluetooth/doryab/main.py index 6efec19a..bbc7a47c 100644 --- a/src/features/phone_bluetooth/doryab/main.py +++ b/src/features/phone_bluetooth/doryab/main.py @@ -14,8 +14,8 @@ def deviceFeatures(devices, ownership, common_devices, features_to_compute, feat features = features.join(device_value_counts.groupby("local_segment")["bt_address"].nunique().to_frame("uniquedevices" + ownership), how="outer") if "meanscans" in features_to_compute: features = features.join(device_value_counts.groupby("local_segment")["scans"].mean().to_frame("meanscans" + ownership), how="outer") - if "stdscans" in features_to_compute: - features = features.join(device_value_counts.groupby("local_segment")["scans"].std().to_frame("stdscans" + ownership), how="outer") + if "stdscans" in features_to_compute: + features = features.join(device_value_counts.groupby("local_segment")["scans"].std().to_frame("stdscans" + ownership).fillna(0), how="outer") # Most frequent device within segments, across segments, and across dataset if "countscansmostfrequentdevicewithinsegments" in features_to_compute: features = features.join(device_value_counts.groupby("local_segment")["scans"].max().to_frame("countscansmostfrequentdevicewithinsegments" + ownership), how="outer") diff --git a/src/features/phone_calls/rapids/main.R b/src/features/phone_calls/rapids/main.R index d793f706..d6c8ab88 100644 --- a/src/features/phone_calls/rapids/main.R +++ b/src/features/phone_calls/rapids/main.R @@ -88,6 +88,16 @@ rapids_features <- function(sensor_data_files, time_segment, provider){ features <- call_features_of_type(calls_of_type, features_type, call_type, time_segment, requested_features) call_features <- merge(call_features, features, all=TRUE) } - call_features <- call_features %>% mutate_at(vars(contains("countmostfrequentcontact") | contains("distinctcontacts") | contains("count") | contains("sumduration") | contains("minduration") | contains("maxduration") | contains("meanduration") | contains("modeduration")), list( ~ replace_na(., 0))) + + # Fill seleted columns with a high number + time_cols <- select(call_features, contains("timefirstcall") | contains("timelastcall")) %>% + colnames(.) + + call_features <- call_features %>% + mutate_at(., time_cols, ~replace(., is.na(.), 1500)) + + # Fill NA values with 0 + call_features <- call_features %>% mutate_all(~replace(., is.na(.), 0)) + return(call_features) } \ No newline at end of file diff --git a/src/features/phone_esm/straw/esm.py b/src/features/phone_esm/straw/esm.py new file mode 100644 index 00000000..6b6d9d7c --- /dev/null +++ b/src/features/phone_esm/straw/esm.py @@ -0,0 +1,274 @@ +from collections.abc import Collection + +import numpy as np +import pandas as pd +from pytz import timezone +import datetime, json + +# from config.models import ESM, Participant +# from features import helper + +ESM_STATUS_ANSWERED = 2 + +GROUP_SESSIONS_BY = ["device_id", "esm_session"] # 'participant_id + +SESSION_STATUS_UNANSWERED = "ema_unanswered" +SESSION_STATUS_DAY_FINISHED = "day_finished" +SESSION_STATUS_COMPLETE = "ema_completed" + +ANSWER_DAY_FINISHED = "DayFinished3421" +ANSWER_DAY_OFF = "DayOff3421" +ANSWER_SET_EVENING = "DayFinishedSetEvening" + +MAX_MORNING_LENGTH = 3 +# When the participants was not yet at work at the time of the first (morning) EMA, +# only three items were answered. +# Two sleep related items and one indicating NOT starting work yet. +# Daytime EMAs are all longer, in fact they always consist of at least 6 items. + + +TZ_LJ = timezone("Europe/Ljubljana") +COLUMN_TIMESTAMP = "timestamp" +COLUMN_TIMESTAMP_ESM = "double_esm_user_answer_timestamp" + + +def get_date_from_timestamp(df_aware) -> pd.DataFrame: + """ + Transform a UNIX timestamp into a datetime (with Ljubljana timezone). + Additionally, extract only the date part, where anything until 4 AM is considered the same day. + + Parameters + ---------- + df_aware: pd.DataFrame + Any AWARE-type data as defined in models.py. + + Returns + ------- + df_aware: pd.DataFrame + The same dataframe with datetime_lj and date_lj columns added. + + """ + if COLUMN_TIMESTAMP_ESM in df_aware: + column_timestamp = COLUMN_TIMESTAMP_ESM + else: + column_timestamp = COLUMN_TIMESTAMP + + df_aware["datetime_lj"] = df_aware[column_timestamp].apply( + lambda x: datetime.datetime.fromtimestamp(x / 1000.0, tz=TZ_LJ) + ) + df_aware = df_aware.assign( + date_lj=lambda x: (x.datetime_lj - datetime.timedelta(hours=4)).dt.date + ) + # Since daytime EMAs could *theoretically* last beyond midnight, but never after 4 AM, + # the datetime is first translated to 4 h earlier. + + return df_aware + + +def preprocess_esm(df_esm: pd.DataFrame) -> pd.DataFrame: + """ + Convert timestamps into human-readable datetimes and dates + and expand the JSON column into several Pandas DF columns. + + Parameters + ---------- + df_esm: pd.DataFrame + A dataframe of esm data. + + Returns + ------- + df_esm_preprocessed: pd.DataFrame + A dataframe with added columns: datetime in Ljubljana timezone and all fields from ESM_JSON column. + """ + df_esm = get_date_from_timestamp(df_esm) + + df_esm_json = df_esm["esm_json"].apply(json.loads) + df_esm_json = pd.json_normalize(df_esm_json).drop( + columns=["esm_trigger"] + ) # The esm_trigger column is already present in the main df. + return df_esm.join(df_esm_json) + + +def classify_sessions_by_completion(df_esm_preprocessed: pd.DataFrame) -> pd.DataFrame: + """ + For each distinct EMA session, determine how the participant responded to it. + Possible outcomes are: SESSION_STATUS_UNANSWERED, SESSION_STATUS_DAY_FINISHED, and SESSION_STATUS_COMPLETE + + This is done in three steps. + + First, the esm_status is considered. + If any of the ESMs in a session has a status *other than* "answered", then this session is taken as unfinished. + + Second, the sessions which do not represent full questionnaires are identified. + These are sessions where participants only marked they are finished with the day or have not yet started working. + + Third, the sessions with only one item are marked with their trigger. + We never offered questionnaires with single items, so we can be sure these are unfinished. + + Finally, all sessions that remain are marked as completed. + By going through different possibilities in expl_esm_adherence.ipynb, this turned out to be a reasonable option. + + Parameters + ---------- + df_esm_preprocessed: pd.DataFrame + A preprocessed dataframe of esm data, which must include the session ID (esm_session). + + Returns + ------- + df_session_counts: pd.Dataframe + A dataframe of all sessions (grouped by GROUP_SESSIONS_BY) with their statuses and the number of items. + """ + sessions_grouped = df_esm_preprocessed.groupby(GROUP_SESSIONS_BY) + + # 0. First, assign all session statuses as NaN. + df_session_counts = pd.DataFrame(sessions_grouped.count()["timestamp"]).rename( + columns={"timestamp": "esm_session_count"} + ) + df_session_counts["session_response"] = np.nan + + # 1. Identify all ESMs with status other than answered. + esm_not_answered = sessions_grouped.apply( + lambda x: (x.esm_status != ESM_STATUS_ANSWERED).any() + ) + df_session_counts.loc[ + esm_not_answered, "session_response" + ] = SESSION_STATUS_UNANSWERED + + # 2. Identify non-sessions, i.e. answers about the end of the day. + non_session = sessions_grouped.apply( + lambda x: ( + (x.esm_user_answer == ANSWER_DAY_FINISHED) # I finished working for today. + | (x.esm_user_answer == ANSWER_DAY_OFF) # I am not going to work today. + | ( + x.esm_user_answer == ANSWER_SET_EVENING + ) # When would you like to answer the evening EMA? + ).any() + ) + df_session_counts.loc[non_session, "session_response"] = SESSION_STATUS_DAY_FINISHED + + # 3. Identify sessions appearing only once, as those were not true EMAs for sure. + singleton_sessions = (df_session_counts.esm_session_count == 1) & ( + df_session_counts.session_response.isna() + ) + df_session_1 = df_session_counts[singleton_sessions] + df_esm_unique_session = df_session_1.join( + df_esm_preprocessed.set_index(GROUP_SESSIONS_BY), how="left" + ) + df_esm_unique_session = df_esm_unique_session.assign( + session_response=lambda x: x.esm_trigger + )["session_response"] + df_session_counts.loc[ + df_esm_unique_session.index, "session_response" + ] = df_esm_unique_session + + # 4. Mark the remaining sessions as completed. + df_session_counts.loc[ + df_session_counts.session_response.isna(), "session_response" + ] = SESSION_STATUS_COMPLETE + + return df_session_counts + + +def classify_sessions_by_time(df_esm_preprocessed: pd.DataFrame) -> pd.DataFrame: + """ + For each EMA session, determine the time of the first user answer and its time type (morning, workday, or evening.) + + Parameters + ---------- + df_esm_preprocessed: pd.DataFrame + A preprocessed dataframe of esm data, which must include the session ID (esm_session). + + Returns + ------- + df_session_time: pd.DataFrame + A dataframe of all sessions (grouped by GROUP_SESSIONS_BY) with their time type and timestamp of first answer. + """ + df_session_time = ( + df_esm_preprocessed.sort_values(["datetime_lj"]) # "participant_id" + .groupby(GROUP_SESSIONS_BY) + .first()[["time", "datetime_lj"]] + ) + return df_session_time + + +def classify_sessions_by_completion_time( + df_esm_preprocessed: pd.DataFrame, +) -> pd.DataFrame: + """ + The point of this function is to not only classify sessions by using the previously defined functions. + It also serves to "correct" the time type of some EMA sessions. + + A morning questionnaire could seamlessly transition into a daytime questionnaire, + if the participant was already at work. + In this case, the "time" label changed mid-session. + Because of the way classify_sessions_by_time works, this questionnaire was classified as "morning". + But for all intents and purposes, it can be treated as a "daytime" EMA. + + The way this scenario is differentiated from a true "morning" questionnaire, + where the participants NOT yet at work, is by considering their length. + + Parameters + ---------- + df_esm_preprocessed: pd.DataFrame + A preprocessed dataframe of esm data, which must include the session ID (esm_session). + + Returns + ------- + df_session_counts_time: pd.DataFrame + A dataframe of all sessions (grouped by GROUP_SESSIONS_BY) with statuses, the number of items, + their time type (with some morning EMAs reclassified) and timestamp of first answer. + + """ + df_session_counts = classify_sessions_by_completion(df_esm_preprocessed) + df_session_time = classify_sessions_by_time(df_esm_preprocessed) + + df_session_counts_time = df_session_time.join(df_session_counts) + + morning_transition_to_daytime = (df_session_counts_time.time == "morning") & ( + df_session_counts_time.esm_session_count > MAX_MORNING_LENGTH + ) + + df_session_counts_time.loc[morning_transition_to_daytime, "time"] = "daytime" + + return df_session_counts_time + + +# def clean_up_esm(df_esm_preprocessed: pd.DataFrame) -> pd.DataFrame: +# """ +# This function eliminates invalid ESM responses. +# It removes unanswered ESMs and those that indicate end of work and similar. +# It also extracts a numeric answer from strings such as "4 - I strongly agree". + +# Parameters +# ---------- +# df_esm_preprocessed: pd.DataFrame +# A preprocessed dataframe of esm data. + +# Returns +# ------- +# df_esm_clean: pd.DataFrame +# A subset of the original dataframe. + +# """ +# df_esm_clean = df_esm_preprocessed[ +# df_esm_preprocessed["esm_status"] == ESM_STATUS_ANSWERED +# ] +# df_esm_clean = df_esm_clean[ +# ~df_esm_clean["esm_user_answer"].isin( +# [ANSWER_DAY_FINISHED, ANSWER_DAY_OFF, ANSWER_SET_EVENING] +# ) +# ] +# df_esm_clean["esm_user_answer_numeric"] = np.nan +# esm_type_numeric = [ +# ESM.ESM_TYPE.get("radio"), +# ESM.ESM_TYPE.get("scale"), +# ESM.ESM_TYPE.get("number"), +# ] +# df_esm_clean.loc[ +# df_esm_clean["esm_type"].isin(esm_type_numeric) +# ] = df_esm_clean.loc[df_esm_clean["esm_type"].isin(esm_type_numeric)].assign( +# esm_user_answer_numeric=lambda x: x.esm_user_answer.str.slice(stop=1).astype( +# int +# ) +# ) +# return df_esm_clean diff --git a/src/features/phone_esm/straw/main.py b/src/features/phone_esm/straw/main.py index 306bb681..8a55b8eb 100644 --- a/src/features/phone_esm/straw/main.py +++ b/src/features/phone_esm/straw/main.py @@ -42,7 +42,8 @@ def straw_features(sensor_data_files, time_segment, provider, filter_data_by_seg requested_features = provider["FEATURES"] # name of the features this function can compute requested_scales = provider["SCALES"] - base_features_names = ["PANAS_positive_affect", "PANAS_negative_affect", "JCQ_job_demand", "JCQ_job_control", "JCQ_supervisor_support", "JCQ_coworker_support"] + base_features_names = ["PANAS_positive_affect", "PANAS_negative_affect", "JCQ_job_demand", "JCQ_job_control", "JCQ_supervisor_support", "JCQ_coworker_support", + "appraisal_stressfulness_period", "appraisal_stressfulness_event", "appraisal_threat", "appraisal_challenge"] #TODO Check valid questionnaire and feature names. # the subset of requested features this function can compute features_to_compute = list(set(requested_features) & set(base_features_names)) @@ -52,7 +53,6 @@ def straw_features(sensor_data_files, time_segment, provider, filter_data_by_seg if not esm_data.empty: esm_features = pd.DataFrame() - for scale in requested_scales: questionnaire_id = QUESTIONNAIRE_IDS[scale] mask = esm_data["questionnaire_id"] == questionnaire_id @@ -60,4 +60,7 @@ def straw_features(sensor_data_files, time_segment, provider, filter_data_by_seg #TODO Create the column esm_user_score in esm_clean. Currently, this is only done when reversing. esm_features = esm_features.reset_index() + if 'index' in esm_features: # In calse of empty esm_features df + esm_features.rename(columns={'index': 'local_segment'}, inplace=True) + return esm_features diff --git a/src/features/phone_esm/straw/process_user_event_related_segments.py b/src/features/phone_esm/straw/process_user_event_related_segments.py new file mode 100644 index 00000000..353e714c --- /dev/null +++ b/src/features/phone_esm/straw/process_user_event_related_segments.py @@ -0,0 +1,220 @@ +import pandas as pd +import numpy as np +import datetime + +import math, sys, yaml + +from esm_preprocess import clean_up_esm +from esm import classify_sessions_by_completion_time, preprocess_esm + +input_data_files = dict(snakemake.input) + +def format_timestamp(x): + """This method formates inputed timestamp into format "HH MM SS". Including spaces. If there is no hours or minutes present + that part is ignored, e.g., "MM SS" or just "SS". + + Args: + x (int): unix timestamp in seconds + + Returns: + str: formatted timestamp using "HH MM SS" sintax + """ + tstring="" + space = False + if x//3600 > 0: + tstring += f"{x//3600}H" + space = True + if x % 3600 // 60 > 0: + tstring += f" {x % 3600 // 60}M" if "H" in tstring else f"{x % 3600 // 60}M" + if x % 60 > 0: + tstring += f" {x % 60}S" if "M" in tstring or "H" in tstring else f"{x % 60}S" + + return tstring + + +def extract_ers(esm_df): + """This method has two major functionalities: + (1) It prepares STRAW event-related segments file with the use of esm file. The execution protocol is depended on + the segmenting method specified in the config.yaml file. + (2) It prepares and writes csv with targets and corresponding time segments labels. This is later used + in the overall cleaning script (straw). + + Details about each segmenting method are listed below by each corresponding condition. Refer to the RAPIDS documentation for the + ERS file format: https://www.rapids.science/1.9/setup/configuration/#time-segments -> event segments + + Args: + esm_df (DataFrame): read esm file that is dependend on the current participant. + + Returns: + extracted_ers (DataFrame): dataframe with all necessary information to write event-related segments file + in the correct format. + """ + pd.set_option("display.max_rows", 20) + pd.set_option("display.max_columns", None) + + with open('config.yaml', 'r') as stream: + config = yaml.load(stream, Loader=yaml.FullLoader) + + pd.DataFrame(columns=["label", "intensity"]).to_csv(snakemake.output[1]) # Create an empty stress_events_targets file + + esm_preprocessed = clean_up_esm(preprocess_esm(esm_df)) + + # Take only ema_completed sessions responses + classified = classify_sessions_by_completion_time(esm_preprocessed) + esm_filtered_sessions = classified[classified["session_response"] == 'ema_completed'].reset_index()[['device_id', 'esm_session']] + esm_df = esm_preprocessed.loc[(esm_preprocessed['device_id'].isin(esm_filtered_sessions['device_id'])) & (esm_preprocessed['esm_session'].isin(esm_filtered_sessions['esm_session']))] + + segmenting_method = config["TIME_SEGMENTS"]["TAILORED_EVENTS"]["SEGMENTING_METHOD"] + + if segmenting_method in ["30_before", "90_before"]: # takes 30-minute peroid before the questionnaire + the duration of the questionnaire + """ '30-minutes and 90-minutes before' have the same fundamental logic with couple of deviations that will be explained below. + Both take x-minute period before the questionnaire that is summed with the questionnaire duration. + All questionnaire durations over 15 minutes are excluded from the querying. + """ + # Extract time-relevant information + extracted_ers = esm_df.groupby(["device_id", "esm_session"])['timestamp'].apply(lambda x: math.ceil((x.max() - x.min()) / 1000)).reset_index() # questionnaire length + extracted_ers["label"] = f"straw_event_{segmenting_method}_" + snakemake.params["pid"] + "_" + extracted_ers.index.astype(str).str.zfill(3) + extracted_ers[['event_timestamp', 'device_id']] = esm_df.groupby(["device_id", "esm_session"])['timestamp'].min().reset_index()[['timestamp', 'device_id']] + extracted_ers = extracted_ers[extracted_ers["timestamp"] <= 15 * 60].reset_index(drop=True) # ensure that the longest duration of the questionnaire anwsering is 15 min + extracted_ers["shift_direction"] = -1 + + if segmenting_method == "30_before": + """The method 30-minutes before simply takes 30 minutes before the questionnaire and sums it with the questionnaire duration. + The timestamps are formatted with the help of format_timestamp() method. + """ + time_before_questionnaire = 30 * 60 # in seconds (30 minutes) + + extracted_ers["length"] = (extracted_ers["timestamp"] + time_before_questionnaire).apply(lambda x: format_timestamp(x)) + extracted_ers["shift"] = time_before_questionnaire + extracted_ers["shift"] = extracted_ers["shift"].apply(lambda x: format_timestamp(x)) + + elif segmenting_method == "90_before": + """The method 90-minutes before has an important condition. If the time between the current and the previous questionnaire is + longer then 90 minutes it takes 90 minutes, otherwise it takes the original time difference between the questionnaires. + """ + time_before_questionnaire = 90 * 60 # in seconds (90 minutes) + + extracted_ers[['end_event_timestamp', 'device_id']] = esm_df.groupby(["device_id", "esm_session"])['timestamp'].max().reset_index()[['timestamp', 'device_id']] + + extracted_ers['diffs'] = extracted_ers['event_timestamp'].astype('int64') - extracted_ers['end_event_timestamp'].shift(1, fill_value=0).astype('int64') + extracted_ers.loc[extracted_ers['diffs'] > time_before_questionnaire * 1000, 'diffs'] = time_before_questionnaire * 1000 + + extracted_ers["diffs"] = (extracted_ers["diffs"] / 1000).apply(lambda x: math.ceil(x)) + + extracted_ers["length"] = (extracted_ers["timestamp"] + extracted_ers["diffs"]).apply(lambda x: format_timestamp(x)) + extracted_ers["shift"] = extracted_ers["diffs"].apply(lambda x: format_timestamp(x)) + + elif segmenting_method == "stress_event": + """This is a special case of the method as it consists of two important parts: + (1) Generating of the ERS file (same as the methods above) and + (2) Generating targets file alongside with the correct time segment labels. + + This extracts event-related segments, depended on the event time and duration specified by the participant in the next + questionnaire. Additionally, 5 minutes before the specified start time of this event is taken to take into a account the + possiblity of the participant not remembering the start time percisely => this parameter can be manipulated with the variable + "time_before_event" which is defined below. + + By default, this method also excludes all events that are longer then 2.5 hours so that the segments are easily comparable. + """ + # Get and join required data + extracted_ers = esm_df.groupby(["device_id", "esm_session"])['timestamp'].apply(lambda x: math.ceil((x.max() - x.min()) / 1000)).reset_index().rename(columns={'timestamp': 'session_length'}) # questionnaire end timestamp + extracted_ers = extracted_ers[extracted_ers["session_length"] <= 15 * 60].reset_index(drop=True) # ensure that the longest duration of the questionnaire anwsering is 15 min + session_end_timestamp = esm_df.groupby(['device_id', 'esm_session'])['timestamp'].max().to_frame().rename(columns={'timestamp': 'session_end_timestamp'}) # questionnaire end timestamp + se_time = esm_df[esm_df.questionnaire_id == 90.].set_index(['device_id', 'esm_session'])['esm_user_answer'].to_frame().rename(columns={'esm_user_answer': 'se_time'}) + se_duration = esm_df[esm_df.questionnaire_id == 91.].set_index(['device_id', 'esm_session'])['esm_user_answer'].to_frame().rename(columns={'esm_user_answer': 'se_duration'}) + + # Extracted 3 targets that will be transfered with the csv file to the cleaning script. + se_stressfulness_event_tg = esm_df[esm_df.questionnaire_id == 87.].set_index(['device_id', 'esm_session'])['esm_user_answer_numeric'].to_frame().rename(columns={'esm_user_answer_numeric': 'appraisal_stressfulness_event'}) + se_threat_tg = esm_df[esm_df.questionnaire_id == 88.].groupby(["device_id", "esm_session"]).mean()['esm_user_answer_numeric'].to_frame().rename(columns={'esm_user_answer_numeric': 'appraisal_threat'}) + se_challenge_tg = esm_df[esm_df.questionnaire_id == 89.].groupby(["device_id", "esm_session"]).mean()['esm_user_answer_numeric'].to_frame().rename(columns={'esm_user_answer_numeric': 'appraisal_challenge'}) + + # All relevant features are joined by inner join to remove standalone columns (e.g., stressfulness event target has larger count) + extracted_ers = extracted_ers.join(session_end_timestamp, on=['device_id', 'esm_session'], how='inner') \ + .join(se_time, on=['device_id', 'esm_session'], how='inner') \ + .join(se_duration, on=['device_id', 'esm_session'], how='inner') \ + .join(se_stressfulness_event_tg, on=['device_id', 'esm_session'], how='inner') \ + .join(se_threat_tg, on=['device_id', 'esm_session'], how='inner') \ + .join(se_challenge_tg, on=['device_id', 'esm_session'], how='inner') + + + # Filter sessions that are not useful. Because of the ambiguity this excludes: + # (1) straw event times that are marked as "0 - I don't remember" + # (2) straw event durations that are marked as "0 - I don't remember" + extracted_ers = extracted_ers[(~extracted_ers.se_time.str.startswith("0 - ")) & (~extracted_ers.se_duration.str.startswith("0 - "))] + + # Transform data into its final form, ready for the extraction + extracted_ers.reset_index(drop=True, inplace=True) + + time_before_event = 5 * 60 # in seconds (5 minutes) + extracted_ers['event_timestamp'] = pd.to_datetime(extracted_ers['se_time']).apply(lambda x: x.timestamp() * 1000).astype('int64') + extracted_ers['shift_direction'] = -1 + + # Checks whether the duration is marked with "1 - It's still ongoing" which means that the end of the current questionnaire + # is taken as end time of the segment. Else the user input duration is taken. + extracted_ers['se_duration'] = \ + np.where( + extracted_ers['se_duration'].str.startswith("1 - "), + extracted_ers['session_end_timestamp'] - extracted_ers['event_timestamp'], + extracted_ers['se_duration'] + ) + + # This converts the rows of timestamps in miliseconds and the row with datetime to timestamp in seconds. + extracted_ers['se_duration'] = \ + extracted_ers['se_duration'].apply(lambda x: math.ceil(x / 1000) if isinstance(x, int) else (pd.to_datetime(x).hour * 60 + pd.to_datetime(x).minute) * 60) + time_before_event + + extracted_ers['shift'] = format_timestamp(time_before_event) + extracted_ers['length'] = extracted_ers['se_duration'].apply(lambda x: format_timestamp(x)) + + # Drop event_timestamp duplicates in case of user referencing the same event over multiple questionnaires + extracted_ers.drop_duplicates(subset=["event_timestamp"], keep='first', inplace=True) + extracted_ers.reset_index(drop=True, inplace=True) + + extracted_ers["label"] = f"straw_event_{segmenting_method}_" + snakemake.params["pid"] + "_" + extracted_ers.index.astype(str).str.zfill(3) + + # Write the csv of extracted ERS labels with targets related to stressfulness event + extracted_ers[["label", "appraisal_stressfulness_event", "appraisal_threat", "appraisal_challenge"]].to_csv(snakemake.output[1], index=False) + + else: + raise Exception("Please select correct target method for the event-related segments.") + extracted_ers = pd.DataFrame(columns=["label", "event_timestamp", "length", "shift", "shift_direction", "device_id"]) + + return extracted_ers[["label", "event_timestamp", "length", "shift", "shift_direction", "device_id"]] + + +""" +Here the code is executed - this .py file is used both for extraction of the STRAW time_segments file for the individual +participant, and also for merging all participant's files into one combined file which is later used for the time segments +to all sensors assignment. + +There are two files involved (see rules extract_event_information_from_esm and merge_event_related_segments_files in preprocessing.smk) +(1) ERS file which contains all the information about the time segment timings and +(2) targets file which has corresponding target value for the segment label which is later used to merge with other features in the cleaning script. +For more information, see the comment in the method above. +""" +if snakemake.params["stage"] == "extract": + esm_df = pd.read_csv(input_data_files['esm_raw_input']) + + extracted_ers = extract_ers(esm_df) + + extracted_ers.to_csv(snakemake.output[0], index=False) + +elif snakemake.params["stage"] == "merge": + + input_data_files = dict(snakemake.input) + straw_events = pd.DataFrame(columns=["label", "event_timestamp", "length", "shift", "shift_direction", "device_id"]) + stress_events_targets = pd.DataFrame(columns=["label", "appraisal_stressfulness_event", "appraisal_threat", "appraisal_challenge"]) + + for input_file in input_data_files["ers_files"]: + ers_df = pd.read_csv(input_file) + straw_events = pd.concat([straw_events, ers_df], axis=0, ignore_index=True) + + straw_events.to_csv(snakemake.output[0], index=False) + + for input_file in input_data_files["se_files"]: + se_df = pd.read_csv(input_file) + stress_events_targets = pd.concat([stress_events_targets, se_df], axis=0, ignore_index=True) + + stress_events_targets.to_csv(snakemake.output[1], index=False) + + + diff --git a/src/features/phone_light/rapids/main.py b/src/features/phone_light/rapids/main.py index 32df47ef..f5644c7e 100644 --- a/src/features/phone_light/rapids/main.py +++ b/src/features/phone_light/rapids/main.py @@ -29,7 +29,7 @@ def rapids_features(sensor_data_files, time_segment, provider, filter_data_by_se if "medianlux" in features_to_compute: light_features["medianlux"] = light_data.groupby(["local_segment"])["double_light_lux"].median() if "stdlux" in features_to_compute: - light_features["stdlux"] = light_data.groupby(["local_segment"])["double_light_lux"].std() + light_features["stdlux"] = light_data.groupby(["local_segment"])["double_light_lux"].std().fillna(0) light_features = light_features.reset_index() diff --git a/src/features/phone_locations/doryab/main.py b/src/features/phone_locations/doryab/main.py index 19a7b8d5..e4dc0117 100644 --- a/src/features/phone_locations/doryab/main.py +++ b/src/features/phone_locations/doryab/main.py @@ -37,7 +37,8 @@ def variance_and_logvariance_features(location_data, location_features): location_data["longitude_for_wvar"] = (location_data["double_longitude"] - location_data["longitude_wavg"]) ** 2 * location_data["duration"] * 60 location_features["locationvariance"] = ((location_data_grouped["latitude_for_wvar"].sum() + location_data_grouped["longitude_for_wvar"].sum()) / (location_data_grouped["duration"].sum() * 60 - 1)).fillna(0) - location_features["loglocationvariance"] = np.log10(location_features["locationvariance"]).replace(-np.inf, np.nan) + + location_features["loglocationvariance"] = np.log10(location_features["locationvariance"]).replace(-np.inf, -1000000) return location_features diff --git a/src/features/phone_messages/rapids/main.R b/src/features/phone_messages/rapids/main.R index b92769fd..d5dddc73 100644 --- a/src/features/phone_messages/rapids/main.R +++ b/src/features/phone_messages/rapids/main.R @@ -65,6 +65,15 @@ rapids_features <- function(sensor_data_files, time_segment, provider){ features <- message_features_of_type(messages_of_type, message_type, time_segment, requested_features) messages_features <- merge(messages_features, features, all=TRUE) } - messages_features <- messages_features %>% mutate_at(vars(contains("countmostfrequentcontact") | contains("distinctcontacts") | contains("count")), list( ~ replace_na(., 0))) + # Fill seleted columns with a high number + time_cols <- select(messages_features, contains("timefirstmessages") | contains("timelastmessages")) %>% + colnames(.) + + messages_features <- messages_features %>% + mutate_at(., time_cols, ~replace(., is.na(.), 1500)) + + # Fill NA values with 0 + messages_features <- messages_features %>% mutate_all(~replace(., is.na(.), 0)) + return(messages_features) } \ No newline at end of file diff --git a/src/features/phone_screen/rapids/main.py b/src/features/phone_screen/rapids/main.py index 5740e430..26580640 100644 --- a/src/features/phone_screen/rapids/main.py +++ b/src/features/phone_screen/rapids/main.py @@ -15,7 +15,7 @@ def getEpisodeDurationFeatures(screen_data, time_segment, episode, features, ref if "avgduration" in features: duration_helper = pd.concat([duration_helper, screen_data_episode.groupby(["local_segment"])[["duration"]].mean().rename(columns = {"duration":"avgduration" + episode})], axis = 1) if "stdduration" in features: - duration_helper = pd.concat([duration_helper, screen_data_episode.groupby(["local_segment"])[["duration"]].std().rename(columns = {"duration":"stdduration" + episode})], axis = 1) + duration_helper = pd.concat([duration_helper, screen_data_episode.groupby(["local_segment"])[["duration"]].std().fillna(0).rename(columns = {"duration":"stdduration" + episode})], axis = 1) if "firstuseafter" + "{0:0=2d}".format(reference_hour_first_use) in features: screen_data_episode_after_hour = screen_data_episode.copy() screen_data_episode_after_hour["hour"] = pd.to_datetime(screen_data_episode["local_start_date_time"]).dt.hour diff --git a/src/features/phone_wifi_visible/rapids/main.R b/src/features/phone_wifi_visible/rapids/main.R index 8a973809..a8a433fa 100644 --- a/src/features/phone_wifi_visible/rapids/main.R +++ b/src/features/phone_wifi_visible/rapids/main.R @@ -9,21 +9,26 @@ compute_wifi_feature <- function(data, feature, time_segment){ "countscans" = data %>% summarise(!!feature := n()), "uniquedevices" = data %>% summarise(!!feature := n_distinct(bssid))) return(data) + } else if(feature == "countscansmostuniquedevice"){ # Get the most scanned device - mostuniquedevice <- data %>% + mostuniquedevice <- data %>% + filter(bssid != "") %>% group_by(bssid) %>% mutate(N=n()) %>% ungroup() %>% filter(N == max(N)) %>% head(1) %>% # if there are multiple device with the same amount of scans pick the first one only pull(bssid) + data <- data %>% filter_data_by_segment(time_segment) + return(data %>% filter(bssid == mostuniquedevice) %>% group_by(local_segment) %>% - summarise(!!feature := n()) %>% - replace(is.na(.), 0)) + summarise(!!feature := n()) + ) + } } @@ -43,6 +48,6 @@ rapids_features <- function(sensor_data_files, time_segment, provider){ feature <- compute_wifi_feature(wifi_data, feature_name, time_segment) features <- merge(features, feature, by="local_segment", all = TRUE) } - + features <- features %>% mutate_all(~replace(., is.na(.), 0)) return(features) } diff --git a/src/features/standardization/main.py b/src/features/standardization/main.py deleted file mode 100644 index d91fca6d..00000000 --- a/src/features/standardization/main.py +++ /dev/null @@ -1,50 +0,0 @@ -import pandas as pd -import numpy as np -from sklearn.preprocessing import StandardScaler - -import sys - -sensor_data_files = dict(snakemake.input) - -provider = snakemake.params["provider"] -provider_key = snakemake.params["provider_key"] -sensor_key = snakemake.params["sensor_key"] - -pd.set_option('display.max_columns', None) - -if provider_key == "cr": - sys.path.append('/rapids/src/features/') - from cr_features_helper_methods import extract_second_order_features - - provider_main = snakemake.params["provider_main"] - prefix = sensor_key + "_" + provider_key + "_" - - windows_features_data = pd.read_csv(sensor_data_files["windows_features_data"]) - excluded_columns = ['local_segment', 'local_segment_label', 'local_segment_start_datetime', 'local_segment_end_datetime', prefix + "level_1"] - - if windows_features_data.empty: - windows_features_data.to_csv(snakemake.output[1], index=False) - windows_features_data.to_csv(snakemake.output[0], index=False) - else: - windows_features_data.loc[:, ~windows_features_data.columns.isin(excluded_columns)] = StandardScaler().fit_transform(windows_features_data.loc[:, ~windows_features_data.columns.isin(excluded_columns)]) - - windows_features_data.to_csv(snakemake.output[1], index=False) - - if provider_main["WINDOWS"]["COMPUTE"] and "SECOND_ORDER_FEATURES" in provider_main["WINDOWS"]: - so_features_names = provider_main["WINDOWS"]["SECOND_ORDER_FEATURES"] - windows_so_features_data = extract_second_order_features(windows_features_data, so_features_names, prefix) - windows_so_features_data.to_csv(snakemake.output[0], index=False) - else: - pd.DataFrame().to_csv(snakemake.output[0], index=False) - -else: - for sensor_features in sensor_data_files["sensor_features"]: - if "/" + sensor_key + ".csv" in sensor_features: - sensor_data = pd.read_csv(sensor_features) - excluded_columns = ['local_segment', 'local_segment_label', 'local_segment_start_datetime', 'local_segment_end_datetime'] - - if not sensor_data.empty: - sensor_data.loc[:, ~sensor_data.columns.isin(excluded_columns)] = StandardScaler().fit_transform(sensor_data.loc[:, ~sensor_data.columns.isin(excluded_columns)]) - - sensor_data.to_csv(snakemake.output[0], index=False) - break \ No newline at end of file diff --git a/src/features/utils/utils.py b/src/features/utils/utils.py index 7303ac86..8a4d2130 100644 --- a/src/features/utils/utils.py +++ b/src/features/utils/utils.py @@ -160,12 +160,16 @@ def fetch_provider_features(provider, provider_key, sensor_key, sensor_data_file return sensor_features -def run_provider_cleaning_script(provider, provider_key, sensor_key, sensor_data_files): +def run_provider_cleaning_script(provider, provider_key, sensor_key, sensor_data_files, target=False): from importlib import import_module, util print("{} Processing {} {}".format(rapids_log_tag, sensor_key, provider_key)) cleaning_module = import_path(provider["SRC_SCRIPT"]) cleaning_function = getattr(cleaning_module, provider_key.lower() + "_cleaning") - sensor_features = cleaning_function(sensor_data_files, provider) + + if target: + sensor_features = cleaning_function(sensor_data_files, provider, target) + else: + sensor_features = cleaning_function(sensor_data_files, provider) return sensor_features \ No newline at end of file diff --git a/src/models/helper.py b/src/models/helper.py index 61f9f666..2e007810 100644 --- a/src/models/helper.py +++ b/src/models/helper.py @@ -1,5 +1,6 @@ import pandas as pd - +import sys +import warnings def retain_target_column(df_input: pd.DataFrame, target_variable_name: str): column_names = df_input.columns @@ -8,9 +9,9 @@ def retain_target_column(df_input: pd.DataFrame, target_variable_name: str): esm_names = column_names[esm_names_index] target_variable_index = esm_names.str.contains(target_variable_name) if all(~target_variable_index): - raise ValueError("The requested target (", target_variable_name, - ")cannot be found in the dataset.", - "Please check the names of phone_esm_ columns in z_all_sensor_features_cleaned_straw_py.csv") + warnings.warn(f"The requested target (, {target_variable_name} ,)cannot be found in the dataset. Please check the names of phone_esm_ columns in cleaned python file") + return None + sensor_features_plus_target = df_input.drop(esm_names, axis=1) sensor_features_plus_target["target"] = df_input[esm_names[target_variable_index]] # We will only keep one column related to phone_esm and that will be our target variable. diff --git a/src/models/merge_features_and_targets_for_population_model.py b/src/models/merge_features_and_targets_for_population_model.py index f9e9acd2..0ede61f8 100644 --- a/src/models/merge_features_and_targets_for_population_model.py +++ b/src/models/merge_features_and_targets_for_population_model.py @@ -12,9 +12,13 @@ for baseline_features_path in snakemake.input["demographic_features"]: all_baseline_features = pd.concat([all_baseline_features, baseline_features], axis=0) # merge sensor features and baseline features -features = sensor_features.merge(all_baseline_features, on="pid", how="left") +if not sensor_features.empty: + features = sensor_features.merge(all_baseline_features, on="pid", how="left") -target_variable_name = snakemake.params["target_variable"] -model_input = retain_target_column(features, target_variable_name) + target_variable_name = snakemake.params["target_variable"] + model_input = retain_target_column(features, target_variable_name) -model_input.to_csv(snakemake.output[0], index=False) + model_input.to_csv(snakemake.output[0], index=False) + +else: + sensor_features.to_csv(snakemake.output[0], index=False) diff --git a/src/models/select_targets.py b/src/models/select_targets.py index 4dc4a252..c6abe687 100644 --- a/src/models/select_targets.py +++ b/src/models/select_targets.py @@ -6,6 +6,8 @@ cleaned_sensor_features = pd.read_csv(snakemake.input["cleaned_sensor_features"] target_variable_name = snakemake.params["target_variable"] model_input = retain_target_column(cleaned_sensor_features, target_variable_name) -model_input.dropna(axis ="index", how="any", subset=["target"], inplace=True) -model_input.to_csv(snakemake.output[0], index=False) +if model_input is None: + pd.DataFrame().to_csv(snakemake.output[0]) +else: + model_input.to_csv(snakemake.output[0], index=False) diff --git a/tests/scripts/NaN.png b/tests/scripts/NaN.png new file mode 100644 index 0000000000000000000000000000000000000000..e1178739504132a4d30c2086d2a5b66a870db509 GIT binary patch literal 12199 zcmeHtXH=8h*6te+*n)zQBGLsFP>P7sA-Dx`qtdH10Rib!By{4oMZw^po|XbMN@>cgOk0Id|M2-wy`K%3J1IYtH9+=A7%UnTgSE zZV_$>f_DFQ*1!US*rgzd4f6{p7`ZGeb{zcE3^lY1wG8wOMY{%jKo?v?ulfas`gyxa zg?R*rcn1ckDxFkPQINV48hSNEOIg|fuM3m{gT0goU03SCLv~(0YaaqZyG-CeL>xLn z27;6Yels{_6`r~{8uiS2Jw};HICevLD${(=W#K(s%WT2s59NBS?yFF8dNN+va3tK_ zR#j1XGay?k@mu68-5KJZ3w*q6o2`QBe_LBEoePU5(9CaaX{>6-ZLcMeFu&HuU3 zHL5$3ft#zF5K}!co<@|xQKM(XZybbOV8gNUZOG(6zn4(d3i%&O<+K^H0*XvMXW&L*%@hiG*)sY_G>BCpU>Efaw}KIXRs;abj}6Ae7-RMA}d3u9EhQ2^gjkn9j{( z1@~raT=d?`iE^Li(%qhYNf`+tw5Sqv__-vbubxjnk$alGBiI7M zgw58*^YIrfiJ12)X67C7muNSEF^zP<_nSnfr@^}IAPGBCjF#khTbrlR6IP=Li&lN)H6m6eqj zrr04V+cCY8t2az0A9bTARyvdu--gVz<`*}JWaDEa*JgWUB%+t!3!`GAPpGRKC3&cX zFMq41E!8?msJgc%JkyB?KA)nd7`d^qU)8JYGsP5wx;ZVQ1d}8!^k*HjE(z|>UV|`> ze)3Y5#OaimJGNr{39uWR%~d|;JF5>OBj@t`*hmctZ4j%OBMhQl|4>xa)WG^t34$`0 z;OU@N)YBVNu-f%=Y&6c$)jsZGb4KD!%B_1Z zK=fGM%dM}BY0Ectbg+wExEH-*_DA%d1Cah}Eo#aw(SxtxOSmPEpG6+y;DIty{jVH3 zUt6>89wB=j3$skSL$B|xsUKrgDeRX*um>3LV-z z9Qp?aTJH%SUBMxscS`WGYQQfyQET82nm3Y>*dDHX^v(o^qSrr2$_Ae`A$}0 zZrL2@X_P>Z$KG(fX)+hI`|Pc#fEI_K4Mo$1tE`nqRv;2dp|Vs_=@%Ogr_)<=d#Z>H zxf3lY6xG-4AC*KR0~^X!Vp%_iDSay^!3#1XfIpFU{VyLr0a|+cjOT<*=$HYpw;0?4W8K1-A zT`cWbgHOPw>t?_c|Gkx^pinEhw`T?$UH(qOOn)#J>=FVhx!>04VCPJK6dm$TVnxXg zf?6gkCuD3Cal&*jB){ZfB+FIyO!VwK>}gz=qR7H-c(GONE2>MP=TMdPtUB#SM%TM8 zaSMS_&9Gos&7~hYmipO2oHx@VDg7wJ%@{^Fq`#N7?$_LlHcWUf95m2*OZ2h47ck&1 zT2C-8IwXdq0%aW62?}ZMRjqQq!s~G!;ri4$iZ)s&1L=E&e`la8w?BFS*V+NkPN#nD z~~JVX5ela!MZw%Za&50sB1b; z^70DRbM}CFo4rkM$S-lEIyPu#{0^K)e@bh1xhZ(zHL#6)Tm;;9oIZ-d?hZ5x9Yu}T zu#SwNCx?)_G^2&`xQ3xWZ}Avh4s)fFv>nK_K|$)}>)f7T>CEm~o5pRPz|E?w5d)>$66{2>3ROc$$4Uk0+B}z0@NZV&_Elz2P zh{#6h!JySn(9V|;i4_l`KSAnlPp5HiK|ilWF)f;z^)j~;3EDeL#Ni=jWu{Ih1tr6o zCLA<7Th|K>1Fec#IZh++bgX(!r?3yo<3Zp)&T3o~@2OMCB=jI{nUV7h89m9K<;$^e zF*YEE^}T&ac6LRPuEVspv2;D2Zcli*lCz3BwGrJ27>SHK`kZfxnmq(tm=Ri6rAOZJ zPQn7b=nExHX2k?rH~`xsPHwQalL+!GQo-OtfC4eo%BwNhr7bf~U>zbK)aTM7w#Ld`*}N}}{cL`B_)$~+1K*deW1G)gi0 z+qW5}%J^w6u<(QIMTj{}YisK)vDl@rPvCZOUfz)(KYnDscySQQOw63n`gdS%=FM;&0!+os*v*pROCV6g=E@{`v7kS?7l$RYo^@ z>!FPCXpd8*ofwnJU1z!0=6YpY6ZZ3$dv@N#VgY|D`}q0-HgD+n2k@;!Q%0gwB`B?u z07*lSdjZMxpD}TgqJxmC=V$DvN3v~iyn8df2u+N6VDx0dkJ%S|}O{cM)eFx%F z3Kd8{;I?=V8^Sa-AjYhkupj&e2;l*qoOrjIwtyw31HXQN1(I@df_b#62#q?}D5h@# zobf@TA9^xZD*E zawd6Wq{nxi<%sh7kOct74CfsA%1lpMnpazXyt{`wjktG@&EmaM%>-%{2h;3wP1DMR zOQLXOWP?2>^CiZr#7mUauFVEbd9FXi+QvE9p|(SS@%J8r2`=b%idGeEY99s@?1ANa z$>y54gvvF4k~~HlX32jlE75lrVx1UojYTSt;L&@bv#qi(*e6{_ld+Ocj&@^>=mvX8 z>VTI%w6dISTUEp-^3;VFI{SJ59!!^7g{^DE)FWgJPN6{3Y@azgy7__W)KeZP81T+3 z3+wu8^$wd@Wx8~u0`!2NV@F2$PpuEhi2zubRj`cmPmAS%3_u#ttr?3_C+~pF9t(iy zE`Mm0Q)qm9d`t{71Nk%<=s7CQD=ch2dXjt)GP43hl^0ion;)^X-PYr4StJ6TfX&BS z>O+^bNgO;Gg-aTW<7xf3u95vXAo1~CznH~};O}h>5v83Oi==Zvi-Q<^H8x(=%1Tygab^Nux>J z!gqYU@Wvdrh?b2~gSwAcmSE}wKPk%$01sWpPleEv1CWwh$1*+f(KtS60qogAOozk~ zg|?WJU{3ly=K7fWdE!`NxX&5an8q8cootW*K;*$%33^IX@R}jAUrCi6s(g)s(k9BQ zrqduu?pJ2i-oe@pGCyE(;FFM{F#@4AoqkSQ2(^bYHipf~`x?nQd2Y312PB{i?7>BR zZ$sLL7*uM4CDY#t-iXs&S=hAAyU0+3Fnq^3c&^m4E((V8LRSOggMvOv@(tdSOn$#!u^X? z*%N}l6N>(r9t=T1_+K~PNRfoMCFdCUF>lJ$KxaMi*1Ba8L7fLOQ>o(-vdXLNO8BT| zlp1H-+%q$o#tG%@>B8v0Zdcov)XoJxc)(V6hjX`xz=zwr<3Wzql4JhTtt(X%i7XAB z`=asvCcoQk_Y0MJ)5@87=W8$w`@+Oplt`IfnMwa z9ytzQ=dU%)QstSa?9e5?x4(!RpPKl+X@#G7CY9b9T@FDD2rvB%0s3TFQ}BbTK@P|` zg^jIc=>|FS+B*y!84DcTzl8QpPh*Ks4Nnvrht%b#-_^Kkmh6szLiZ9e`bLEj^5aPp z-1C;iFj7zv)4zGUN}5-vU=`$vFmsFd7ka^o@{Gg`rUo{5mz-3$xi0VAh^iz6d3-!6 z)h)#1C8Wg3!e^aAKnpuLc6=RoDOOUr6S@+56v7c|1jqjHCQ6CsV%G$CHOG}%KF0*D* zIU$TLxK;=7&WPSOs-KyG0bQPEMn5o*KP?AJn1?M(1T=gOVah8t-+b zko`UbwFadbAP&UwG5i5mQ)W_FM!O<6xm3vL;;FP>AdE>S=0Rf{^MJH;S8n9N8Bl@< zRpYoBt;-xu>X-D)o)QL&E+DXHW&Ju^*+I_fzrHSgZ3_PJH5#}v>^QLjs6(*h9L&Nz zz@zli(BQRuE%iU*;cWoqbU@NFY6@@@ModoXja>uTrFqnerL=e|jRCw%?qmd4Ms=zc zV;H-+S8UdoU}%LNoi4g2ngXbuDoQ>G)=^&q*bwxs2({||z9aDPE>NU3=04lfk$E`T zJtXw*DKb-vd<@hTG>I6iYG0q}=CT;s)^jxH$LPmGN5`D;XkeVyUA8ZGla!o9WK*yH zx=UQ^L@Q2QRJm^XSJEG0%TvktX>nJs`@TPGPOmYcnSZA*yRyr_IYup3!1Prov-l96 z(yI(i=tJv#o0g_qJQJDm+?FY=iIS3lic0;4%2FTsS459BVzJo3Ms!fuZ(xc7+-GII z{rnPQ8QX>zE_~Nw{UcK?UI|z=0+JijzqPDn_cOIRkJqc8n$OD0y4yYierz8v@%O%fn59B!|tV(N!!D1pOu_mG;UMuczYq`Ps9RzBUF1 zBY-N*9UMeKcBd%|T!jAY_x`)!S0ufV->$!T;s11=^g#%K?BQ?u z+`qURr*-8=UoPA@5t+9`DL>&f)+@w*zE zBO&hZf9Dt$NUVGYblP8Wq1(P2lmwxyDz#A7F4%*_O-sd&i)dJHs^?` zC-CE85LuZrx3#V6Z8P~B$sUaWgP5 zNR4AsgJ_FYMj;^~1@`5KVOh(6`25GCM<7UGx0k-wY|VU8JgE6=`S|!`*3@VP)UOW+ zYlU7hFmQ~fw3J?Y|GV(<%V({uty><;TV8Aj6l8Ur_P5d4Bf|yZb`N|Oa4d>*$aR! z9wg-d-Tt4z{c@b<32Euu>x+NR)a58@HvAw?PY~ZD*4k++}e&XQDI_RT;YM*P>Fo5!h)@uR<6NxB2z>77#{scXm-!9=}8NI3? z(sZ*-@uZdY4a<6PVXD>I{s$n|+UAQJ8+wPA+KoW#uWqINsG^IqOM?!mAcocYM_Fc_ zzY98TZ$@g+18Ns5HE0Xv6XWs9I1V16<_8B(XM%cV*7@gq;T#Zq z<(lZ9aV40Y9_dqHSL&8MoR^cyFScX$nU%NZQYGqy$k) ze-dnyoRnZe*4JJ>NdICiJHUv4^1w#!|Jw6pR%-roL({MtLK5X|FG7NSJ&w5Vt!;t<9B-ENT6{cLaf1 zvtTvY1gy{sA7*w})SO#etOx{uWRog%ey&OC6(ilvMG++Oj>PhdR3PC?vgw;D|LO>X z{%@xmtnHILBJNzu4;Za6B$Mh_aY!Wc!i5Vx^--I3?%kZ}`5P+E{$2%N+rpTX-h-zl>%6DxI=9-dS3j`&<1G~ou=P;qz(oDhI{ z3Us@hFQ||9{h$yF9ro>Wyn3pSNNfY0hY3;?*?&k4NY>f0#yD6EoVhO|NOTP8RN7qa z42)I9X^dlqHG?mUp+|x!9WKl=kKxUR&3a+I=^r(HF?`z9rk#rJ z^*;T1exS~?NHUFDSui=2K=eY3BbaAb<#%w<7G4WI6ZjUl$L$v zhVeokupZl}D2^%I{j$vW$uVaa+E-#(aU5eFMA;B%?UePJqI7H52OQAaYQsmLmijEt z%-j|Y7@WZ8T4Z1zKLEJ>hW>gej>RYk&7ev@QSC6ljeHk|&z-gc=X$OD++1&Je3hXC zvlR6s=VckJ=b~en)EO^81W_rHjjElUo%V?mx@AvYVk5GL6ZN)d7rx(-4A_i~H&@Yh znP|P=5liiGsgy_i(!5R6H;%MXv}RI6$^zqP!?20TCwUN#o1Jf|(c!pmc zrB##yqS0`yA;zI{_n!hdZ2OsNmxc_Gj$dQyK)F&_s*6GuBnlcUDwyJNL1 zHE~1$3j#nrn~Fuwz6hFqai_KUF26fCpR(9hQxBG4bnc_CvBQS_4|nSyz&X~${7ON5 z0L2jI7&-Sk>J7fj3^#F&u}*9QewZ0J*3gq}iq}~AC^G?cdH{N0@yQRn#jgKd>hqGb zpH;-IeQE*iHaFnGl>ZW5;?h?S@t%0fr!N%?WV!-x3DgKR&#lQ&=qIY4H1C9VS5?8Zq{0KzgedAM4 zU2sN{F(*jTtDRfTvU#;~%rZF|opDdv?LbVi3d`3OLZm)s1c21K+w04;Fn3*PyO1l4gA za{{HMVeuMeQ#jP`Yi#u7gWZ$uc~2ERE{BnzHt}IP-?~F!eSaAqhWAd|P)d8mpUc18 z#PbrQNZ@!B67;-D5)49YZlbmC_Qs;9Q*GFNEH-GOwPm*FrTh1$TeUHcpFTzL3K!D4 z&<&43Pxi0_XacKknkJv{PKht%)m`gOiZ2A|*rR?T(W~RB^4L`D@NpsqhMezi`G6O^ zK-7a$jN+I&C|@jO+~DILAk=Czx>={o-AOA{x7J1l@$Dh$j>0ni62Ih=eG*JK+VchFWdM#VY8v6EZhvD|vZ&N=zTY1nI#&kT1(%f@w%k%SY>fpT zNDmASJ~K2>11HS?jv)%m{l`1>To3J zZ~S${_3xqq@MK{0gy;|eGBj;w@Qt@+*whnS&pYB_cJLZ3WTgc745EQn( zXss8R9eLTX9}t&v{hGx_u4&8>Q6O@{CjT( st) & (df.index < en)] +plt.plot(df) + +plt.savefig(f'NaN.png') +sys.exit() + + +plt.plot(df) + +esm = pd.read_csv(f"/rapids/data/raw/p03/phone_esm_raw.csv") + +esm['date'] = pd.to_datetime(esm['timestamp'],unit='ms') +esm = esm[esm['date']] +esm.set_index('date', inplace=True) +print(esm) + +esm = esm['esm_session'].resample("2900ms").mean() + +plt.plot(esm) +plt.savefig(f'NaN.png')