Update step 2 to choose data yield unit by users

data_cleaning
Meng Li 2021-11-17 18:27:18 -05:00
parent 10a57be839
commit 0e112d4f68
6 changed files with 44 additions and 25 deletions

View File

@ -579,7 +579,8 @@ ALL_CLEANING_INDIVIDUAL:
COLS_NAN_THRESHOLD: 0.3 # set to 1 to disable
COLS_VAR_THRESHOLD: True
ROWS_NAN_THRESHOLD: 0.3 # set to 1 to disable
DATA_YIELDED_HOURS_RATIO_THRESHOLD: 0.5 # set to 0 to disable
DATA_YIELD_UNIT: HOURS # HOURS or MINUTES
DATA_YIELD_RATIO_THRESHOLD: 0.5 # set to 0 to disable
DROP_HIGHLY_CORRELATED_FEATURES:
COMPUTE: True
MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
@ -596,7 +597,8 @@ ALL_CLEANING_OVERALL:
COLS_NAN_THRESHOLD: 0.3 # set to 1 to disable
COLS_VAR_THRESHOLD: True
ROWS_NAN_THRESHOLD: 0.3 # set to 1 to disable
DATA_YIELDED_HOURS_RATIO_THRESHOLD: 0.5 # set to 0 to disable
DATA_YIELD_UNIT: HOURS # HOURS or MINUTES
DATA_YIELD_RATIO_THRESHOLD: 0.5 # set to 0 to disable
DROP_HIGHLY_CORRELATED_FEATURES:
COMPUTE: True
MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5

View File

@ -1,7 +1,7 @@
Data Cleaning
=============
This module is to clean the extracted sensor features before merging it with the target labels.
The goal of this module is to perform basic clean tasks on the behavioral features that RAPIDS computes. You might need to do further processing depending on your analysis objectives. This module can clean features at the individual level and at the study level. For example, if we have enough data for a participant, we can train a model for that participant. By setting parameters in `[ALL_CLEANING_INDIVIDUAL]` section, we can get cleaned data for that participant. Similarly, if we want to train a model for all participants, we can get the cleaned data for all participants by setting parameters in `[ALL_CLEANING_OVERALL]` section.
## Clean sensor features for individual participants
@ -17,20 +17,21 @@ Parameters description for `[ALL_CLEANING_INDIVIDUAL][PROVIDERS][RAPIDS]`:
|Key                              | Description |
|----------------|-----------------------------------------------------------------------------------------------------------------------------------
|`[COMPUTE]` | Set to `True` to clean sensor features for individual participants from the `RAPIDS` provider|
|`[IMPUTE_SELECTED_EVENT_FEATURES]` | Fill NA with 0 for the selected event features, see table below
|`[COMPUTE]` | Set to `True` to execute the cleaning tasks described below. You can use the parameters of each task to tweak them or deactivate them|
|`[IMPUTE_SELECTED_EVENT_FEATURES]` | Fill NAs with 0 only for event-based features, see table below
|`[COLS_NAN_THRESHOLD]` | Discard columns with missing value ratios higher than `[COLS_NAN_THRESHOLD]`. Set to 1 to disable
|`[COLS_VAR_THRESHOLD]` | Set to `True` to discard columns with zero variance
|`[ROWS_NAN_THRESHOLD]` | Discard rows with missing value ratios higher than `[ROWS_NAN_THRESHOLD]`. Set to 1 to disable
|`[DATA_YIELDED_HOURS_RATIO_THRESHOLD]` | Discard rows with `phone_data_yield_rapids_ratiovalidyieldedhours` feature less than `[DATA_YIELDED_HOURS_RATIO_THRESHOLD]`. Set to 0 to disable
|`[DATA_YIELD_UNIT]` | `HOURS` or `MINUTES`. Set to `HOURS` to denote `ratiovalidyieldedhours` feature; set to `MINUTES` to denote `ratiovalidyieldedminutes` feature.
|`[DATA_YIELD_RATIO_THRESHOLD]` | Discard rows with `ratiovalidyieldedhours` or `ratiovalidyieldedminutes` feature less than `[DATA_YIELD_RATIO_THRESHOLD]`. The feature name is determined by `[DATA_YIELD_UNIT]` parameter. Set to 0 to disable
|`DROP_HIGHLY_CORRELATED_FEATURES` | Discard highly correlated features, see table below
Parameters description for `[ALL_CLEANING_INDIVIDUAL][PROVIDERS][RAPIDS][IMPUTE_SELECTED_EVENT_FEATURES]`:
|Parameters | Description |
|-------------------------------------- |----------------------------------------------------------------|
|`[COMPUTE]` | Set to `True` to fill NA with 0 for the selected event features
|`[MIN_DATA_YIELDED_MINUTES_TO_IMPUTE]` | Assume the selected event sensor is working when phone_data_yield_rapids_ratiovalidyieldedminutes > `[MIN_DATA_YIELDED_MINUTES_TO_IMPUTE]`. |
|`[COMPUTE]` | Set to `True` to fill NAs with 0 for phone event-based features
|`[MIN_DATA_YIELDED_MINUTES_TO_IMPUTE]` | Any feature value in a time segment instance with phone data yield > `[MIN_DATA_YIELDED_MINUTES_TO_IMPUTE]` will be replaced with a zero. See below for an explanation. |
Parameters description for `[ALL_CLEANING_INDIVIDUAL][PROVIDERS][RAPIDS][DROP_HIGHLY_CORRELATED_FEATURES]`:
@ -47,8 +48,18 @@ Steps to clean sensor features for individual participants. It only considers th
Take phone calls sensor as an example. If there are no calls records during a time segment for a participant, then (1) the calls sensor was not working during that time segment; or (2) the calls sensor was working and the participant did not have any calls during that time segment. To differentiate these two situations, we assume the selected sensors are working when `phone_data_yield_rapids_ratiovalidyieldedminutes > [MIN_DATA_YIELDED_MINUTES_TO_IMPUTE]`.
The following phone event-based features are considered currently:
- Application foreground: countevent, countepisode, minduration, maxduration, meanduration, sumduration.
- Battery: all features.
- Calls: count, distinctcontacts, sumduration, minduration, maxduration, meanduration, modeduration.
- Keyboard: sessioncount, averagesessionlength, changeintextlengthlessthanminusone, changeintextlengthequaltominusone, changeintextlengthequaltoone, changeintextlengthmorethanone, maxtextlength, totalkeyboardtouches.
- Messages: count, distinctcontacts.
- Screen: sumduration, maxduration, minduration, avgduration, countepisode.
- WiFi: all connected and visible features.
??? info "2. Discard unreliable rows."
Extracted features might be not reliable if the sensor only works for a short period during a time segment. In this step, we discard rows when the `phone_data_yield_rapids_ratiovalidyieldedhours` column is less than the `[DATA_YIELDED_HOURS_RATIO_THRESHOLD]` parameter. We do not recommend you to skip this step, but you can do it by setting `[DATA_YIELDED_HOURS_RATIO_THRESHOLD]` to 0.
Extracted features might be not reliable if the sensor only works for a short period during a time segment. In this step, we discard rows when the `phone_data_yield_rapids_ratiovalidyieldedminutes` column or the `phone_data_yield_rapids_ratiovalidyieldedhours` column is less than the `[DATA_YIELD_RATIO_THRESHOLD]` parameter. We recommend using `phone_data_yield_rapids_ratiovalidyieldedminutes` column (set `[DATA_YIELD_UNIT]` to `MINUTES`) on time segments that are shorter than two or three hours and `phone_data_yield_rapids_ratiovalidyieldedhours` (set `[DATA_YIELD_UNIT]` to `HOURS`) for longer segments. We do not recommend you to skip this step, but you can do it by setting `[DATA_YIELD_RATIO_THRESHOLD]` to 0.
??? info "3. Discard columns (features) with too many missing values."
In this step, we discard columns with missing value ratios higher than `[COLS_NAN_THRESHOLD]`. We do not recommend you to skip this step, but you can do it by setting `[COLS_NAN_THRESHOLD]` to 1.
@ -57,15 +68,15 @@ Steps to clean sensor features for individual participants. It only considers th
In this step, we discard columns with zero variance. We do not recommend you to skip this step, but you can do it by setting `[COLS_VAR_THRESHOLD]` to `False`.
??? info "5. Drop highly correlated features."
As highly correlated features might not bring additional information and will increase the complexity of our model, we drop them in this step. The absolute values of pair-wise correlations are calculated. It is regarded as valid only if the ratio of this pair of columns (features) are less than `[DROP_HIGHLY_CORRELATED_FEATURES][MIN_OVERLAP_FOR_CORR_THRESHOLD]`. If two variables have a valid correlation higher than `[DROP_HIGHLY_CORRELATED_FEATURES][CORR_THRESHOLD]`, we looks at the mean absolute correlation of each variable and removes the variable with the largest mean absolute correlation. This step can be skip by setting `[DROP_HIGHLY_CORRELATED_FEATURES][COMPUTE]` to `False`.
As highly correlated features might not bring additional information and will increase the complexity of a model, we drop them in this step. The absolute values of pair-wise correlations are calculated. Each correlation vector between two variables is regarded as valid only if the ratio of valid value pairs (i.e. non NA pairs) is greater than or equal to `[DROP_HIGHLY_CORRELATED_FEATURES][MIN_OVERLAP_FOR_CORR_THRESHOLD]`. If two variables have a correlation coefficient higher than `[DROP_HIGHLY_CORRELATED_FEATURES][CORR_THRESHOLD]`, we look at the mean absolute correlation of each variable and remove the variable with the largest mean absolute correlation. This step can be skipped by setting `[DROP_HIGHLY_CORRELATED_FEATURES][COMPUTE]` to False.
??? info "6. Discard rows with too many missing values."
In this step, we discard rows with missing value ratios higher than `[ROWS_NAN_THRESHOLD]`. We do not recommend you to skip this step, but you can do it by setting `[ROWS_NAN_THRESHOLD]` to 1.
In this step, we discard rows with missing value ratios higher than `[ROWS_NAN_THRESHOLD]`. We do not recommend you to skip this step, but you can do it by setting `[ROWS_NAN_THRESHOLD]` to 1. In other words, we are discarding time segments (e.g. days) that did not have enough data to be considered reliable. This step is similar to step 2 except the ratio is computed based on NA values instead of a phone data yield threshold.
## Clean sensor features for all participants.
## Clean sensor features for all participants
!!! info "File Sequence"
```bash

View File

@ -1,7 +1,7 @@
# Change Log
## v.1.7.0
- Add tests for phone battery features
- Fill NA with 0 for the selected event features: (1) each sensor's feature extraction script (2) data cleaning script
- Replace NA with 0 for selected event-based features. Done in each feature extraction script and the data cleaning module
- Refactor data cleaning module: update the structure and add dropping highly correlated features section
## v1.6.0
- Refactor PHONE_CALLS RAPIDS provider to compute features based on call episodes or events

View File

@ -548,7 +548,8 @@ ALL_CLEANING_INDIVIDUAL:
COLS_NAN_THRESHOLD: 0.3 # set to 1 to disable
COLS_VAR_THRESHOLD: True
ROWS_NAN_THRESHOLD: 0.3 # set to 1 to disable
DATA_YIELDED_HOURS_RATIO_THRESHOLD: 0.75 # set to 0 to disable
DATA_YIELD_UNIT: HOURS # HOURS or MINUTES
DATA_YIELD_RATIO_THRESHOLD: 0.75 # set to 0 to disable
DROP_HIGHLY_CORRELATED_FEATURES:
COMPUTE: False
MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
@ -565,7 +566,8 @@ ALL_CLEANING_OVERALL:
COLS_NAN_THRESHOLD: 0.3 # set to 1 to disable
COLS_VAR_THRESHOLD: True
ROWS_NAN_THRESHOLD: 0.3 # set to 1 to disable
DATA_YIELDED_HOURS_RATIO_THRESHOLD: 0.75 # set to 0 to disable
DATA_YIELD_UNIT: HOURS # HOURS or MINUTES
DATA_YIELD_RATIO_THRESHOLD: 0.75 # set to 0 to disable
DROP_HIGHLY_CORRELATED_FEATURES:
COMPUTE: False
MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5

View File

@ -12,7 +12,9 @@ rapids_cleaning <- function(sensor_data_files, provider){
cols_nan_threshold <- as.numeric(provider[["COLS_NAN_THRESHOLD"]])
drop_zero_variance_columns <- as.logical(provider[["COLS_VAR_THRESHOLD"]])
rows_nan_threshold <- as.numeric(provider[["ROWS_NAN_THRESHOLD"]])
data_yielded_hours_ratio_threshold <- as.numeric(provider[["DATA_YIELDED_HOURS_RATIO_THRESHOLD"]])
data_yield_unit <- tolower(provider[["DATA_YIELD_UNIT"]])
data_yield_column <- paste0("phone_data_yield_rapids_ratiovalidyielded", data_yield_unit)
data_yield_ratio_threshold <- as.numeric(provider[["DATA_YIELD_RATIO_THRESHOLD"]])
drop_highly_correlated_features <- provider[["DROP_HIGHLY_CORRELATED_FEATURES"]]
# Impute selected event features
@ -33,12 +35,12 @@ rapids_cleaning <- function(sensor_data_files, provider){
clean_features[selected_columns][is.na(clean_features[selected_columns]) & (clean_features$phone_data_yield_rapids_ratiovalidyieldedminutes > impute_selected_event_features$MIN_DATA_YIELDED_MINUTES_TO_IMPUTE)] <- 0
}
# Drop rows with the value of "phone_data_yield_rapids_ratiovalidyieldedhours" column less than data_yielded_hours_ratio_threshold
if(!"phone_data_yield_rapids_ratiovalidyieldedhours" %in% colnames(clean_features)){
stop("Error: RAPIDS provider needs to clean data based on phone_data_yield_rapids_ratiovalidyieldedhours column, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyieldedhours' in [FEATURES].")
# Drop rows with the value of data_yield_column less than data_yield_ratio_threshold
if(!data_yield_column %in% colnames(clean_features)){
stop(paste0("Error: RAPIDS provider needs to clean data based on ", data_yield_column, " column, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyielded", data_yield_unit, "' in [FEATURES]."))
}
clean_features <- clean_features %>%
filter(phone_data_yield_rapids_ratiovalidyieldedhours >= data_yielded_hours_ratio_threshold)
filter(.[[data_yield_column]] >= data_yield_ratio_threshold)
# Drop columns with a percentage of NA values above cols_nan_threshold
if(nrow(clean_features))

View File

@ -12,7 +12,9 @@ rapids_cleaning <- function(sensor_data_files, provider){
cols_nan_threshold <- as.numeric(provider[["COLS_NAN_THRESHOLD"]])
drop_zero_variance_columns <- as.logical(provider[["COLS_VAR_THRESHOLD"]])
rows_nan_threshold <- as.numeric(provider[["ROWS_NAN_THRESHOLD"]])
data_yielded_hours_ratio_threshold <- as.numeric(provider[["DATA_YIELDED_HOURS_RATIO_THRESHOLD"]])
data_yield_unit <- tolower(provider[["DATA_YIELD_UNIT"]])
data_yield_column <- paste0("phone_data_yield_rapids_ratiovalidyielded", data_yield_unit)
data_yield_ratio_threshold <- as.numeric(provider[["DATA_YIELD_RATIO_THRESHOLD"]])
drop_highly_correlated_features <- provider[["DROP_HIGHLY_CORRELATED_FEATURES"]]
# Impute selected event features
@ -33,12 +35,12 @@ rapids_cleaning <- function(sensor_data_files, provider){
clean_features[selected_columns][is.na(clean_features[selected_columns]) & (clean_features$phone_data_yield_rapids_ratiovalidyieldedminutes > impute_selected_event_features$MIN_DATA_YIELDED_MINUTES_TO_IMPUTE)] <- 0
}
# Drop rows with the value of "phone_data_yield_rapids_ratiovalidyieldedhours" column less than data_yielded_hours_ratio_threshold
if(!"phone_data_yield_rapids_ratiovalidyieldedhours" %in% colnames(clean_features)){
stop("Error: RAPIDS provider needs to clean data based on phone_data_yield_rapids_ratiovalidyieldedhours column, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyieldedhours' in [FEATURES].")
# Drop rows with the value of data_yield_column less than data_yield_ratio_threshold
if(!data_yield_column %in% colnames(clean_features)){
stop(paste0("Error: RAPIDS provider needs to clean data based on ", data_yield_column, " column, please set config[PHONE_DATA_YIELD][PROVIDERS][RAPIDS][COMPUTE] to True and include 'ratiovalidyielded", data_yield_unit, "' in [FEATURES]."))
}
clean_features <- clean_features %>%
filter(phone_data_yield_rapids_ratiovalidyieldedhours >= data_yielded_hours_ratio_threshold)
filter(.[[data_yield_column]] >= data_yield_ratio_threshold)
# Drop columns with a percentage of NA values above cols_nan_threshold
if(nrow(clean_features))