Merge branch 'master' into ml_pipeline

# Conflicts:
#	.gitignore
#	rapids
ml_pipeline
junos 2022-11-16 17:46:01 +01:00
commit 848416bf6a
17 changed files with 4005 additions and 18 deletions

4
.gitignore vendored
View File

@ -6,4 +6,6 @@ __pycache__/
/config/*.ipynb /config/*.ipynb
/statistical_analysis/*.ipynb /statistical_analysis/*.ipynb
/machine_learning/intermediate_results/ /machine_learning/intermediate_results/
/data /data/features/
/data/baseline/
/data/*input*.csv

4
.gitmodules vendored 100644
View File

@ -0,0 +1,4 @@
[submodule "rapids"]
path = rapids
url = https://repo.ijs.si/junoslukan/rapids.git
branch = master

View File

@ -4,4 +4,17 @@
<component name="PyCharmProfessionalAdvertiser"> <component name="PyCharmProfessionalAdvertiser">
<option name="shown" value="true" /> <option name="shown" value="true" />
</component> </component>
<component name="RMarkdownSettings">
<option name="renderProfiles">
<map>
<entry key="file://$PROJECT_DIR$/rapids/src/visualization/merge_heatmap_sensors_per_minute_per_time_segment.Rmd">
<value>
<RMarkdownRenderProfile>
<option name="outputDirectoryUrl" value="file://$PROJECT_DIR$/rapids/src/visualization" />
</RMarkdownRenderProfile>
</value>
</entry>
</map>
</option>
</component>
</project> </project>

View File

@ -0,0 +1,4 @@
<?xml version="1.0" encoding="UTF-8"?>
<project version="4">
<component name="SmkProjectSettings" sdk="Python 3.10 (snakemake)" enabled="true" />
</project>

View File

@ -2,5 +2,6 @@
<project version="4"> <project version="4">
<component name="VcsDirectoryMappings"> <component name="VcsDirectoryMappings">
<mapping directory="$PROJECT_DIR$" vcs="Git" /> <mapping directory="$PROJECT_DIR$" vcs="Git" />
<mapping directory="$PROJECT_DIR$/rapids" vcs="Git" />
</component> </component>
</project> </project>

126
README.md
View File

@ -33,3 +33,129 @@ To install:
``` ```
DB_PASSWORD=database-password DB_PASSWORD=database-password
``` ```
# RAPIDS
To install RAPIDS, follow the [instructions on their webpage](https://www.rapids.science/1.6/setup/installation/).
Here, I include additional information related to the installation and specific to the STRAW2analysis project.
The installation was tested on Windows using Ubuntu 20.04 on Windows Subsystem for Linux ([WSL2](https://docs.microsoft.com/en-us/windows/wsl/install)).
## Custom configuration
### Credentials
As mentioned under [Database in RAPIDS documentation](https://www.rapids.science/1.6/snippets/database/), a `credentials.yaml` file is needed to connect to a database.
It should contain:
```yaml
PSQL_STRAW:
database: staw
host: 212.235.208.113
password: password
port: 5432
user: staw_db
```
where`password` needs to be specified as well.
## Possible installation issues
### Missing dependencies for RPostgres
To install `RPostgres` R package (used to connect to the PostgreSQL database), an error might occur:
```text
------------------------- ANTICONF ERROR ---------------------------
Configuration failed because libpq was not found. Try installing:
* deb: libpq-dev (Debian, Ubuntu, etc)
* rpm: postgresql-devel (Fedora, EPEL)
* rpm: postgreql8-devel, psstgresql92-devel, postgresql93-devel, or postgresql94-devel (Amazon Linux)
* csw: postgresql_dev (Solaris)
* brew: libpq (OSX)
If libpq is already installed, check that either:
(i) 'pkg-config' is in your PATH AND PKG_CONFIG_PATH contains a libpq.pc file; or
(ii) 'pg_config' is in your PATH.
If neither can detect , you can set INCLUDE_DIR
and LIB_DIR manually via:
R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
--------------------------[ ERROR MESSAGE ]----------------------------
<stdin>:1:10: fatal error: libpq-fe.h: No such file or directory
compilation terminated.
```
The library requires `libpq` for compiling from source, so install accordingly.
### Timezone environment variable for tidyverse (relevant for WSL2)
One of the R packages, `tidyverse` might need access to the `TZ` environment variable during the installation.
On Ubuntu 20.04 on WSL2 this triggers the following error:
```text
> install.packages('tidyverse')
ERROR: configuration failed for package xml2
System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to create bus connection: Host is down
Warning in system("timedatectl", intern = TRUE) :
running command 'timedatectl' had status 1
Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) :
namespace xml2 1.3.1 is already loaded, but >= 1.3.2 is required
Calls: <Anonymous> ... namespaceImportFrom -> asNamespace -> loadNamespace
Execution halted
ERROR: lazy loading failed for package tidyverse
```
This happens because WSL2 does not use the `timedatectl` service, which provides this variable.
```bash
~$ timedatectl
System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to create bus connection: Host is down
```
and later
```bash
Warning message:
In system("timedatectl", intern = TRUE) :
running command 'timedatectl' had status 1
Execution halted
```
This can be amended by setting the environment variable manually before attempting to install `tidyverse`:
```bash
export TZ='Europe/Ljubljana'
```
## Possible runtime issues
### Unix end of line characters
Upon running rapids, an error might occur:
```bash
/usr/bin/env: python3\r: No such file or directory
```
This is due to Windows style end of line characters.
To amend this, I added a `.gitattributes` files to force `git` to checkout `rapids` using Unix EOL characters.
If this still fails, `dos2unix` can be used to change them.
### System has not been booted with systemd as init system (PID 1)
See [the installation issue above](#Timezone-environment-variable-for-tidyverse-(relevant-for-WSL2)).
## Update RAPIDS
To update RAPIDS, first pull and merge [origin]( https://github.com/carissalow/rapids), such as with:
```commandline
git fetch --progress "origin" refs/heads/master
git merge --no-ff origin/master
```
Next, update the conda and R virtual environment.
```bash
R -e 'renv::restore(repos = c(CRAN = "https://packagemanager.rstudio.com/all/__linux__/focal/latest"))'
```

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,323 @@
# ---
# jupyter:
# jupytext:
# formats: ipynb,py:percent
# text_representation:
# extension: .py
# format_name: percent
# format_version: '1.3'
# jupytext_version: 1.13.0
# kernelspec:
# display_name: straw2analysis
# language: python
# name: straw2analysis
# ---
# %%
import os, sys
import importlib
import pandas as pd
import numpy as np
# import plotly.graph_objects as go
from importlib import util
from pathlib import Path
import yaml
# %%
phone_data_yield = pd.read_csv(
"../rapids/data/interim/p011/phone_yielded_timestamps_with_datetime.csv",
parse_dates=["local_date_time"],
)
time_segments_labels = pd.read_csv(
"../rapids/data/interim/time_segments/p011_time_segments_labels.csv"
)
# %%
phone_data_yield["assigned_segments"] = phone_data_yield[
"assigned_segments"
].str.replace(r"_RR\d+SS#", "#")
time_segments_labels["label"] = time_segments_labels["label"].str.replace(
r"_RR\d+SS$", ""
)
# %% tags=[]
def filter_data_by_segment(data, time_segment):
data.dropna(subset=["assigned_segments"], inplace=True)
if data.shape[0] == 0: # data is empty
data["local_segment"] = data["timestamps_segment"] = None
return data
datetime_regex = "[0-9]{4}[\-|\/][0-9]{2}[\-|\/][0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}"
timestamps_regex = "[0-9]{13}"
segment_regex = "\[({}#{},{};{},{})\]".format(
time_segment, datetime_regex, datetime_regex, timestamps_regex, timestamps_regex
)
data["local_segment"] = data["assigned_segments"].str.extract(
segment_regex, expand=True
)
data = data.drop(columns=["assigned_segments"])
data = data.dropna(subset=["local_segment"])
if (
data.shape[0] == 0
): # there are no rows belonging to time_segment after droping na
data["timestamps_segment"] = None
else:
data[["local_segment", "timestamps_segment"]] = data["local_segment"].str.split(
pat=";", n=1, expand=True
)
# chunk episodes
if (
(not data.empty)
and ("start_timestamp" in data.columns)
and ("end_timestamp" in data.columns)
):
data = chunk_episodes(data)
return data
# %% tags=[]
time_segment = "daily"
phone_data_yield_per_segment = filter_data_by_segment(phone_data_yield, time_segment)
# %%
phone_data_yield.tail()
# %%
phone_data_yield_per_segment.tail()
# %%
def getDataForPlot(phone_data_yield_per_segment):
# calculate the length (in minute) of per segment instance
phone_data_yield_per_segment["length"] = (
phone_data_yield_per_segment["timestamps_segment"]
.str.split(",")
.apply(lambda x: int((int(x[1]) - int(x[0])) / (1000 * 60)))
)
# calculate the number of sensors logged at least one row of data per minute.
phone_data_yield_per_segment = (
phone_data_yield_per_segment.groupby(
["local_segment", "length", "local_date", "local_hour", "local_minute"]
)[["sensor", "local_date_time"]]
.max()
.reset_index()
)
# extract local start datetime of the segment from "local_segment" column
phone_data_yield_per_segment["local_segment_start_datetimes"] = pd.to_datetime(
phone_data_yield_per_segment["local_segment"].apply(
lambda x: x.split("#")[1].split(",")[0]
)
)
# calculate the number of minutes after local start datetime of the segment
phone_data_yield_per_segment["minutes_after_segment_start"] = (
(
phone_data_yield_per_segment["local_date_time"]
- phone_data_yield_per_segment["local_segment_start_datetimes"]
)
/ pd.Timedelta(minutes=1)
).astype("int")
# impute missing rows with 0
columns_for_full_index = phone_data_yield_per_segment[
["local_segment_start_datetimes", "length"]
].drop_duplicates(keep="first")
columns_for_full_index = columns_for_full_index.apply(
lambda row: [
[row["local_segment_start_datetimes"], x] for x in range(row["length"] + 1)
],
axis=1,
)
full_index = []
for columns in columns_for_full_index:
full_index = full_index + columns
full_index = pd.MultiIndex.from_tuples(
full_index,
names=("local_segment_start_datetimes", "minutes_after_segment_start"),
)
phone_data_yield_per_segment = (
phone_data_yield_per_segment.set_index(
["local_segment_start_datetimes", "minutes_after_segment_start"]
)
.reindex(full_index)
.reset_index()
.fillna(0)
)
# transpose the dataframe per local start datetime of the segment and discard the useless index layer
phone_data_yield_per_segment = phone_data_yield_per_segment.groupby(
"local_segment_start_datetimes"
)[["minutes_after_segment_start", "sensor"]].apply(
lambda x: x.set_index("minutes_after_segment_start").transpose()
)
phone_data_yield_per_segment.index = phone_data_yield_per_segment.index.get_level_values(
"local_segment_start_datetimes"
)
return phone_data_yield_per_segment
# %%
data_for_plot_per_segment = getDataForPlot(phone_data_yield_per_segment)
# %%
# calculate the length (in minute) of per segment instance
phone_data_yield_per_segment["length"] = (
phone_data_yield_per_segment["timestamps_segment"]
.str.split(",")
.apply(lambda x: int((int(x[1]) - int(x[0])) / (1000 * 60)))
)
# %%
phone_data_yield_per_segment.tail()
# %%
# calculate the number of sensors logged at least one row of data per minute.
phone_data_yield_per_segment = (
phone_data_yield_per_segment.groupby(
["local_segment", "length", "local_date", "local_hour", "local_minute"]
)[["sensor", "local_date_time"]]
.max()
.reset_index()
)
# %%
# extract local start datetime of the segment from "local_segment" column
phone_data_yield_per_segment["local_segment_start_datetimes"] = pd.to_datetime(
phone_data_yield_per_segment["local_segment"].apply(
lambda x: x.split("#")[1].split(",")[0]
)
)
# %%
# calculate the number of minutes after local start datetime of the segment
phone_data_yield_per_segment["minutes_after_segment_start"] = (
(
phone_data_yield_per_segment["local_date_time"]
- phone_data_yield_per_segment["local_segment_start_datetimes"]
)
/ pd.Timedelta(minutes=1)
).astype("int")
# %%
columns_for_full_index = phone_data_yield_per_segment[
["local_segment_start_datetimes", "length"]
].drop_duplicates(keep="first")
columns_for_full_index = columns_for_full_index.apply(
lambda row: [
[row["local_segment_start_datetimes"], x] for x in range(row["length"] + 1)
],
axis=1,
)
# %%
full_index = []
for columns in columns_for_full_index:
full_index = full_index + columns
full_index = pd.MultiIndex.from_tuples(
full_index, names=("local_segment_start_datetimes", "minutes_after_segment_start")
)
# %%
phone_data_yield_per_segment.tail()
# %% [markdown]
# # A workaround
# %%
phone_data_yield_per_segment["local_segment_start_datetimes", "minutes_after_segment_start"] = phone_data_yield_per_segment[
["local_segment_start_datetimes", "minutes_after_segment_start"]
].drop_duplicates(keep="first")
# %%
phone_data_yield_per_segment.set_index(
["local_segment_start_datetimes", "minutes_after_segment_start"],
verify_integrity=True,
).reindex(full_index)
# %%
phone_data_yield_per_segment.head()
# %% [markdown]
# # Retry
# %%
def getDataForPlot(phone_data_yield_per_segment):
# calculate the length (in minute) of per segment instance
phone_data_yield_per_segment["length"] = (
phone_data_yield_per_segment["timestamps_segment"]
.str.split(",")
.apply(lambda x: int((int(x[1]) - int(x[0])) / (1000 * 60)))
)
# calculate the number of sensors logged at least one row of data per minute.
phone_data_yield_per_segment = (
phone_data_yield_per_segment.groupby(
["local_segment", "length", "local_date", "local_hour", "local_minute"]
)[["sensor", "local_date_time"]]
.max()
.reset_index()
)
# extract local start datetime of the segment from "local_segment" column
phone_data_yield_per_segment["local_segment_start_datetimes"] = pd.to_datetime(
phone_data_yield_per_segment["local_segment"].apply(
lambda x: x.split("#")[1].split(",")[0]
)
)
# calculate the number of minutes after local start datetime of the segment
phone_data_yield_per_segment["minutes_after_segment_start"] = (
(
phone_data_yield_per_segment["local_date_time"]
- phone_data_yield_per_segment["local_segment_start_datetimes"]
)
/ pd.Timedelta(minutes=1)
).astype("int")
# impute missing rows with 0
columns_for_full_index = phone_data_yield_per_segment[
["local_segment_start_datetimes", "length"]
].drop_duplicates(keep="first")
columns_for_full_index = columns_for_full_index.apply(
lambda row: [
[row["local_segment_start_datetimes"], x] for x in range(row["length"] + 1)
],
axis=1,
)
full_index = []
for columns in columns_for_full_index:
full_index = full_index + columns
full_index = pd.MultiIndex.from_tuples(
full_index,
names=("local_segment_start_datetimes", "minutes_after_segment_start"),
)
phone_data_yield_per_segment = phone_data_yield_per_segment.drop_duplicates(subset=["local_segment_start_datetimes", "minutes_after_segment_start"],keep="first")
phone_data_yield_per_segment = (
phone_data_yield_per_segment.set_index(
["local_segment_start_datetimes", "minutes_after_segment_start"]
)
.reindex(full_index)
.reset_index()
.fillna(0)
)
# transpose the dataframe per local start datetime of the segment and discard the useless index layer
phone_data_yield_per_segment = phone_data_yield_per_segment.groupby(
"local_segment_start_datetimes"
)[["minutes_after_segment_start", "sensor"]].apply(
lambda x: x.set_index("minutes_after_segment_start").transpose()
)
phone_data_yield_per_segment.index = phone_data_yield_per_segment.index.get_level_values(
"local_segment_start_datetimes"
)
return phone_data_yield_per_segment
# %%
phone_data_yield_per_segment = filter_data_by_segment(phone_data_yield, time_segment)
# %%
data_for_plot_per_segment = getDataForPlot(phone_data_yield_per_segment)
# %%

View File

@ -46,7 +46,7 @@ from features import esm, helper, proximity
# %% [markdown] tags=[] # %% [markdown] tags=[]
# # 1. Get the relevant data # # 1. Get the relevant data
# %% jupyter={"source_hidden": true} # %%
participants_inactive_usernames = participants.query_db.get_usernames( participants_inactive_usernames = participants.query_db.get_usernames(
collection_start=datetime.date.fromisoformat("2020-08-01") collection_start=datetime.date.fromisoformat("2020-08-01")
) )
@ -56,11 +56,11 @@ ptcp_2 = participants_inactive_usernames[0:2]
# %% [markdown] jp-MarkdownHeadingCollapsed=true tags=[] # %% [markdown] jp-MarkdownHeadingCollapsed=true tags=[]
# ## 1.1 Labels # ## 1.1 Labels
# %% jupyter={"source_hidden": true} # %%
df_esm = esm.get_esm_data(ptcp_2) df_esm = esm.get_esm_data(ptcp_2)
df_esm_preprocessed = esm.preprocess_esm(df_esm) df_esm_preprocessed = esm.preprocess_esm(df_esm)
# %% jupyter={"source_hidden": true} # %%
df_esm_PANAS = df_esm_preprocessed[ df_esm_PANAS = df_esm_preprocessed[
(df_esm_preprocessed["questionnaire_id"] == 8) (df_esm_preprocessed["questionnaire_id"] == 8)
| (df_esm_preprocessed["questionnaire_id"] == 9) | (df_esm_preprocessed["questionnaire_id"] == 9)
@ -70,7 +70,7 @@ df_esm_PANAS_clean = esm.clean_up_esm(df_esm_PANAS)
# %% [markdown] # %% [markdown]
# ## 1.2 Sensor data # ## 1.2 Sensor data
# %% jupyter={"source_hidden": true} # %%
df_proximity = proximity.get_proximity_data(ptcp_2) df_proximity = proximity.get_proximity_data(ptcp_2)
df_proximity = helper.get_date_from_timestamp(df_proximity) df_proximity = helper.get_date_from_timestamp(df_proximity)
df_proximity = proximity.recode_proximity(df_proximity) df_proximity = proximity.recode_proximity(df_proximity)
@ -81,7 +81,7 @@ df_proximity = proximity.recode_proximity(df_proximity)
# %% [markdown] # %% [markdown]
# # 2. Grouping/segmentation # # 2. Grouping/segmentation
# %% jupyter={"source_hidden": true} # %%
df_esm_PANAS_daily_means = ( df_esm_PANAS_daily_means = (
df_esm_PANAS_clean.groupby(["participant_id", "date_lj", "questionnaire_id"]) df_esm_PANAS_clean.groupby(["participant_id", "date_lj", "questionnaire_id"])
.esm_user_answer_numeric.agg("mean") .esm_user_answer_numeric.agg("mean")
@ -89,7 +89,7 @@ df_esm_PANAS_daily_means = (
.rename(columns={"esm_user_answer_numeric": "esm_numeric_mean"}) .rename(columns={"esm_user_answer_numeric": "esm_numeric_mean"})
) )
# %% jupyter={"source_hidden": true} # %%
df_esm_PANAS_daily_means = ( df_esm_PANAS_daily_means = (
df_esm_PANAS_daily_means.pivot( df_esm_PANAS_daily_means.pivot(
index=["participant_id", "date_lj"], index=["participant_id", "date_lj"],
@ -102,16 +102,16 @@ df_esm_PANAS_daily_means = (
) )
# %% jupyter={"source_hidden": true} # %%
df_proximity_daily_counts = proximity.count_proximity(df_proximity, ["date_lj"]) df_proximity_daily_counts = proximity.count_proximity(df_proximity, ["date_lj"])
# %% jupyter={"source_hidden": true} # %%
df_proximity_daily_counts df_proximity_daily_counts
# %% [markdown] # %% [markdown]
# # 3. Join features (and export to csv?) # # 3. Join features (and export to csv?)
# %% jupyter={"source_hidden": true} # %%
df_full_data_daily_means = df_esm_PANAS_daily_means.join( df_full_data_daily_means = df_esm_PANAS_daily_means.join(
df_proximity_daily_counts df_proximity_daily_counts
).reset_index() ).reset_index()
@ -119,13 +119,13 @@ df_full_data_daily_means = df_esm_PANAS_daily_means.join(
# %% [markdown] # %% [markdown]
# # 4. Machine learning model and parameters # # 4. Machine learning model and parameters
# %% jupyter={"source_hidden": true} # %%
lin_reg_proximity = linear_model.LinearRegression() lin_reg_proximity = linear_model.LinearRegression()
# %% [markdown] # %% [markdown]
# ## 4.1 Validation method # ## 4.1 Validation method
# %% jupyter={"source_hidden": true} # %%
logo = LeaveOneGroupOut() logo = LeaveOneGroupOut()
logo.get_n_splits( logo.get_n_splits(
df_full_data_daily_means[["freq_prox_near", "prop_prox_near"]], df_full_data_daily_means[["freq_prox_near", "prop_prox_near"]],
@ -136,7 +136,7 @@ logo.get_n_splits(
# %% [markdown] # %% [markdown]
# ## 4.2 Fit results (export?) # ## 4.2 Fit results (export?)
# %% jupyter={"source_hidden": true} # %%
cross_val_score( cross_val_score(
lin_reg_proximity, lin_reg_proximity,
df_full_data_daily_means[["freq_prox_near", "prop_prox_near"]], df_full_data_daily_means[["freq_prox_near", "prop_prox_near"]],
@ -147,7 +147,7 @@ cross_val_score(
scoring="r2", scoring="r2",
) )
# %% jupyter={"source_hidden": true} # %%
lin_reg_proximity.fit( lin_reg_proximity.fit(
df_full_data_daily_means[["freq_prox_near", "prop_prox_near"]], df_full_data_daily_means[["freq_prox_near", "prop_prox_near"]],
df_full_data_daily_means["PA"], df_full_data_daily_means["PA"],

View File

@ -6,7 +6,7 @@
# extension: .py # extension: .py
# format_name: percent # format_name: percent
# format_version: '1.3' # format_version: '1.3'
# jupytext_version: 1.11.4 # jupytext_version: 1.13.0
# kernelspec: # kernelspec:
# display_name: straw2analysis # display_name: straw2analysis
# language: python # language: python
@ -74,3 +74,29 @@ rows_os_manufacturer = df_category_not_found["package_name"].str.contains(
# %% # %%
with pd.option_context("display.max_rows", None, "display.max_columns", None): with pd.option_context("display.max_rows", None, "display.max_columns", None):
display(df_category_not_found.loc[~rows_os_manufacturer]) display(df_category_not_found.loc[~rows_os_manufacturer])
# %% [markdown]
# # Export categories
# %% [markdown]
# Rename all of "not_found" to "system" or "other".
# %%
df_app_categories_to_export = df_app_categories.copy()
rows_os_manufacturer_full = (df_app_categories_to_export["package_name"].str.contains(
"|".join(manufacturers + custom_rom + other), case=False
)) & (df_app_categories_to_export["play_store_genre"] == "not_found")
df_app_categories_to_export.loc[rows_os_manufacturer_full, "play_store_genre"] = "System"
# %%
rows_not_found = (df_app_categories_to_export["play_store_genre"] == "not_found")
df_app_categories_to_export.loc[rows_not_found, "play_store_genre"] = "Other"
# %%
df_app_categories_to_export["play_store_genre"].value_counts()
# %%
df_app_categories_to_export.rename(columns={"play_store_genre": "genre"},inplace=True)
df_app_categories_to_export.to_csv("../data/app_categories.csv", columns=["package_hash","genre"],index=False)
# %%

View File

@ -26,10 +26,11 @@ import pandas as pd
import seaborn as sns import seaborn as sns
import yaml import yaml
from pyprojroot import here from pyprojroot import here
from sklearn import linear_model, svm, kernel_ridge, gaussian_process from sklearn import linear_model, svm, kernel_ridge, gaussian_process, ensemble
from sklearn.model_selection import LeaveOneGroupOut, cross_val_score from sklearn.model_selection import LeaveOneGroupOut, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score from sklearn.metrics import mean_squared_error, r2_score
from sklearn.impute import SimpleImputer from sklearn.impute import SimpleImputer
from xgboost import XGBRegressor
nb_dir = os.path.split(os.getcwd())[0] nb_dir = os.path.split(os.getcwd())[0]
if nb_dir not in sys.path: if nb_dir not in sys.path:
@ -269,3 +270,203 @@ np.median(
) )
) )
# %% # %%
def insert_row(df, row):
return pd.concat([df, pd.DataFrame([row], columns=df.columns)], ignore_index=True)
# %%
def run_all_models(input_csv):
# Prepare data
model_input = pd.read_csv(input_csv)
model_input.dropna(axis=1, how="all", inplace=True)
model_input.dropna(axis=0, how="any", subset=["target"], inplace=True)
index_columns = ["local_segment", "local_segment_label", "local_segment_start_datetime", "local_segment_end_datetime"]
model_input.set_index(index_columns, inplace=True)
data_x, data_y, data_groups = model_input.drop(["target", "pid"], axis=1), model_input["target"], model_input["pid"]
categorical_feature_colnames = ["gender", "startlanguage"]
categorical_features = data_x[categorical_feature_colnames].copy()
mode_categorical_features = categorical_features.mode().iloc[0]
# fillna with mode
categorical_features = categorical_features.fillna(mode_categorical_features)
# one-hot encoding
categorical_features = categorical_features.apply(lambda col: col.astype("category"))
if not categorical_features.empty:
categorical_features = pd.get_dummies(categorical_features)
numerical_features = data_x.drop(categorical_feature_colnames, axis=1)
train_x = pd.concat([numerical_features, categorical_features], axis=1)
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
train_x_imputed = imputer.fit_transform(train_x)
# Prepare cross validation
logo = LeaveOneGroupOut()
logo.get_n_splits(
train_x,
data_y,
groups=data_groups,
)
scores = pd.DataFrame(columns=["method", "median", "max"])
# Validate models
lin_reg_rapids = linear_model.LinearRegression()
lin_reg_scores = cross_val_score(
lin_reg_rapids,
X=train_x_imputed,
y=data_y,
groups=data_groups,
cv=logo,
n_jobs=-1,
scoring='r2'
)
print("Linear regression:")
print(np.median(lin_reg_scores))
scores = insert_row(scores, ["Linear regression",np.median(lin_reg_scores),np.max(lin_reg_scores)])
ridge_reg = linear_model.Ridge(alpha=.5)
ridge_reg_scores = cross_val_score(
ridge_reg,
X=train_x_imputed,
y=data_y,
groups=data_groups,
cv=logo,
n_jobs=-1,
scoring="r2"
)
print("Ridge regression")
print(np.median(ridge_reg_scores))
scores = insert_row(scores, ["Ridge regression",np.median(ridge_reg_scores),np.max(ridge_reg_scores)])
lasso_reg = linear_model.Lasso(alpha=0.1)
lasso_reg_score = cross_val_score(
lasso_reg,
X=train_x_imputed,
y=data_y,
groups=data_groups,
cv=logo,
n_jobs=-1,
scoring="r2"
)
print("Lasso regression")
print(np.median(lasso_reg_score))
scores = insert_row(scores, ["Lasso regression",np.median(lasso_reg_score),np.max(lasso_reg_score)])
bayesian_ridge_reg = linear_model.BayesianRidge()
bayesian_ridge_reg_score = cross_val_score(
bayesian_ridge_reg,
X=train_x_imputed,
y=data_y,
groups=data_groups,
cv=logo,
n_jobs=-1,
scoring="r2"
)
print("Bayesian Ridge")
print(np.median(bayesian_ridge_reg_score))
scores = insert_row(scores, ["Bayesian Ridge",np.median(bayesian_ridge_reg_score),np.max(bayesian_ridge_reg_score)])
ransac_reg = linear_model.RANSACRegressor()
ransac_reg_score = cross_val_score(
ransac_reg,
X=train_x_imputed,
y=data_y,
groups=data_groups,
cv=logo,
n_jobs=-1,
scoring="r2"
)
print("RANSAC (outlier robust regression)")
print(np.median(ransac_reg_score))
scores = insert_row(scores, ["RANSAC",np.median(ransac_reg_score),np.max(ransac_reg_score)])
svr = svm.SVR()
svr_score = cross_val_score(
svr,
X=train_x_imputed,
y=data_y,
groups=data_groups,
cv=logo,
n_jobs=-1,
scoring="r2"
)
print("Support vector regression")
print(np.median(svr_score))
scores = insert_row(scores, ["Support vector regression",np.median(svr_score),np.max(svr_score)])
kridge = kernel_ridge.KernelRidge()
kridge_score = cross_val_score(
kridge,
X=train_x_imputed,
y=data_y,
groups=data_groups,
cv=logo,
n_jobs=-1,
scoring="r2"
)
print("Kernel Ridge regression")
print(np.median(kridge_score))
scores = insert_row(scores, ["Kernel Ridge regression",np.median(kridge_score),np.max(kridge_score)])
gpr = gaussian_process.GaussianProcessRegressor()
gpr_score = cross_val_score(
gpr,
X=train_x_imputed,
y=data_y,
groups=data_groups,
cv=logo,
n_jobs=-1,
scoring="r2"
)
print("Gaussian Process Regression")
print(np.median(gpr_score))
scores = insert_row(scores, ["Gaussian Process Regression",np.median(gpr_score),np.max(gpr_score)])
rfr = ensemble.RandomForestRegressor(max_features=0.3, n_jobs=-1)
rfr_score = cross_val_score(
rfr,
X=train_x_imputed,
y=data_y,
groups=data_groups,
cv=logo,
n_jobs=-1,
scoring="r2"
)
print("Random Forest Regression")
print(np.median(rfr_score))
scores = insert_row(scores, ["Random Forest Regression",np.median(rfr_score),np.max(rfr_score)])
xgb = XGBRegressor()
xgb_score = cross_val_score(
xgb,
X=train_x_imputed,
y=data_y,
groups=data_groups,
cv=logo,
n_jobs=-1,
scoring="r2"
)
print("XGBoost Regressor")
print(np.median(xgb_score))
scores = insert_row(scores, ["XGBoost Regressor",np.median(xgb_score),np.max(xgb_score)])
ada = ensemble.AdaBoostRegressor()
ada_score = cross_val_score(
ada,
X=train_x_imputed,
y=data_y,
groups=data_groups,
cv=logo,
n_jobs=-1,
scoring="r2"
)
print("ADA Boost Regressor")
print(np.median(ada_score))
scores = insert_row(scores, ["ADA Boost Regressor",np.median(ada_score),np.max(ada_score)])
return scores

View File

@ -0,0 +1,30 @@
from collections.abc import Collection
import pandas as pd
from config.models import Participant, Timezone
from setup import db_engine, session
def get_timezone_data(usernames: Collection) -> pd.DataFrame:
"""
Read the data from the proximity sensor table and return it in a dataframe.
Parameters
----------
usernames: Collection
A list of usernames to put into the WHERE condition.
Returns
-------
df_proximity: pd.DataFrame
A dataframe of proximity data.
"""
query_timezone = (
session.query(Timezone, Participant.username)
.filter(Participant.id == Timezone.participant_id)
.filter(Participant.username.in_(usernames))
)
with db_engine.connect() as connection:
df_timezone = pd.read_sql(query_timezone.statement, connection)
return df_timezone

View File

@ -0,0 +1,205 @@
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<!-- Generated by graphviz version 2.43.0 (0)
-->
<!-- Title: snakemake_dag Pages: 1 -->
<svg width="548pt" height="625pt"
viewBox="0.00 0.00 548.00 625.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
<g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 621)">
<title>snakemake_dag</title>
<polygon fill="white" stroke="transparent" points="-4,4 -4,-621 544,-621 544,4 -4,4"/>
<!-- 0 -->
<g id="node1" class="node">
<title>0</title>
<path fill="none" stroke="#565bd8" stroke-width="2" d="M202,-36C202,-36 172,-36 172,-36 166,-36 160,-30 160,-24 160,-24 160,-12 160,-12 160,-6 166,0 172,0 172,0 202,0 202,0 208,0 214,-6 214,-12 214,-12 214,-24 214,-24 214,-30 208,-36 202,-36"/>
<text text-anchor="middle" x="187" y="-15.5" font-family="sans" font-size="10.00">all</text>
</g>
<!-- 1 -->
<g id="node2" class="node">
<title>1</title>
<path fill="none" stroke="#56d8a9" stroke-width="2" d="M100,-617C100,-617 12,-617 12,-617 6,-617 0,-611 0,-605 0,-605 0,-588 0,-588 0,-582 6,-576 12,-576 12,-576 100,-576 100,-576 106,-576 112,-582 112,-588 112,-588 112,-605 112,-605 112,-611 106,-617 100,-617"/>
<text text-anchor="middle" x="56" y="-605" font-family="sans" font-size="10.00">pull_phone_data</text>
<text text-anchor="middle" x="56" y="-594" font-family="sans" font-size="10.00">pid: nokia_0000003</text>
<text text-anchor="middle" x="56" y="-583" font-family="sans" font-size="10.00">sensor: calls</text>
</g>
<!-- 1&#45;&gt;0 -->
<g id="edge1" class="edge">
<title>1&#45;&gt;0</title>
<path fill="none" stroke="grey" stroke-width="2" d="M47.83,-575.78C37.21,-548.32 20,-496.76 20,-451 20,-451 20,-451 20,-161 20,-114.96 38.83,-102.85 73,-72 95.21,-51.94 126.33,-38.17 150.45,-29.7"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="151.61,-33 159.97,-26.5 149.38,-26.37 151.61,-33"/>
</g>
<!-- 2 -->
<g id="node3" class="node">
<title>2</title>
<path fill="none" stroke="#56d863" stroke-width="2" d="M124,-540C124,-540 60,-540 60,-540 54,-540 48,-534 48,-528 48,-528 48,-516 48,-516 48,-510 54,-504 60,-504 60,-504 124,-504 124,-504 130,-504 136,-510 136,-516 136,-516 136,-528 136,-528 136,-534 130,-540 124,-540"/>
<text text-anchor="middle" x="92" y="-519.5" font-family="sans" font-size="10.00">calls_episodes</text>
</g>
<!-- 1&#45;&gt;2 -->
<g id="edge9" class="edge">
<title>1&#45;&gt;2</title>
<path fill="none" stroke="grey" stroke-width="2" d="M65.84,-575.69C69.87,-567.56 74.6,-558.03 78.92,-549.33"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="82.09,-550.83 83.4,-540.32 75.82,-547.72 82.09,-550.83"/>
</g>
<!-- 2&#45;&gt;0 -->
<g id="edge2" class="edge">
<title>2&#45;&gt;0</title>
<path fill="none" stroke="grey" stroke-width="2" d="M85.12,-503.83C75.18,-477.44 58,-425.14 58,-379 58,-379 58,-379 58,-161 58,-105.34 112.96,-61.84 151.14,-38.34"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="153.16,-41.21 159.96,-33.08 149.58,-35.2 153.16,-41.21"/>
</g>
<!-- 3 -->
<g id="node4" class="node">
<title>3</title>
<path fill="none" stroke="#d8a456" stroke-width="2" d="M187.5,-468C187.5,-468 98.5,-468 98.5,-468 92.5,-468 86.5,-462 86.5,-456 86.5,-456 86.5,-444 86.5,-444 86.5,-438 92.5,-432 98.5,-432 98.5,-432 187.5,-432 187.5,-432 193.5,-432 199.5,-438 199.5,-444 199.5,-444 199.5,-456 199.5,-456 199.5,-462 193.5,-468 187.5,-468"/>
<text text-anchor="middle" x="143" y="-453" font-family="sans" font-size="10.00">resample_episodes</text>
<text text-anchor="middle" x="143" y="-442" font-family="sans" font-size="10.00">sensor: phone_calls</text>
</g>
<!-- 2&#45;&gt;3 -->
<g id="edge10" class="edge">
<title>2&#45;&gt;3</title>
<path fill="none" stroke="grey" stroke-width="2" d="M104.61,-503.7C110.6,-495.47 117.88,-485.48 124.48,-476.42"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="127.48,-478.25 130.54,-468.1 121.82,-474.13 127.48,-478.25"/>
</g>
<!-- 3&#45;&gt;0 -->
<g id="edge3" class="edge">
<title>3&#45;&gt;0</title>
<path fill="none" stroke="grey" stroke-width="2" d="M140.43,-432C136.64,-405.4 130,-352.3 130,-307 130,-307 130,-307 130,-161 130,-117.8 153,-72.19 169.78,-44.66"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="172.83,-46.37 175.19,-36.04 166.91,-42.65 172.83,-46.37"/>
</g>
<!-- 4 -->
<g id="node5" class="node">
<title>4</title>
<path fill="none" stroke="#56d8d8" stroke-width="2" d="M357.5,-396C357.5,-396 194.5,-396 194.5,-396 188.5,-396 182.5,-390 182.5,-384 182.5,-384 182.5,-372 182.5,-372 182.5,-366 188.5,-360 194.5,-360 194.5,-360 357.5,-360 357.5,-360 363.5,-360 369.5,-366 369.5,-372 369.5,-372 369.5,-384 369.5,-384 369.5,-390 363.5,-396 357.5,-396"/>
<text text-anchor="middle" x="276" y="-375.5" font-family="sans" font-size="10.00">resample_episodes_with_datetime</text>
</g>
<!-- 3&#45;&gt;4 -->
<g id="edge11" class="edge">
<title>3&#45;&gt;4</title>
<path fill="none" stroke="grey" stroke-width="2" d="M175.54,-431.88C193.25,-422.55 215.35,-410.92 234.32,-400.94"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="236.12,-403.94 243.34,-396.19 232.86,-397.75 236.12,-403.94"/>
</g>
<!-- 4&#45;&gt;0 -->
<g id="edge4" class="edge">
<title>4&#45;&gt;0</title>
<path fill="none" stroke="grey" stroke-width="2" d="M250.68,-359.83C218.76,-335.92 168,-289.36 168,-235 168,-235 168,-235 168,-161 168,-120.86 175.55,-74.9 181.13,-46.4"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="184.61,-46.88 183.16,-36.39 177.74,-45.5 184.61,-46.88"/>
</g>
<!-- 8 -->
<g id="node9" class="node">
<title>8</title>
<path fill="none" stroke="#68d856" stroke-width="2" d="M353.5,-324C353.5,-324 248.5,-324 248.5,-324 242.5,-324 236.5,-318 236.5,-312 236.5,-312 236.5,-300 236.5,-300 236.5,-294 242.5,-288 248.5,-288 248.5,-288 353.5,-288 353.5,-288 359.5,-288 365.5,-294 365.5,-300 365.5,-300 365.5,-312 365.5,-312 365.5,-318 359.5,-324 353.5,-324"/>
<text text-anchor="middle" x="301" y="-309" font-family="sans" font-size="10.00">phone_calls_r_features</text>
<text text-anchor="middle" x="301" y="-298" font-family="sans" font-size="10.00">provider_key: rapids</text>
</g>
<!-- 4&#45;&gt;8 -->
<g id="edge15" class="edge">
<title>4&#45;&gt;8</title>
<path fill="none" stroke="grey" stroke-width="2" d="M282.18,-359.7C285,-351.81 288.39,-342.3 291.52,-333.55"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="294.82,-334.7 294.89,-324.1 288.23,-332.34 294.82,-334.7"/>
</g>
<!-- 5 -->
<g id="node6" class="node">
<title>5</title>
<path fill="none" stroke="#afd856" stroke-width="2" d="M475.5,-468C475.5,-468 364.5,-468 364.5,-468 358.5,-468 352.5,-462 352.5,-456 352.5,-456 352.5,-444 352.5,-444 352.5,-438 358.5,-432 364.5,-432 364.5,-432 475.5,-432 475.5,-432 481.5,-432 487.5,-438 487.5,-444 487.5,-444 487.5,-456 487.5,-456 487.5,-462 481.5,-468 475.5,-468"/>
<text text-anchor="middle" x="420" y="-453" font-family="sans" font-size="10.00">process_time_segments</text>
<text text-anchor="middle" x="420" y="-442" font-family="sans" font-size="10.00">pid: nokia_0000003</text>
</g>
<!-- 5&#45;&gt;4 -->
<g id="edge12" class="edge">
<title>5&#45;&gt;4</title>
<path fill="none" stroke="grey" stroke-width="2" d="M384.77,-431.88C365.42,-422.47 341.23,-410.71 320.57,-400.67"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="321.89,-397.41 311.36,-396.19 318.83,-403.71 321.89,-397.41"/>
</g>
<!-- 5&#45;&gt;8 -->
<g id="edge16" class="edge">
<title>5&#45;&gt;8</title>
<path fill="none" stroke="grey" stroke-width="2" d="M415.13,-431.72C409.07,-412.57 397.25,-381.55 379,-360 369.03,-348.23 355.82,-337.94 343.12,-329.64"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="344.79,-326.55 334.45,-324.21 341.08,-332.49 344.79,-326.55"/>
</g>
<!-- 6 -->
<g id="node7" class="node">
<title>6</title>
<path fill="none" stroke="#d86656" stroke-width="2" stroke-dasharray="5,2" d="M322.5,-468C322.5,-468 229.5,-468 229.5,-468 223.5,-468 217.5,-462 217.5,-456 217.5,-456 217.5,-444 217.5,-444 217.5,-438 223.5,-432 229.5,-432 229.5,-432 322.5,-432 322.5,-432 328.5,-432 334.5,-438 334.5,-444 334.5,-444 334.5,-456 334.5,-456 334.5,-462 328.5,-468 322.5,-468"/>
<text text-anchor="middle" x="276" y="-447.5" font-family="sans" font-size="10.00">prepare_tzcodes_file</text>
</g>
<!-- 6&#45;&gt;4 -->
<g id="edge13" class="edge">
<title>6&#45;&gt;4</title>
<path fill="none" stroke="grey" stroke-width="2" d="M276,-431.7C276,-423.98 276,-414.71 276,-406.11"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="279.5,-406.1 276,-396.1 272.5,-406.1 279.5,-406.1"/>
</g>
<!-- 7 -->
<g id="node8" class="node">
<title>7</title>
<path fill="none" stroke="#56d86b" stroke-width="2" stroke-dasharray="5,2" d="M370,-540C370,-540 182,-540 182,-540 176,-540 170,-534 170,-528 170,-528 170,-516 170,-516 170,-510 176,-504 182,-504 182,-504 370,-504 370,-504 376,-504 382,-510 382,-516 382,-516 382,-528 382,-528 382,-534 376,-540 370,-540"/>
<text text-anchor="middle" x="276" y="-519.5" font-family="sans" font-size="10.00">query_usernames_device_empatica_ids</text>
</g>
<!-- 7&#45;&gt;6 -->
<g id="edge14" class="edge">
<title>7&#45;&gt;6</title>
<path fill="none" stroke="grey" stroke-width="2" d="M276,-503.7C276,-495.98 276,-486.71 276,-478.11"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="279.5,-478.1 276,-468.1 272.5,-478.1 279.5,-478.1"/>
</g>
<!-- 8&#45;&gt;0 -->
<g id="edge5" class="edge">
<title>8&#45;&gt;0</title>
<path fill="none" stroke="grey" stroke-width="2" d="M264.63,-287.8C250.06,-279.08 234.51,-267.11 225,-252 184.07,-186.97 182.71,-92.23 184.91,-46.17"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="188.41,-46.26 185.49,-36.07 181.42,-45.85 188.41,-46.26"/>
</g>
<!-- 9 -->
<g id="node10" class="node">
<title>9</title>
<path fill="none" stroke="#d87556" stroke-width="2" d="M382,-252C382,-252 246,-252 246,-252 240,-252 234,-246 234,-240 234,-240 234,-228 234,-228 234,-222 240,-216 246,-216 246,-216 382,-216 382,-216 388,-216 394,-222 394,-228 394,-228 394,-240 394,-240 394,-246 388,-252 382,-252"/>
<text text-anchor="middle" x="314" y="-237" font-family="sans" font-size="10.00">join_features_from_providers</text>
<text text-anchor="middle" x="314" y="-226" font-family="sans" font-size="10.00">sensor_key: phone_calls</text>
</g>
<!-- 8&#45;&gt;9 -->
<g id="edge17" class="edge">
<title>8&#45;&gt;9</title>
<path fill="none" stroke="grey" stroke-width="2" d="M304.21,-287.7C305.65,-279.98 307.37,-270.71 308.96,-262.11"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="312.44,-262.58 310.82,-252.1 305.56,-261.3 312.44,-262.58"/>
</g>
<!-- 9&#45;&gt;0 -->
<g id="edge6" class="edge">
<title>9&#45;&gt;0</title>
<path fill="none" stroke="grey" stroke-width="2" d="M294.15,-215.87C283.81,-206.16 271.58,-193.31 263,-180 235.01,-136.57 243.3,-118.11 220,-72 215.36,-62.81 209.61,-53.14 204.23,-44.62"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="207.17,-42.72 198.81,-36.21 201.29,-46.51 207.17,-42.72"/>
</g>
<!-- 10 -->
<g id="node11" class="node">
<title>10</title>
<path fill="none" stroke="#56d8d0" stroke-width="2" d="M526,-180C526,-180 284,-180 284,-180 278,-180 272,-174 272,-168 272,-168 272,-156 272,-156 272,-150 278,-144 284,-144 284,-144 526,-144 526,-144 532,-144 538,-150 538,-156 538,-156 538,-168 538,-168 538,-174 532,-180 526,-180"/>
<text text-anchor="middle" x="405" y="-159.5" font-family="sans" font-size="10.00">merge_sensor_features_for_individual_participants</text>
</g>
<!-- 9&#45;&gt;10 -->
<g id="edge18" class="edge">
<title>9&#45;&gt;10</title>
<path fill="none" stroke="grey" stroke-width="2" d="M336.49,-215.7C347.96,-206.88 362.06,-196.03 374.48,-186.47"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="376.97,-188.98 382.76,-180.1 372.7,-183.43 376.97,-188.98"/>
</g>
<!-- 10&#45;&gt;0 -->
<g id="edge7" class="edge">
<title>10&#45;&gt;0</title>
<path fill="none" stroke="grey" stroke-width="2" d="M366.3,-143.85C346.21,-134.31 321.62,-121.63 301,-108 280.21,-94.25 277.55,-87.46 258,-72 245.35,-62 231.16,-51.3 218.81,-42.16"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="220.72,-39.22 210.59,-36.1 216.57,-44.85 220.72,-39.22"/>
</g>
<!-- 11 -->
<g id="node12" class="node">
<title>11</title>
<path fill="none" stroke="#56d892" stroke-width="2" d="M528,-108C528,-108 322,-108 322,-108 316,-108 310,-102 310,-96 310,-96 310,-84 310,-84 310,-78 316,-72 322,-72 322,-72 528,-72 528,-72 534,-72 540,-78 540,-84 540,-84 540,-96 540,-96 540,-102 534,-108 528,-108"/>
<text text-anchor="middle" x="425" y="-87.5" font-family="sans" font-size="10.00">merge_sensor_features_for_all_participants</text>
</g>
<!-- 10&#45;&gt;11 -->
<g id="edge19" class="edge">
<title>10&#45;&gt;11</title>
<path fill="none" stroke="grey" stroke-width="2" d="M409.94,-143.7C412.17,-135.9 414.85,-126.51 417.33,-117.83"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="420.73,-118.68 420.11,-108.1 414,-116.76 420.73,-118.68"/>
</g>
<!-- 11&#45;&gt;0 -->
<g id="edge8" class="edge">
<title>11&#45;&gt;0</title>
<path fill="none" stroke="grey" stroke-width="2" d="M367.08,-71.97C322.5,-58.85 262.21,-41.12 223.96,-29.87"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="224.84,-26.48 214.26,-27.02 222.87,-33.2 224.84,-26.48"/>
</g>
</g>
</svg>

After

Width:  |  Height:  |  Size: 13 KiB

File diff suppressed because it is too large Load Diff

After

Width:  |  Height:  |  Size: 135 KiB

View File

@ -0,0 +1,68 @@
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<!-- Generated by graphviz version 2.43.0 (0)
-->
<!-- Title: snakemake_dag Pages: 1 -->
<svg width="414pt" height="396pt"
viewBox="0.00 0.00 414.00 396.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
<g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 392)">
<title>snakemake_dag</title>
<polygon fill="white" stroke="transparent" points="-4,4 -4,-392 410,-392 410,4 -4,4"/>
<!-- 0 -->
<g id="node1" class="node">
<title>0</title>
<text text-anchor="start" x="81" y="-71.6" font-family="sans" font-weight="bold" font-size="18.00">create_participants_files</text>
<text text-anchor="start" x="81" y="-47.8" font-family="sans" font-size="10.00"> </text>
<text text-anchor="start" x="85" y="-47.8" font-family="sans" font-weight="bold" font-size="14.00">↪ input</text>
<text text-anchor="start" x="143" y="-47.8" font-family="sans" font-size="10.00"> </text>
<text text-anchor="start" x="81" y="-28" font-family="monospace" font-size="10.00">data/external/example_participants.csv</text>
<text text-anchor="start" x="319" y="-10" font-family="sans" font-size="10.00"> &#160;</text>
<polygon fill="#acd957" stroke="#acd957" points="75,-62 75,-62 333,-62 333,-62 75,-62"/>
<polygon fill="#acd957" stroke="#acd957" points="75,-22 75,-22 333,-22 333,-22 75,-22"/>
<polygon fill="none" stroke="#acd957" stroke-width="2" points="74.5,-1 74.5,-91 331.5,-91 331.5,-1 74.5,-1"/>
</g>
<!-- 1 -->
<g id="node2" class="node">
<title>1</title>
<text text-anchor="start" x="77" y="-221.6" font-family="sans" font-weight="bold" font-size="18.00">prepare_participants_csv</text>
<text text-anchor="start" x="77" y="-197.8" font-family="sans" font-size="10.00"> </text>
<text text-anchor="start" x="81" y="-197.8" font-family="sans" font-weight="bold" font-size="14.00">↪ input</text>
<text text-anchor="start" x="139" y="-197.8" font-family="sans" font-size="10.00"> </text>
<text text-anchor="start" x="77" y="-178" font-family="monospace" font-size="10.00">data/external/example_usernames.csv</text>
<text text-anchor="start" x="251" y="-157.8" font-family="sans" font-size="10.00"> </text>
<text text-anchor="start" x="255" y="-157.8" font-family="sans" font-weight="bold" font-size="14.00">output →</text>
<text text-anchor="start" x="325" y="-157.8" font-family="sans" font-size="10.00"> </text>
<text text-anchor="start" x="77" y="-138" font-family="monospace" font-size="10.00">data/external/example_participants.csv</text>
<polygon fill="#57d99e" stroke="#57d99e" points="71,-212 71,-212 336,-212 336,-212 71,-212"/>
<polygon fill="#57d99e" stroke="#57d99e" points="71,-172 71,-172 336,-172 336,-172 71,-172"/>
<polygon fill="none" stroke="#57d99e" stroke-width="2" points="71,-129 71,-241 335,-241 335,-129 71,-129"/>
</g>
<!-- 1&#45;&gt;0 -->
<g id="edge1" class="edge">
<title>1&#45;&gt;0</title>
<path fill="none" stroke="grey" stroke-width="2" d="M203,-127.88C203,-119.48 203,-110.81 203,-102.42"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="206.5,-102.36 203,-92.36 199.5,-102.36 206.5,-102.36"/>
</g>
<!-- 2 -->
<g id="node3" class="node">
<title>2</title>
<text text-anchor="start" x="7" y="-367.6" font-family="sans" font-weight="bold" font-size="18.00">query_usernames_device_empatica_ids</text>
<text text-anchor="start" x="7" y="-346" font-family="sans" font-size="10.00"> &#160;</text>
<text text-anchor="start" x="321" y="-325.8" font-family="sans" font-size="10.00"> </text>
<text text-anchor="start" x="325" y="-325.8" font-family="sans" font-weight="bold" font-size="14.00">output →</text>
<text text-anchor="start" x="395" y="-325.8" font-family="sans" font-size="10.00"> </text>
<text text-anchor="start" x="7" y="-306" font-family="monospace" font-size="10.00">data/external/example_usernames.csv</text>
<text text-anchor="start" x="7" y="-288" font-family="monospace" font-size="10.00">data/external/timezone.csv</text>
<polygon fill="#86d957" stroke="#86d957" points="1,-358 1,-358 406,-358 406,-358 1,-358"/>
<polygon fill="#86d957" stroke="#86d957" points="1,-340 1,-340 406,-340 406,-340 1,-340"/>
<polygon fill="none" stroke="#86d957" stroke-width="2" points="1,-279 1,-387 405,-387 405,-279 1,-279"/>
</g>
<!-- 2&#45;&gt;1 -->
<g id="edge2" class="edge">
<title>2&#45;&gt;1</title>
<path fill="none" stroke="grey" stroke-width="2" d="M203,-277.63C203,-269.45 203,-260.93 203,-252.53"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="206.5,-252.36 203,-242.36 199.5,-252.36 206.5,-252.36"/>
</g>
</g>
</svg>

After

Width:  |  Height:  |  Size: 4.6 KiB

View File

@ -0,0 +1,69 @@
import datetime
import os
import sys
nb_dir = os.path.split(os.getcwd())[0]
if nb_dir not in sys.path:
sys.path.append(nb_dir)
import pandas as pd
from features.timezone import get_timezone_data
from pyprojroot import here
import participants.query_db
participants_inactive_usernames = participants.query_db.get_usernames(
tester=False, # True participants are wanted.
active=False, # They have all finished their participation.
collection_start=datetime.date.fromisoformat(
"2020-08-01"
), # This is the timeframe of the main study.
last_upload=datetime.date.fromisoformat("2021-09-01"),
)
participants_overview_si = pd.read_csv(
snakemake.params["baseline_folder"] + "Participants_overview_Slovenia.csv", sep=";"
)
participants_overview_be = pd.read_csv(
snakemake.params["baseline_folder"]+ "Participants_overview_Belgium.csv", sep=";"
)
participants_true_si = participants_overview_si[
participants_overview_si["Wristband_SerialNo"] != "DECLINED"
]
participants_true_be = participants_overview_be[
participants_overview_be["SmartphoneBrand+Generation"].str.slice(0, 3) != "Not"
]
# Concatenate participants from both countries.
participants_usernames_empatica = pd.concat(
[participants_true_be, participants_true_si]
)
# Filter only the participants from the main study (queried from the database).
participants_usernames_empatica = participants_usernames_empatica[
participants_usernames_empatica["Username"].isin(participants_inactive_usernames)
]
# Rename and select columns.
participants_usernames_empatica = participants_usernames_empatica.rename(
columns={"Username": "label", "Wristband_SerialNo": "empatica_id"}
)[["label", "empatica_id"]]
# Adapt for csv export.
participants_usernames_empatica["empatica_id"] = participants_usernames_empatica[
"empatica_id"
].str.replace(",", ";")
participants_usernames_empatica.to_csv(
snakemake.output["usernames_file"],
header=True,
index=False,
line_terminator="\n",
)
timezone_df = get_timezone_data(participants_inactive_usernames)
timezone_df.to_csv(
snakemake.output["timezone_file"],
header=True,
index=False,
line_terminator="\n",
)

2
rapids

@ -1 +1 @@
Subproject commit a620def209e9f043f852097d0318cb45bde74467 Subproject commit f78aa3e7b3567423b44045766b230cd60d557cb0