2021-06-01 12:10:42 +02:00
# ---
# jupyter:
# jupytext:
# formats: ipynb,py:percent
# text_representation:
# extension: .py
# format_name: percent
# format_version: '1.3'
# jupytext_version: 1.11.2
# kernelspec:
# display_name: straw2analysis
# language: python
# name: straw2analysis
# ---
# %%
import os
import sys
2021-06-02 18:35:00 +02:00
import seaborn as sns
2021-06-01 12:10:42 +02:00
nb_dir = os . path . split ( os . getcwd ( ) ) [ 0 ]
if nb_dir not in sys . path :
sys . path . append ( nb_dir )
import participants . query_db
from features . esm import *
2021-06-02 18:35:00 +02:00
# %% [markdown]
# # ESM data
# %% [markdown]
# Only take data from the main part of the study. The pilot data have different structure, there were especially many additions to ESM_JSON.
# %%
2021-06-07 19:32:38 +02:00
participants_inactive_usernames = participants . query_db . get_usernames (
collection_start = datetime . date . fromisoformat ( " 2020-08-01 " )
)
2021-06-02 18:35:00 +02:00
df_esm_inactive = get_esm_data ( participants_inactive_usernames )
# %%
df_esm_preprocessed = preprocess_esm ( df_esm_inactive )
df_esm_preprocessed . head ( )
# %%
df_esm_preprocessed . columns
# %% [markdown]
# # Concordance
# %% [markdown]
# The purpose of concordance is to count the number of EMA sessions that a participant answered in a day and possibly compare it to some maximum number of EMAs that could theoretically be presented for that day.
2021-06-07 19:32:38 +02:00
# Traditionally, concordance (adherence) in EMA study is simply calculated as the ratio of (daily) answered EMAs.
2021-06-02 18:42:39 +02:00
# This is possible for studies with simple EMA design, such that they are presented at fixed schedule and expired within a certain limit.
#
# Since EMAs were triggered more flexibly in our study, a different approach is needed.
2021-06-02 18:35:00 +02:00
# %% [markdown]
# ## Session IDs
# %% [markdown]
# One approach would be to count distinct session IDs which are incremented for each group of EMAs. However, since not every question answered counts as a fulfilled EMA, some unique session IDs should be eliminated first.
# %%
2021-07-03 18:45:46 +02:00
session_counts = df_esm_preprocessed . groupby ( [ " participant_id " , " esm_session " ] ) . count ( ) [
" id "
]
2021-06-07 12:01:41 +02:00
# %% [markdown]
# Group data by participant_id and esm_session and count the number of instances (by id). Session counts are therefore counts of how many times a specific session ID appears *within* a specific participant.
#
# In the plot below, it is impossible to distinguish whether a specific count appears many times within the same or across different participants.
2021-06-02 18:35:00 +02:00
# %%
sns . displot ( session_counts . to_numpy ( ) , binwidth = 1 , height = 8 )
# %% [markdown]
# ### Unique session IDs
# %%
2021-06-07 19:32:38 +02:00
df_session_counts = pd . DataFrame ( session_counts ) . rename (
columns = { " id " : " esm_session_count " }
)
2021-06-07 12:01:41 +02:00
df_session_1 = df_session_counts [ ( df_session_counts [ " esm_session_count " ] == 1 ) ]
2021-06-07 19:32:38 +02:00
df_esm_unique_session = df_session_1 . join (
2021-06-09 17:29:42 +02:00
df_esm_preprocessed . set_index ( [ " participant_id " , " device_id " , " esm_session " ] )
2021-06-07 19:32:38 +02:00
)
2021-06-02 18:35:00 +02:00
# %%
df_esm_unique_session [ " esm_user_answer " ] . value_counts ( )
# %% [markdown]
# The "DayFinished3421" tag marks the last EMA, where the participant only marked "I finished with work for today" and did not answer any questions.
# What do the answers "Ne" represent?
# %%
2021-06-07 19:32:38 +02:00
df_esm_unique_session . query ( " esm_user_answer == ' Ne ' " ) [
[ " esm_trigger " , " esm_instructions " , " esm_user_answer " ]
] . head ( )
2021-06-02 18:35:00 +02:00
# %%
2021-06-07 19:32:38 +02:00
df_esm_unique_session . loc [
df_esm_unique_session [ " esm_user_answer " ] . str . contains ( " Ne " ) , " esm_trigger "
] . value_counts ( )
2021-06-02 18:35:00 +02:00
# %% [markdown]
# These are all "first" questions of EMAs which serve as a way to postpone the daytime or evening EMAs.
# %% [markdown]
# The other answers signify expired or interrupted EMAs.
# %% [markdown]
# ### "Almost" unique session IDs
# %% [markdown]
# There are some session IDs that only appear twice or three times.
2021-06-01 12:10:42 +02:00
# %%
2021-06-07 19:32:38 +02:00
df_session_counts [
( df_session_counts [ " esm_session_count " ] < 4 )
& ( df_session_counts [ " esm_session_count " ] > 1 )
]
2021-06-02 18:35:00 +02:00
# %% [markdown]
# Some represent the morning EMAs that only contained three questions.
2021-06-01 12:10:42 +02:00
# %%
2021-06-07 19:32:38 +02:00
df_esm_preprocessed . query ( " participant_id == 89 & esm_session == 158 " ) [
[ " esm_trigger " , " esm_instructions " , " esm_user_answer " ]
]
2021-06-01 12:10:42 +02:00
# %%
2021-06-07 19:32:38 +02:00
df_esm_preprocessed . query ( " participant_id == 89 & esm_session == 157 " ) [
[ " esm_trigger " , " esm_instructions " , " esm_user_answer " ]
]
2021-06-02 18:35:00 +02:00
# %% [markdown]
# Others represent interrupted EMA sessions.
2021-06-01 12:10:42 +02:00
# %%
2021-06-07 19:32:38 +02:00
df_esm_preprocessed . query ( " participant_id == 31 & esm_session == 77 " ) [
[ " esm_trigger " , " esm_instructions " , " esm_user_answer " ]
]
2021-06-07 12:01:41 +02:00
2021-06-11 13:50:24 +02:00
# %% tags=[]
2021-06-11 14:50:14 +02:00
df_esm_2 = (
df_session_counts [ df_session_counts [ " esm_session_count " ] == 2 ]
. reset_index ( )
2021-07-03 18:45:46 +02:00
. merge ( df_esm_preprocessed , how = " left " , on = [ " participant_id " , " esm_session " ] , )
2021-06-11 14:50:14 +02:00
)
# with pd.option_context('display.max_rows', None, 'display.max_columns', None): # more options can be specified also
# display(df_esm_2)
2021-06-09 17:29:42 +02:00
2021-06-11 13:50:24 +02:00
# %% [markdown] tags=[]
2021-06-07 12:01:41 +02:00
# ### Long sessions
# %%
df_session_counts [ ( df_session_counts [ " esm_session_count " ] > 40 ) ]
# %%
2021-06-07 19:32:38 +02:00
df_esm_preprocessed . query ( " participant_id == 83 " ) . sort_values ( " _id " ) [
[ " esm_trigger " , " datetime_lj " , " _id " , " username " , " device_id " ]
]
2021-06-07 12:01:41 +02:00
# %% [markdown]
# Both, session ID and \_ID (and others) reset on application reinstall. Here, it can be seen that the application was reinstalled on 2 April (actually, the phone was replaced as reported by the participant).
#
2021-06-07 16:44:42 +02:00
# Session IDs should therefore be grouped while taking the device ID into account.
# %%
2021-06-07 19:32:38 +02:00
session_counts_device = df_esm_preprocessed . groupby (
[ " participant_id " , " device_id " , " esm_session " ]
) . count ( ) [ " id " ]
2021-06-07 16:44:42 +02:00
sns . displot ( session_counts_device . to_numpy ( ) , binwidth = 1 , height = 8 )
2021-06-02 18:35:00 +02:00
# %% [markdown]
# ## Other possibilities
# %% [markdown]
2021-06-07 19:32:38 +02:00
# Prepare a dataframe with session response as determined from other indices.
# %%
import numpy as np
df_session_counts = pd . DataFrame ( session_counts_device ) . rename (
columns = { " id " : " esm_session_count " }
)
2021-07-04 14:34:13 +02:00
df_session_counts [ " session_response " ] = np . nan
2021-06-07 19:32:38 +02:00
session_group_by = df_esm_preprocessed . groupby (
[ " participant_id " , " device_id " , " esm_session " ]
)
df_session_counts . count ( )
# %% [markdown]
# ### ESM statuses
# %% [markdown]
# The status of the ESM can be: 0-new, 1-dismissed, 2-answered, 3-expired, 4-visible, or 5-branched.
#
# Which statuses appear in the data?
# %%
df_esm_preprocessed [ " esm_status " ] . value_counts ( )
# %% [markdown]
# Most of the ESMs were answered (2). We can group all others as unanswered.
# %%
contains_status_not_2 = session_group_by . apply ( lambda x : ( x . esm_status != 2 ) . any ( ) )
df_session_counts . loc [ contains_status_not_2 , " session_response " ] = " esm_unanswered "
# %%
df_session_counts . count ( )
# %% [markdown]
# ### Day finished or off
# %%
non_session = session_group_by . apply (
lambda x : (
2021-06-11 14:50:14 +02:00
( x . esm_user_answer == " DayFinished3421 " )
| ( x . esm_user_answer == " DayOff3421 " )
| ( x . esm_user_answer == " DayFinishedSetEvening " )
2021-06-07 19:32:38 +02:00
) . any ( )
)
df_session_counts . loc [ non_session , " session_response " ] = " day_finished "
# %%
df_session_counts . count ( )
# %% [markdown]
# ### Removed
# %% [markdown]
# There are also answers that explicitly describe what happened to a pending question that start with "Removed%".
# %%
esm_removed = session_group_by . apply (
lambda x : ( x . esm_user_answer . str . contains ( " Removed " ) ) . any ( )
)
# %%
df_session_counts . loc [ esm_removed ]
# %%
df_session_counts . loc [ esm_removed , " session_response " ] . value_counts ( )
# %% [markdown]
# It turns out that these had been accounted for with ESM statuses.
2021-06-11 14:50:14 +02:00
# %% [markdown]
# ### Singleton sessions
# %%
df_session_counts . count ( )
# %%
df_session_counts [
( df_session_counts . esm_session_count == 1 )
& df_session_counts . session_response . isna ( )
]
# %%
df_session_1 = df_session_counts [
( df_session_counts [ " esm_session_count " ] == 1 )
& df_session_counts . session_response . isna ( )
]
df_esm_unique_session = df_session_1 . join (
df_esm_preprocessed . set_index ( [ " participant_id " , " device_id " , " esm_session " ] )
)
df_esm_unique_session = df_esm_unique_session [ " esm_trigger " ] . rename ( " session_response " )
# %%
df_session_counts . loc [
df_esm_unique_session . index , " session_response "
] = df_esm_unique_session
# %%
df_session_counts . count ( )
2021-06-07 19:32:38 +02:00
# %% [markdown]
# ### Evening_last
# %% [markdown]
# When the evening EMA session comes to an end, the trigger should reflect this, that is, it should say `evening_last`.
# %%
finished_sessions = session_group_by . apply (
lambda x : ( x . esm_trigger . str . endswith ( " _last " ) ) . any ( )
)
df_session_counts . loc [ finished_sessions , " session_response " ] = " esm_finished "
# %%
df_session_counts . count ( )
# %%
df_esm_preprocessed [ " esm_trigger " ] . value_counts ( )
# %%
sns . displot (
df_session_counts [ df_session_counts . session_response . isna ( ) ] ,
x = " esm_session_count " ,
binwidth = 1 ,
height = 8 ,
)
2021-06-11 13:50:24 +02:00
# %% [markdown]
2021-06-11 14:50:14 +02:00
# ### Repeated sessions
2021-06-11 13:50:24 +02:00
2021-06-11 14:50:14 +02:00
# %% [markdown]
# The sessions lengths that repeat often can probably be used as filled in EMAs. Let's only review the session lengths that are rare.
2021-06-11 13:50:24 +02:00
# %%
2021-06-11 14:50:14 +02:00
df_session_counts . loc [
df_session_counts . session_response . isna ( ) , " esm_session_count "
] . value_counts ( ) . sort_index ( )
2021-06-11 13:50:24 +02:00
2021-07-03 18:45:46 +02:00
# %% tags=[]
2021-06-11 14:50:14 +02:00
df_session_7 = df_session_counts [
( df_session_counts [ " esm_session_count " ] == 7 )
& df_session_counts . session_response . isna ( )
]
df_esm_session_7 = df_session_7 . join (
df_esm_preprocessed . set_index ( [ " participant_id " , " device_id " , " esm_session " ] ) ,
how = " left " ,
2021-06-11 13:50:24 +02:00
)
2021-06-11 20:28:24 +02:00
# %% tags=[]
2021-06-11 14:50:14 +02:00
with pd . option_context (
" display.max_rows " , None , " display.max_columns " , None
) : # more options can be specified also
display ( df_esm_session_7 [ [ " esm_trigger " , " esm_instructions " , " esm_user_answer " ] ] )
2021-06-11 13:50:24 +02:00
2021-06-11 14:50:14 +02:00
# %% [markdown]
# These are all morning questionnaires with "commute" selected or rarely "long break" in the morning.
2021-06-11 13:50:24 +02:00
# %%
2021-06-11 14:50:14 +02:00
df_session_27 = df_session_counts [
( df_session_counts [ " esm_session_count " ] == 27 )
& df_session_counts . session_response . isna ( )
]
df_esm_session_27 = df_session_27 . join (
df_esm_preprocessed . set_index ( [ " participant_id " , " device_id " , " esm_session " ] ) ,
how = " left " ,
)
2021-06-11 13:50:24 +02:00
2021-06-11 20:28:24 +02:00
# %% tags=[]
2021-06-11 14:50:14 +02:00
with pd . option_context (
" display.max_rows " , None , " display.max_columns " , None
) : # more options can be specified also
display ( df_esm_session_27 [ [ " esm_trigger " , " esm_instructions " , " esm_user_answer " ] ] )
2021-06-11 13:50:24 +02:00
2021-06-11 14:50:14 +02:00
# %% [markdown]
# These are all morning questionnaires with morning *and* workday items, with the feedback added and also branched in the longest possible way.
2021-06-11 13:50:24 +02:00
2021-06-07 19:32:38 +02:00
# %%
2021-06-11 20:28:24 +02:00
df_session_6 = df_session_counts [
( df_session_counts [ " esm_session_count " ] == 6 )
& df_session_counts . session_response . isna ( )
]
df_esm_session_6 = df_session_6 . join (
df_esm_preprocessed . set_index ( [ " participant_id " , " device_id " , " esm_session " ] ) ,
how = " left " ,
)
# %%
display ( df_esm_session_6 [ [ " esm_trigger " , " esm_instructions " , " esm_user_answer " ] ] )
2021-07-03 18:45:46 +02:00
# %% [markdown]
# The 6-question sessions are long interruptions of work during daytime.
# %% [markdown]
# # Count and classify sessions
2021-06-11 20:28:24 +02:00
# %%
df_session_counts = classify_sessions_by_completion ( df_esm_preprocessed )
df_session_time = classify_sessions_by_time ( df_esm_preprocessed )
2021-07-03 18:45:46 +02:00
# %%
df_session_time
2021-06-11 20:28:24 +02:00
# %% [markdown]
# The sessions were classified by time by taking the **first** record in a session.
# However, a morning questionnaire could seamlessly transition into a daytime questionnaire, if the participant was already at work.
# In this case, the "time" label changed mid-session.
#
# Because of the way classify_sessions_by_time works, this questionnaire was classified as "morning".
# But for all intents and purposes, it can be treated as a "daytime" EMA.
#
# This is corrected in `classify_sessions_by_completion_time`