Add bluetooth doryab features

2020-12-11 12:03:22 -05:00 · 2020-12-11 12:03:22 -05:00 · 266dd28d02
parent 9ff75b10ba
commit 266dd28d02
4 changed files with 195 additions and 0 deletions
--- a/config.yaml
+++ b/config.yaml
@ -127,6 +127,14 @@ PHONE_BLUETOOTH:
      FEATURES: ["countscans", "uniquedevices", "countscansmostuniquedevice"]
      SRC_FOLDER: "rapids" # inside src/features/phone_bluetooth
      SRC_LANGUAGE: "r"
+    DORYAB:
+      COMPUTE: False
+      FEATURES: 
+        ALL: ["countscans", "uniquedevices", "countscansmostuniquedevice", "countscansleastuniquedevice", "meanscans", "stdscans"]
+        OWN: ["countscans", "uniquedevices", "countscansmostuniquedevice", "countscansleastuniquedevice", "meanscans", "stdscans"]
+        OTHERS: ["countscans", "uniquedevices", "countscansmostuniquedevice", "countscansleastuniquedevice", "meanscans", "stdscans"]
+      SRC_FOLDER: "doryab" # inside src/features/phone_bluetooth
+      SRC_LANGUAGE: "python"

 # See https://www.rapids.science/latest/features/phone-calls/
 PHONE_CALLS:
--- a/docs/citation.md
+++ b/docs/citation.md
@ -28,6 +28,13 @@ If you computed applications foreground features using the app category (genre)
 !!! cite "Stachl et al. citation"
    Clemens Stachl, Quay Au, Ramona Schoedel, Samuel D. Gosling, Gabriella M. Harari, Daniel Buschek, Sarah Theres Völkel, Tobias Schuwerk, Michelle Oldemeier, Theresa Ullmann, Heinrich Hussmann, Bernd Bischl, Markus Bühner. Proceedings of the National Academy of Sciences Jul 2020, 117 (30) 17680-17687; DOI: 10.1073/pnas.1920484117 

+## Doryab (bluetooth)
+
+If you computed bluetooth features using the provider `[PHONE_BLUETOOTH][DORYAB]` cite [this paper](https://arxiv.org/abs/1812.10394) in addition to RAPIDS.
+
+!!! cite "Doryab et al. citation"
+    Doryab, A., Chikarsel, P., Liu, X., & Dey, A. K. (2019). Extraction of Behavioral Features from Smartphone and Wearable Data. ArXiv:1812.10394 [Cs, Stat]. http://arxiv.org/abs/1812.10394
+
 ## Barnett (locations)

 If you computed locations features using the provider `[PHONE_LOCATIONS][BARNETT]` cite [this paper](https://doi.org/10.1093/biostatistics/kxy059) and [this paper](https://doi.org/10.1145/2750858.2805845) in addition to RAPIDS.
--- a/docs/features/phone-bluetooth.md
+++ b/docs/features/phone-bluetooth.md
@ -39,3 +39,77 @@ Features description for `[PHONE_BLUETOOTH][PROVIDERS][RAPIDS]`:

 !!! note "Assumptions/Observations"
    NA
+
+## DORYAB provider
+
+!!! info "Available time segments and platforms"
+    - Available for all time segments
+    - Available for Android only
+
+!!! info "File Sequence"
+    ```bash
+    - data/raw/{pid}/phone_bluetooth_raw.csv
+    - data/raw/{pid}/phone_bluetooth_with_datetime.csv
+    - data/interim/{pid}/phone_bluetooth_features/phone_bluetooth_{language}_{provider_key}.csv
+    - data/processed/features/{pid}/phone_bluetooth.csv"
+    ```
+
+
+Parameters description for `[PHONE_BLUETOOTH][PROVIDERS][DORYAB]`:
+
+|Key&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;            | Description |
+|----------------|-----------------------------------------------------------------------------------------------------------------------------------
+|`[COMPUTE]`| Set to `True` to extract `PHONE_BLUETOOTH` features from the `DORYAB` provider|
+|`[FEATURES]` |         Features to be computed, see table below. These features are computed for three device categories: `all` devices, `own` devices and `other` devices.
+
+
+Features description for `[PHONE_BLUETOOTH][PROVIDERS][DORYAB]`:
+
+|Feature                    |Units      |Description|
+|-------------------------- |---------- |---------------------------|
+| countscans                 | scans | Number of scans (rows) from the devices sensed during a time segment instance. The more scans a bluetooth device has the longer it remained within range of the participant's phone |
+| uniquedevices              | devices | Number of unique bluetooth devices sensed during a time segment instance as identified by their hardware addresses (`bt_address`) |
+| countscansmostuniquedevice | scans   | Number of scans of the most sensed device within each time segment instance|
+| countscansleastuniquedevice| scans| Number of scans of the least sensed device within each time segment instance |
+| meanscans | scans| Mean of the scans of every sensed device within each time segment instance|
+| stdscans | scans| Standard deviation of the scans of every sensed device within each time segment instance|
+
+!!! note "Assumptions/Observations"
+    - This provider is adapted from the work by [Doryab et al](../../citation#doryab-bluetooth). Devices are clasified as belonging to the participant (`own`) or to other people (`others`) using k-means based on the number of times and the number of days each device was detected across each participant's dataset.
+    - If ownership cannot be computed because all devices were detected on only one day, they are all considered as `other`. Thus `all` and `other` features will be equal.
+    - These features are computed for devices detected within each time segment instance. For example, let's say that we logged the following devices on three different time segment instances (days) for `p01`:
+    ```csv
+    local_date                            bt_address
+    2016-11-29  55C836F5-487E-405F-8E28-21DBD40FA4FF
+    2016-11-29  55C836F5-487E-405F-8E28-21DBD40FA4FF
+    2016-11-29  55C836F5-487E-405F-8E28-21DBD40FA4FF
+    2016-11-29  48872A52-68DE-420D-98DA-73339A1C4685
+    2016-11-29  48872A52-68DE-420D-98DA-73339A1C4685
+    2016-11-30  55C836F5-487E-405F-8E28-21DBD40FA4FF
+    2016-11-30  55C836F5-487E-405F-8E28-21DBD40FA4FF
+    2016-11-30  48872A52-68DE-420D-98DA-73339A1C4685
+    2017-05-07  5C5A9C41-2F68-4CEB-96D0-77DE3729B729
+    2017-05-07  25262DC7-780C-4AD5-AD3A-D9776AEF7FC1
+    2017-05-07  5B1E6981-2E50-4D9A-99D8-67AED430C5A8
+    2017-05-07  6C444841-FE64-4375-BC3F-FA410CDC0AC7
+    2017-05-07  5B1E6981-2E50-4D9A-99D8-67AED430C5A8
+    2017-05-07  4DC7A22D-9F1F-4DEF-8576-086910AABCB5
+    ```
+    - For each device we compute `days_scanned` (the number of days on which each device was detected), `scans` (the number of times each device was detected), `scans_per_day` that's equal to `scans/days_scanned`, and whether a devices is labelled as `own` or `other` (note the last device is labelled as a `own` device because it was detected 6 times over two time segment instances):
+    ```csv
+    bt_address                            days_scanned  scans  scans_per_day own_device
+    25262DC7-780C-4AD5-AD3A-D9776AEF7FC1             1      1            1.0          0
+    4DC7A22D-9F1F-4DEF-8576-086910AABCB5             1      1            1.0          0
+    5C5A9C41-2F68-4CEB-96D0-77DE3729B729             1      1            1.0          0
+    6C444841-FE64-4375-BC3F-FA410CDC0AC7             1      1            1.0          0
+    5B1E6981-2E50-4D9A-99D8-67AED430C5A8             1      2            2.0          0
+    48872A52-68DE-420D-98DA-73339A1C4685             2      3            1.5          0
+    55C836F5-487E-405F-8E28-21DBD40FA4FF             2      5            2.5          1
+    ```
+    - These are the metrics for each time instance (day) for `own` and `other` devices (we ignore `all` for brevity). The only `own` device (`55C836F5-487E-405F-8E28-21DBD40FA4FF`) was detected on the first two days, 3 and 2 times respectively, the `other` devices where detected on all three days. On the last day (`2017-05-07`) there were 6 scans from 5 unique devices, the most frequent device for that day was `5B1E6981-2E50-4D9A-99D8-67AED430C5A8` with 2 scans, and the mean number of scans among all devices was 1.2 (`[1 + 1 + 1 + 1 + 2] / 5`)
+    ```csv
+    local_segment countscansown uniquedevicesown countscansmostuniquedeviceown countscansleastuniquedeviceown meanscansown stdscansown countscansothers uniquedevicesothers countscansmostuniquedeviceothers countscansleastuniquedeviceothers meanscansothers stdscansothers
+    2016-11-29 3.0 1.0 3.0 3.0 3.0 NaN 2 1 2 2 2.0 NaN
+    2016-11-30 2.0 1.0 2.0 2.0 2.0 NaN 1 1 1 1 1.0 NaN
+    2017-05-07 NaN NaN NaN NaN NaN NaN 6 5 2 1 1.2 0.447214
+    ```
--- a/src/features/phone_bluetooth/doryab/main.py
+++ b/src/features/phone_bluetooth/doryab/main.py
@ -0,0 +1,106 @@
+import pandas as pd
+import numpy as np
+from sklearn.cluster import KMeans
+
+def deviceFeatures(devices, ownership, features_to_compute, features):
+    if devices.shape[0] == 0:
+        device_value_counts = pd.DataFrame(columns=["local_segment", "bt_address", "scans"], dtype=int)
+    else:
+        device_value_counts = devices.groupby(["local_segment"])["bt_address"].value_counts().to_frame("scans").reset_index()
+
+    if "countscans" in features_to_compute:
+        features = features.join(device_value_counts.groupby("local_segment")["scans"].sum().to_frame("countscans" + ownership), how="outer")
+    if "uniquedevices" in features_to_compute:
+        features = features.join(device_value_counts.groupby("local_segment")["bt_address"].nunique().to_frame("uniquedevices" + ownership), how="outer")
+    if "countscansmostuniquedevice" in features_to_compute:
+        features = features.join(device_value_counts.groupby("local_segment")["scans"].max().to_frame("countscansmostuniquedevice" + ownership), how="outer")
+    if "countscansleastuniquedevice" in features_to_compute:
+        features = features.join(device_value_counts.groupby("local_segment")["scans"].min().to_frame("countscansleastuniquedevice" + ownership), how="outer")
+    if "meanscans" in features_to_compute:
+        features = features.join(device_value_counts.groupby("local_segment")["scans"].mean().to_frame("meanscans" + ownership), how="outer")
+    if "stdscans" in features_to_compute:
+        features = features.join(device_value_counts.groupby("local_segment")["scans"].std().to_frame("stdscans" + ownership), how="outer")
+    return(features)
+
+def deviceFrequency(bt_data):
+    bt_data = bt_data[["local_date", "bt_address"]].dropna(subset=["bt_address"])
+    bt_data = bt_data.groupby("bt_address").agg({"local_date": pd.Series.nunique, "bt_address" : 'count'})
+    bt_data = bt_data.rename(columns={"local_date" : "days_scanned", "bt_address" : "scans"})
+    bt_data["scans_per_day"] = bt_data["scans"] / bt_data["days_scanned"]
+    return bt_data
+
+def ownership_based_on_clustering(bt_frequency):
+    bt_frequency = bt_frequency.reset_index()
+    for col in ["scans_per_day", "days_scanned", "scans"]:
+        col_zscore = col + '_z'
+        bt_frequency[col_zscore] = (bt_frequency[col] - bt_frequency[col].mean()) / bt_frequency[col].std(ddof=0)
+
+    bt_frequency = bt_frequency.dropna(how='any')
+    if len(bt_frequency) == 0:
+        bt_frequency["own_device"] = None
+        return bt_frequency[["bt_address", "own_device"]]
+
+    avgfreq_z = bt_frequency["scans_per_day_z"]
+    numdays_z = bt_frequency["days_scanned_z"]
+    score = avgfreq_z + numdays_z
+    maxscore = np.max(score)
+    minscore = np.min(score)
+    midscore = (maxscore + minscore) / 2
+    initial_k2 = np.array([[maxscore], [minscore]], np.int32)
+    initial_k3 = np.array([[maxscore], [midscore], [minscore]], np.int32)
+    X_array = score.values
+    X = np.reshape(X_array, (len(score), 1))
+
+    # K = 2, devices I own VS devices other people own
+    kmeans_k2 = KMeans(n_clusters=2, init = initial_k2, n_init = 1).fit(X)
+    labels_k2 = kmeans_k2.labels_
+    centers_k2 = [c[0] for c in kmeans_k2.cluster_centers_]
+    diff_k2 = [(X_array[xi] - centers_k2[labels_k2[xi]])**2 for xi in range(0, len(X_array))]
+    sum_dist_k2 = sum(diff_k2)
+
+    # K = 3, devices I own VS devices my partner/roommate owns (can also be other devices I own though) VS devices other people own
+    kmeans_k3 = KMeans(n_clusters=3, init=initial_k3,  n_init = 1).fit(X)
+    labels_k3 = kmeans_k3.labels_
+    centers_k3 = [c[0] for c in kmeans_k3.cluster_centers_]
+    diff_k3 = [(X_array[xi] - centers_k3[labels_k3[xi]])**2 for xi in range(0, len(X_array))]
+    sum_dist_k3 = sum(diff_k3)
+
+    if sum_dist_k2 < sum_dist_k3: # K = 2 is better
+        labels = labels_k2
+        centers = centers_k2
+        numclust = 2
+    else:
+        labels = labels_k3
+        centers = centers_k3
+        numclust = 3
+    
+    maxcluster = np.where(labels == np.argmax(centers), 1, 0)
+    bt_frequency["own_device"] = maxcluster
+    return bt_frequency[["bt_address", "own_device"]]
+    
+
+def doryab_features(sensor_data_files, time_segment, provider, filter_data_by_segment, *args, **kwargs):
+
+    bt_data = pd.read_csv(sensor_data_files["sensor_data"])
+    base_features = set(["countscans", "uniquedevices", "countscansmostuniquedevice", "countscansleastuniquedevice", "meanscans", "stdscans"])
+    ownership_keys = [x.lower() for x in provider["FEATURES"].keys()]
+    if set(ownership_keys) != set(["own", "others", "all"]):
+        raise ValueError("[PHONE_BLUETOOTH][DORYAB][FEATURES] config key can only have three lists called ALL, OWN and OTHERS, instead you provided {}".format(ownership_keys))
+    
+    device_ownership = ownership_based_on_clustering(deviceFrequency(bt_data)).set_index("bt_address")
+    bt_data = bt_data.set_index("bt_address").join(device_ownership, how="left").reset_index()
+    bt_data["own_device"].fillna(0, inplace=True)
+    segment_bt_data = filter_data_by_segment(bt_data, time_segment)
+    features = pd.DataFrame(columns=['local_segment']).set_index("local_segment")
+    for ownership in provider["FEATURES"].keys():
+        features_to_compute = list(set(provider["FEATURES"][ownership]) & base_features)
+        if ownership == "OWN":
+            owner_segment_bt_data = segment_bt_data.query("own_device == 1")
+        elif ownership == "OTHERS":
+            owner_segment_bt_data = segment_bt_data.query("own_device == 0")
+        else: #ALL
+            owner_segment_bt_data = segment_bt_data
+        features = deviceFeatures(owner_segment_bt_data, ownership.lower(), features_to_compute, features)
+        
+    features = features.reset_index()
+    return features