Add bluetooth doryab features

pull/107/head
JulioV 2020-12-11 12:03:22 -05:00
parent 9ff75b10ba
commit 266dd28d02
4 changed files with 195 additions and 0 deletions

View File

@ -127,6 +127,14 @@ PHONE_BLUETOOTH:
FEATURES: ["countscans", "uniquedevices", "countscansmostuniquedevice"]
SRC_FOLDER: "rapids" # inside src/features/phone_bluetooth
SRC_LANGUAGE: "r"
DORYAB:
COMPUTE: False
FEATURES:
ALL: ["countscans", "uniquedevices", "countscansmostuniquedevice", "countscansleastuniquedevice", "meanscans", "stdscans"]
OWN: ["countscans", "uniquedevices", "countscansmostuniquedevice", "countscansleastuniquedevice", "meanscans", "stdscans"]
OTHERS: ["countscans", "uniquedevices", "countscansmostuniquedevice", "countscansleastuniquedevice", "meanscans", "stdscans"]
SRC_FOLDER: "doryab" # inside src/features/phone_bluetooth
SRC_LANGUAGE: "python"
# See https://www.rapids.science/latest/features/phone-calls/
PHONE_CALLS:

View File

@ -28,6 +28,13 @@ If you computed applications foreground features using the app category (genre)
!!! cite "Stachl et al. citation"
Clemens Stachl, Quay Au, Ramona Schoedel, Samuel D. Gosling, Gabriella M. Harari, Daniel Buschek, Sarah Theres Völkel, Tobias Schuwerk, Michelle Oldemeier, Theresa Ullmann, Heinrich Hussmann, Bernd Bischl, Markus Bühner. Proceedings of the National Academy of Sciences Jul 2020, 117 (30) 17680-17687; DOI: 10.1073/pnas.1920484117
## Doryab (bluetooth)
If you computed bluetooth features using the provider `[PHONE_BLUETOOTH][DORYAB]` cite [this paper](https://arxiv.org/abs/1812.10394) in addition to RAPIDS.
!!! cite "Doryab et al. citation"
Doryab, A., Chikarsel, P., Liu, X., & Dey, A. K. (2019). Extraction of Behavioral Features from Smartphone and Wearable Data. ArXiv:1812.10394 [Cs, Stat]. http://arxiv.org/abs/1812.10394
## Barnett (locations)
If you computed locations features using the provider `[PHONE_LOCATIONS][BARNETT]` cite [this paper](https://doi.org/10.1093/biostatistics/kxy059) and [this paper](https://doi.org/10.1145/2750858.2805845) in addition to RAPIDS.

View File

@ -39,3 +39,77 @@ Features description for `[PHONE_BLUETOOTH][PROVIDERS][RAPIDS]`:
!!! note "Assumptions/Observations"
NA
## DORYAB provider
!!! info "Available time segments and platforms"
- Available for all time segments
- Available for Android only
!!! info "File Sequence"
```bash
- data/raw/{pid}/phone_bluetooth_raw.csv
- data/raw/{pid}/phone_bluetooth_with_datetime.csv
- data/interim/{pid}/phone_bluetooth_features/phone_bluetooth_{language}_{provider_key}.csv
- data/processed/features/{pid}/phone_bluetooth.csv"
```
Parameters description for `[PHONE_BLUETOOTH][PROVIDERS][DORYAB]`:
|Key                              | Description |
|----------------|-----------------------------------------------------------------------------------------------------------------------------------
|`[COMPUTE]`| Set to `True` to extract `PHONE_BLUETOOTH` features from the `DORYAB` provider|
|`[FEATURES]` | Features to be computed, see table below. These features are computed for three device categories: `all` devices, `own` devices and `other` devices.
Features description for `[PHONE_BLUETOOTH][PROVIDERS][DORYAB]`:
|Feature |Units |Description|
|-------------------------- |---------- |---------------------------|
| countscans | scans | Number of scans (rows) from the devices sensed during a time segment instance. The more scans a bluetooth device has the longer it remained within range of the participant's phone |
| uniquedevices | devices | Number of unique bluetooth devices sensed during a time segment instance as identified by their hardware addresses (`bt_address`) |
| countscansmostuniquedevice | scans | Number of scans of the most sensed device within each time segment instance|
| countscansleastuniquedevice| scans| Number of scans of the least sensed device within each time segment instance |
| meanscans | scans| Mean of the scans of every sensed device within each time segment instance|
| stdscans | scans| Standard deviation of the scans of every sensed device within each time segment instance|
!!! note "Assumptions/Observations"
- This provider is adapted from the work by [Doryab et al](../../citation#doryab-bluetooth). Devices are clasified as belonging to the participant (`own`) or to other people (`others`) using k-means based on the number of times and the number of days each device was detected across each participant's dataset.
- If ownership cannot be computed because all devices were detected on only one day, they are all considered as `other`. Thus `all` and `other` features will be equal.
- These features are computed for devices detected within each time segment instance. For example, let's say that we logged the following devices on three different time segment instances (days) for `p01`:
```csv
local_date bt_address
2016-11-29 55C836F5-487E-405F-8E28-21DBD40FA4FF
2016-11-29 55C836F5-487E-405F-8E28-21DBD40FA4FF
2016-11-29 55C836F5-487E-405F-8E28-21DBD40FA4FF
2016-11-29 48872A52-68DE-420D-98DA-73339A1C4685
2016-11-29 48872A52-68DE-420D-98DA-73339A1C4685
2016-11-30 55C836F5-487E-405F-8E28-21DBD40FA4FF
2016-11-30 55C836F5-487E-405F-8E28-21DBD40FA4FF
2016-11-30 48872A52-68DE-420D-98DA-73339A1C4685
2017-05-07 5C5A9C41-2F68-4CEB-96D0-77DE3729B729
2017-05-07 25262DC7-780C-4AD5-AD3A-D9776AEF7FC1
2017-05-07 5B1E6981-2E50-4D9A-99D8-67AED430C5A8
2017-05-07 6C444841-FE64-4375-BC3F-FA410CDC0AC7
2017-05-07 5B1E6981-2E50-4D9A-99D8-67AED430C5A8
2017-05-07 4DC7A22D-9F1F-4DEF-8576-086910AABCB5
```
- For each device we compute `days_scanned` (the number of days on which each device was detected), `scans` (the number of times each device was detected), `scans_per_day` that's equal to `scans/days_scanned`, and whether a devices is labelled as `own` or `other` (note the last device is labelled as a `own` device because it was detected 6 times over two time segment instances):
```csv
bt_address days_scanned scans scans_per_day own_device
25262DC7-780C-4AD5-AD3A-D9776AEF7FC1 1 1 1.0 0
4DC7A22D-9F1F-4DEF-8576-086910AABCB5 1 1 1.0 0
5C5A9C41-2F68-4CEB-96D0-77DE3729B729 1 1 1.0 0
6C444841-FE64-4375-BC3F-FA410CDC0AC7 1 1 1.0 0
5B1E6981-2E50-4D9A-99D8-67AED430C5A8 1 2 2.0 0
48872A52-68DE-420D-98DA-73339A1C4685 2 3 1.5 0
55C836F5-487E-405F-8E28-21DBD40FA4FF 2 5 2.5 1
```
- These are the metrics for each time instance (day) for `own` and `other` devices (we ignore `all` for brevity). The only `own` device (`55C836F5-487E-405F-8E28-21DBD40FA4FF`) was detected on the first two days, 3 and 2 times respectively, the `other` devices where detected on all three days. On the last day (`2017-05-07`) there were 6 scans from 5 unique devices, the most frequent device for that day was `5B1E6981-2E50-4D9A-99D8-67AED430C5A8` with 2 scans, and the mean number of scans among all devices was 1.2 (`[1 + 1 + 1 + 1 + 2] / 5`)
```csv
local_segment countscansown uniquedevicesown countscansmostuniquedeviceown countscansleastuniquedeviceown meanscansown stdscansown countscansothers uniquedevicesothers countscansmostuniquedeviceothers countscansleastuniquedeviceothers meanscansothers stdscansothers
2016-11-29 3.0 1.0 3.0 3.0 3.0 NaN 2 1 2 2 2.0 NaN
2016-11-30 2.0 1.0 2.0 2.0 2.0 NaN 1 1 1 1 1.0 NaN
2017-05-07 NaN NaN NaN NaN NaN NaN 6 5 2 1 1.2 0.447214
```

View File

@ -0,0 +1,106 @@
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
def deviceFeatures(devices, ownership, features_to_compute, features):
if devices.shape[0] == 0:
device_value_counts = pd.DataFrame(columns=["local_segment", "bt_address", "scans"], dtype=int)
else:
device_value_counts = devices.groupby(["local_segment"])["bt_address"].value_counts().to_frame("scans").reset_index()
if "countscans" in features_to_compute:
features = features.join(device_value_counts.groupby("local_segment")["scans"].sum().to_frame("countscans" + ownership), how="outer")
if "uniquedevices" in features_to_compute:
features = features.join(device_value_counts.groupby("local_segment")["bt_address"].nunique().to_frame("uniquedevices" + ownership), how="outer")
if "countscansmostuniquedevice" in features_to_compute:
features = features.join(device_value_counts.groupby("local_segment")["scans"].max().to_frame("countscansmostuniquedevice" + ownership), how="outer")
if "countscansleastuniquedevice" in features_to_compute:
features = features.join(device_value_counts.groupby("local_segment")["scans"].min().to_frame("countscansleastuniquedevice" + ownership), how="outer")
if "meanscans" in features_to_compute:
features = features.join(device_value_counts.groupby("local_segment")["scans"].mean().to_frame("meanscans" + ownership), how="outer")
if "stdscans" in features_to_compute:
features = features.join(device_value_counts.groupby("local_segment")["scans"].std().to_frame("stdscans" + ownership), how="outer")
return(features)
def deviceFrequency(bt_data):
bt_data = bt_data[["local_date", "bt_address"]].dropna(subset=["bt_address"])
bt_data = bt_data.groupby("bt_address").agg({"local_date": pd.Series.nunique, "bt_address" : 'count'})
bt_data = bt_data.rename(columns={"local_date" : "days_scanned", "bt_address" : "scans"})
bt_data["scans_per_day"] = bt_data["scans"] / bt_data["days_scanned"]
return bt_data
def ownership_based_on_clustering(bt_frequency):
bt_frequency = bt_frequency.reset_index()
for col in ["scans_per_day", "days_scanned", "scans"]:
col_zscore = col + '_z'
bt_frequency[col_zscore] = (bt_frequency[col] - bt_frequency[col].mean()) / bt_frequency[col].std(ddof=0)
bt_frequency = bt_frequency.dropna(how='any')
if len(bt_frequency) == 0:
bt_frequency["own_device"] = None
return bt_frequency[["bt_address", "own_device"]]
avgfreq_z = bt_frequency["scans_per_day_z"]
numdays_z = bt_frequency["days_scanned_z"]
score = avgfreq_z + numdays_z
maxscore = np.max(score)
minscore = np.min(score)
midscore = (maxscore + minscore) / 2
initial_k2 = np.array([[maxscore], [minscore]], np.int32)
initial_k3 = np.array([[maxscore], [midscore], [minscore]], np.int32)
X_array = score.values
X = np.reshape(X_array, (len(score), 1))
# K = 2, devices I own VS devices other people own
kmeans_k2 = KMeans(n_clusters=2, init = initial_k2, n_init = 1).fit(X)
labels_k2 = kmeans_k2.labels_
centers_k2 = [c[0] for c in kmeans_k2.cluster_centers_]
diff_k2 = [(X_array[xi] - centers_k2[labels_k2[xi]])**2 for xi in range(0, len(X_array))]
sum_dist_k2 = sum(diff_k2)
# K = 3, devices I own VS devices my partner/roommate owns (can also be other devices I own though) VS devices other people own
kmeans_k3 = KMeans(n_clusters=3, init=initial_k3, n_init = 1).fit(X)
labels_k3 = kmeans_k3.labels_
centers_k3 = [c[0] for c in kmeans_k3.cluster_centers_]
diff_k3 = [(X_array[xi] - centers_k3[labels_k3[xi]])**2 for xi in range(0, len(X_array))]
sum_dist_k3 = sum(diff_k3)
if sum_dist_k2 < sum_dist_k3: # K = 2 is better
labels = labels_k2
centers = centers_k2
numclust = 2
else:
labels = labels_k3
centers = centers_k3
numclust = 3
maxcluster = np.where(labels == np.argmax(centers), 1, 0)
bt_frequency["own_device"] = maxcluster
return bt_frequency[["bt_address", "own_device"]]
def doryab_features(sensor_data_files, time_segment, provider, filter_data_by_segment, *args, **kwargs):
bt_data = pd.read_csv(sensor_data_files["sensor_data"])
base_features = set(["countscans", "uniquedevices", "countscansmostuniquedevice", "countscansleastuniquedevice", "meanscans", "stdscans"])
ownership_keys = [x.lower() for x in provider["FEATURES"].keys()]
if set(ownership_keys) != set(["own", "others", "all"]):
raise ValueError("[PHONE_BLUETOOTH][DORYAB][FEATURES] config key can only have three lists called ALL, OWN and OTHERS, instead you provided {}".format(ownership_keys))
device_ownership = ownership_based_on_clustering(deviceFrequency(bt_data)).set_index("bt_address")
bt_data = bt_data.set_index("bt_address").join(device_ownership, how="left").reset_index()
bt_data["own_device"].fillna(0, inplace=True)
segment_bt_data = filter_data_by_segment(bt_data, time_segment)
features = pd.DataFrame(columns=['local_segment']).set_index("local_segment")
for ownership in provider["FEATURES"].keys():
features_to_compute = list(set(provider["FEATURES"][ownership]) & base_features)
if ownership == "OWN":
owner_segment_bt_data = segment_bt_data.query("own_device == 1")
elif ownership == "OTHERS":
owner_segment_bt_data = segment_bt_data.query("own_device == 0")
else: #ALL
owner_segment_bt_data = segment_bt_data
features = deviceFeatures(owner_segment_bt_data, ownership.lower(), features_to_compute, features)
features = features.reset_index()
return features