In [1]:
import mysql.connector
import pandas as pd
import numpy as np
import warnings
import plotly.express as px
warnings.filterwarnings('ignore')
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
In [2]:
# establish connection to MySQL on the Pi Zero that hosts the enviro sensors
connection = mysql.connector.connect(host = 'yourIPAddr',
                                    database = 'enviro',
                                    user='root',
                                    password='ROOT')
In [3]:
# read data as pandas dataframe
df = pd.read_sql_query("SELECT * FROM enviro_log WHERE timestamp > '2019-07-30' and timestamp < '2019-08-02';", connection)

Data Quality Inspection

In [4]:
df.head()
Out[4]:
timestamp light rgb motion temp pressure cputemp
0 2019-07-30 00:00:07 1 0,0,0 0.00714111328125,-0.00225830078125,1.029479980... 29.429103 83865.300031 35.242
1 2019-07-30 00:00:14 1 0,0,0 0.00762939453125,-0.0048828125,1.046875 29.426641 83865.775515 35.780
2 2019-07-30 00:00:20 1 0,0,0 0.00823974609375,-0.0052490234375,1.0474853515625 29.421410 83864.907747 35.780
3 2019-07-30 00:00:27 1 0,0,0 0.0064697265625,-0.00323486328125,1.0487670898... 29.418641 83865.664800 36.856
4 2019-07-30 00:00:33 1 0,0,0 0.0068359375,-0.00274658203125,1.04754638671875 29.418333 83865.784819 37.394
In [5]:
# take a quick look at the data structure:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38773 entries, 0 to 38772
Data columns (total 7 columns):
timestamp    38773 non-null datetime64[ns]
light        38773 non-null int64
rgb          38773 non-null object
motion       38773 non-null object
temp         38773 non-null float64
pressure     38773 non-null float64
cputemp      38773 non-null float64
dtypes: datetime64[ns](1), float64(3), int64(1), object(2)
memory usage: 2.1+ MB

Observation: Variables such as light and rgb were recorded using multiple components (R,G,B for color). We may want to tranform these variables to numrical variables for calculation.

In [6]:
# Check for null values
df.isnull().values.any()
Out[6]:
False
In [7]:
df.describe()
Out[7]:
light temp pressure cputemp
count 38773.000000 38773.000000 38773.000000 38773.000000
mean 80.276378 29.476964 83975.741610 32.543258
std 117.777596 0.881345 144.654731 0.915826
min 0.000000 26.351996 83676.052382 29.324000
25% 0.000000 28.941990 83861.410575 32.014000
50% 18.000000 29.798979 83965.913939 32.552000
75% 121.000000 30.144547 84103.762720 33.090000
max 737.000000 30.782451 84231.131556 39.008000

A quick inspection of the sensor data to ensure that the values makes sense.

  1. The light value ranging from 0-737 Lux (or Lumens per Square Meter). This is consistent with the in-door reading of phtotometry sensor.

 examples of the illuminance provided under various conditions

*Table Retrieved from: https://en.wikipedia.org/wiki/Lux*
  1. The temperature values ranging around 25-30 degree C. The readings are slightly higher compared to the actual temperature in my room. This is because the temperature censor is positioned close to the Pi Zero's CPU, which may generated a non-trivial amount of heat. We will inspect this problem later.

  2. The CPU temperature is assumed to be calibrated correctly by the Raspberry Pi Foundation and reading values is taken as reasonably accurate.

Temperature Sensor Analysis

In [8]:
# checking for correlation between sensor temperature and CPU temperature
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.axes_style("darkgrid")

fig, ax = plt.subplots()
sns.lineplot(x = 'timestamp', y = 'temp',
             data = df, label = "Ambient Temp")
sns.lineplot(x = 'timestamp', y = 'cputemp',
             data = df, label = "CPU Temp")
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x19f78de4c88>

Since both ambient temperature and CPU temperature are affected by the surrounding airflow, we expect both measurements to correlate with one another. One of my concerns was during period of high CPU usage, the CPU temperature could get pushed up significantly, causing the ambient temperature to go up and produce inaccurate readings. However, upon closer inspection, it appears that there isn't a significant increase in ambient temp when CPU generate heat (big vertical spikes). Therefore, no adjustment is needed for CPU generated heat.

CPU Temperature Anomaly Detection

Anomaly detection often serves as methods of finding outlier data points relative to some standard or usual signal. In this case, we are looking for unexpected spikes in CPU temperature using unsupervised anomaly detection. After doing some research, I have found an algorithm that works very well for these type of data. The algorithm consists of moving average and number of standard deveations from a given moving mean. The idea is to remove the trend and seasonality of data with a low pass filter, providing a smoother parameters for peak detection.

Reference:

https://stackoverflow.com/questions/22583391/peak-signal-detection-in-realtime-timeseries-data

Perkins, P., Heber, S. (2018). Identification of Ribosome Pause Sites Using a Z-Score Based Peak Detection Algorithm. IEEE 8th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS), ISBN: 978-1-5386-8520-

The algorithm takes 3 inputs:

  1. lag = the lag of the moving window (a lag of 5 will use the last 5 observations to smooth the data)
  2. threshold = the z-score at which the algorithm signals
  3. influence = the influence (between 0 and 1) of new signals on the mean and standard deviation ( an influence of 0.5 gives signals half of the influence that normal datapoints have)

How it works:

The lag parameter determines the smoothness of the data. This affect how flexible or adaptive the algorithm to the volatility of the data. The larger the lag, the smoother the rolling mean, the slower the algorithm can adapt to new trend. However, to little lag can cause the algorithm to overestimate the number of signals. The standard deviation then can be derived from the mean and number of data point (n = lag). We then multiply this standard deviation with a threshold z-score (just a number that define the number of standard deviations from the moving mean). The threshold therefore directly influences how sensitive the algorithm is and thereby also how often the algorithm signals. The smaller the threshold, the more signals we will get.

The influence parameter determines the influence of signals on the algorithm's detection threshold. Values can range from 0 to 1, from no influence to perfect correlation. 0 influence means the future signals are not influneced by the past signals. No matter how many signals that occured previously, the time series always returns to the same average over the long term. This is the case for our data, where brief CPU temperature anomalies don't permanently increase or decrease the overall trend of CPU temperature.

In [9]:
#Implementation of algorithm from https://gist.github.com/ximeg/587011a65d05f067a29ce9c22894d1d2
import numpy as np
import pylab

def thresholding_algo(y, lag, threshold, influence):
    # signal initial state 0 - noise, 1 - signal (signals are peaks)
    signals = np.zeros(len(y))
    filteredY = np.array(y)
    # place holder for filters
    avgFilter = [0]*len(y)
    stdFilter = [0]*len(y)
    # calculate the base case (number of null filter value = lag - 1)
    avgFilter[lag - 1] = np.mean(y[0:lag])
    stdFilter[lag - 1] = np.std(y[0:lag])
    for i in range(lag, len(y) - 1):
        # compare the difference between y to mean and the standard deviation
        if abs(y[i] - avgFilter[i-1]) > threshold * stdFilter [i-1]:
            if y[i] > avgFilter[i-1]:
                signals[i] = 1
            else:
                signals[i] = -1

            filteredY[i] = influence * y[i] + (1 - influence) * filteredY[i-1]
            avgFilter[i] = np.mean(filteredY[(i-lag):i])
            stdFilter[i] = np.std(filteredY[(i-lag):i])
        else:
            signals[i] = 0
            filteredY[i] = y[i]
            avgFilter[i] = np.mean(filteredY[(i-lag):i])
            stdFilter[i] = np.std(filteredY[(i-lag):i])

    result_data = pd.DataFrame({'cputemp':np.array(y),
                         'signal':np.asarray(signals),
                         'mean':np.asarray(avgFilter),
                         'std':np.asarray(stdFilter)})

    return result_data
In [10]:
# Try CPU temp Data for 1 day
y = np.array(df[(df['timestamp'] > '2019-07-30 00:00:00') & (df['timestamp'] < '2019-07-31 00:00:00')]['cputemp'])

lag = 120
threshold= 6
influence= 0

# Run algo with settings from above
result = thresholding_algo(y, lag = lag, threshold= threshold , influence=influence)
result['index'] = result.index
result = result[result['mean'] != 0] # remove 0 for prettier plot


# Plot result
plt.title('Anomalies Detection for CPU Temperature', fontdict = {'fontsize' : 14})

baseline = sns.lineplot(x = result['index'], y = result['cputemp'], label = "CPU Temp")
moving_average_line = sns.lineplot(x = result['index'], y = result["mean"], color="cyan", lw=2, label = 'Moving average line')
upper_zstd = sns.lineplot(x = result['index'],y = result["mean"] + threshold * result["std"], color="yellow", lw=2, label = 'Upper/Lower zstd')
lower_zstd = sns.lineplot(x = result['index'],y = result["mean"] - threshold * result["std"], color="yellow", lw=2)

anomalies_values = result[(result['signal'] == 1) | (result['signal'] == -1)]['cputemp']
anomalies_index = anomalies_values.index.values

sns.scatterplot(anomalies_index, anomalies_values, marker='X', s=100, color = "red", label = "Anomalies")
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x19f79b4ecf8>

Interpretation: After playing with the parameters, I have found a sweet pot that can help the algorithm identify the CPU temperature anomalies with reasonable accuracy. The anomalies associated with the brief period of CPU temperature increase caused by the crontab jobs that I created.

Conclusion: The process of learning and implementing anomaly detection algorithm has been fun. I have learned a tremendous amount of knowledge in using pandas and seaborn. If time allows, I will implement this method in my web application for real-time anomaly detection.

Daily Ambient Temperature Analysis

In [11]:
# Let's take a quick look at ambient temperature readings in one day:
one_day_data = df[(df['timestamp'] > '2019-07-31 00:00:00') & (df['timestamp'] < '2019-08-01 00:00:00')]
sns.lineplot(x = 'timestamp', y = 'temp',
             data = one_day_data).set_title('Ambient Temperature Readings in One Day')
Out[11]:
Text(0.5, 1.0, 'Ambient Temperature Readings in One Day')

This plot displays the temperature variation during one day in the summer. There is an intersting pattern going on here.

The temperature is higher during the night-time compared to the day-time, meaning day is colder than night. This is strange since the AC suppose to kicked in during the day to cool the house off. However, it has the reverse effect on my room, which located in the basement, next to the AC unit (I share the house with 5 other people). The AC tries to stablize the temperature in the living room, where the AC's thermometer is located. While the living room is blazing hot during the day, my room is overly cool.

Since the AC automatically turn on and off to maintain a certain temperature, we will see the rise and fall of temperature during the AC operation. This is the prominent zigzag pattern that start from 9am to 10pm. This lead me to an interesting question: can I count the cycles of AC by detecting local maxima/minima in the trend?

Detecting AC Cycles by Counting Peaks

Zoom in AC-on period

In [12]:
AC_on_data = df[(df['timestamp'] > '2019-07-31 09:00:00') & (df['timestamp'] < '2019-07-31 22:00:00')]

# # using plotly to add interactive capability to graph
import plotly.graph_objects as go

fig = go.Figure(data = go.Scatter(
    x= AC_on_data['timestamp'],
    y = AC_on_data['temp'],
    name = "Original Ambient Temperature"))

fig.update_layout(
    height=500,
    title_text='Ambient Temperature Readings 9AM - 10PM'
)

fig.show()

# if script gives `--NotebookApp.iopub_data_rate_limit`.
# reopen jupyter notebook with
# jupyter notebook --NotebookApp.iopub_data_rate_limit=10000000

Due to how much noise in the data, it's likely that we will pick up on a lot of false local maxima. To remedy this issue, we will resort back to our trusty moving average to filter out much of the noise while preserving the peaks - the points where the AC kicks in.

In [13]:
from scipy.signal import find_peaks

smoothed_data = AC_on_data['temp'].rolling(window = 50).mean()

peaks_indices = find_peaks(smoothed_data, prominence = 0.1)[0]

fig = go.Figure(data = go.Scatter(
    name = 'Original Data',
    x= AC_on_data['timestamp'],
    y = AC_on_data['temp']))

fig.add_trace(go.Scatter(
    name = "Moving Average (lag = 50)",
    x= AC_on_data['timestamp'],
    y = smoothed_data,
    line=dict(color='firebrick',dash='dot')))

fig.add_trace(go.Scatter(
    x= AC_on_data['timestamp'].iloc[peaks_indices],
    y= smoothed_data.iloc[peaks_indices],
    mode='markers',
    marker=dict(
        size=10,
        color='yellow',
        symbol='cross'
    ),
    name='Detected Peaks'
))

fig.update_layout(
    height=500,
    title_text='Detect AC Activations from 9AM - 10PM'
)

fig.show()
In [14]:
print("From 9am to 10p, AC turns on: " + str(len(peaks_indices)) +
      " times\nAveraging " + str(13/len(peaks_indices)*60) + " minutes per activation")
From 9am to 10p, AC turns on: 32 times
Averaging 24.375 minutes per activation

Discussion: As you can see from two lines overlaying one another, the moving average line is slightly lagged behind the actual temperature line. Therefore, the predicted time of each AC activation is slightly behind the actual activation time. However, since we are more interested in the count of AC activation cycles, the lag is not a big concern.

Light Sensor Analysis

In [15]:
sns.lineplot(x = 'timestamp', y = 'light', data = df)
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x19f78de4630>

We can see the alternating periods of light-on and light-off with lowest Lux values at night and highest values during the day. We can also see the period of peaks and plateau where the sun slowly setting until the in-door lighting takes over. This poses a great question: How can I know what kind of light condition during any given time in the day?

After examine the distribution of light intensity value, we can obtain the average threshold for each lighting condition: no light, led-on, low light with no led, and high light with no led. It's likely that any tangible threshold would appears as mode (position of high density) in our distribution:

In [16]:
sns.distplot(df['light'], kde = False)
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x19f7d194550>

We can see that the led-only light hover around 100~130 lumens. Therefore, we will use this threshold to classify the time that I turn on the led (but does not include the instances of led on + natural light, we will attempt to address this issue later). For now, we have the following thresholds:

  1. no_light (0 lumens): no light condition, sleepy time.

  2. low_light (1-20 lumens): low light condition, LED turned off with light leak from electronic source (TV screen).

  3. low_natural (20-100 lumens): natural light condition, sunrise or sunset, or condition with little natural light: cloudy, heavy raining, solar eclipse, apocalypse, neighbourgh set up new solar cells, etc.

  4. led_light (100-130 lumens): LED on with no natural light.

  5. high_light (greater than 130): intense natural light with LED on sometimes.

The prominent count of low values (0-20 lumens) suggests that my room is frequently under dimmly lit condition. Which came no suprise since a large chunk of my day was spent programming or binging TV shows in the dark.

In [17]:
# let generate these new classification:
df.loc[df['light'] == 0, 'lighting_condition'] = 'no_light'
df.loc[(df['light'] > 0) & (df['light'] <= 100), 'lighting_condition'] = 'low_natural_light'
df.loc[(df['light'] > 100) & (df['light'] <= 130), 'lighting_condition'] = 'led_light'
df.loc[df['light'] > 130, 'lighting_condition'] = 'high_natural_light'
In [18]:
# We can generate the count of each lighting condition to estimate the proportion of each lighting condition
df['lighting_condition'].value_counts()
Out[18]:
no_light              12636
low_natural_light     10426
led_light              8278
high_natural_light     7433
Name: lighting_condition, dtype: int64
In [19]:
# Convert absolute count to proportion:
df['lighting_condition'].value_counts(normalize = True)
Out[19]:
no_light              0.325897
low_natural_light     0.268898
led_light             0.213499
high_natural_light    0.191706
Name: lighting_condition, dtype: float64

Let's see the new light classification on our line graph:

In [20]:
import plotly.graph_objects as go

one_day = df[(df['timestamp'] > '2019-07-31 00:00:00') & (df['timestamp'] < '2019-08-01 00:00:00')]

one_day_led = df[(df['timestamp'] > '2019-07-31 00:00:00') &
                 (df['timestamp'] < '2019-08-01 00:00:00') & (df['lighting_condition'] == "led_light")]

one_day_no_light = df[(df['timestamp'] > '2019-07-31 00:00:00') &
                 (df['timestamp'] < '2019-08-01 00:00:00') & (df['lighting_condition'] == "no_light")]

one_day_low_natural_light = df[(df['timestamp'] > '2019-07-31 00:00:00') &
                 (df['timestamp'] < '2019-08-01 00:00:00') & (df['lighting_condition'] == "low_natural_light")]

one_day_high_natural_light = df[(df['timestamp'] > '2019-07-31 00:00:00') &
                 (df['timestamp'] < '2019-08-01 00:00:00') & (df['lighting_condition'] == "high_natural_light")]


fig = go.Figure(data = go.Scatter(x = one_day['timestamp'],
                                 y = one_day['light'],
                                 name = "One-day Light Data"))

fig.add_trace(go.Scatter(
    name = "LED-on",
    x= one_day_led['timestamp'],
    y = one_day_led['light'],
    mode = 'markers'))

fig.add_trace(go.Scatter(
    name = "no_light",
    x= one_day_no_light['timestamp'],
    y = one_day_no_light['light'],
    visible = "legendonly",
    mode = 'markers'))
              
fig.add_trace(go.Scatter(
    name = "low_natural_light",
    x= one_day_low_natural_light['timestamp'],
    y = one_day_low_natural_light['light'],
    visible = "legendonly",
    mode = 'markers'))
                           
fig.add_trace(go.Scatter(
    name = "high_natural_light",
    x= one_day_high_natural_light['timestamp'],
    y = one_day_high_natural_light['light'],
    visible = "legendonly",
    mode = 'markers'))
              
fig.update_layout(
    height=500,
    title_text='Lighting Condition Classification - 1 Day Data'
)

fig.show()
*You can toggle the classification that you want to see.*

The new interpretation of light sensor make a lot of sense. We can see the periods of pitch-dark and gentle transition to day-light through out 24h. The periods of LED-on match very well with my real usage.

You might be puzzled by the sudden increase in light intensity from low_natural_light to high_natural_light at the beginning of the day. This is because I have a smart LED bulb that automatically turns on at 8am everyday.

Estimate the LED-on duration (for 1 day)

The problem: We know the data points that correspond to LED-on classification. If the beginning and the ending timestamp create a LED-on session, we can estimate the total duration of that session. And if we sum all the time of all sessions, we will get the total LED usage in my room. From the plot, I can eye-ball each session lasts about ~2 hours, for the total of ~6 hours of LED-on time.

Challenge: Since the points are segmented, with large gap of LED-off. It would be difficult to calculate the LED-on duration since the start and end time of a session is unknown. There are two approaches for this problems, the easy way and the hard way (hard and easy are relative terms).

The Easy Way

The easy way to think about this problem is to finding the transition timestamp between AC-on session and the time gaps between them. Since the data is collected every 6 seconds, we can assume that LED is turned on for the full 6 seconds duration and will continuous so, until the LED is turned off. We can calculate the difference between each data points, if the difference is greater than 6 (seconds), the LED is off (let's call this LED-off). We can record the time that the difference occured and we will have the starting and ending time of each session.

In [21]:
# for calculating time difference, we should convert datetime to unix timestamp for easier calculation:
from datetime import datetime
one_day_led['unix_timestamp'] = (one_day_led['timestamp']- np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')

# calculate the timestamp diff
one_day_led['timestamp_diff'] = one_day_led['unix_timestamp'] - one_day_led['unix_timestamp'].shift(1)

one_day_led[['timestamp','light','timestamp_diff']].head()
Out[21]:
timestamp light timestamp_diff
19915 2019-07-31 12:51:59 130 NaN
19916 2019-07-31 12:52:05 130 6.0
19917 2019-07-31 12:52:12 130 7.0
19918 2019-07-31 12:52:18 130 6.0
19919 2019-07-31 12:52:25 129 7.0

The first row is NaN because there is no prior value to compare with.

Great! Now that we have the calculation for the diff time, we should double check the distribution of the difference to make sure that most data is within the 6-7 seconds interval:

In [22]:
one_day_led['timestamp_diff'].value_counts()
Out[22]:
7.0       1756
6.0       1082
8.0         65
9.0         25
13.0         8
20.0         3
16.0         3
21.0         2
34.0         1
19.0         1
76.0         1
286.0        1
692.0        1
17.0         1
14.0         1
1680.0       1
5962.0       1
3798.0       1
53.0         1
113.0        1
41.0         1
38.0         1
Name: timestamp_diff, dtype: int64

We can see that most values are approximately 6 or 7 seconds, with very few values exceed 10. Due to possible latency, a few seconds delay was introduced. The chance of delaying 3 seconds or higher is only 0.008, which is 8 in 1000. Not terrible for a computer system that cost $5. Let's use the thresh hold 10s to determine the timestamps.

Calculation:

Let t0 and t1 be the starting and ending points of all LED-on sessions.

Let x0 and x1 be the starting and ending points of each LED-off.

Then the total LED-on time equal:

$$ LED\:time = (t_1 - t_0) - \sum (x_1 - x_0) $$$$ LED\:time = (t_1 - t_0) - \sum timestamp\:diff $$
In [23]:
led_off = one_day_led.query('timestamp_diff > 10')['timestamp_diff'].sum()

led_on = one_day_led.iloc[-1,:]['unix_timestamp'] - one_day_led.iloc[0,:]['unix_timestamp'] - led_off

print("Total LED-on time is {} seconds ({} hours)".format(led_on, led_on/3600))
Total LED-on time is 19529.0 seconds (5.424722222222222 hours)

Discussion: Since our data can be affected by latency, noise and fluctuation of light levels, it's likely that some LED-off time values could have been noise (walking in the room can cast shadows on the sensor). This leads to over-estimation of LED-off (under-estimation of LED-on). However, despite some errors from the system, the result is reasonably close to actual LED-on time.

The algorithm's strength in simplicity, is also its weakness.

The Hard(er) Way

We can apply some clustering algorithm to retrieve the start and end-points. A good option if we don't known the number of clusters is Meanshift.

Meanshift works by updating candidates (points) for centroids to be the mean of the points within a given region. The mean shift vector that is computed for each centroid that points towards a region of the maximum increase in the density of points. Mean shift aims to locating the maxima of a density function - the modes. The advantage of this algorithm is that we don't have to specify the number of cluster K, such as k-mean algorithm. Instead, we will supply another parameter: the radius around a data point (bandwidth) to help mean shift clusters.

In [24]:
import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth

x = one_day_led['unix_timestamp']

quantile = 0.3

X = np.array(list(zip(x,np.zeros(len(x)))))
bandwidth = estimate_bandwidth(X, quantile = quantile)
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_

labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)
print("Bandwidth: {} \nNumber of cluster: {}".format(bandwidth,n_clusters_))

one_day_led['cluster'] = labels
starts = one_day_led.groupby('cluster')['unix_timestamp'].min()
ends = one_day_led.groupby('cluster')['unix_timestamp'].max()
led_on_time = ends - starts

total_led_on_time = led_on_time.sum()/3600
print("Total LED-on time is {} hours".format(total_led_on_time))
Bandwidth: 5017.233186887462 
Number of cluster: 3
Total LED-on time is 6.346111111111111 hours
In [25]:
fig = go.Figure(data = go.Scatter(x = one_day['timestamp'],
                                 y = one_day['light'],
                                 name = "One-day Light Data"))

fig.add_trace(go.Scatter(
    name = "LED-on",
    x= one_day_led['timestamp'],
    y = one_day_led['light'],
    mode = 'markers',
    marker_color = one_day_led['cluster']))

fig.update_layout(
    height=500,
    title_text='Clustering LED-on Periods - Clusters = 3'
)

fig.show()

Discussion:The mean shift algorithm returned 3 clusters. While the two clusters on the far rights is pretty much spot on, the cluster on the left failed to exlude the smaller cluster next to it. This leads to over-estimation of LED-on time, an none-trivial amount. This is because the bandwidth is slightly too large to form smaller clusters. While this methods is very flexible and result in elegant and intepretable clusters. the accuracy has been compromised. We can rectify this by decreasing the bandwidth:

In [26]:
x = one_day_led['unix_timestamp']

quantile = 0.1

X = np.array(list(zip(x,np.zeros(len(x)))))
bandwidth = estimate_bandwidth(X, quantile = quantile)
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_

labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)
print("Bandwidth: {} \nNumber of cluster: {}".format(bandwidth,n_clusters_))

one_day_led['cluster'] = labels
starts = one_day_led.groupby('cluster')['unix_timestamp'].min()
ends = one_day_led.groupby('cluster')['unix_timestamp'].max()
led_on_time = ends -starts

total_led_on_time = led_on_time.sum()/3600
print("Total LED-on time is {} hours".format(total_led_on_time))

fig = go.Figure(data = go.Scatter(x = one_day['timestamp'],
                                 y = one_day['light'],
                                 name = "One-day Light Data"))

fig.add_trace(go.Scatter(
    name = "LED-on",
    x= one_day_led['timestamp'],
    y = one_day_led['light'],
    mode = 'markers',
    marker_color = one_day_led['cluster']))

fig.update_layout(
    height=500,
    title_text='Clustering LED-on Periods - Clusters = 11'
)

fig.show()
Bandwidth: 1231.1213247718824 
Number of cluster: 11
Total LED-on time is 5.6755555555555555 hours

Discussion: By reducing the bandwidth, the number of clusters increases from 3 to 11 clusters. The LED-on time also become smaller (and arguably more accurate). The small cluster has been separated correctly, but we now have way more clusters than expected.

In a stable system such as computers and sensors, the simple yet accurate way is preferable due to accuracy and speed. The more complex approach such as clustering can be flexible and intuitive, but yet introduce a fair amount of bias and errors.

Improve Accurary of Light Classification

Error source: The biggest issue we might have with this data is that it will alway lead to underestimation of LED light since it got mixed with natural light if narutal light exceeds the LED brightness level. Let's see if we can separate natural light and LED light using RGB values:

In [27]:
# separate RGB values into 3 separate components
rgb_cols = ['red', 'green', 'blue']
df[rgb_cols] = df['rgb'].str.split(',', expand=True)
df[rgb_cols] = df[rgb_cols].apply(pd.to_numeric, errors='coerce', axis=1)
df.head()
Out[27]:
timestamp light rgb motion temp pressure cputemp lighting_condition red green blue
0 2019-07-30 00:00:07 1 0,0,0 0.00714111328125,-0.00225830078125,1.029479980... 29.429103 83865.300031 35.242 low_natural_light 0 0 0
1 2019-07-30 00:00:14 1 0,0,0 0.00762939453125,-0.0048828125,1.046875 29.426641 83865.775515 35.780 low_natural_light 0 0 0
2 2019-07-30 00:00:20 1 0,0,0 0.00823974609375,-0.0052490234375,1.0474853515625 29.421410 83864.907747 35.780 low_natural_light 0 0 0
3 2019-07-30 00:00:27 1 0,0,0 0.0064697265625,-0.00323486328125,1.0487670898... 29.418641 83865.664800 36.856 low_natural_light 0 0 0
4 2019-07-30 00:00:33 1 0,0,0 0.0068359375,-0.00274658203125,1.04754638671875 29.418333 83865.784819 37.394 low_natural_light 0 0 0

Let's visualize rgb values through out the day:

In [28]:
fig = go.Figure(data = go.Scatter(x = df['timestamp'],
                                 y = df['red'],
                                 name = "red",
                                 line = dict(color = "red")))

fig.add_trace(go.Scatter(
    name = "green",
    x= df['timestamp'],
    y = df['green'],
    line = dict(color = "green")))

fig.add_trace(go.Scatter(
    name = "blue",
    x= df['timestamp'],
    y = df['blue'],
    line = dict(color = "blue")))

fig.update_layout(
    height=500,
    title_text='RGB Values'
)

fig.show()

Observation: We can see that the values of RGB change throughout the day. During the night, all rgb values returns 0 because there was no light. When the led light is turned on, we still start to see the separation of rgb into separate lines. When LED is turned off and there is only sunlight, the 3 lines overlapse each other. This is exactly the clue we needed to determine if the led light is turned on or not.

In [29]:
# Summarize RGB values by lighting_condition groups:
rgb_summary = df[['lighting_condition','red', 'green', 'blue']].groupby('lighting_condition').mean()
rgb_summary
Out[29]:
red green blue
lighting_condition
high_natural_light 122.913628 106.762142 86.231401
led_light 113.831602 90.519691 55.177821
low_natural_light 111.973336 107.706503 103.192404
no_light 0.000000 0.000000 0.000000
In [30]:
# calculate proportion of RGB values by lighting_condition groups
rgb_summary[rgb_cols] = rgb_summary[rgb_cols].div(rgb_summary[rgb_cols].sum(axis=1), axis=0)
rgb_summary
Out[30]:
red green blue
lighting_condition
high_natural_light 0.389081 0.337954 0.272964
led_light 0.438608 0.348784 0.212607
low_natural_light 0.346804 0.333589 0.319608
no_light NaN NaN NaN

Observation: We can see that my LED light has a much higher red to blue ratio (more than twice), which is consistent with the orange-yellow light that it emits. On the other hand, natural sunlight gave a much more balance colors. The higher red values are likely due to the mixture of LED light. In the real world, light consists of all visible colors, not just red, green, and blue wavelengths. The RGB color system that we use in computer graphics arose out of a peculiarity of human perception. However, the proportion of all perceived colors should be equal. This is also known as white balancing. Here are some reproduction of color of natural light:

natural_lights

So if we got too much red to blue, that indicates non-natural light source! But how much is too much?

Let's calculate the difference in proportion of red to blue. We can then examine the distribution of ratio differences in Led-light to determine a better cut-off point:

In [31]:
# calculate the difference in proportion of red to blue for whole dataframe
df['red_to_blue'] = df['red']/df['blue']

# this create some NaN due to 0s in rgb, fill these values by 0
df['red_to_blue'].fillna(0, inplace = True)

px.histogram(df[df['red_to_blue'] > 1], 'red_to_blue', color = 'lighting_condition')

We can see a lot of natural light overlapse the led-light, despite having high red to blue ratio (1.8 ~ 2.2). But where exactly should be create a cut-off for natural light? Let's use some statistics to help us:

In [32]:
df[df['lighting_condition'] == 'led_light']['red_to_blue'].describe()
Out[32]:
count    8278.000000
mean        2.095184
std         0.189799
min         1.100917
25%         2.017857
50%         2.111111
75%         2.235294
max         2.367347
Name: red_to_blue, dtype: float64

We have calculated the descriptive statistics for distribution of led_light red to blue ratio. Using mean and std, we can construct the confidence interval for the mean:

In [33]:
from scipy import stats

stats.norm.interval(0.95, loc=2.095184, scale=0.189799)
Out[33]:
(1.7231847956982824, 2.467183204301718)

Interpretation: The 95% confidence interval defines a range of values that you can be 95% certain contains the population mean - the mean ratio of red to blue ratio generate by led light. We can use the lower bound as a filter to obtain our led_light classification.

In [34]:
# calculate the difference in proportion of red to blue in Led lighting
led_red_to_blue = 1.723

# Apply filter to high_natural_light to separate led_light:
df.loc[(df['red_to_blue'] > led_red_to_blue) & (df['light'] > 130), 'lighting_condition'] = 'led_light'

Examine the new classification after filtering:

In [35]:
one_day = df[(df['timestamp'] > '2019-07-31 00:00:00') & (df['timestamp'] < '2019-08-01 00:00:00')]

one_day_led = df[(df['timestamp'] > '2019-07-31 00:00:00') &
                 (df['timestamp'] < '2019-08-01 00:00:00') & (df['lighting_condition'] == "led_light")]

one_day_no_light = df[(df['timestamp'] > '2019-07-31 00:00:00') &
                 (df['timestamp'] < '2019-08-01 00:00:00') & (df['lighting_condition'] == "no_light")]

one_day_low_natural_light = df[(df['timestamp'] > '2019-07-31 00:00:00') &
                 (df['timestamp'] < '2019-08-01 00:00:00') & (df['lighting_condition'] == "low_natural_light")]

one_day_high_natural_light = df[(df['timestamp'] > '2019-07-31 00:00:00') &
                 (df['timestamp'] < '2019-08-01 00:00:00') & (df['lighting_condition'] == "high_natural_light")]


fig = go.Figure(data = go.Scatter(x = one_day['timestamp'],
                                 y = one_day['light'],
                                 name = "One-day Light Data"))

fig.add_trace(go.Scatter(
    name = "LED-on",
    x= one_day_led['timestamp'],
    y = one_day_led['light'],
    mode = 'markers'))

fig.add_trace(go.Scatter(
    name = "no_light",
    x= one_day_no_light['timestamp'],
    y = one_day_no_light['light'],
    visible = "legendonly",
    mode = 'markers'))
              
fig.add_trace(go.Scatter(
    name = "low_natural_light",
    x= one_day_low_natural_light['timestamp'],
    y = one_day_low_natural_light['light'],
    visible = "legendonly",
    mode = 'markers'))
                           
fig.add_trace(go.Scatter(
    name = "high_natural_light",
    x= one_day_high_natural_light['timestamp'],
    y = one_day_high_natural_light['light'],
    visible = "legendonly",
    mode = 'markers'))
              
fig.update_layout(
    height=500,
    title_text='Lighting Condition Classification - 1 Day Data'
)

fig.show()

The new filter helps identify more LED-on instances with the help of RGB readings. However, this method only works when the sunlight intensity is not too high since it will decrease the red to blue ratio. We can adjust our cut-offs to get even more led-light, but doing so may risk making more false potive errors.

Compare Lighting Conditions of Different Days

In [36]:
mydf = df.set_index('timestamp').groupby([pd.Grouper(freq='D'), 'lighting_condition']).count()
mydf['duration'] = mydf['light'] * 6.7 /3600
summary_df = mydf.unstack()['duration'].reset_index()
summarized_df = pd.melt(summary_df, id_vars = 'timestamp', value_vars=['high_natural_light', 'led_light', 'low_natural_light', 'no_light'])
summarized_df.sort_values('timestamp', inplace = True)
In [37]:
px.bar(summarized_df, x = 'timestamp', y = 'value', color = "lighting_condition")

Pressure Data Analysis

This is another interesting sensor that measures the atmospheric pressure. Also known as barometric pressure, atmospheric pressure simply measures the weight of the air at ground level. Pressure is lower as you get higher in elevation, because there is less air.

Atmospheric pressure is most commonly used in weather forecasting. An area of high pressure pushes air down, make the air warmer and therefore, inhibit the formation of clouds and storms. So high pressure is also always a sign of good or fair weather. On the contrary, when the air hits an area of low pressure, it will rises and condenses to form precipitaion. Hence, low pressure is associated with poor weather (cloudy or rainy) (https://www.artofmanliness.com/articles/fair-or-foul-how-to-use-a-barometer/)

pressure

In [38]:
sns.lineplot(x = 'timestamp', y = 'pressure', data = df)
plt.title('Pressure Sensor - Unit: Pa')
Out[38]:
Text(0.5, 1.0, 'Pressure Sensor - Unit: Pa')

Estimate Altitude Based on Atmospheric Pressure

We can estimate the altitude of my current location using pressure. Using the elevation calculation derived from the "Barometric formula" (https://en.wikipedia.org/wiki/Barometric_formula):

$$h=44330\times\left(1-\left(\frac{P}{P_0}\right)^\frac{1}{5.255}\right)$$

$P$ : Atmospheric pressure at current location (in hPa)

$P_0$ : Atmospheric pressure at sea level = 1013.25 hPa (https://en.wikipedia.org/wiki/Atmospheric_pressure)

We can get the current location pressure by taking the mean of pressure values.

In [39]:
p0 = 1013.2
p = df['pressure'].mean()/100

elevation = 44330*(1-(p/p0)**(1/5.255))

print("Elevation at current location is {} m".format(elevation))
Elevation at current location is 1555.9052825593296 m

The actual elevation turned out to be 1609.3 m instead (https://en.wikipedia.org/wiki/Denver). This is exactly 1 mile from the sea level. Instead of calculating the Mile High city, we only got 0.96 Mile High city :(

Weather Forecast

As we learned earlier that low pressure often associates with bad weather conditions such as cloudy, rainy, or stormmy. However, it would be difficult to reliably predict any weather condition without a baseline for standard pressure values at any specific location. Furthermore, we also need to obtain some correct weather condition from historical data and attach these labels for own pressure data. The weather data is obtained from https://darksky.net/dev

Warning: Since the returned times are in UTC, the API may reports times that are different than your local time. Mine is Mountain time, so I need to add -7 to the UTC time.

In [40]:
import forecastio
import datetime

API_KEY = "19b3514a1b00d050b011770e16efd590"
LAT = 39.878294
LONG = -104.967082

# get number of days from the past to the current day:
start_date = datetime.datetime(2019, 7, 29)
today = datetime.datetime.today()
days = today - start_date
days_number = days.days

# define weather atributes we want to get:
attributes = ["pressure", "summary"]
times = []
data = {}
for attr in attributes:
    data[attr] = []

# loop api call to get all weather data for each day
for i in range(1,days_number):
    forecast = forecastio.load_forecast(key = API_KEY, lat = LAT, lng = LONG,
                                        time=start_date+datetime.timedelta(i), units="si")
    h = forecast.hourly()
    d = h.data
    for p in d:
        times.append(p.time)
        for attr in attributes:
            data[attr].append(p.d[attr])

# subtract -7 from UTC time to get MDT time
times = [i + datetime.timedelta(hours = -7) for i in times]

# Create a Pandas data frame for this data
weather = pd.DataFrame(data, index=times)
In [41]:
# Get my pressure data:
pressure_data = pd.read_sql_query("SELECT timestamp, pressure FROM enviro_log WHERE timestamp > '2019-07-30' and timestamp < '2019-08-09';", connection)
pressure_data.index = pressure_data['timestamp']
connection.close()
In [42]:
weather.head(10)
Out[42]:
pressure summary
2019-07-29 23:00:00 1012.47 Clear
2019-07-30 00:00:00 1012.46 Clear
2019-07-30 01:00:00 1012.57 Clear
2019-07-30 02:00:00 1012.21 Clear
2019-07-30 03:00:00 1011.47 Clear
2019-07-30 04:00:00 1011.72 Clear
2019-07-30 05:00:00 1012.65 Foggy
2019-07-30 06:00:00 1013.00 Foggy
2019-07-30 07:00:00 1013.29 Foggy
2019-07-30 08:00:00 1013.34 Foggy

Now we need to summarize our dataset in such a way that we can join the pressure values with the historical weather.

We can group the reading by hour and calculate the mean for each hour:

In [43]:
pressure_data_hourly = pressure_data.resample('H').mean()

# rename `pressure` variable to `pi_pressure`
pressure_data_hourly.rename(columns={'pressure':'pi_pressure'}, inplace = True)

# convert pa to hpa for consistent pressure values
pressure_data_hourly['pi_pressure'] = pressure_data_hourly['pi_pressure']/100

pressure_data_hourly.head()
Out[43]:
pi_pressure
timestamp
2019-07-30 00:00:00 838.808589
2019-07-30 01:00:00 838.676387
2019-07-30 02:00:00 838.194837
2019-07-30 03:00:00 838.153922
2019-07-30 04:00:00 838.280600
In [44]:
# Join Raspberry Pi pressure with historical weather table:
weather_merged = pd.merge(weather, pressure_data_hourly, how = 'inner',left_index=True, right_index=True)
weather_merged.head()
Out[44]:
pressure summary pi_pressure
2019-07-30 00:00:00 1012.46 Clear 838.808589
2019-07-30 01:00:00 1012.57 Clear 838.676387
2019-07-30 02:00:00 1012.21 Clear 838.194837
2019-07-30 03:00:00 1011.47 Clear 838.153922
2019-07-30 04:00:00 1011.72 Clear 838.280600

We now can compare the readings from our sensor and readings from weather stations' API. If my sensor is any good, it should agree with the readings that we got from forecastio

In [45]:
fig = go.Figure(data = go.Scatter(
    name = 'Sensor Pressure',
    x = weather_merged.index,
    y = weather_merged['pi_pressure']))

fig.add_trace(go.Scatter(
    name = "API Pressue",
    x= weather_merged.index,
    y = weather_merged['pressure'],
    line=dict(color='firebrick',dash='dot')))

fig.update_layout(
    height=500,
    title_text='Compare Pressure Sources'
)

fig.show()

We can see right away that our readings are consistently lower than the API's pressure readings. A quick search on their website's documentation page revealed that their pressure reading has been adjusted to make it appear that the measurement was taken from sea level (https://darksky.net/dev/docs). I will try to correct this by adding 170 (arbitrary number) to my pressure sensor:

In [46]:
fig = go.Figure(data = go.Scatter(
    name = 'Sensor Pressure',
    x = weather_merged.index,
    y = weather_merged['pi_pressure'] + 170))

fig.add_trace(go.Scatter(
    name = "API Pressure",
    x= weather_merged.index,
    y = weather_merged['pressure'],
    line=dict(color='firebrick',dash='dot')))

fig.update_layout(
    height=500,
    title_text='Compare Pressure Sources'
)

fig.show()

We can see now that two readings are very similar to one another. We can verify this intuition by calculate the Pearson's correlation coefficient:

In [47]:
pressure_correlation = weather_merged['pi_pressure'].corr(weather_merged['pressure'])
print("Correlation coefficient between Pi-measured pressure and API pressure is: {}".format(pressure_correlation))
Correlation coefficient between Pi-measured pressure and API pressure is: 0.9478287870403225
In [48]:
sns.regplot(data = weather_merged, x = 'pi_pressure', y = 'pressure')
Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x19f7fd4d550>

This is one of the most beautiful linear relationship and highest degree of correlation I have ever seen in the real world data. The result from comparing my sensors with another data source have demonstrated the reliability and consistency of the Enviro Phat as an instrument. This small circuit board packs an amazing set of sensors.

Let's explore that past 10 days weather data:

In [49]:
weather_merged['summary'].value_counts()
Out[49]:
Partly Cloudy       89
Mostly Cloudy       82
Foggy               33
Clear               32
Possible Drizzle     3
Overcast             1
Name: summary, dtype: int64

The past 10 days has been mostly cloudy. Since we don't have a lot of orther information to work with, we can simplify the weather condition to cloudy and clear.

In [50]:
weather_merged['condition'] = np.where((weather_merged['summary']=='Clear') | (weather_merged['summary']=='Partly Cloudy'), 'Clear', 'Cloudy')
weather_merged['condition'].value_counts()
Out[50]:
Clear     121
Cloudy    119
Name: condition, dtype: int64

Let's add the labels to the pressure plot:

In [51]:
fig = go.Figure(data = go.Scatter(
    name = 'Sensor Pressure',
    x = weather_merged.index,
    y = weather_merged['pi_pressure']))

clear_condition = weather_merged[weather_merged['condition'] == "Clear"]
cloudy_condition = weather_merged[weather_merged['condition'] == "Cloudy"]

fig.add_trace(go.Scatter(
    name = "Clear",
    x= clear_condition.index,
    y = clear_condition['pi_pressure'],
    mode = 'markers'))

fig.add_trace(go.Scatter(
    name = "Cloudy",
    x= cloudy_condition.index,
    y = cloudy_condition['pi_pressure'],
    mode = 'markers'))


fig.update_layout(
    height=500,
    title_text='Pressure and Weather Conditions'
)

fig.show()

Observations: From the plot, the cloudy and clear conditions are mixed up. Initially, I was expecting weather conditions to fall into clear and cloudy based on some average pressure thresholds. However, weather systems are much more complicated that that. After doing some reasearch on cloud formation, I learned that instead of the absolute pressure, the changes in relative pressure is what affects the cloud formation, along with other variables such as humidity and temperature.

We can see from the graph that the state of cloudy and clear often change after the slope. The quickly rising pressure often results in clear weather, while quickly falling pressure often results in cloudy weather. However, the signals are not well distinguishable from the noise. Therefore, I cannot use pressure as single predictor for weather condition. Weather forecasting is a project for another time.