Gordon Strodel, Archetype Consulting

Gordon Strodel

Data & Analytics Solution Owner for @SlalomBoston. Background in Mechanical Engineering. Obsessed with solving problems elegantly and permanently.

📍 Boston, MA

Using Google BigQuery ML to Predict when to Water your Houseplant

Using Google BigQuery ML to Predict when to Water your Houseplant

Preamble

I suspect my journey to Data & Analytics was similar to many of you. You start out interested in data and immediately need to learn SQL to get anything done. After a while, someone introduces you to ETL tools or BI tools which make working with data or visualizing data a bit easier. Then along came Data Science. And now there is another set of code and language to learn. You quickly realize that Python isn’t a snake and R isn’t something pirates day when they agree with you. The industry shifts. It’s no longer enough to simple write a SQL query and put your data into the latest version of Tableau. People start asking you questions about Why? things happen or How can we be sure? that X caused Y. (Great album name there.)

Suddenly data must have have a deeper analysis performed to determine and understand the causality or correlated relationships. And as a SQL jockey reluctant to jump on the coding bandwagon of R and Python, you are stuck. Excel gives you trend-lines and basic statistical functions. Tableau has clustering and time-series forecasting options, but only if you massage your data enough to get them available to use. If only you could take the math you know, the SQL language you are nearly fluent in, and apply that experience to answer those tough, data science questions…

If none of this describes you, I welcome you to abandon ship at this point.

If you resonate with this perspective, this post is for you.

-Gordon


Data Science and SQL sitting in a tree…

When you think about Data Science, what pops to mind? Probably things like R, Python, maybe DataRobot, or your balding statistics professor from college. Rarely do people think SQL. If there was one way to make data science techniques available to the masses of data analysts in the analytics industry, I hypothesize it would be via a SQL interface. We have our data sitting in databases, ready for analysis. What if we could do data science where we already have our data? And what if we could use our familiar tools like SQL to do it?

Enter Google BigQuery ML - the first cloud-based, modern-data-warehouse driven, machine learning capability which naively uses SQL to create, train, evaluate, and predict outcomes.

Finally, we have data science and SQL sitting in a tree…let’s see some kissing! :)

Brief Intro for the Use Case

In an effort to leverage a simple and understandable data-set, but one that is not already exhausted in some tutorial, I decided to open-source some data I’ve been collecting as a personal hobby on my houseplants. I’ve had an IOT sensor running for a month measuring the soil moisture levels for a particular plant. A reading is taken every 60 seconds or so and I’ve published roughly 42k records into BigQuery, representing approximately 20 days worth of data. My goal is simple: Given the current trajectory of the plant’s water absorption, when do I need to water my plant again?

Make sense? Ok, let’s dive in. First, let’s take a look at the data we’ll be working with.

 Processed with MOLDIV Processed with MOLDIV

Preview of the Data

Once you have data being captured (see appendix below for how to do that), the resulting raw soil readings will most likely follow a pattern like the below image. High values (>2900) indicate soil dryness, with the maximum being 3100. If you were to dunk the sensor in pure water, typically the values bottom out around 1500-1800. The vertical drops on the chart indicate watering events for the houseplant. And the rise from “watered” at 1800 to “dry” at 3100 is fairly linear in most cases. (There are some instances of local spikes around April 10 and April 13, and those are are most likely due to additional water draw on warm days. I hope to explore that potential correlation vs. ambient weather in future articles!) For the sake of our analysis, I will focus on data captured between 4/28 and 5/1 which represents a period of sensor readings immediately following a watering event. Based on eyeballing the results and assuming a linear pattern, I’d expect to hit our 2900 dryness threshold around 5/2-5/3. Now let’s see what Google BigQuery predicts!

D3.js wasn’t working so here’s a picture of the data charted in Excel.

D3.js wasn’t working so here’s a picture of the data charted in Excel.

Google Cloud BigQuery ML

Now that you understand the data and usecase, let’s dive into the mechanics of BigQuery.

Based on initial exploration of the data above, it appears a linear regression model could be a fairly accurate model to leverage for predicting when to water. As of the time of writing, BQML support linear regression, logistic regression, and K-means clustering. (More details on the specific models available can be found here.) From a technology perspective, we are fine. And now you know why I chose this dataset to demo. :)

Let’s assume you already have the data loaded into a BigQuery table. In this case, the first order of business is to identify the dataset we want to use to create and train the model. In this case, I’ve written a SQL statement to generate the results:

 
SELECT
UNIX_SECONDS( TIMESTAMP(FORMAT_TIMESTAMP('%F %T', human_readable_date, "America/New_York")))-1556409605  as UNIX_DATETIME
,avg(soil) as raw_soil_value
FROM `smartplant31.plant_sensor.raw_sensor_data` 
WHERE
 human_readable_date >='2019-04-28 00:00:00-4'
 and  human_readable_date <'2019-05-01 00:00:00-4'
 and soil is not null
GROUP BY  human_readable_date
 ORDER BY human_readable_date ASC

A few explanations about the query:

  • The WHERE clause shows the date range we want to use for the input dataset. I offset by -4 hours to get eastern time since the data in BigQuery is in UTC.

  • We use a GROUP BY function with the measure of “soil” because there are a few timestamps that have multiple readings. Average just smooths that out so at most there is only 1 reading per timestamp.

  • The UNIX_SECONDS column is translating the data-timestamp into an integer of seconds since the first datetime in our selected dataset. Since the UNIX conversion baselines to 1/1/1970, I offset the results by subtracting 1556409605 from the values. This would make the first timestamp have a value of 0. The second timestamp 60 seconds later would have a value of 60. The value of 1556409605 was basically just the minimum Unix Seconds of the first timestamp in the selected data, in this case 2019-04-28 04:00:05 UTC.

  • The reason for using a UNIX Seconds (integer measurement of time) and subtracting the offset was simply to reduce any issues with large values in the regression. Admittedly, I haven’t looked into the math of it, but this post on Stack Overflow inspired this in my approach.

  • We also use a filter of “soil is not null” to exclude any potential null values that are in there. I don’t believe there are any, but wanted to include just in case.

After we have our query for the data and it returns the results we expect, let’s create the linear regression model. In BigQuery, the models are stored in what looks like tables and called via a DML-like statement. In the screenshot below, you can see the model statement. Full create code can be accessed here on my Github project. Notice it took 16 seconds to generate the linear model with approximately 4,000 rows.

Creating the Linear Regression Model in BigQuery

Creating the Linear Regression Model in BigQuery

After the model is created, you’ll want to evaluate the model’s performance. The screenshot below shows the key statistics output as part of the linear regression. The SQL statement can be downloaded here. While I am not an expert on linear regression, an R^2 score of 0.9937776450587621 is pretty good and certainly good enough for me to assume the model is reasonable at forecasting soil dryness. Let’s proceed forward.

Linear Regression Model Performance. R^2 of .993 is a good sign!

Linear Regression Model Performance. R^2 of .993 is a good sign!

We have a linear regression model developed and know it is reasonably accurate based on the R^2 score. Now let’s cover using the model for predictions. If you run the model on existing dates in your date table, all you will get is a predicted soil value to compare to a reading you already have. Since this is a time-series problem, and since our dataset ends mid-day on 5/1, and since we suspect the dryness threashold to be reached on 5/2 or 5/3, we’ll need another dataset to help scaffold our linear model.

To do this, I created a second table with just a date column (SQL code is here). This SQL code merely creates a new table with a new row for each timestamp that are hourly increments after 5/1/2019 12am Eastern, as illustrated below.

A skeleton table with future dates which I can leverage in my analysis

A skeleton table with future dates which I can leverage in my analysis

Once this table is created, I can write my SQL for the prediction. In this example, I want to know when (date + time) the linear model exceeds a value of 2900. (A full copy of the code can be downloaded here.) The code ignores the data set we used earlier with the actual sensor values and merely leverages the date table and the linear model we saved earlier. Here is the simple SQL:

#standardSQL
SELECT
TIMESTAMP(FORMAT_TIMESTAMP('%F %T',TIMESTAMP_SECONDS( CAST(UNIX_DATETIME as INT64)+1556409605), "America/New_York")) as date_time
, round(predicted_raw_soil_value,2) as predicted_raw_soil_value
FROM
  ML.PREDICT(MODEL plant_sensor.linear_model,
    (
     select 
     TIMESTAMP(FORMAT_TIMESTAMP('%F %T', shift_value, "America/New_York")) as shift_Value_est,
     UNIX_SECONDS( TIMESTAMP(FORMAT_TIMESTAMP('%F %T', shift_value, "America/New_York")))-1556409605  as UNIX_DATETIME
     from plant_sensor.date_time_table
      WHERE 1=1
     and TIMESTAMP(FORMAT_TIMESTAMP('%F %T', shift_value, "America/New_York")) >='2019-05-01 00:00:00-4'
     and TIMESTAMP(FORMAT_TIMESTAMP('%F %T', shift_value, "America/New_York")) <'2019-05-15 00:00:00-4'
     GROUP BY shift_value
     ORDER BY shift_value ASC
     )
  )
where predicted_raw_soil_value <= 3100

A few notes on this SQL:

  • We include all of the UNIX_SECONDS and timezone shifting commands from earlier

  • We filter out predicted soil values > 3100 as that is where the physical hardware maxes out. You won’t ever see a reading above 3100, but the linear model doesn’t know that. So we simple filter it out.

  • The results return the datetime in Eastern and the predicted soil moisture reading

The output is very simple, the datetime from our reference table along with the predicted soil moisture level according to our linear model. The screenshot below illustrates that BigQuery ML is predicting that by 2019-05-02 10:00:00 UTC (or 6am eastern) we’ll see the moisture sensor cross our threshold of 2900, indicating time to water the plant.

Success!

Soil Moisture Prediction results in Google BigQuery ML

Soil Moisture Prediction results in Google BigQuery ML

Conclusion

Google BigQuery ML is a SQL-native, advanced analytics tool that empowers data analysts with a SQL background to apply data science techniques to their analyses. In aggregate, it took me about a day to go from uploading a dataset to having a prediction from BigQuery ML. I am not sure the same timeline would hold true for me to learn how to do this in R. The SQL interface for BigQuery was intuitive and the overall language for the modeling capability felt familiar, much like writing DML.

Obviously, I am writing this and publishing it later in May 2019 and thus can compare actuals with the prediction. Google predicted the soil would be dry around 5/2/2019 6am. Based on initial analysis, it looks like the sensor reached 2900 at approximately 5/3/2019 12:41pm. That’s roughly a 30 hour difference, which isn’t great.

Pulling all the data sets into Excel and charting them, you can see the wide range of predictions based on linear models. Green line is the results from BigQuery ML with the dry-line of 2900 reached after 10am on 5/2. The lighter blue indicates the actual sensor readings collected after the data was selected for analysis (orange). As you can see, there is a slight flattening of the slope along with a local peak around mid-day on 5/3 which affect the accuracy of the linear model. Basically, after we stopped pulling in data for analysis after 4/30, the actual sensor reading became more Quadratic in nature rather than linear, thus contributing to the too-early dryness prediction for the model.

Comparison of results in Excel

Comparison of results in Excel

Now, as to why the sensor readings started flattening out…not sure yet. I suspect we had some cooler weather where the plant didn’t absorb as much water. But that is a correlation analysis for another day!

Overall, I am impressed by the beta version of Google’s BigQuery ML. I’d highly recommend experimenting with it to solve questions and conduct simple predictions. I look forward to following Google’s development as they add time-series capability and other use-cases to this functionality in the future. Take BigQuery ML for a spin, you won’t be disappointed. Leave the Python at the zoo, and R to the pirates. Machine Learning SQL baby! Awww yeah! 😎


 DCIM\100GOPRO\GOPR0907.JPG

Deep Dive!

  • BigQuery ML Tutorial on Linear Regression: Here

  • Link to Gordon’s Github with example code: Here

  • Sample of IoT data used in this tutorial: Link to CSV File (2.5mb)

  • This was a very helpful tutorial (piece of SQL) for generating a skeleton view in BigQuery: Here

This post is about Google Cloud’s new BigQuery ML capability within the SQL-syntax of the BigQuery engine. To learn more about this capability and see other examples, check out this link. Also, it’s worth noting that this capability from Google is in *beta* status.


Appendix:

Hardware to Measure Soil Dryness

In other areas of the blog, I have covered IoT data before and my propensity to leverage Particle.io hardware for IoT sensing applications. Since I’ve not covered the hardware-side in previous posts, I will do so briefly here. If you have questions, best place to reach me is via email.

The hardware is as follows:

Connect the capacitive soil sensor to Ground, 3.3/5V power, and one of the analog inputs (A1-A5). Next, place the sensor into the soil of your houseplant. Make sure you don’t insert the sensor deeper than the line on the PCB. Also, place the sensor in spot where it will get zero/minimal splashing when the plant is being watered. The circuit components are exposed and water-based corrosion will cause them to degrade more quickly. (You can also wrap the component section of the PCB in electrical tape for further protection.)

In the picture below, I have some extra wires because I put a DHT22 sensor on the board as well, but I removed that component for a clearer picture. The temp/humidity sensor will not be used for this tutorial, so feel free to ignore references to it in the code and photos.

plant-monitor2.JPG

Once you have the hardware set-up per Particle’s documentation, go ahead and re-use my code as needed to flash the Xenon boards. Find the code here: https://github.com/gstrodel/mesh-smartplant/blob/master/xenon-plant-monitor .

Arrived! IoT data in Google Cloud

Arrived! IoT data in Google Cloud