The goal of this short blog post is to demystify Machine Learning for the Excel-only users out there. Let's imagine you are responsible for a small post office and have 4 different data points stored in an Excel for each week:
- how many customers showed up
- how much time it took on average to serve the customers
- a rating of how happy your customers were with the service incl. with the wait time
- how many hours your employees worked
We're going to pull data from an Excel model and then create a
RandomForestRegressor, a simple yet powerful ML algorithm to predict continuous values such as our customer rating (a percentage score).
We're going to take some shortcuts in the spirit of simplicity, dive in and follow along. The only 2 functions we'll need are below. Skip understanding them, they aren't central, just copy paste them (and the imports) in your code (one helps extract data from Excel, the other formats the data into a dataframe so that the ML library can read it)
import plotly.graph_objects as go
import openpyxl
import os
import pandas as pd
import matplotlib.pyplot as plt
from plotly.subplots import make_subplots
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
def get_data_from_excel_row(row_num, wb, column_start, column_stop):
return_list = []
for col_i in range(column_start, column_stop + 1):
return_list.append( wb.cell(row_num, col_i).value )
return return_list
def create_dataframe_from_list_of_lists(data_list, col_list):
transposed_tuples = list(zip(*data_list))
transposed_data = [list(sublist) for sublist in transposed_tuples]
df = pd.DataFrame (transposed_data, columns=col_list)
return df
Step 1: We're going to pull data from your Excel file (
here's an example) using our get_data_from_excel_row function
excel_path = <path of the excel file on your computer>
excel_file = openpyxl.load_workbook(excel_path, data_only=True)
wb = excel_file["Sheet1"]
dates = get_data_from_excel_row(2, wb, 3, 168)
customers_num = get_data_from_excel_row(3, wb, 3, 168)
avg_handling_time = get_data_from_excel_row(4, wb, 3, 168)
achievement_rating = get_data_from_excel_row(5, wb, 3, 168)
employee_work_hours = get_data_from_excel_row(6, wb, 3, 168)
Step 2: Now that we have our data loaded into Python, we can prepare it as is required by the RandomForestRegressor model:
x_data_list = [customers_num, avg_handling_time, employee_work_hours]
x_col_list = ["customers_num", "avg_handling_time", "employee_work_hours"]
y_data_list = [achievement_rating]
y_col_list = ["achievement_rating"]
df_x = create_dataframe_from_list_of_lists(x_data_list, x_col_list)
df_y = create_dataframe_from_list_of_lists(y_data_list, y_col_list)
This means that we want to train a machine learning model to predict our customer achievement rating using the 3 other values as input parameters: number of customers, handling time and work hours.
Step 3: Finally, we split our data into testing and training sets (ML tests its prediction-testing capabilities on data it has never seen before, that's why we separate the training data from the testing data):
x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.25)
regr = RandomForestRegressor(max_depth=3)
regr.fit(x_train, y_train)
y_predict = regr.predict(x_test)
Note that the above 4 lines are the core of the program, all the rest is about importing, preparing and visualizing the data. Think about it,
4 lines of code is all you need to create an ML model and generate a prediction, we think that's amazing.
Step 4: Let's visualize the prediction accuracy by plotting the actual results versus the predicted test data:
num_vals = y_predict.shape[0]
plt.plot(range(num_vals), y_test, color="black", label="actuals")
plt.plot(range(num_vals), y_predict, color="red", label="prediction")
plt.ylim(0, 1)
plt.legend()
plt.show()
We'll be sending the full code to our mailing-list subscribers. We really hope you have a go at this, there's no better way than learning by doing.