Machine Learning – Finding the Method in the Madness

Disclaimer: I have taken the help of ChatGPT to describe some of the more technical concepts in this write up, like definitions and explanations. Everything else, are my words.

In this series of posts I will try to build out a toolbox for data analysts. A basic understanding of statistics is critical for any data analyst or data scientist. While statistics as a subject is vast and capable of inducing cold sweats in most people due to its seemingly complex nature, it is also misunderstood and perhaps the most used concepts are simple enough for a person with basic math skills.

For example, today’s topic of basic descriptive statistics. Mean, median, mode, range, variance and standard deviation. For a given dataset, and for a single value:

Mean: The average of a dataset.

Median: The middle value of a dataset.

Mode: The most frequent value in a dataset.

Range: The difference between the highest and lowest values.

Variance: A measure of how spread out the values are.

Standard Deviation: The square root of the variance.

These terms are pretty self explanatory. But Variance and Standard Deviation may need more explanation. Here are the definitions in a bit more detail.

Variance

Definition: Variance is a measure of how spread out the values in a dataset are around the mean. It quantifies the extent to which each number in the dataset differs from the mean.

Standard Deviation

Definition: Standard deviation is the square root of the variance. It provides a measure of the average distance of each data point from the mean and is expressed in the same units as the data.

How to calculate?

These metrics are extremely easy to calculate with a simple python code. In my example here, I am using the House Prices dataset downloaded from Kaggle. The column in the dataset we are analyzing is: SalePrice which is the price of the house. The python code, which can very easily be generated by ChatGPT. Here is an example:

[python]
import pandas as pd

# Load the dataset
data = pd.read_csv('train.csv')

# Calculate descriptive statistics
mean_price = data['SalePrice'].mean()
median_price = data['SalePrice'].median()
mode_price = data['SalePrice'].mode()[0]
range_price = data['SalePrice'].max() - data['SalePrice'].min()
variance_price = data['SalePrice'].var()
std_dev_price = data['SalePrice'].std()

# Print the results
print(f"Mean: {mean_price}")
print(f"Median: {median_price}")
print(f"Mode: {mode_price}")
print(f"Range: {range_price}")
print(f"Variance: {variance_price}")
print(f"Standard Deviation: {std_dev_price}")

Use Case and Pitfalls

Descriptive statistics may not be as “fancy” or “complex” as other statistical methods but they are crucial. For example, the mean value of a dataset can give us a fair idea of what the data looks like. Specially useful in pricing and sales. For example, in my job, I use these measures for understanding the price of our product offering as compared to other similar products. Median and Mode are also helpful in understanding where our product price lies if I were to lay down all similar products in the market on a table in front of me. Are we close to middle or are we too pricey.

One thing to take note is the word “similar”. When I say “similar” products, its important to understand what that means. Lets say we are trying to understand better if an online course I am selling is priced correctly or not. If I compare all available courses online (irrespective of what the course topic is) then my descriptive statistics will be misleading. Not all courses are created equal. Not all courses deliver the same value. And I cannot really compare a course which teaches you a technical skill and a course which is soft skills focused. The average price for my technical course may be too low (or too high).

While an AI tool can help you write the code, its in the data gathering phase where a human intellect is required. For these descriptive statistics to make sense, we must compare similar courses. And the “similarity” can be a complex. How deep should I categorize the data collection ? Should I look at all course prices of the same subject? Or should I look at all course prices of the same topic?

Inversely, if you already have a dataset, then these statistics can be used to understand the quality of the data. For example, continuing with our online course prices example, if the median and mean have a massive difference, then we are probably looking courses which are vastly different. If the variance or standard deviation of the course prices is too high, we are either looking at very different courses or we have an error in our data or we have an “outlier”. So these descriptive statistics can be used as a data quality measurement tool as well.

Results

From a house prices perspective, these results look accurate. The mean and median are not “wildly different” suggesting a uniform dataset, perhaps house prices from a zipcodes not too far from each other.

The high value of the range suggests that cheapest and the most expensive house have a massive difference, which can be true. But this is something to investigate.

The standard deviation is also high suggesting that there is either a big difference between the house prices or there are certainly some outliers. Further investigation is required.

Whats next?

I would probably do a histogram of the prices to understand how the dataset looks like. I already did and here are the results.

Looks like we have a significant number of houses which are above 200K . There is greater variability in prices above 200K suggesting that could be the reason for the high standard deviation.

Hopefully this was brief enough and useful enough! See you in the next one.

More Use Cases (Thanks ChatGPT!)

Finance:

Variance: In finance, variance is used to measure the volatility of a stock’s returns. A higher variance indicates a more volatile stock.
Standard Deviation: Standard deviation is used to gauge the risk associated with an investment. A higher standard deviation means more risk as the investment returns are more spread out from the mean.

Quality Control:

Variance and Standard Deviation: In manufacturing, these measures help ensure product consistency. Low variance and standard deviation indicate that the product quality is consistent, with minimal deviation from the desired specifications.

Healthcare:

Variance and Standard Deviation: In medical research, these measures help analyze the effectiveness of treatments. They can indicate how varied patients’ responses are to a treatment.

Horror video for the middle class

This is not a video review. Although it may sound like it at times. This video is about how automation is changing the world today. There is a comparison to historical times and an explanation to why things are different today. The examples are very limited in scope but I for a short video like this one, they are impactful.

This video presents an ominous picture of the future. However, I agree with it more than I disagree. I work in the technology industry as a data analyst/data engineer and every month I am handed a project which involves the automation of some functionality. I am not even talking about the automation as explained in this video (machine learning). The automation I develop is simple code or by use of tools which eliminate the need for a human being to press a button or even open the file. It’s been two years and I have automated numerous such processes saving many hours or boring monotonous work. It’s after working on projects like this that I slowly start to realize that certain functions of my work can be automated too. (Chill runs down the spine).

The truth is, that there is no clear answer to the question: Will we all lose our jobs to the robots? .

People have ideas but they are untested. Tech companies do not even want to try to answer the question as they will fall behind the race to create the smartest machine in the world if they pause to ask this question. Governments seem to be oblivious to the problem and the alarming number of graduates taking up low skilled jobs doesn’t bother them; at least they have a job.

After watching the video, I went into a micro panic; I typed the following words in the YouTube search bar: FIGHT AGAINST AUTOMATION. The only videos I found were mostly media house produced videos about the very question I have mentioned above, news articles and university videos about how to IMPROVE automation (WTF!). There were no independent videos of people seriously thinking about the question. There was one video:

177 views + 1 (mine)

This was particularly scary for me because I thought that I would find tons of videos of YouTubers ranting against automation and thinking of ways to fight back to protect professions or atleast thinking of ideas to help people stay employed in the future but … nope. No angry YouTubers here. Only rich business owners talking about why their hands are tied:-

“I don’t want to do it but I have to do…nothing personal”

All this is not immediately worrying for me. I am a male with a master’s degree and technical skills sitting in front of a laptop. I order my food online. I am a beneficiary of this wave of automation which is sweeping the low skill work market. I cannot empathize with a server in a fast food joint or a worker in a grocery store. But in my own industry I have seen jobs disappearing. There was a time when software testers used to be in demand. I myself had thought of learning extensive software testing (in depth) in the past. But today, I rarely see any jobs for software testers. I do see automation engineering positions a lot though.

Conclusion:-

My fingers are tired. “Hey Google, can you please take this down for me? ”

We are benefiting from automation. But we are also blind to its pitfalls. The lack of voices out there screaming about the challenges automation poses to our collective future is alarming but this picture can be changed. And I think we are moving in the right direction. Although I am afraid our speed is slow.

First, we need to make the effects of automation known far and wide. The word is not out there yet. Also, we must realize that automation is not taking a bottom up approach. It’s is taking a bottom-up, top-down and middle-up-down approach, all at the same time. It is happening in all levels of the corporate hierarchy. So if you are reading this on your computer or mobile screen, start thinking about how this can impact you and your future.

Tag: Machine Learning

Data Analyst Toolbox: Calculation and Use Cases of Variance and Standard Deviation