Editor’s note: You’re about to read an extract from our new book “Ship It 2.0”. You can get the full book here.
Machine Learning (ML) and Artificial Intelligence (AI) are often lumped together as the same thing. Rubén Lozano knows otherwise, by explaining the key differences between them and how ML is related to statistics.
What is Machine Learning Anyway?
Machine Learning is a subset of Artificial Intelligence – when a machine is trying to mimic what a human is doing – and uses a lot of algorithms and statistics to do so. Deep Learning is a concept that often comes along with ML, and can be thought of as multiple layers of Machine Learning.
According to Arthur Samuel, a pioneer of AI research; Machine Learning is:
“The field of study that gives computers the ability to learn without being explicitly programmed.”Arthur Samuel
There are many problems that can be, and still are, solved with classic programming. You take your data and rules, and use classic programming to get your answers.
For example, simple customization can be done by using classic programming to show one set of search results to Person A, and another set to Person B. ML comes into play when you simply have too many rules and the answers you want are more complicated. The key to ML is having the data and the answers, but wanting to figure out the rules.
How Should Product Managers Think About ML?
ML is clearly complex. So to make it easier and add focus, Rubén breaks down what aspects of the process a PM should concern themselves with.
First, as a PM you should ask yourself; “what is the problem I need to solve?” Then make sure that there’s enough data, without which ML may not be a viable option. After that, you can start to think about the right algorithm to use to predict the answer they want. At this point the Data Scientists will join the conversation, working on the model and eventually the output.
When it comes to understanding the vocabulary, you can use statistics. Many of the concepts directly translate, and if you have a basic understanding of statistics, you’re already on your way to mastering ML. It’s important for a PM considering ML as a possible tool to get comfortable with the language surrounding it.
The main difference between Supervised and Unsupervised learning, is whether you have labeled data or not. With Unsupervised, you are using ML to try to find the similarities between data clusters that you don’t fully understand.
When Should We Use ML?
ML is a cycle that should start with data. If you have no data, you need to find a different solution. Not only should you have enough data, it should also adhere to a set of certain conditions:
Whether or not you can use ML is also based on what kind of problem you’re trying to solve. ML can be implemented when your problem:
- Handles very complex logic
- Scales-up fast
- Adapts in real-time
- Requires specialized personalization
- Has existing examples of actual answers
Here are some examples of problems that can be solved with ML:
- Ranking: Helping users find the most relevant thing
- Example: Ranking algorithm within Amazon search
- Recommendation: Giving users what they may be most interested in
- Example: Recommendations from Netflix
- Classification: Figuring out what kind of thing something is
- Example: Product classification for Amazon catalog
- Regression: Predicting a numerical value of a thing
- Example: Predicting sales for specific Amazon products
- Clustering: Putting similar things together
- Example: Related news from Google search
- Anomaly: Finding uncommon things
- Example: Fruit freshness
When is ML Not Needed?
There’s no point in using ML just for the sake of doing something cool, and a good PM knows when to step back and admit that ML isn’t necessary. It should not be implemented if the problem:
- Can be solved by simple rules
- Does not adapt to new data
- Requires full interpretability
- Requires 100% accuracy
It should also not be implemented if the data:
- Is unavailable/insufficient
- Is not readily accessible to you
- Has privacy concerns or is unsecure
- Is irrelevant, stale, biased, or otherwise low quality
So… To ML or Not to ML?
Here is a quick exercise, asking whether ML should be applied to answer the following questions:
- What apparel items should be protected by copyright laws? — No, because this requires 100% accuracy.
- Which resumes should we prioritize to interview for our candidate pipeline? — Has great qualities for an ML problem, but the data is biased.
- What products should be exclusively sold to Hispanics in the US? —Tempting, as you might have all the data and customer profiles, but it’s discrimination and makes a lot of assumptions about people)
- Which sellers have the greatest revenue potential? — Could be argued to be discrimination as well – but you cannot help every single seller and you need to find a way to use your limited resources to have the maximum impact. So no.
- Where should Amazon build HQ2? — No, it’s not a repeatable problem, and you don’t already have the answers. You could use classic programming, but ML isn’t necessary.
- Which search queries should we scope for the Amazon Fresh store? — Yes, you’ll need a combination of ML and classic programming.
Let’s Do ML!
Once a PM feels ready to take on ML, it’s time to take a look at the ML Lifecycle, starting with what you need to do it.
Get the right people:
There are many different roles in different organizations, but it’s important for PMs to differentiate between Science and Engineering.
The people who are working with the data and doing all the maths like choosing the right model (ML Scientist, Research Scientist, Data Scientist, etc) will not be doing the Engineering.
People with titles like Data Engineer, Software Engineer or Dev Manager will be doing things like collecting, cleaning, ranking, and processing the data.
Understand the process:
- Formulate the problem
- What is the problem to solve?
- What is the measurable goal?
- What do you want to predict?
- Select and preprocess data
- Feature engineering
- Feature: individual measurable property or characteristic of the phenomenon being observed
- Goals: Use domain and data knowledge to develop relevant features from existing raw features of the data to increase the predictive power of ML
- Test and tune models
- Productionize: Integrating ML with existing software, and keeping it running successfully over time
- Deployment environment
- Data storage
- Security and privacy
Sometimes you can have everything you need, like great data and a great problem, but it’s just too costly to make sense. Part of being a Product Manager is figuring out the trade-off, and for this, you need a strong relationship with your data scientist.
A Product Manager’s Role in ML
Once you’ve decided to move forward with ML, and understand the rough outline of the process, it’s time to take a look at what part a Product Manager plays in it.
Firstly, a PM has to formulate the problem. Ask yourself what the problem is, what the measurable goal is, and figure out what you want to predict. Here’s an example of what this might look like:
| || |
|What is the problem?||Units per order from category X in the US has remained flat YoY and engagement has declined as measured by purchase-week frequency|
|What is the measurable goal?||Increase unit order rate for category X in the US by +X% within the next X months without affecting revenue|
|What do you want to predict?||Category X products that are more likely to be added to a customer cart based on items in the customer car|
Once you understand the problem and your goal, the next step is to select and preprocess data. As Rubén says, at this point you have to be pretty in the weeds. If you don’t have the data right, everything else will be wrong. Choosing the right data sets and knowing that they’re being used for the right purposes is a critical PM task.
When it comes to formatting, a PM can expect to have a fairly low level of involvement, zooming out just to make sure that everything is working the way it’s supposed to.
What a PM can do, by working together with the Data Scientist, is to get involved with cleaning the data, namely by having incomplete, noisy, biased, or inconsistent data removed.
In your PM role, you should also get involved with sampling, by choosing representative data. You can choose random data (for which there are pros and cons) or you can use stratified data.
You’ll need to also check your data for seasonality, leakage, or biases. There’s also a danger of your data being collected based on a certain trend, which will affect your results.
Treat Your Scientists Right
You’ll need a good working relationship with your Data Scientist for ML to work. Firstly, you should treat your ML project as a partnership. Make sure everyone knows why you’re making your decisions the way you are. You should have a clear problem, hypothesis and success metric. Start from there and let everything else come later.
Another key part of the PM<>Tech relationship is to be willing to make tradeoffs. Rubén gives us the examples of Time vs Quality, White Box vs Black Box, False Positives vs False Negatives, and Go vs No-Go Metrics.
Finally, be considerate of scientist time and momentum. When working with people who have different skills, it’s important not to expect them to work at your pace or presume to tell them how they should organize their time.
Being transparent about what you need from them and why you need it, without crossing the boundary of telling them how to do their jobs. Bringing what you do best to the table and working in tandem with other disciplines is a recipe for success.