1. Easy access to required data and a comprehensive data strategy
There is saying in computer science world “garbage in, garbage out” which means nonsense input data produces nonsense output. Therefore your machine learning model is only as good as the data it’s trained on.If there is problem with data, machine learning scientists will end up spending their time in doing data cleanup and management.So we need a strong data strategy to make efficient use of ML scientist’s time and talent.
What makes a strong data strategy ?
- Data should be viewed as organizational asset rather than property of individual department that created or collected that data.
- Data should be available easily, securely and in compliance with legal and regulatory requirements.
- Data is put to work through analytics and machine learning to make better decisions, create efficiencies and drive new innovations.
Data related questions to be asked before the start of ML project
- What data is available to me today?
- What data is not quite available, but with some effort could become available?
- What data I don’t have today, but I might have in next few months or year? And what steps can be taken to begin gathering that data?
- Is there any potential bias in data or data sources?
2. Selecting machine learning use cases and setting success metrics
We should aim to use machine learning where it is actually needed and not where it might be interesting. Some times simple analytics or rules get you 10-40% of business impact.Things to keep in mind include data readiness, business impact and machine learning applicability.
- A high impact use case without data or machine learing applicability ❌
- A use case with lots of data and high machine learning applicability but low business impact ❌
Before working on a project the team needs estimate its potential impact as well (Opportunity Sizing). So once we define business problem which can be solved with machine learning and done with opportunity sizing the next step is to outlining clear metrics to measure success.
The data science projects needs to have clear goal which is typically a target value for a clearly defined metric. In real world data science projects there are not just one but multiple metrics that model will evaluated against. Some of these evaluation metric won’t even be related to how your prediction performs against the ground truth. Other such metrics are like :
- Overall memory usage
- latency of the prediction process
- complexity of predictive model
Real world problems are indeed dominated by business and tech infrastructure concerns.
3. Technical experts and domain experts should work together
We need to make sure that domain experts and technical experts or stakeholders work side by side. If relevant stakeholders are the part of entire process, everyone is most likely to accept, adopt and implement the solution. If a data scientist is working in silos then its very much unlikely that their models get implemented.
4. Exploratory data analysis
Before building the model,we need to interrogate the data to see if there is any predictive power in the feature set. Read more about EDA here.
5. A quick MVP
Its good practice to build a minimum viable product which is build quickly and cheaply to validate the hypothesis before we commit extensive time and resource.
6. Experiment Metrics
We should look for more than one metric to look when an experiment concludes.
7. Regular Check-ins
Rather than meeting at the start and end of a model build, it is better to check-in frequently (e.g. once or twice a week) to discuss latest findings and align on if any course corrections are necessary.