Great question! The first main difference is about data. While in traditional software you’re mainly dealing with code, ML engineers have to deal with data + code + potentially stochastic behavior. This makes the process considerably more challenging and sometimes not even deterministic. What we found is helpful is to follow 4 steps:
1. Identify problematic cohorts of data.
2. Diagnose “why” they’re not working properly (explainability helps a lot here!)
3. Fix it. Implement the necessary changes (
adding more data, changing the algo, etc)
4. Assert that it won’t happen again. Create data unit tests to make sure that cohort performance is stable!