PROCESS: Machine Learning Engineering Tactics at TWOSENSE.AI
At TWOSENSE.AI, we build Enterprise products that implement continuous biometric authentication. Our core technology relies on Machine Learning (ML) to learn user-specific behaviors and automate the user workload of authentication. Although this idea has been around for a bit, the challenge is to execute on it. An Agile approach was absolutely necessary for both execution at speed, and to get our product into the hands of customers early. Our goal was to combine Agile Software Engineering (SWE) with Machine Learning Research (MLR) to create what we call Agile Machine Learning Engineering (MLE). In order to do this, we had to make changes to our existing Software Engineering process on a tactical level (Process), on a strategic level (Paradigm) and on how we see ourselves in the process (People).
We group our tactical approach to day-to-day product development under the term ‘Process.’ We’re on a two-week sprint with max 3-day tickets with a whiteboarding, implementation, and code review phase. The goal of the ML process is not to create a parallel set of tactics for MLE, but to find a unifying process to combine the research aspects of ML with the rigor of software engineering.
Simply throwing our SWE process at ML development highlighted some glaring problems.
We’re bad at estimating effort for SWE tickets in general. It’s not really our fault per se, there’s something called the “planning fallacy” at work, it’s just human nature. However, when it comes to ML we’re off by much more than we are for general SWE tickets. I think this is a function of both the research component, as well as the nascency of the field, but whatever the reason it's bad enough to be disruptive. I believe it’s not attributable to individual underperformance or error. The destructive impact is compounded by our need to keep ML and product dev in lock-step, which means slowdowns impact the whole team. One way to mitigate this strategically is the general paradigm of focusing on infrastructure first which allows everyone to work in parallel. Our solution to improve our ability to estimate ML tickets is to limit ticket or task scope topically with some bumper-rails. Here are the scopes we’re working with as guidelines now:
While we wouldn’t claim this is an inclusive set of all MLE activities, the limitations on scope seem to be effective for us as well as others. Combining these tactics with the strategy of starting with the infrastructure and pipeline help us break down the process of MLE and apply SWE rigor to each task, with testing, pull requests, code review, and merging to dev, staging, and shipping quickly to production.
If the reader is a software engineer you’re probably feeling pretty good at this point. However, if you’re a Data Scientist or Machine Learning Researcher you’re probably disgusted! For research, there’s a need to go WAY off the beaten path to try things and fail, find answers to questions, and find questions we weren’t even asking. The concept of being constrained by the restrictions of production code and scope bumpers feels contrary to the nature of exploration. Our solution to this is the research ticket (RT) or task. Here is where we break from standard Agile SWE. The largest differentiator is that the RT is meant for human consumption, rather than something that’s merged to dev on completion. The same time limitations apply, trying to limit RTs to be under 3 days of work, and there’s a whiteboarding session to kick them off. RTs are defined by a question or related questions that we’re going to try and answer, and the general approach to trying to find answers is bounced around at whiteboarding.
The output of an RT can be thought of as a guide forward. It can be decision recommendations, e.g. these features should be implemented, or this model type outperforms the incumbent which then becomes a new production ticket or adds context to one that already exists, or a set of new questions that need answering and become research tickets in the backlog. How these tickets are reviewed is also different. Thorough code review is not a good use of everyone’s time, but light code review focused on checking the validity of the assumptions, conclusions, and implementation makes sense. The output itself is a demonstration, demonstrator or document, and the review is more of a discussion or demo where the author walks the reviewers through the work. These outputs are often in the form of Notebooks in the case of research tasks (the only time they should be used), or knowledge base documents for literature review tasks. This is similar to the Agile concept of a “spike,” where engineers can go off and investigate scale or feasibility on a branch, discarding the branch once an answer is found. However with an RT, the results can have far-reaching impacts so they need to be maintained, and further RTs often iterate off of previous ones so they need to be maintained in executable form. We have also found this approach of review demos valuable for non MLE spikes as well.
One risk is that by reducing the research components to small iterative steps, one tends to think in small steps (Agile is a way of thinking more than anything else). While that clearly has it’s upsides, one downside is that we tend to get caught up in the little things and miss the big picture. Our solution to this issue is to allot time (1 hour a week) that is a free and open space to bounce around big ideas. This time is intentionally left unstructured, everyone can bring a few bullets of discussion points, but in general, it’s just a time to bounce around unbaked ideas and debate things in a non-judgemental way. Part of our general SWE process is giving everyone “Personal Project Time” Friday afternoons to work on whatever they think will bring value to the company, where individuals are welcome to dive into the longer-term things as well.
Finally, because of this added research component, the path forward can change each sprint, or even each ticket. The problem is that if we blindly follow the sprint as laid out at planning, it's probable that we’ll be working on tickets that are not the most important things. Our solution is to reevaluate the sprint after every research ticket, reprioritizing tickets, adding tickets, or dropping tickets as needed. It is important to do this in the context of the sprint goals, and not to fall into the trap of becoming a [production team] where the sprint is a shipment schedule only, and not feature-focused.
We’re always hiring. If this sounds like an environment you’d like to work in please reach out!