Skip to main content

Machine learning and software development - debugging

Machine learning is an impressive approach to create software.  The universal approximation theorem is often cited to establish the claim that deep learning - a branch of machine learning - is already sufficiently expressive to approximate any numerical functions.  Ignoring the impracticality of this claim, I would like to contrast how this approach of creating software is very different from the traditional approach with human software developers.

There are many ways machine learning based software creation differs from the traditional approach: The requirements are specified differently; the creation process is different; the testing is done differently; the created software is debugged differently.  In this post I will focus on debugging.

The testing aspect will be discussed elsewhere, but let's say that you have found a bug realized in the following form: 

There is an input x whose output f(x) of f is not the expected output y.

And the goal of debugging is to modify this implementation f of some desired function F (for which we have F(x) is equal to y), so that f(x) is equal to y.

In the traditional approach one might work backwards and try to figure out why the incorrect output f(x) was returned, and fix the problem.  More systematically, one might factor the code as, say:

f(x) = w(v(u(x))),

where the implementation v is determined to be the cause for the incorrect output (since it does not have the correct output when given u(x) as input), whereas u and w are working correctly.  The human developer would then manually pull these apart, add tests for each factor, and fix the buggy piece v.  Next time around when something breaks there would be a smaller surface area for the developer to work through.  An important thing to note is that the implementation f that is being debugged may not admit such a simple factorization, and it might already come with some internal structures which are not individually tested.  For example, we might have something like

f(x) = w(v_1(u_1(x)), v_2(u_2(x))).

(It seems some of this process of refactoring and testing could be automated, and probably has been automated.  I would be very interested in learning about existing libraries for this - they would do to code what Python's Hypothesis library does to test examples.)

As things are today, the situation is quite different in practice if the implementation f is created with machine learning.  In this case we refer to f as a model.

Let's say f is an ensemble of other models, it might look like

f(x) = avg(u_1(x), u_2(x), ..., u_100(x)).

To fix the ideas let's pretend that the desired function F is "is cat", that returns 0 if x is not a cat, and 1 if x is a cat, and that each u returns a number between 0 and 1, and therefore so does f.  Now if f(Felix) is equal to 0.2 and far from F(Felix), which is 1, what would the developer do?  What can the developer do?  The lack of the ability to pin down exactly which submodel among the u's should be the most responsible for the error and therefore should be the first to be fixed has driven a good portion of the explainable artificial intelligence movement.

At the other extreme, if f is a neural network, why not a straightforward feedforward network, then it already comes with a factorization

f(x) = u_1(u_2(u_3(...(x))...).

But the situation is not much better.  Although in this case every submodel u is pretty simple, maybe a linear function followed by a truncation function ReLU, there simply isn't a way to know which one is responsible for the incorrect final output (or if this question even makes sense!), nor a way to know the "correct" output for each submodel (of if this question even makes sense!).  Part of the problem is that while there are many ways for f to make mistakes, there are also many ways for f to do the right thing as a whole even if every submodel individually do not obviously work toward the goal F.  For example u_2 could do something numerically whacky but u_3 undoes it.  Any attempt to fix only one of the submodels without touching all/most/many of the others have little to no guarantee it would work.  Transfer learning is one situation where something like this actually works.

So, what is the developer to do in this situation?  The default answer is retraining - in other words repeat the whole process of creating the model f, preferably with better data and/or better design and/or better parameters - and better luck.  This is an expensive process which I find in sharp contrast when compared to the traditional approach of software debugging.

One reasonable and notable counter argument is that machine learning models are not evaluated according to individual outputs, but statistically.  If the model gets the answer right 99% of the time, the user should accept the rare incorrect outputs.  This is an important and very interesting point that I will discuss another day.









Comments

Popular posts from this blog

"Good enough" is good enough

You bought a watch that tells the correct time 90% of the time you look at it.  The other 10% of time it might give you an obviously incorrect time (say, showing 11PM during the day) or might be just slightly off to make you miss that important meeting. After using this watch for a while you notice that its correctness depends on several factors such as the time of the day, your location in longitude (but not latitude), your viewing angle at the watch face, what you had for breakfast the previous day, and so on.  When you need to find out if you can catch the next bus, for example, you learn to first move to a location with the optimal ambient lighting, tilt your head just so, and raise your wrist at just the right speed to ensure a correct reading. You went back to the store to exchange the watch for a better one that is correct 95% of the time.  However, it comes with a different set of quirks for getting the correct time.  You need to remember to rub your right cheek before reading

The code that writes code that writes code

I read that folks had observed that some machine learning models could be used to write code   that runs .  Let that sink in for a moment.  At the same time, be aware that biases and extensive memory have been observed in the same model. This might be considered as an implementation of automatic programming , and is definitely not the first time machine learning models are used for generating code (or strings that look like code). The model that writes code (Z) when given some inputs is itself a piece of (very very very large and complex) code (Y).  If expressed in a general purpose programming language, it would have perhaps thousands of variables and many more operations.  No human programmer wrote this code - another piece of code (X) did. Humans -> X -> Y -> Z This is now the classical scientific fiction set up where a robot make other robots (which may in turn make other robots).  In the case when Y is a neural network, X would be responsible for both the training loops