Machine learning is an impressive approach to create software. The universal approximation theorem is often cited to establish the claim that deep learning - a branch of machine learning - is already sufficiently expressive to approximate any numerical functions. Ignoring the impracticality of this claim, I would like to contrast how this approach of creating software is very different from the traditional approach with human software developers.
There are many ways machine learning based software creation differs from the traditional approach: The requirements are specified differently; the creation process is different; the testing is done differently; the created software is debugged differently. In this post I will focus on debugging.
The testing aspect will be discussed elsewhere, but let's say that you have found a bug realized in the following form:
There is an input x whose output f(x) of f is not the expected output y.And the goal of debugging is to modify this implementation f of some desired function F (for which we have F(x) is equal to y), so that f(x) is equal to y.
In the traditional approach one might work backwards and try to figure out why the incorrect output f(x) was returned, and fix the problem. More systematically, one might factor the code as, say:
f(x) = w(v(u(x))),
where the implementation v is determined to be the cause for the incorrect output (since it does not have the correct output when given u(x) as input), whereas u and w are working correctly. The human developer would then manually pull these apart, add tests for each factor, and fix the buggy piece v. Next time around when something breaks there would be a smaller surface area for the developer to work through. An important thing to note is that the implementation f that is being debugged may not admit such a simple factorization, and it might already come with some internal structures which are not individually tested. For example, we might have something like
f(x) = w(v_1(u_1(x)), v_2(u_2(x))).
(It seems some of this process of refactoring and testing could be automated, and probably has been automated. I would be very interested in learning about existing libraries for this - they would do to code what Python's Hypothesis library does to test examples.)
As things are today, the situation is quite different in practice if the implementation f is created with machine learning. In this case we refer to f as a model.
Let's say f is an ensemble of other models, it might look like
f(x) = avg(u_1(x), u_2(x), ..., u_100(x)).
To fix the ideas let's pretend that the desired function F is "is cat", that returns 0 if x is not a cat, and 1 if x is a cat, and that each u returns a number between 0 and 1, and therefore so does f. Now if f(Felix) is equal to 0.2 and far from F(Felix), which is 1, what would the developer do? What can the developer do? The lack of the ability to pin down exactly which submodel among the u's should be the most responsible for the error and therefore should be the first to be fixed has driven a good portion of the explainable artificial intelligence movement.
At the other extreme, if f is a neural network, why not a straightforward feedforward network, then it already comes with a factorization
f(x) = u_1(u_2(u_3(...(x))...).
But the situation is not much better. Although in this case every submodel u is pretty simple, maybe a linear function followed by a truncation function ReLU, there simply isn't a way to know which one is responsible for the incorrect final output (or if this question even makes sense!), nor a way to know the "correct" output for each submodel (of if this question even makes sense!). Part of the problem is that while there are many ways for f to make mistakes, there are also many ways for f to do the right thing as a whole even if every submodel individually do not obviously work toward the goal F. For example u_2 could do something numerically whacky but u_3 undoes it. Any attempt to fix only one of the submodels without touching all/most/many of the others have little to no guarantee it would work. Transfer learning is one situation where something like this actually works.
So, what is the developer to do in this situation? The default answer is retraining - in other words repeat the whole process of creating the model f, preferably with better data and/or better design and/or better parameters - and better luck. This is an expensive process which I find in sharp contrast when compared to the traditional approach of software debugging.
One reasonable and notable counter argument is that machine learning models are not evaluated according to individual outputs, but statistically. If the model gets the answer right 99% of the time, the user should accept the rare incorrect outputs. This is an important and very interesting point that I will discuss another day.
Comments
Post a Comment