Problems Of Modern Machine Learning And How Not To Face Them In Development


Ibad | Updated: 27-01-2023 14:25 IST | Created: 27-01-2023 14:25 IST
Problems Of Modern Machine Learning And How Not To Face Them In Development
Image Credit: Pixabay

Many popular machine learning and deep learning courses will teach you how to classify dogs and cats, predict real estate prices, and show you dozens of other problems in which machine learning seems to work just fine. But you will be told a lot less (or nothing at all) about cases where ML models don't work as expected.

A common problem in machine learning is the inability of ML models to work correctly on a larger variety of examples than those encountered in training. Here we are talking not just about other examples (for example, test ones), but about other types of examples. For example, the network was trained on images of a cow, in which the cow was most often against the background of grass, and when testing, the correct recognition of the cow against any background is required. Why ML models often fail to cope with this task and what to do about it - we will consider further. Work on this problem is important not only for solving practical problems but also in general for the further development of AI.

Of course, the problems of machine learning are not limited to this, there are also difficulties with the interpretation of models, problems of bias and ethics, resource intensity of training, and others. In this review, we consider only generalization problems.

Data leak

A data leak is a situation where there is a feature that, when trained, contained more information about the target variable than when the model was subsequently applied in practice. Data leakage can occur in a variety of forms.

Example 1: When diagnosing diseases from x-rays, the model is trained and tested on data collected from different hospitals. X-ray machines in different hospitals produce images with slightly different tonality, and the differences may not even be noticeable to the human eye. The model can learn to determine the hospital based on the tone of the image, and hence the probable diagnosis (in different hospitals there are people with different diagnoses), while not analyzing the image in the image at all. Such a model will show excellent accuracy when tested since the test images are taken from the same data sample.

Example 2. In the problem of diagnosing real and fake vacancies, almost all fake vacancies belonged to Europe. The model trained on these data may not consider any vacancies from other regions as suspicious. In practice, such a model will become useless as soon as fake vacancies are posted in other regions. Of course, you can remove the region from the features, but then the model will try to predict the region by other features (say, by individual words in the description and requirements), and the problem will persist.

Example 3. In the same task, there is a feature "company name", and companies can have several vacancies. If the company is fake, then all its vacancies are fake. If we divide the data into training and test parts (or folds) in such a way that vacancies of the same company fall into both training and test parts, then the model will only need to remember the names of fake companies to predict the test set, as a result on the test, we get a greatly overestimated accuracy.

All the described problems cannot be solved simply by increasing the amount of training data without changing the approach to training. Here it can be argued that data leakage is not a problem of the ML algorithms themselves, since the algorithm cannot know which feature should be taken into account and which should not. But let's look at other similar examples.

Shortcut learning: "right for the wrong reasons"

At the beginning of the 20th century, there lived a horse named Clever Hans, who allegedly knew how to solve complex arithmetic problems by tapping the right number of times with his hoof. As a result, it turned out that the horse determines the moment when it is necessary to stop pounding with its hoof, by the expression on the face of the one who asks the question. Since then, the name "smart Hans" has become a household name (1, 2) - it means getting a solution in a roundabout way, without solving the problem itself in the sense that we want. Such a workaround "solution" suddenly stops working if conditions change (for example, the person asking the horse a question does not know the answer to it himself).

Shortcut learning is the phenomenon when models get the right answer using generally incorrect reasoning ("right for the wrong reasons"), which only works well for the training data distribution. Since the training and test samples are usually taken from the same distribution, such models can also provide good accuracy in testing.

More examples can be found in the excellent article Shortcut Learning in Deep Neural Networks (Geirhos et al., 2020), published in the journal Nature and based on a compilation of information from over 140 sources.

Changing the angle of rotation of the object, its appearance in an unusual (from the point of view of the training sample) environment, the addition of unusual noise can completely ruin the model.

In general, the researchers concluded that many modern convolutional neural networks prefer to focus on local regions of textures (Jo and Bengio, 2017; Geirhos et al., 2018), which can be related to both the object and the background, and not classify an object based on its general shape (Baker et al., 2018).

Now neural networks are starting to solve very complex tasks, such as working with three-dimensional scenes, animating objects, and so on. It would seem that the problem of classifying dogs and cats has long been solved? It turns out that it doesn't. One of the most advanced object detection neural networks, YOLOv5, can be easily fooled by unusual details that are not related to the object itself, or it can take several objects for one.

Convolutional classification networks, even the most advanced ones, often classify black cats as the Schipperke dog breed (which also has black coats), even if the muzzle and body shape shows that it is a cat. This refers, of course, to general-purpose networks that have not been specially trained for accurate recognition of breeds of dogs and cats, although in this case there is no guarantee of accurate recognition.

Generalization levels of machine generalization models

Almost any dataset has a limited variety and does not cover all situations in which the correct operation of the model is desirable. This is especially evident in the case of complex data such as images, texts, and sound recordings. There may be "spurious correlations" in the data, which make it possible to predict the answer with good accuracy only on a given sample without a "complex" understanding of the image or text.

These are precisely the predictions discussed above based on the background, individual texture spots, or individual adjacent words. Datasets in which spurious correlations are explicitly expressed are called biased (biased). It seems that all datasets in CV and NLP tasks are biased to one degree or another.

Uninformative features. The network uses features that do not allow you to effectively predict the answer even on the training set, for example, a neural network with a randomly initialized output layer.

Overfitting features. The network uses features that allow you to effectively predict the response on the training sample, but not on the entire distribution from which this sample was obtained. By "distribution" here we mean the probability joint distribution P(x, y) from which the data is taken (for more details see also this overview, section "data distribution").

Shortcut features. The network uses features that allow you to effectively predict the response to the data distribution P(x, y) from which the training (and, as a rule, test) sample is taken. Because a sample is a set of independent examples, the term independent and identically distributed (i. i. d.) is used. The ability of an algorithm to work with good accuracy on some fixed distribution of data is called i by the authors. i. d. generalization. As we saw above, for this it is not at all necessary to be able to solve the task itself in a general way, you can often use "workarounds" or data leaks.

intended features. The network uses features that allow you to effectively predict the response in the general case. Such features will also work well outside the training distribution of data when the "workarounds" are closed (for example, the object is on an unusual background, in an unusual position, has an unusual texture, etc.).

Ways to Solve Generalization Problems

Having finished the "who is to blame" section, let's move on to the "what to do" section. The question is how to expel "smart Hans" from ML models and train them to fully solve the problem. Let us first consider the most straightforward methods, and finish with research hypotheses and a review of works devoted to the development of new architectures and teaching methods.

The first step, oddly enough, is delegation. Is the exchange of experience the main impetus for every business? That is why it is worth seeking advice. Contact the machine learning consulting company to help you develop a product that will replicate the success of well-known applications and other products because a successful team is a big half of success.

(Disclaimer: Devdiscourse's journalists were not involved in the production of this article. The facts and opinions appearing in the article do not reflect the views of Devdiscourse and Devdiscourse does not claim any responsibility for the same.)

Give Feedback