Top tips is a weekly column where we highlight what’s trending in the tech world today and list ways to explore these trends. This week, we’re looking at four machine learning-related risks to watch out for.
Machine learning (ML) is truly mind-blowing tech. The very fact that we’ve been able to develop AI models that are capable of learning and improving over time is remarkable.
Thanks to its incredible pattern recognition and decision-making capabilities, ML is taking on a very central role in the global technological landscape, with companies across industry verticals already deriving benefits or expecting potential benefits from implementing this tech.
But it isn’t all sunshine and rainbows. As with any form of tech, ML also comes with certain risks. Here are four of the most critical.
1. Poor or biased data
It’s becoming a cliché to say this, but an ML model is only as good as the data used to train it. The input data being fed into the model during the training phase determines how accurate its outputs will be in deployment.
So it goes without saying that the input data should be high-quality data that is accurate, error-free, diverse, varied, and free of noise (i.e., meaningless or corrupt data that cannot be correctly interpreted by the model). Noisy, inaccurate, or misleading “dirty” data, especially during the training phase, can result in a model that is deeply flawed on a fundamental level—to an extent where it is unable to fulfill its intended purpose.
Always verifying the integrity of your training data helps to create a model that produces outputs that are accurate and unbiased.
2. Overfitting
Overfitting refers to an undesirable situation where the ML model performs extremely well with its training data but fails to provide accurate outputs when dealing with real-world data. This occurs when the model detects additional patterns in the training data. These additional patterns are generally unwanted disturbances that affect the predictive capabilities of the model.
Let’s say an ML model is being trained to detect images of tables. Unfortunately, the training data includes a large number of images that also contain chairs. This inclusion of chairs confuses the model into considering chairs as a classification factor, meaning it may not be able to recognize an image of a table without a chair present.
In order to avoid overfitting, you must ensure that the data you’re using is varied and doesn’t contain any noise that could be misinterpreted as another pattern that should be included under the classification criteria.
3. Adversarial machine learning
Adversarial machine learning refers to a type of attack which aims to disrupt the functioning of an ML model by either manipulating the input data or even gaining unauthorized access to the model itself. The end goal of such an attack is to negatively affect the capabilities of the model, resulting in faulty and inaccurate predictions. The three main types of an adversarial machine learning attack are:
-
Data poisoning: This is carried out during the training phase. The attacker adds faulty or misleading data to the training dataset.
-
Evasion: Evasion attacks are carried out during the inference phase, where the ML model has been deployed and is put to work on real-world data. In this case, manipulated data with just enough noise, which is imperceptible to the human eye but can be recognized by the model, is fed into it to cause it to misclassify data.
-
Inversion: Inversion attacks involve feeding the outputs of an ML model into a separate model to predict the input data. This is especially worrying considering a lot of input data tends to be of the highly sensitive variety.
4. Data privacy
Some ML models are trained on incredibly sensitive personal data (e.g., financial or medical information), and organizations using such data are required to comply with data protection regulations like the GDPR and HIPAA.
Moreover, as we’ve already seen in the previous point, it is also possible to reproduce the training data of an ML model using inversion. A common method of combating model inversion is to add noise to the data. Unfortunately, as we know, noise can make the model less accurate. There are, however, some positive developments in this regard: a team of researchers at MIT has developed a framework for data protection known as Probably Approximately Correct (PAC) Privacy. This framework enables developers to determine the smallest amount of noise needed to protect the data while also maintaining performance levels. However, this framework is still in its early stages, and how effective it actually is remains to be seen.
The future of ML comes with risks aplenty
ML is still in a relatively nascent stage, with organizations still experimenting and exploring its possibilities. The risks we’ve mentioned above are just scratching the surface; as this tech continues to grow, expect many more threats to emerge. In addition to developing primary ML functions, now is the right time for organizations to also invest in hardening their ML models to protect against all threats, both existing and future.