The amount of data generated and fed into AI systems has increased quickly over the last few years. Attackers are taking advantage of the massive increase in data volume to contaminate the data input in training datasets, resulting in incorrect or malicious results. In fact, at a recent Shanghai conference, Nicholas Carlini, research scientist at Google Brain, stated that data poisoning can be accomplished efficiently by modifying only 0.1% of the dataset.
Such findings by experts make it imperative to implement measures that safeguard data against manipulations and modifications by threat actors. This blog explores different prevention strategies you can employ to prevent data poisoning.
What is data poisoning?
Data poisoning is an adversarial attack that involves manipulating training datasets by injecting poisoned data. In this way, an attacker can control the model, and any AI system trained on that model will deliver false results. In order to manipulate the behavior of the trained machine learning (ML) model and provide false results, data poisoning entails adding malicious or poisoned data into training datasets.
How are data poisoning attacks carried out?
If an AI tool is trained with an incorrect dataset, it is not going to know what it needs to know. The systems will take these datasets as valid inputs, incorporating that data into their system rules. This creates a path for attackers to pollute the data and compromise the whole system.
Let’s take a closer look at the stages in a data poisoning attack:
-
Ideally, an ML model that is trained by an authorized engineer would use authorized and trustworthy datasets. The goal of the attacker in this phase is to make sure that the model continues to work without errors, even if poisoned data is added. This is done to make it easier for attackers to introduce more lethal datasets later.
-
By analyzing how the model makes decisions and predictions, the attackers identify the weaknesses of the model. This will help them know the probable data points that, when manipulated, will lead the model to produce incorrect outputs.
-
After the attackers find the weak points, they create adversarial data samples that look similar to the original datasets. These data samples can lead the model to generate wrong predictions when they are included in the training datasets.
-
The attackers inject the poisoned data into the training dataset directly, or they compromise the data collection process to introduce them indirectly. The act of directly injecting poisoned data can be achieved by compromising databases and data servers.
-
After injecting the poisoned data, the model is retrained with the updated datasets, which includes the malicious data samples. During the training process, the model eventually adapts to the poisoned data and this, in turn, leads to compromised performance.
-
After the model has been successfully poisoned, it is deployed in the real world scenario where it interacts with new datasets. The model’s biased behavior can be easily exploited by the attackers to achieve their malicious goals.
Mitigation strategies to avoid data poisoning
In order to make sure that data poisoning attacks are mitigated, we must ensure that sensitive information is not leaked. Leaked data can serve as the entry point for the attackers to poison the dataset. Thus, it is important to make sure that this information is protected at all vulnerable points. To keep sensitive data secure, the Department of Defense’s Cyber Maturity Model Certification (CMMC) outlines four basic cyber principles. These include network protection, endpoint protection, facility protection, and people protection.
The following table lists the functions that need to be monitored to make sure that the sensitive information is protected:
Type of protection |
Functions that need to be monitored |
Network protection |
|
Facility protection |
|
Endpoint protection |
Endpoints are physical devices, which include your desktop computers, virtual machines, mobile devices, and servers. Monitor all activity in these devices for any unusual activity. This includes (but is not limited to):
|
People protection |
|
Remember that data contamination is a major issue in ML and cybersecurity. Organizations that employ ML systems must be on the lookout for potential data poisoning attacks and put strong security measures in place to protect their data and ML models from such dangers. Model monitoring, routine data validation, and anomaly detection are some of the best practices to spot and thwart data poisoning assaults.
One way to prevent malicious input is by spotting anomalies. The security and integrity of computer systems, networks, and software applications depend on this. ManageEngine Log360 is a unified SIEM solution with anomaly detection capabilities. With Log360, security analysts can:
-
Spot deviant user and entity behavior, such as logons at an unusual hour, excessive logon failures, and file deletions from a host that is not generally used by a particular user.
-
Gain greater visibility into threats with its score-based risk assessment for users and entities.
-
Identify indicators of compromise (IoCs) and indicators of attack (IoAs), exposing major threats including insider threats, account compromise, logon anomalies, and data exfiltration.
-
Spot changes to the database through Data Definition Language and Data Manipulation Language auditing reports.
It is also important to check the changes happening in operational data and performance. Many times, raw training data—including images, audio files, and text—is retained in cloud object stores because they offer more affordable, readily accessible, and scalable storage than on-premises storage solutions. With the help of a unified SIEM solution integrated with cloud access security broker (CASB) capabilities, security analysts can:
-
Gain enhanced visibility into cloud events.
-
Facilitate identity monitoring in the cloud.
-
Gain threat protection capabilities in the cloud.
-
Facilitate compliance management in the cloud.
Additionally, in order to carry out these attacks, the attackers need to understand how the model functions. They need a strong access control mechanism for this. It’s essential to block access to the access controls and keep a close eye on them. Log360 includes a sophisticated correlation engine that can combine various events happening in your network in real time and determine whether any are possible threats or not.
Security analysts can use the strategies outlined above to avoid attacks like these.
Are you looking for ways through which you can protect your organization’s sensitive information from being misused? Sign up for a personalized demo of ManageEngine Log360, a comprehensive SIEM solution that can help you detect, prioritize, investigate, and respond to security threats.
You can also explore on your own with a free, fully functionally, 30-day trial of Log360.