• CERTIFICATE
    • Eminent VARs of India
    • Best OEM 2023
  • SYNDICATION
    • AMD
    • DELL TECHNOLOGIES
    • HITACHI
    • LOGMEIN
    • MICROSOFT
    • RIVERBED
    • STORAGECRAFT
    • THALES
  • EVENTS
  • GO DIGITAL
  • INFOGRAPHICS
  • PRESS
    • Press Release PR News Wire
    • Press Release Business Wire
    • GlobeNewsWire
  • SPECIAL
    • WHITE PAPER
    • TECHNOMANIA
    • SME
    • SMART CITY
    • SERVICES
    • EDITOR SPEAK
    • CSR INITIATIVES
    • CHANNEL GURU
    • CHANNEL CHIEF
    • CASE STUDY
  • TECHTREND
    • VAR PANCHAYAT
    • TELECOM
    • SOFTWARE
    • POWER
    • PERIPHERALS
    • NETWORKING
    • LTE
    • CHANNEL BUZZ
    • ASK AN EXPERT
  • SUBSCRIBE
  • Apps
  • Gaming
  • KDS
  • Security
  • Telecom
  • WFH
  • Subscriber to Newsletter
  • April Issue
  • Blogs
  • Vlogs
  • Faceoff AI
    

HOME
NEWS

How data poisoning attacks can corrupt machine learning models


By VARINDIA - 2021-04-21
How data poisoning attacks can corrupt machine learning models

Bohitesh Misra
Member at Digital Futurists Angels Private Limited

 

Data poisoning can render machine learning models inaccurate, possibly resulting in poor decisions based on faulty outputs. With no easy fixes available, security professionals must focus on prevention and detection.

 

Machine learning adoption have exploded recently, driven in part by the rise of cloud computing, which has made high performance computing, network and storage more accessible to all businesses. As data scientists and product companies integrate machine learning into various products across industries, and users rely on the output of its algorithms in their decision making, security experts warn of adversarial attacks designed to abuse the technology.

 

Most social networking platforms, online video platforms, large e-Commerce sites, search engines and other services have some sort of recommendation engine system based on machine learning algorithms. The movies and shows that people like on Netflix, the content that people like or share on Facebook, the hashtags and likes on Twitter, the products consumers buy or view on Amazon, the queries users type in Google Search are all fed back into these sites' machine learning models to make better and more accurate recommendations. Various attack methods, including fooling a model into incorrectly classifying an input, obtaining information about the data that was used to train a model.

 

Recommendation algorithms and other similar systems can be easily hijacked and manipulated by performing actions that pollute the input to the next model update. For instance, if you want to attack an online shopping site to recommend product B to a shopper who viewed or purchased product A, all you have to do is view A and then B multiple times or add both A and B to a wish list or shopping basket. If you want a hashtag to trend on a social network, simply post and/or retweet that hashtag a great deal. If you want that new fake political account to get noticed, simply have a bunch of other fake accounts follow it and continually engage with its content.

 

The user refers to retweeting the content continuously. This could mean the user in question controls many Twitter accounts. However, Twitter allows users to retweet, undo, and then retweet again. Re-retweeting serves to “bump” the content over and over again. Some Twitter accounts even use this tactic on their own content to fish for more visibility.

 

Recommendation algorithms can be attacked in a variety of ways, depending on the motive of the attacker. Adversaries can use promotion attacks to trick a recommender system into promoting a product, piece of content, or user to as many people as possible. They can perform demotion attacks in order to cause a product, piece of content, or user to be promoted less than it should. Algorithmic manipulation can also be used for social engineering purposes. In theory, if an adversary has knowledge about how a specific user has interacted with a system, an attack can be crafted to target that user with a recommendation such as a YouTube video, malicious app, or imposter account to follow. As such, algorithmic manipulation can be used for a variety of purposes including disinformation, phishing, scams, altering of public opinion, promotion of unwanted content, and discrediting individuals or brands. You can even pay someone to manipulate Google’s search autocomplete functionality.

 

Numerous attacks are already being performed against recommenders, search engines, and other similar online services. In fact, an entire industry exists to support these attacks. With a simple web search, it is possible to find inexpensive purchasable services to manipulate app store ratings, post fake restaurant reviews, post comments on websites, inflate online polls, boost engagement of content or accounts on social networks, and much more. The prevalence and low cost of these services indicates that they are widely used.

 

It's not news that attackers try to influence and skew these recommendation systems by using fake accounts to upvote, downvote, share or promote certain products or content. Users can buy services to perform such manipulation on the underground market as well as "troll farms" used in disinformation campaigns to spread fake news.

 

What is data poisoning?

 

Data poisoning or model poisoning attacks involve polluting a machine learning model's training data. Data poisoning is considered an integrity attack because tampering with the training data impacts the model's ability to output correct predictions.

 

The difference between an attack that is meant to evade a model's prediction or classification and a poisoning attack is persistence: with poisoning, the attacker's goal is to get their inputs to be accepted as training data. The length of the attack also differs because it depends on the model's training cycle; it might take weeks for the attacker to achieve their poisoning goal.

 

Data poisoning can be achieved either in a blackbox scenario against classifiers that rely on user feedback to update their learning or in a whitebox scenario where the attacker gains access to the model and its private training data, possibly somewhere in the supply chain if the training data is collected from multiple sources.

 

Data poisoning examples

 

In a cybersecurity context, the target could be a system that uses machine learning to detect network anomalies that could indicate suspicious activity. If an attacker understands that such a model is in place, they can attempt to slowly introduce data points that decrease the accuracy of that model, so that eventually the things that they want to do won't be flagged as anomalous anymore. This is also known as model skewing.

 

A real-world example of this is attacks against the spam filters used by email providers. In practice, we regularly see some of the most advanced spammer groups trying to throw the Gmail filter off-track by reporting massive amounts of spam emails as not spam. Between the end of Nov 2017 and early 2018, there were at least four malicious large-scale attempts to skew the classifier.

 

Another example involves Google’s VirusTotal scanning service, which many antivirus vendors use to augment their own data. While attackers have been known to test their malware against VirusTotal before deploying it in the wild, thereby evading detection, they can also use it to engage in more persistent poisoning. In fact, in 2015 there were reports that intentional sample poisoning attacks through VirusTotal were performed to cause antivirus vendors to detect benign files as malicious.

 

No easy fix

 

The main problem with data poisoning is that it's not easy to fix. Models are retrained with newly collected data at certain intervals, depending on their intended use and their owner's preference. Since poisoning usually happens over time, and over some number of training cycles, it can be hard to tell when prediction accuracy starts to shift.

 

Reverting the poisoning effects would require a time-consuming historical analysis of inputs for the affected class to identify all the bad data samples and remove them. Then a version of the model from before the attack started would need to be retrained. When dealing with large quantities of data and a large number of attacks, however, retraining in such a way is simply not feasible and the models never get fixed. Some practical solutions for machine unlearning are still years away, so the solution at this point is to retrain with good data and that can be super hard to accomplish or expensive.

 

Prevent and detect

 

Data scientists and developers need to focus on measures that could either block attack attempts or detect malicious inputs before the next training cycle happens, like input validity checking, rate limiting, regression testing, manual moderation and using various statistical techniques to detect anomalies.

 

For example, restrictions can be placed on how many inputs provided by a unique user are accepted into the training data or with what weight. Newly trained classifiers can be compared to previous ones to compare their outputs by rolling them out to only a small subset of users. Recommendation is to build a golden dataset that any retrained model must accurately predict, which can help detect regressions.

 

Data poisoning is just a special case of a larger issue called data drift that happens in systems. Everyone gets bad data for a variety of reasons, and there is a lot of research on how to deal with data drift as well as tools to detect significant changes in operational data and model performance, including by large cloud computing providers. Azure Monitor and Amazon SageMaker are examples of services that include such capabilities.

 

To perform data poisoning, attackers also need to gain information about how the model works, so it's important to leak as little information as possible and have strong access controls in place for both the model and the training data. A lot of security in AI and machine learning has to do with very basic read/write permissions for data or access to models or systems or servers.

 

Just as organizations run regular penetration tests against their networks and systems to discover weaknesses, they should expand this to the machine learning context, as well as treating machine learning as part of the security of the larger system or application.

 

Developers should do with building a model is to actually attack it themselves to understand how it can be attacked and by understanding how it can be attacked, they can then attempt to build defenses against those attacks.

 

Why it’s so hard to fix a poisoned model

 

If the owner of an online shop notices that their site has started recommending product B alongside product A, and they’re suspicious that they’ve been the victim of an attack, the first thing they need to do is look through historical data to determine why the model started making this recommendation. To do this, they need to gather all instances of product B being viewed, liked, or purchased alongside product A. Then they need to determine whether the users that generated those interactions look like real users or fake users – something that is probably extremely difficult to do if the attacker knows how to make their fake accounts look and behave like real people.

 

Fixing a poisoned model, in most cases, involves retraining. You take an old version of the model, and train it against all accumulated data between that past date and the present day, but with the malicious entries removed. You then deploy the fixed model into production and resume business. If at some point in the future you discover a new attack, you’ll need to perform the same steps over again. Social networks and other large online sites are under attack on numerous fronts, on an almost constant basis.

 

When considering social networks, detecting poisoning attacks is only part of the problem that needs to be solved. In order to detect that users of a system are intentionally creating bad training data, a way of identifying accounts that are fake or specifically coordinating to manipulate the platform is also required.

 

I can conclude by reiterating that threats arising from the manipulation of recommenders, especially those used by social networks hold broad societal implications. It is widely understood that algorithmic manipulation has led to entirely false stories, conspiracy theories, and genuine news pieces with altered figures, statistics, or online polls being circulated as real news.

See What’s Next in Tech With the Fast Forward Newsletter

SECURITY
View All
Zscaler announces AI innovations to its Data Protection Platform
Technology

Zscaler announces AI innovations to its Data Protection Platform

by VARINDIA 2024-05-20
SHIELD to enhance Swiggy’s fraud prevention and detection capabilities
Technology

SHIELD to enhance Swiggy’s fraud prevention and detection capabilities

by VARINDIA 2024-05-20
Axis Communications announces its first thermometric camera designed for Zone/Division 2
Technology

Axis Communications announces its first thermometric camera designed for Zone/Division 2

by VARINDIA 2024-05-20
SOFTWARE
View All
Hitachi Vantara and Veeam announce Global Strategic Alliance
Technology

Hitachi Vantara and Veeam announce Global Strategic Alliance

by VARINDIA 2024-05-16
Adobe launches Acrobat AI Assistant for the Enterprise
Technology

Adobe launches Acrobat AI Assistant for the Enterprise

by VARINDIA 2024-05-11
Oracle Database 23ai offers the power of AI to Enterprise Data and Applications
Technology

Oracle Database 23ai offers the power of AI to Enterprise Data and Applications

by VARINDIA 2024-05-10
START - UP
View All
Data Subject Access Request is an integrated module within ID-REDACT®
Technology

Data Subject Access Request is an integrated module within ID-REDACT®

by VARINDIA 2024-04-30
SiMa.ai Secures $70M Funds from Maverick Capital
Technology

SiMa.ai Secures $70M Funds from Maverick Capital

by VARINDIA 2024-04-05
Sarvam AI collaborates with Microsoft to bring its Indic voice LLM to Azure
Technology

Sarvam AI collaborates with Microsoft to bring its Indic voice LLM to Azure

by VARINDIA 2024-02-08

Tweets From @varindiamag

Nothing to see here - yet

When they Tweet, their Tweets will show up here.

CIO - SPEAK
Automation has the potential to greatly improve efficiency and production

Automation has the potential to greatly improve efficiency and production

by VARINDIA
Various approaches are followed to enhance efficiency, productivity, and cost-effectiveness

Various approaches are followed to enhance efficiency, productivity, and cost-effectiveness

by VARINDIA
Technology can be leveraged in several ways to boost efficiency, productivity and reduce cost

Technology can be leveraged in several ways to boost efficiency, productivity and reduce cost

by VARINDIA
Start-Up and Unicorn Ecosystem
GoDaddy harnesses AI power for new domain name recommendations

GoDaddy harnesses AI power for new domain name recommendations

by VARINDIA
UAE’s du Telecom selects STL as a strategic fibre partner

UAE’s du Telecom selects STL as a strategic fibre partner

by VARINDIA
JLR and Dassault Systèmes extend partnership for All Vehicle Programs worldwide

JLR and Dassault Systèmes extend partnership for All Vehicle Programs worldwide

by VARINDIA
Rapyder partners with AWS to accelerate Generative AI led innovation

Rapyder partners with AWS to accelerate Generative AI led innovation

by VARINDIA
ManageEngine integrates its SIEM solution with Constella Intelligence

ManageEngine integrates its SIEM solution with Constella Intelligence

by VARINDIA
Elastic replaces traditional SIEM game with AI-driven security analytics

Elastic replaces traditional SIEM game with AI-driven security analytics

by VARINDIA
Infosys and ServiceNow to transform customer experiences with generative AI-powered solutions

Infosys and ServiceNow to transform customer experiences with generative AI-powered solutions

by VARINDIA
Crayon Software Experts India inaugurates its ISV Incubation Center in Kolkata

Crayon Software Experts India inaugurates its ISV Incubation Center in Kolkata

by VARINDIA
Dassault Systèmes to accelerate EV charging infrastructure development in India

Dassault Systèmes to accelerate EV charging infrastructure development in India

by VARINDIA
Tech Mahindra and Atento to deliver GenAI powered business transformation services

Tech Mahindra and Atento to deliver GenAI powered business transformation services

by VARINDIA
×

Reproduction in whole or in part in any form or medium without express written permission of Kalinga Digital Media Pvt. Ltd. is prohibited.

  • Distributors & VADs
  • Industry Associations
  • Telco's in India
  • Indian Global Leaders
  • Edit Calendar
  • About Us
  • Advertise Us
  • Contact Us
  • Disclaimer
  • Privacy Statement
  • Sitemap

Copyright varindia.com @1999-2024 - All rights reserved.