Cybersecurity researchers have been warning for quite a while now that generative artificial intelligence (GenAI) programs are vulnerable to a vast array of attacks, from specially crafted prompts that can break guardrails, to data leaks that can reveal sensitive information.
The deeper the research goes, the more experts are finding out just how much GenAI is a wide-open risk, especially to enterprise users with extremely sensitive and valuable data.
Also: Generative AI can easily be made malicious despite guardrails, say scholars
“This is a new attack vector that opens up a new attack surface,” said Elia Zaitsev, chief technology officer of cyber-security vendor CrowdStrike, in an interview with ZDNET.
“I see with generative AI a lot of people just rushing to use this technology, and they’re bypassing the normal controls and methods” of secure computing, said Zaitsev.
“In many ways, you can think of generative AI technology as a new operating system, or a new programming language,” said Zaitsev. “A lot of people don’t have expertise with what the pros and cons are, and how to use it correctly, how to secure it correctly.”
The most infamous recent example of AI raising security concerns is Microsoft’s Recall feature, which originally was to be built into all new Copilot+ PCs.
Security researchers have shown that attackers who gain access to a PC with the Recall function can see the entire history of an individual’s interaction with the PC, not unlike what happens when a keystroke logger or other spyware is deliberately placed on the machine.
“They have released a consumer feature that basically is built-in spyware, that copies everything you’re doing in an unencrypted local file,” explained Zaitsev. “That is a goldmine for adversaries to then go attack, compromise, and get all sorts of information.”
Also: US car dealerships reeling from massive cyberattack: 3 things customers should know
After a backlash, Microsoft said it would turn off the feature by default on PCs, making it an opt-in feature instead. Security researchers said there were still risks to the function. Subsequently, the company said it would not make Recall available as a preview feature in Copilot+ PCs, and now says Recall “is coming soon through a post-launch Windows Update.”
The threat, however, is broader than a poorly designed application. The same problem of centralizing a bunch of valuable information exists with all large language model (LLM) technology, said Zaitsev.
“I call it naked LLMs,” he said, referring to large language models. “If I train a bunch of sensitive information, put it in a large language model, and then make that large language model directly accessible to an end user, then prompt injection attacks can be used where you can get it to basically dump out all the training information, including information that’s sensitive.”
Enterprise technology executives have voiced similar concerns. In an interview this month with tech newsletter The Technology Letter, the CEO of data storage vendor Pure Storage, Charlie Giancarlo, remarked that LLMs are “not ready for enterprise infrastructure yet.”
Giancarlo cited the lack of “role-based access controls” on LLMs. The programs will allow anyone to get ahold of the prompt of an LLM and find out sensitive data that has been absorbed with the model’s training process.
Also: Cybercriminals are using Meta’s Llama 2 AI, according to CrowdStrike
“Right now, there are not good controls in place,” said Giancarlo.
“If I were to ask an AI bot to write my earnings script, the problem is I could provide data that only I could have,” as the CEO, he explained, “but once you taught the bot, it couldn’t forget it, and so, someone else — in advance of the disclosure — could ask, ‘What are Pure’s earnings going to be?’ and it would tell them.” Disclosing earnings information of companies prior to scheduled disclosure can lead to insider trading and other securities violations.
GenAI programs, said Zaitsev, are “part of a broader category that you could call malware-less intrusions,” where there doesn’t need to be malicious software invented and placed on a target computer system.
Cybersecurity experts call such malware-less code “living off the land,” said Zaitsev, using vulnerabilities inherent in a software program by design. “You’re not bringing in anything external, you’re just taking advantage of what’s built into the operating system.”
A common example of living off the land includes SQL injection, where the structured query language used to query a SQL database can be fashioned with certain sequences of characters to force the database to take steps that would ordinarily be locked down.
Similarly, LLMs are themselves databases, as a model’s main function is “just a super-efficient compression of data” that effectively creates a new data store. “It’s very analogous to SQL injection,” said Zaitsev. “It’s a fundamental negative property of these technologies.”
The technology of Gen AI is not something to ditch, however. It has its value if it can be used carefully. “I’ve seen first-hand some pretty spectacular successes with [GenAI] technology,” said Zaitsev. “And we’re using it to great effect already in a customer-facing way with Charlotte AI,” Crowdstrike’s assistant program that can help automate some security functions.
Also: Businesses’ cloud security fails are ‘concerning’ – as AI threats accelerate
Among the techniques to mitigate risk are validating a user’s prompt before it goes to an LLM, and then validating the response before it is sent back to the user.
“You don’t allow users to pass prompts that haven’t been inspected, directly into the LLM,” said Zaitsev.
For example, a “naked” LLM can search directly in a database to which it has access via “RAG,” or, retrieval-augmented generation, an increasingly common practice of taking the user prompt and comparing it to the contents of the database. That extends the ability of the LLM to disclose not just sensitive information that has been compressed by the LLM, but also the entire repository of sensitive information in those external sources.
The key is to not allow the naked LLM to access data stores directly, said Zaitsev. In a sense, you must tame RAG before it makes the problem worse.
“We take advantage of the property of LLMs where the user can ask an open-ended question, and then we use that to decide, what are they trying to do, and then we use more traditional programming technologies” to fulfill the query.
“For example, Charlotte AI, in many cases, is allowing the user to ask a generic question, but then what Charlotte does is identify what part of the platform, what data set has the source of truth, to then pull from to answer the question” via an API call rather than allowing the LLM to query the database directly.
Also: AI is changing cybersecurity and businesses must wake up to the threat
“We’ve already invested in building this robust platform with APIs and search capability, so we don’t need to overly rely on the LLM, and now we’re minimizing the risks,” said Zaitsev.
“The important thing is that you’ve locked down these interactions, it’s not wide-open.”
Beyond misuses at the prompt, the fact that GenAI can leak training data is a very broad concern for which adequate controls must be found, said Zaitsev.
“Are you going to put your social security number into a prompt that you’re then sending up to a third party that you have no idea is now training your social security number into a new LLM that somebody could then leak through an injection attack?”
“Privacy, personally identifiable information, knowing where your data is stored, and how it’s secured — those are all things that people should be concerned about when they’re building Gen AI technology, and using other vendors that are using that technology.”