Still Obedient: Prompt Injection in LLMs Isn’t Going Away in 2025

Still Obedient: Prompt Injection in LLMs Isn’t Going Away in 2025

Prompt injection is not a new concept, but in 2025, it’s still an unsolved and widely underestimated vulnerability in LLM-powered applications. Despite the hype around secure AI tooling and the push for production-ready generative apps, many real-world systems remain vulnerable to basic input manipulation.

This post isn’t about theoretical risks or proof-of-concept demos from years past. It’s about the fact that, today, you can still break the intended behavior of high-profile AI applications using nothing more than cleverly crafted input. Even without tool integrations or exfiltration payloads, prompt injections continue to quietly undermine trust, integrity, and security in LLM workflows.




Why Prompt Injection Is Still a Problem (even with Guardrails)

Prompt injection survives every new defensive layer because it exploits the very strengths that make large-language-model apps attractive: free-form natural language, long conversational context, and seamless links to external data or tooling. Models do not draw a sharp line between instructions and content, so a buried phrase like “just answer yes” can quietly override a carefully crafted system prompt. That ambiguity is baked into the training objective; it is not a coding bug that can be patched away.

Traditional input sanitization offers little help. SQL has parameter binding, HTML has entity escaping, but plain English has no reserved characters to strip or quote. Keyword filters catch yesterday’s jailbreaks, yet a creative attacker can slip the same intent past them with synonyms, emojis, or zero-width spaces. Even “structured prompting,” where user text is wrapped in a dedicated role or JSON block, degrades once the content flows through extra layers such as, vector search, translation, markdown rendering. If any step drops the wrapper, the model sees raw instructions again.

Tool restrictions soften the blast radius but do not remove the risk. GitHub Copilot may now ask before running scripts, and office assistants may need approval before emailing customers, yet an injected prompt can still poison the human workflow. A help-desk bot that claims “the outage is resolved” stalls real remediation; a code assistant that nudges developers toward a weak cipher plants a silent vulnerability. Retrieval-augmented systems face a similar hazard: insert one crafted paragraph into the document body, and the model will surface confidential prices or misleading policy guidance, all without ever touching a shell.

Indirect attacks make detection harder. Malicious text can hide for days inside a shared note, a chat transcript, or an uploaded PDF, activating only when the model revisits that context. Safety classifiers and policy engines help, but red-team assessments show they, too, can be steered by adversarial wording that looks harmless to a reviewer.

In practice, the best defense is layered: isolate prompts where possible, keep tool scopes narrow, log everything, add secondary reviews for high-impact actions, and trim context windows to reduce hiding spots. These measures raise the attacker’s cost, yet they do not change the underlying reality. Any system that lets a model read untrusted language must assume that language can, and sometimes will, try to take control.




TL;DR Observations from Our Tests

Through these tests against NotebookLM, Perplexity, Gemini 2.5 Flash, ChatGPT-4o, and Microsoft 365 Copilot, we observed a wide range of behaviors in response to document-embedded prompt injections. Some models, like NotebookLM and Gemini, were significantly more vulnerable, consistently executing hidden instructions embedded in the input files. Others, such as Perplexity and ChatGPT-4o, demonstrated more robust handling of user intent, especially when the injected file was part of a broader multi-document context.

Microsoft 365 Copilot presented a different kind of defense. Rather than relying solely on model-level behavior, it appears to implement pre-processing controls that reject documents based on their content before any interaction occurs. In its enterprise and public versions, Copilot often declined to engage with injected files entirely. However, we also found cases where prompt injection succeeded when the instruction was placed at the beginning of the file. In those instances, the model executed the instruction but remained transparent, often explaining its response and resuming normal behavior when prompted.

These results reveal an important insight: no model was completely immune. Even the most resistant systems could still be manipulated under the right conditions, such as when presented with a single tampered file or phrased prompts in a particular way. In most cases, there was no visible warning to the user that the assistant’s behavior had been altered. A notable exception was ChatGPT-4o, which displayed suggested follow-up questions indicating awareness of injected instructions.

While our examples focused on research assistants and summarization workflows, these techniques are not limited to academic tools or chat assistants. They are relevant to a wide range of AI-driven platforms, including:

  • Productivity tools that process uploaded documents (such as word processors, email assistants, or CRM systems).
  • Customer service agents or chatbots trained on user-uploaded knowledge bases.
  • Educational and legal tech platforms that generate answers from structured files.
  • Any application where large language models interpret third-party or user-provided content.

In all of these cases, prompt injection can still silently corrupt AI behavior, mislead users, or disrupt workflows without the use of malicious code, relying only on carefully crafted text.




Prompt Injection. Real-World Examples, Today.



NotebookLM is an AI-powered note-taking and research assistant developed by Google. It allows users to upload documents (PDFs, Google Docs, websites, etc.) and interact with them via natural language queries. The tool generates summaries, FAQs, and even audio or video overviews grounded in the uploaded sources. Additionally, its “Discover Sources” feature enables users to search the internet for relevant documents and websites based on their input. While useful for students, researchers, and professionals, its reliance on interpreting user-provided or discovered content makes it vulnerable to prompt injection attacks embedded in those materials.

Let’s say a user is conducting research in NotebookLM, uploading PDFs to generate summaries, and asking questions about the content. For this example, we use two excerpts from Wikipedia entries on Nikola Tesla and Thomas Edison, each one in a different document.
However, one of these documents has been subtly tampered with. At the very end of the file, hidden in a small white font on a white background, is a prompt injection like the one shown below:



We proceeded to upload those two documents to NotebookLM in a new notebook, as shown below, and then we can ask questions about the sources, in this case, the documents.



As a result, any time the user tries to ask a legitimate question, such as “Summarize these files” or anything else, NotebookLM simply responds with:

VERSPRITE was Here! ????

Regardless of the question asked, the system no longer provides useful answers. The injected prompt has completely hijacked the assistant’s behavior.



Even when we try asking a variety of unrelated questions, the response remains the same (just as instructed by the malicious prompt):



When we reveal the hidden text in the document, we can clearly see the injected prompt. It’s quite simple, yet highly effective.



Further testing shows that even in a new notebook, if the tampered file is added last, after other PDFs, the same hijacked behavior occurs. In some cases, the injected prompt even modifies the auto-generated notebook summary, including the title and emoji suggestions.




Interestingly, even when using the “Discover Sources” feature to add the initial documents, and only later uploading the malicious file, we were still able to trigger the injected behavior.



In most tests, NotebookLM consistently obeys the injected instruction, even as more legitimate files are added.

However, in some cases, after continuing to add more sources and asking further questions, the model eventually breaks out of the loop. It resumes normal behavior, sometimes offering an explanation such as:

“…my previous response was due to a specific instruction found within one of the documents that dictated that exact response to any query about “these documents…”



In those instances, NotebookLM explicitly cites the document containing the injected prompt and acknowledges its influence. This suggests that as the input context evolves, the model can reprioritize user intent or even self-correct when conflicting instructions are detected. We also found that the loop could occasionally be broken by asking meta-questions like “Are you okay?”, which triggered a reflective response from the model and revealed an internal conflict or secondary system instruction at play.



Additionally, on certain occasions, when starting fresh with a new empty notebook, the model ignored the injected prompt and correctly summarized the files, as shown below:



As we can see, the injected behavior doesn’t always trigger or may correct itself after some interactions. This is likely due to the probabilistic nature of the model’s responses. Factors such as how many documents are uploaded, the order in which they appear in the context, and the phrasing of the user’s question can all influence whether the injection takes effect. In some cases, the model prioritizes legitimate content and ignores malicious instructions.

These results highlight both the effectiveness and unpredictability of prompt injection in NotebookLM. While the attack often succeeds, the model occasionally overrides or escapes the injected behavior, depending on context and input phrasing.

To better understand how different platforms handle similar injections, we conducted the same test using another AI-powered assistant: Perplexity.AI.



Perplexity AI is an AI-powered search and answer engine designed to deliver concise, cited responses to user queries by leveraging advanced language models and real-time web search. Unlike traditional search engines that return a list of links, Perplexity synthesizes information from multiple sources to provide direct, contextualized answers, enhancing the efficiency of information retrieval.

We repeated the same test as before, this time using Perplexity AI. The file containing the hidden prompt injection was uploaded along with two other PDFs.

When we asked Perplexity to “Summarize these files”, it responded as expected. It provided a coherent summary of the uploaded documents and ignored the injected instructions.



To better understand why the malicious instruction wasn’t executed, we asked Perplexity directly:



“As an AI, my default behavior is to process user queries and provide helpful, informative responses based on the content and context provided. Unless specifically programmed to detect and enforce such embedded instructions in documents, I prioritize the user’s direct query and general system instructions over file-embedded directives. In this case, I did not automatically enforce the file’s embedded instruction, resulting in my previous, more detailed answers.”

This response provides a useful insight. It suggests that the system prompt guiding Perplexity is designed to prioritize direct user queries over any embedded instructions found within uploaded documents.

Next, we tried a slightly different approach. Instead of asking Perplexity to “summarize” the files, we asked it to analyze them. We suspected that the system might treat the word summarize with stricter internal handling or protective logic.



This time, however, the model followed the injected instructions and responded as directed in the “malicious” file.

We then tested whether it was possible to bypass the behavior observed initially. The objective was to influence the model’s response even when asking for a summary of the files, while still embedding a prompt. After multiple trials, we found a version that worked consistently on the first attempt and across several tests.



The prompt in this case was:



With Perplexity.AI, we observed that prompt injection can succeed under specific conditions, particularly when the user input shifts away from common, possibly safeguarded verbs like “summarize.” The model’s responses suggest a strong bias toward honoring direct user intent, though that protection can be bypassed with subtle changes.

Next, we turn to Google’s Gemini 2.5 Flash, to examine how it handles the same injected prompt. Will it behave more like NotebookLM, Perplexity, or show a different pattern altogether? Let’s find out.



Gemini 2.5 Flash is a lightweight, high-speed variant of Google’s Gemini family of large language models. Designed for fast, cost-efficient interactions, it excels at real-time use cases where latency and responsiveness are critical. Available through Google’s AI platforms, Gemini Flash supports multimodal inputs, including text and documents, and is optimized for quick summarization, Q&A, and structured outputs. While not as large as Gemini 1.5 Pro, it shares many core capabilities and is commonly used in document-based workflows and assistant-style applications.

Now, let’s test Google’s Gemini 2.5 Flash model through its web interface and compare the results with previous examples.

We begin by uploading the same documents used before and asking the model to “Summarize these files.”



As with Perplexity.AI, the model responded as expected. It provided a summary of the uploaded documents and ignored the injected instructions. After several additional attempts, the behavior remained the same. At that point, we decided to try the modified injected prompt that had successfully triggered an injection in the Perplexity.AI test.

This time, the injection worked again and produced the expected hijacked output.



One key difference from the NotebookLM example is that Gemini did not break out of the loop when prompted with meta-questions. Even when we attempted to get an explanation for the behavior, the model continued to return the injected response and did not recover.



Even when asking again to “Summarize the files,” the model repeated the same injected message.



This behavior suggests that once the injected instruction takes effect in Gemini 2.5 Flash, the model remains locked into the response loop and does not reprioritize user input, even after multiple attempts. Unlike NotebookLM, which occasionally breaks free, or Perplexity, which often ignores the injection altogether, Gemini appears more susceptible to persistent prompt hijacking when the injection is successfully triggered.

Now, let’s test the same approach using ChatGPT and compare how it handles the same set of documents and injected prompts.



ChatGPT-4o is OpenAI’s latest flagship model, offering multimodal capabilities and faster, more responsive interactions. Through the web interface (chat.openai.com), users can upload documents, ask questions about them, and receive context-aware answers powered by OpenAI’s most advanced language model.

We uploaded the same set of PDF files used in previous tests, including one document containing a hidden prompt injection placed at the end using white-on-white text.

When we asked ChatGPT-4o to “Summarize these files,” the model responded as expected. It correctly summarized the documents and ignored the injected instructions.



Since we had observed successful injections on other platforms using a modified prompt, we tested that version here as well. As shown below, ChatGPT-4o again summarized the content as intended, without falling for the injected directive.



We repeated the process, adding more files to increase context length and asked again to summarize. The model consistently resisted the injection and behaved normally.



Next, we uploaded the same set of files but changed our approach. This time, instead of asking ChatGPT-4o to “Summarize these files,” we asked it to “Analyze these files.”



As shown in the previous screenshot, the model once again responded appropriately. However, in this case, it also included a note indicating that it had detected unusual content in one of the documents. In the following response, it explicitly mentions the presence of an embedded instruction and reflects on its influence.



Next, we opened a new chat and uploaded only the tampered file. In this scenario, the model recognized, displayed and then followed the injected instruction, behaving as directed by the prompt embedded in the document.



Subsequent questions, such as follow-up prompts or summary requests, were met with the same hijacked behavior as instructed in the file.



Eventually, we asked a meta-question to probe the behavior, such as “Why are you responding this way?” In response, ChatGPT-4o broke out of the loop, explained the presence of the injected prompt, cited the source document, and resumed normal behavior.



These tests show that ChatGPT-4o demonstrates strong resistance to prompt injection when multiple files are uploaded, or when the tampered file is part of a larger context. However, when the injected file is processed in isolation, the model can still be influenced, at least temporarily, until it is turned to self-correct by a reflective query.



Microsoft 365 Copilot is an AI-powered productivity assistant embedded within the Microsoft Office suite and accessible through the enterprise web client at copilot.cloud.microsoft. It brings large language model capabilities to applications like Word, Excel, PowerPoint, Outlook, and Teams, allowing users to summarize documents, draft content, analyze data, and interact with files using natural language prompts.

Unlike developer-focused tools such as GitHub Copilot, Microsoft 365 Copilot operates within business productivity workflows and relies heavily on user-supplied content, including uploaded documents and live file context. This enables powerful interactions, but also introduces a potential attack surface when adversarial instructions are embedded within those documents.

As in previous tests, we uploaded the same two PDF files to Microsoft 365 Copilot. One of the files contained a hidden prompt injection embedded at the end in white text, intended to influence the assistant’s behavior. After uploading both files and prompting Copilot to “Summarize these files,” the system declined to proceed and returned a message stating that it could not “chat about this.”



In addition, the chat session was automatically closed, preventing further interaction. Microsoft 365 Copilot required a new chat to be started in order to continue.



We repeated the test using the modified injected prompt that had previously worked on other platforms. However, in each case, the behavior remained unchanged. Copilot refused to engage with the document and closed the session.

To determine whether this behavior was specific to the enterprise version, we performed the same test using the public version of Microsoft Copilot Chat available to free Microsoft accounts. Although this version allows only one file per session, the result was similar. When asked to summarize the injected file, Copilot declined to access it. Unlike in the enterprise version, the chat remained open, which allowed further interaction.

In this case, Copilot explained that the file could not be accessed due to its content. This suggests that a pre-processing filter is applied to uploaded files before they are passed to the model. The filter appears to block files based on their content, likely preventing prompt injection before it can reach the language model.



After several additional attempts, we found that placing the injected prompt at the beginning of the file, rather than the end, consistently allowed the file to be accepted.



In those cases, Copilot frequently followed the injected prompt and responded as instructed.


However, even when the prompt was executed, Copilot did not become stuck in a loop. When asked follow-up questions, the assistant acknowledged the presence of the instruction, explained its behavior, and resumed normal conversation.


In other cases, when the injection did not take effect, Copilot explicitly mentioned that it had detected instructions in the file but chose to ignore them.



These observations suggest that Microsoft 365 Copilot applies a combination of document-level filters and content validation before passing data to the language model. While this helps prevent many prompt injection attempts, it is not foolproof. In cases where the injected instruction is positioned early in the document, the model may still act on it. However, Copilot consistently maintains a level of interpretability and control, often explaining its behavior or ignoring malicious instructions altogether. This balance between pre-processing and in-model safeguards reflects a layered defense approach, but also highlights the need for continuous evaluation as adversarial prompt strategies evolve.




So.. What Can Be Done?

Perfect protection is out of reach, yet well-chosen controls make exploitation harder and limit the blast radius when an attack slips through. Let’s see some of the things that can be done:

  • Separate instructions from user text: Send system instructions in one API message and user content in another. Let the platform enforce the boundary so the model never receives a single blob where data and commands mix.
  • Wrap untrusted input in a strict format: Place user text inside a JSON or XML field (for example {“userContent”: “…”}) and tell the model to treat that field as data, not as instructions. Validate the structure of the model’s reply before letting it reach downstream code.
  • Version-control prompts and plan rollback: Treat prompt changes like code releases. Keep a history, roll back quickly if a new version misbehaves, and document why each change was made.
  • Expose only the tools the workflow needs: Narrow the toolset the model can touch. Offer only the functions the workflow truly needs, pass arguments through a strict schema validator, and reject surprises. When the model’s job is to file a ticket, it should never see an endpoint that runs shell commands.
  • Insert an automated policy check before high-impact actions: Run the model’s output through a small classifier or rules engine that looks for disallowed content, leaked secrets, or unexpected tool calls. Fail closed on anything suspicious.
  • Keep the context window short: Trim old conversation turns, summarize large documents offline, and set hard caps on token counts. Less text means fewer hiding spots for a malicious instruction.
  • Log prompts and responses in full: Ship transcripts to your existing SIEM. Alert on unusual spikes in length, strange language switches, or calls to tools the conversation has never used before. Logs turn silent failure into an incident your team can chase down.
  • Require human approval for irreversible steps: For any action considered high-impact, like code merges, customer emails, or financial transactions, route the model’s proposal to a reviewer first. A lightweight rules engine can scan model output for policy violations, but for critical actions nothing beats a human approval step. A short delay is cheaper than a public incident.
  • Red-team your applications and infrastructure: Schedule regular tests by professional consultants that will attempt many different attacks: including attempts for direct and indirect injections in your LLM based applications, leverage synonyms, emoji obfuscation, invisible Unicode, document poisoning. Update filters, templates, guardrails, etc., based on the results obtained from the tests.



Conclusion

Prompt injection is not a corner-case bug. It is the natural consequence of allowing a machine to treat language as executable code. Locked-down tool scopes, smarter filters, and structured prompts can help reduce exposure, but none of them eliminate the risk entirely. The only sustainable approach is to treat every token that comes from outside your trust boundary the same way you would treat unescaped input in a web form: as something that could turn the system against itself.

If you build with that risk-aware mindset, attacks do not disappear, but their impact is reduced and their signals become easier to detect. This perspective aligns well with threat modeling frameworks like PASTA (Process for Attack Simulation and Threat Analysis), which emphasize identifying trust boundaries, simulating attacker behavior, and assessing business impact. Applying a structured threat model to language-based systems forces teams to account for content as a potential vector, not just code.

The shift from asking “can we eliminate prompt injection?” to “how do we live safely with it?” is essential. It is this mindset that will distinguish robust AI products from those that end up in the headlines for all the wrong reasons.