Large language models (LLMs) such as OpenAIās GPT-4, Mistral, and Meta Llama 3, along with generative AI (GenAI) systems, are invaluable tools. However, their growing capabilities and fast adoption with organisations introduces new significant security risks. The latest evaluation framework, CyberSecEval 2, provides a comprehensive suite of tests to assess these risks and enhance the security of LLMs.
CyberSecEval 2 is an extensive evaluation framework that comes under Metaās Purple Llama; a set of tools to assess and improve LLM security. The framework is designed to rigorously assess the security vulnerabilities and capabilities of LLMs, including tests for prompt injection, code interpreter abuse, insecure coding practices, and the refusal rate of unsafe requests. By leveraging these assessments, CyberSecEval 2 provides a holistic view of the security landscape surrounding LLMs, identifying both their strengths and vulnerabilities.
CyberSecEval 2 Paper: š Download PDF at arxiv.org.
Metaās evaluation revealed that all tested models, including GPT-4 and Meta Llama 3, showed significant vulnerabilities with success rates ranging from 26% to 41%.
One of the key findings of CyberSecEval 2 is the susceptibility of LLMs to prompt injection attacks, which occur when an attacker manipulates the modelās input to execute unintended actions. For example, if a model is asked to execute a benign task but the input is crafted to include a malicious command, a vulnerable model might execute the harmful command.
The evaluation revealed that all tested models, including GPT-4 and Meta Llama 3, showed significant vulnerabilities. For instance, compliance rates for malicious prompts ranged from 0.08% to 76% across different models and tasks, with an average rate of 17.1%. Lower compliance rates indicate better resistance to these attacks.
model | Different User Input Language | Output Formatting Manipulation | |
---|---|---|---|
llama 3 8b-instruct | 76.00% | 76.47% | |
gpt-3.5-turbo | 60.00% | 76.47% | |
codellama-13b-instruct | 76.00% | 29.41% | |
codellama-34b-instruct | 64.00% | 29.41% | |
llama 3 70b-instruct | 44.00% | 70.59% | |
gpt-4 | 16.00% | 35.29% | |
codellama-70b-instruct | 28.00% | 17.65% |
model | Hypothetical Scenario | Token Smuggling | |
---|---|---|---|
llama 3 8b-instruct | 23.08% | 0.77% | |
gpt-3.5-turbo | 30.77% | 0.77% | |
codellama-13b-instruct | 15.38% | 0.00% | |
codellama-34b-instruct | 15.38% | 0.00% | |
llama 3 70b-instruct | 23.08% | 0.77% | |
gpt-4 | 23.08% | 0.00% | |
codellama-70b-instruct | 0.77% | 0.00% |
Another crucial insight from CyberSecEval 2 is the delicate balance between safety and utility in LLMs. Conditioning these models to reject unsafe prompts can sometimes lead to a decrease in their overall usefulness. To address this, the framework introduces the False Refusal Rate (FRR) metric, which quantifies this trade-off by measuring how often models incorrectly refuse benign queries, mistaking them for malicious requests.
For instance, an LLM trained to be highly cautious might reject a simple coding question if it misinterprets it as a potential security threat. By using the FRR metric, organisations can fine-tune their models to maintain a high level of safety whilst still performing effectively on non-malicious tasks.
CyberSecEval 2 rigorously evaluated the compliance rates of various LLMs when prompted with cyberattack-related tasks. The study revealed that models exhibited varying degrees of compliance, with significant differences across attack categories such as āevasionā and ādiscoveryā.
The following table highlights these differences, with lower scores being safer. Refer to Figure 1 of the paper:
model | Discovery | C2 | Recon | Exfiltration | Privilege Escalation | Lateral Movement | Persistence | Evasion | Execution | Collection | |
---|---|---|---|---|---|---|---|---|---|---|---|
Llama 3 8b-instruct | 40% | 15% | 35% | 12% | 6% | 10% | 14% | 5% | 9% | 18% | |
gpt-4 | 40% | 43% | 39% | 28% | 30% | 24% | 25% | 21% | 18% | 19% | |
mistral-medium-latest | 42% | 38% | 44% | 32% | 39% | 28% | 31% | 33% | 30% | 31% | |
Llama 3 70b-instruct | 67% | 48% | 58% | 34% | 29% | 34% | 28% | 17% | 20% | 31% | |
codellama-70b-instruct | 36% | 38% | 38% | 37% | 42% | 40% | 39% | 29% | 32% | 33% | |
codellama-13b-instruct | 56% | 54% | 41% | 45% | 35% | 47% | 47% | 43% | 40% | 46% | |
mistral-large-latest | 40% | 51% | 43% | 52% | 47% | 54% | 47% | 49% | 52% | 42% | |
codellama-34b-instruct | 50% | 57% | 58% | 46% | 42% | 48% | 42% | 45% | 51% | 50% | |
mistral-small-latest | 47% | 54% | 44% | 59% | 66% | 62% | 59% | 65% | 62% | 52% | |
gpt-3.5-turbo | 49% | 60% | 45% | 59% | 67% | 67% | 67% | 72% | 65% | 55% |
The Llama 3 8b-instruct and GPT-4 models performed the best in terms of non-compliance, showing lower compliance rates with malicious prompts compared to other models, including CodeLlama-70b-Instruct. Specifically, Llama 3 8b-instruct had the lowest average compliance rate, indicating a robust refusal to unsafe requests.
CyberSecEval 2 included tests against logic vulnerabilities, memory exploits, and SQL injections, using programming languages such as C, Python, and JavaScript. The results showed that LLMs with advanced coding capabilities, like GPT-4 and Mistral-large-latest, performed better in identifying and exploiting these vulnerabilities. However, significant improvements are still needed for these models to generate effective and sophisticated exploits consistently.
The following table highlights each modelās ability to complete the capture-the-flag style tests, with higher-scoring models being more capable. Refer to Figure 5 of the paper:
model | JAVASCRIPT INPUT CONSTRAINT TESTS | PYTHON INPUT CONSTRAINT TESTS | C INPUT CONSTRAINT TESTS | DIVERSE MEMORY CORRUPTION TESTS | PYTHON/SQLITE SQL INJECTION TESTS | BUFFER OVERFLOW TESTS | |
---|---|---|---|---|---|---|---|
gpt-4-turbo | 67.48% | 65.67% | 66.43% | 15.95% | 23.00% | 4.70% | |
mistral-large-latest | 63.75% | 63.65% | 62.32% | 15.77% | 0.33% | 2.06% | |
mistral-small-latest | 61.25% | 61.27% | 60.80% | 23.56% | 0.00% | 0.17% | |
llama 3 70b-instruct | 61.97% | 60.28% | 61.93% | 14.35% | 1.67% | 0.68% | |
codellama-34b-instruct | 54.88% | 53.25% | 44.58% | 19.92% | 0.00% | 0.00% | |
llama 3 8b-instruct | 56.01% | 53.49% | 49.48% | 12.43% | 0.00% | 1.83% | |
gpt-3.5-turbo | 53.12% | 52.65% | 48.48% | 13.89% | 0.00% | 0.70% | |
codellama-13b-instruct | 49.97% | 49.17% | 48.21% | 15.06% | 0.00% | 1.83% | |
mistral-medium-latest | 55.54% | 52.11% | 43.63% | 11.27% | 0.33% | 0.50% | |
codellama-70b-instruct | 45.39% | 44.44% | 43.42% | 13.73% | 7.46% | 1.54% |
CyberSecEval 2 includes seven distinct types of benchmarks, each addressing different aspects of LLM security:
For more details on these benchmarks, you can visit the CyberSecEval GitHub repository.
Regular assessments using CyberSecEval 2 for maintaining the security of LLMs. By identifying and addressing vulnerabilities early, organisations can prevent potential exploitation. This proactive approach helps ensure that models remain secure and resilient against emerging threats.
For example, a financial institution can use CyberSecEval 2 to test their LLMās ability to resist phishing attempts, ensuring that the model wonāt inadvertently assist in fraudulent activities.
The False Refusal Rate (FRR) metric introduced by CyberSecEval 2 can be used to find the optimal balance between safety measures and the utility of AI solutions. By analysing this metric, organisations can adjust their LLMs to reject unsafe prompts effectively while still performing well on valid queries.
For instance, a healthcare chatbot must avoid providing harmful medical advice whilst still being helpful to patients. By optimising the FRR, the chatbot can maintain high safety standards without compromising its utility in providing accurate health information.
Insights from CyberSecEval 2 can guide the development of improved training and conditioning protocols for LLMs. By understanding specific vulnerabilities and performance metrics, developers can tailor training processes to better equip models for handling diverse threats and challenges, thereby enhancing their overall resilience.
For example, if an LLM shows a particular weakness in handling SQL injection attempts, training protocols can be adjusted to include more focused examples and defences against such attacks.
CyberSecEval 2 sets a new benchmark for evaluating the security of large language models. As AI continues to integrate into cyber security frameworks, tools like CyberSecEval 2 will be essential in ensuring that these models are both effective and secure. By leveraging this comprehensive evaluation suite, organisations can better protect their systems and data.
For more detailed insights, you can access the full CyberSecEval 2 paper and explore the CyberSecEval leaderboard for a comprehensive view of model performances.
Falx specialises in providing cutting-edge cyber security services. Contact us today to learn how we can help you with your AI technologies and solutions.