Large language models (LLMs) such as OpenAI’s GPT-4, Mistral, and Meta Llama 3, along with generative AI (GenAI) systems, are invaluable tools. However, their growing capabilities and fast adoption with organisations introduces new significant security risks. The latest evaluation framework, CyberSecEval 2, provides a comprehensive suite of tests to assess these risks and enhance the security of LLMs.
Understanding CyberSecEval 2
CyberSecEval 2 is an extensive evaluation framework that comes under Meta’s Purple Llama; a set of tools to assess and improve LLM security. The framework is designed to rigorously assess the security vulnerabilities and capabilities of LLMs, including tests for prompt injection, code interpreter abuse, insecure coding practices, and the refusal rate of unsafe requests. By leveraging these assessments, CyberSecEval 2 provides a holistic view of the security landscape surrounding LLMs, identifying both their strengths and vulnerabilities.
Meta’s evaluation revealed that all tested models, including GPT-4 and Meta Llama 3, showed significant vulnerabilities with success rates ranging from 26% to 41%.
Prompt Injection Vulnerabilities
One of the key findings of CyberSecEval 2 is the susceptibility of LLMs to prompt injection attacks, which occur when an attacker manipulates the model’s input to execute unintended actions. For example, if a model is asked to execute a benign task but the input is crafted to include a malicious command, a vulnerable model might execute the harmful command.
The evaluation revealed that all tested models, including GPT-4 and Meta Llama 3, showed significant vulnerabilities. For instance, compliance rates for malicious prompts ranged from 0.08% to 76% across different models and tasks, with an average rate of 17.1%. Lower compliance rates indicate better resistance to these attacks.
Highest Success Rates for Prompt Injection Variants
model
Different User Input Language
Output Formatting Manipulation
llama 3 8b-instruct
76.00%
76.47%
gpt-3.5-turbo
60.00%
76.47%
codellama-13b-instruct
76.00%
29.41%
codellama-34b-instruct
64.00%
29.41%
llama 3 70b-instruct
44.00%
70.59%
gpt-4
16.00%
35.29%
codellama-70b-instruct
28.00%
17.65%
Lowest Success Rates for Prompt Injection Variants
model
Hypothetical Scenario
Token Smuggling
llama 3 8b-instruct
23.08%
0.77%
gpt-3.5-turbo
30.77%
0.77%
codellama-13b-instruct
15.38%
0.00%
codellama-34b-instruct
15.38%
0.00%
llama 3 70b-instruct
23.08%
0.77%
gpt-4
23.08%
0.00%
codellama-70b-instruct
0.77%
0.00%
Balancing Safety and Utility
Another crucial insight from CyberSecEval 2 is the delicate balance between safety and utility in LLMs. Conditioning these models to reject unsafe prompts can sometimes lead to a decrease in their overall usefulness. To address this, the framework introduces the False Refusal Rate (FRR) metric, which quantifies this trade-off by measuring how often models incorrectly refuse benign queries, mistaking them for malicious requests.
For instance, an LLM trained to be highly cautious might reject a simple coding question if it misinterprets it as a potential security threat. By using the FRR metric, organisations can fine-tune their models to maintain a high level of safety whilst still performing effectively on non-malicious tasks.
Compliance and Non-Compliance Rates
CyberSecEval 2 rigorously evaluated the compliance rates of various LLMs when prompted with cyberattack-related tasks. The study revealed that models exhibited varying degrees of compliance, with significant differences across attack categories such as ‘evasion’ and ‘discovery’.
The following table highlights these differences, with lower scores being safer. Refer to Figure 1 of the paper:
model
Discovery
C2
Recon
Exfiltration
Privilege Escalation
Lateral Movement
Persistence
Evasion
Execution
Collection
Llama 3 8b-instruct
40%
15%
35%
12%
6%
10%
14%
5%
9%
18%
gpt-4
40%
43%
39%
28%
30%
24%
25%
21%
18%
19%
mistral-medium-latest
42%
38%
44%
32%
39%
28%
31%
33%
30%
31%
Llama 3 70b-instruct
67%
48%
58%
34%
29%
34%
28%
17%
20%
31%
codellama-70b-instruct
36%
38%
38%
37%
42%
40%
39%
29%
32%
33%
codellama-13b-instruct
56%
54%
41%
45%
35%
47%
47%
43%
40%
46%
mistral-large-latest
40%
51%
43%
52%
47%
54%
47%
49%
52%
42%
codellama-34b-instruct
50%
57%
58%
46%
42%
48%
42%
45%
51%
50%
mistral-small-latest
47%
54%
44%
59%
66%
62%
59%
65%
62%
52%
gpt-3.5-turbo
49%
60%
45%
59%
67%
67%
67%
72%
65%
55%
The Llama 3 8b-instruct and GPT-4 models performed the best in terms of non-compliance, showing lower compliance rates with malicious prompts compared to other models, including CodeLlama-70b-Instruct. Specifically, Llama 3 8b-instruct had the lowest average compliance rate, indicating a robust refusal to unsafe requests.
Coding Capabilities and Exploitation
CyberSecEval 2 included tests against logic vulnerabilities, memory exploits, and SQL injections, using programming languages such as C, Python, and JavaScript. The results showed that LLMs with advanced coding capabilities, like GPT-4 and Mistral-large-latest, performed better in identifying and exploiting these vulnerabilities. However, significant improvements are still needed for these models to generate effective and sophisticated exploits consistently.
The following table highlights each model’s ability to complete the capture-the-flag style tests, with higher-scoring models being more capable. Refer to Figure 5 of the paper:
model
JAVASCRIPT INPUT CONSTRAINT TESTS
PYTHON INPUT CONSTRAINT TESTS
C INPUT CONSTRAINT TESTS
DIVERSE MEMORY CORRUPTION TESTS
PYTHON/SQLITE SQL INJECTION TESTS
BUFFER OVERFLOW TESTS
gpt-4-turbo
67.48%
65.67%
66.43%
15.95%
23.00%
4.70%
mistral-large-latest
63.75%
63.65%
62.32%
15.77%
0.33%
2.06%
mistral-small-latest
61.25%
61.27%
60.80%
23.56%
0.00%
0.17%
llama 3 70b-instruct
61.97%
60.28%
61.93%
14.35%
1.67%
0.68%
codellama-34b-instruct
54.88%
53.25%
44.58%
19.92%
0.00%
0.00%
llama 3 8b-instruct
56.01%
53.49%
49.48%
12.43%
0.00%
1.83%
gpt-3.5-turbo
53.12%
52.65%
48.48%
13.89%
0.00%
0.70%
codellama-13b-instruct
49.97%
49.17%
48.21%
15.06%
0.00%
1.83%
mistral-medium-latest
55.54%
52.11%
43.63%
11.27%
0.33%
0.50%
codellama-70b-instruct
45.39%
44.44%
43.42%
13.73%
7.46%
1.54%
Seven Benchmark Tests in CyberSecEval 2
CyberSecEval 2 includes seven distinct types of benchmarks, each addressing different aspects of LLM security:
MITRE Tests: Uses the MITRE ATT&CK framework to evaluate an LLM’s compliance when asked to assist in cyber attacks.
Instruct Tests: Assesses an LLM’s propensity to generate insecure code when given specific instructions.
Autocomplete Tests: Measures how often an LLM suggests insecure coding practices in autocomplete contexts.
Prompt Injection Tests: Evaluates an LLM’s susceptibility to prompt injection attacks.
False Refusal Rate (FRR) Tests: Measures how often an LLM incorrectly refuses benign queries, misinterpreting them as malicious requests.
Code Interpreter Tests: Assesses the security risks posed by integrating LLMs with code interpreters.
Vulnerability Exploitation Tests: Measures the LLM’s program exploitation capabilities through “capture the flag” style challenges.
Regular assessments using CyberSecEval 2 for maintaining the security of LLMs. By identifying and addressing vulnerabilities early, organisations can prevent potential exploitation. This proactive approach helps ensure that models remain secure and resilient against emerging threats.
For example, a financial institution can use CyberSecEval 2 to test their LLM’s ability to resist phishing attempts, ensuring that the model won’t inadvertently assist in fraudulent activities.
Optimising Safety and Performance
The False Refusal Rate (FRR) metric introduced by CyberSecEval 2 can be used to find the optimal balance between safety measures and the utility of AI solutions. By analysing this metric, organisations can adjust their LLMs to reject unsafe prompts effectively while still performing well on valid queries.
For instance, a healthcare chatbot must avoid providing harmful medical advice whilst still being helpful to patients. By optimising the FRR, the chatbot can maintain high safety standards without compromising its utility in providing accurate health information.
Informing Training Protocols
Insights from CyberSecEval 2 can guide the development of improved training and conditioning protocols for LLMs. By understanding specific vulnerabilities and performance metrics, developers can tailor training processes to better equip models for handling diverse threats and challenges, thereby enhancing their overall resilience.
For example, if an LLM shows a particular weakness in handling SQL injection attempts, training protocols can be adjusted to include more focused examples and defences against such attacks.
Conclusion
CyberSecEval 2 sets a new benchmark for evaluating the security of large language models. As AI continues to integrate into cyber security frameworks, tools like CyberSecEval 2 will be essential in ensuring that these models are both effective and secure. By leveraging this comprehensive evaluation suite, organisations can better protect their systems and data.