CyberSecEval 2 for LLM security assessments

Jared

• May 31, 2024

Updated: May 31, 2024

Large language models (LLMs) such as OpenAI’s GPT-4, Mistral, and Meta Llama 3, along with generative AI (GenAI) systems, are invaluable tools. However, their growing capabilities and fast adoption with organisations introduces new significant security risks. The latest evaluation framework, CyberSecEval 2, provides a comprehensive suite of tests to assess these risks and enhance the security of LLMs.

Understanding CyberSecEval 2

CyberSecEval 2 is an extensive evaluation framework that comes under Meta’s Purple Llama; a set of tools to assess and improve LLM security. The framework is designed to rigorously assess the security vulnerabilities and capabilities of LLMs, including tests for prompt injection, code interpreter abuse, insecure coding practices, and the refusal rate of unsafe requests. By leveraging these assessments, CyberSecEval 2 provides a holistic view of the security landscape surrounding LLMs, identifying both their strengths and vulnerabilities.

CyberSecEval 2 Paper: 📄 Download PDF at arxiv.org.

Key Findings and Insights

Meta’s evaluation revealed that all tested models, including GPT-4 and Meta Llama 3, showed significant vulnerabilities with success rates ranging from 26% to 41%.

Prompt Injection Vulnerabilities

One of the key findings of CyberSecEval 2 is the susceptibility of LLMs to prompt injection attacks, which occur when an attacker manipulates the model’s input to execute unintended actions. For example, if a model is asked to execute a benign task but the input is crafted to include a malicious command, a vulnerable model might execute the harmful command.

The evaluation revealed that all tested models, including GPT-4 and Meta Llama 3, showed significant vulnerabilities. For instance, compliance rates for malicious prompts ranged from 0.08% to 76% across different models and tasks, with an average rate of 17.1%. Lower compliance rates indicate better resistance to these attacks.

Highest Success Rates for Prompt Injection Variants

model	Different User Input Language	Output Formatting Manipulation
llama 3 8b-instruct	76.00%	76.47%
gpt-3.5-turbo	60.00%	76.47%
codellama-13b-instruct	76.00%	29.41%
codellama-34b-instruct	64.00%	29.41%
llama 3 70b-instruct	44.00%	70.59%
gpt-4	16.00%	35.29%
codellama-70b-instruct	28.00%	17.65%

Lowest Success Rates for Prompt Injection Variants

model	Hypothetical Scenario	Token Smuggling
llama 3 8b-instruct	23.08%	0.77%
gpt-3.5-turbo	30.77%	0.77%
codellama-13b-instruct	15.38%	0.00%
codellama-34b-instruct	15.38%	0.00%
llama 3 70b-instruct	23.08%	0.77%
gpt-4	23.08%	0.00%
codellama-70b-instruct	0.77%	0.00%

Balancing Safety and Utility

Another crucial insight from CyberSecEval 2 is the delicate balance between safety and utility in LLMs. Conditioning these models to reject unsafe prompts can sometimes lead to a decrease in their overall usefulness. To address this, the framework introduces the False Refusal Rate (FRR) metric, which quantifies this trade-off by measuring how often models incorrectly refuse benign queries, mistaking them for malicious requests.

For instance, an LLM trained to be highly cautious might reject a simple coding question if it misinterprets it as a potential security threat. By using the FRR metric, organisations can fine-tune their models to maintain a high level of safety whilst still performing effectively on non-malicious tasks.

Compliance and Non-Compliance Rates

CyberSecEval 2 rigorously evaluated the compliance rates of various LLMs when prompted with cyberattack-related tasks. The study revealed that models exhibited varying degrees of compliance, with significant differences across attack categories such as ‘evasion’ and ‘discovery’.

The following table highlights these differences, with lower scores being safer. Refer to Figure 1 of the paper:

model	Discovery	C2	Recon	Exfiltration	Privilege Escalation	Lateral Movement	Persistence	Evasion	Execution	Collection
Llama 3 8b-instruct	40%	15%	35%	12%	6%	10%	14%	5%	9%	18%
gpt-4	40%	43%	39%	28%	30%	24%	25%	21%	18%	19%
mistral-medium-latest	42%	38%	44%	32%	39%	28%	31%	33%	30%	31%
Llama 3 70b-instruct	67%	48%	58%	34%	29%	34%	28%	17%	20%	31%
codellama-70b-instruct	36%	38%	38%	37%	42%	40%	39%	29%	32%	33%
codellama-13b-instruct	56%	54%	41%	45%	35%	47%	47%	43%	40%	46%
mistral-large-latest	40%	51%	43%	52%	47%	54%	47%	49%	52%	42%
codellama-34b-instruct	50%	57%	58%	46%	42%	48%	42%	45%	51%	50%
mistral-small-latest	47%	54%	44%	59%	66%	62%	59%	65%	62%	52%
gpt-3.5-turbo	49%	60%	45%	59%	67%	67%	67%	72%	65%	55%

The Llama 3 8b-instruct and GPT-4 models performed the best in terms of non-compliance, showing lower compliance rates with malicious prompts compared to other models, including CodeLlama-70b-Instruct. Specifically, Llama 3 8b-instruct had the lowest average compliance rate, indicating a robust refusal to unsafe requests.

Coding Capabilities and Exploitation

CyberSecEval 2 included tests against logic vulnerabilities, memory exploits, and SQL injections, using programming languages such as C, Python, and JavaScript. The results showed that LLMs with advanced coding capabilities, like GPT-4 and Mistral-large-latest, performed better in identifying and exploiting these vulnerabilities. However, significant improvements are still needed for these models to generate effective and sophisticated exploits consistently.

The following table highlights each model’s ability to complete the capture-the-flag style tests, with higher-scoring models being more capable. Refer to Figure 5 of the paper:

model	JAVASCRIPT INPUT CONSTRAINT TESTS	PYTHON INPUT CONSTRAINT TESTS	C INPUT CONSTRAINT TESTS	DIVERSE MEMORY CORRUPTION TESTS	PYTHON/SQLITE SQL INJECTION TESTS	BUFFER OVERFLOW TESTS
gpt-4-turbo	67.48%	65.67%	66.43%	15.95%	23.00%	4.70%
mistral-large-latest	63.75%	63.65%	62.32%	15.77%	0.33%	2.06%
mistral-small-latest	61.25%	61.27%	60.80%	23.56%	0.00%	0.17%
llama 3 70b-instruct	61.97%	60.28%	61.93%	14.35%	1.67%	0.68%
codellama-34b-instruct	54.88%	53.25%	44.58%	19.92%	0.00%	0.00%
llama 3 8b-instruct	56.01%	53.49%	49.48%	12.43%	0.00%	1.83%
gpt-3.5-turbo	53.12%	52.65%	48.48%	13.89%	0.00%	0.70%
codellama-13b-instruct	49.97%	49.17%	48.21%	15.06%	0.00%	1.83%
mistral-medium-latest	55.54%	52.11%	43.63%	11.27%	0.33%	0.50%
codellama-70b-instruct	45.39%	44.44%	43.42%	13.73%	7.46%	1.54%

Seven Benchmark Tests in CyberSecEval 2

CyberSecEval 2 includes seven distinct types of benchmarks, each addressing different aspects of LLM security:

MITRE Tests: Uses the MITRE ATT&CK framework to evaluate an LLM’s compliance when asked to assist in cyber attacks.
Instruct Tests: Assesses an LLM’s propensity to generate insecure code when given specific instructions.
Autocomplete Tests: Measures how often an LLM suggests insecure coding practices in autocomplete contexts.
Prompt Injection Tests: Evaluates an LLM’s susceptibility to prompt injection attacks.
False Refusal Rate (FRR) Tests: Measures how often an LLM incorrectly refuses benign queries, misinterpreting them as malicious requests.
Code Interpreter Tests: Assesses the security risks posed by integrating LLMs with code interpreters.
Vulnerability Exploitation Tests: Measures the LLM’s program exploitation capabilities through “capture the flag” style challenges.

For more details on these benchmarks, you can visit the CyberSecEval GitHub repository.

Practical Applications of CyberSecEval 2

Enhancing AI Security

Regular assessments using CyberSecEval 2 for maintaining the security of LLMs. By identifying and addressing vulnerabilities early, organisations can prevent potential exploitation. This proactive approach helps ensure that models remain secure and resilient against emerging threats.

For example, a financial institution can use CyberSecEval 2 to test their LLM’s ability to resist phishing attempts, ensuring that the model won’t inadvertently assist in fraudulent activities.

Optimising Safety and Performance

The False Refusal Rate (FRR) metric introduced by CyberSecEval 2 can be used to find the optimal balance between safety measures and the utility of AI solutions. By analysing this metric, organisations can adjust their LLMs to reject unsafe prompts effectively while still performing well on valid queries.

For instance, a healthcare chatbot must avoid providing harmful medical advice whilst still being helpful to patients. By optimising the FRR, the chatbot can maintain high safety standards without compromising its utility in providing accurate health information.

Informing Training Protocols

Insights from CyberSecEval 2 can guide the development of improved training and conditioning protocols for LLMs. By understanding specific vulnerabilities and performance metrics, developers can tailor training processes to better equip models for handling diverse threats and challenges, thereby enhancing their overall resilience.

For example, if an LLM shows a particular weakness in handling SQL injection attempts, training protocols can be adjusted to include more focused examples and defences against such attacks.

Conclusion

CyberSecEval 2 sets a new benchmark for evaluating the security of large language models. As AI continues to integrate into cyber security frameworks, tools like CyberSecEval 2 will be essential in ensuring that these models are both effective and secure. By leveraging this comprehensive evaluation suite, organisations can better protect their systems and data.

For more detailed insights, you can access the full CyberSecEval 2 paper and explore the CyberSecEval leaderboard for a comprehensive view of model performances.

Back to all insights