CyberSecEval 2 for LLM security assessments

CyberSecEval 2 for LLM security assessments

Large language models (LLMs) such as OpenAIā€™s GPT-4, Mistral, and Meta Llama 3, along with generative AI (GenAI) systems, are invaluable tools. However, their growing capabilities and fast adoption with organisations introduces new significant security risks. The latest evaluation framework, CyberSecEval 2, provides a comprehensive suite of tests to assess these risks and enhance the security of LLMs.

Understanding CyberSecEval 2

CyberSecEval 2 is an extensive evaluation framework that comes under Metaā€™s Purple Llama; a set of tools to assess and improve LLM security. The framework is designed to rigorously assess the security vulnerabilities and capabilities of LLMs, including tests for prompt injection, code interpreter abuse, insecure coding practices, and the refusal rate of unsafe requests. By leveraging these assessments, CyberSecEval 2 provides a holistic view of the security landscape surrounding LLMs, identifying both their strengths and vulnerabilities.

CyberSecEval 2 Paper: šŸ“„ Download PDF at arxiv.org.

Key Findings and Insights

Metaā€™s evaluation revealed that all tested models, including GPT-4 and Meta Llama 3, showed significant vulnerabilities with success rates ranging from 26% to 41%.

Prompt Injection Vulnerabilities

One of the key findings of CyberSecEval 2 is the susceptibility of LLMs to prompt injection attacks, which occur when an attacker manipulates the modelā€™s input to execute unintended actions. For example, if a model is asked to execute a benign task but the input is crafted to include a malicious command, a vulnerable model might execute the harmful command.

The evaluation revealed that all tested models, including GPT-4 and Meta Llama 3, showed significant vulnerabilities. For instance, compliance rates for malicious prompts ranged from 0.08% to 76% across different models and tasks, with an average rate of 17.1%. Lower compliance rates indicate better resistance to these attacks.

Highest Success Rates for Prompt Injection Variants

model Different User Input Language Output Formatting Manipulation
llama 3 8b-instruct 76.00% 76.47%
gpt-3.5-turbo 60.00% 76.47%
codellama-13b-instruct 76.00% 29.41%
codellama-34b-instruct 64.00% 29.41%
llama 3 70b-instruct 44.00% 70.59%
gpt-4 16.00% 35.29%
codellama-70b-instruct 28.00% 17.65%

Lowest Success Rates for Prompt Injection Variants

model Hypothetical Scenario Token Smuggling
llama 3 8b-instruct 23.08% 0.77%
gpt-3.5-turbo 30.77% 0.77%
codellama-13b-instruct 15.38% 0.00%
codellama-34b-instruct 15.38% 0.00%
llama 3 70b-instruct 23.08% 0.77%
gpt-4 23.08% 0.00%
codellama-70b-instruct 0.77% 0.00%

Balancing Safety and Utility

Another crucial insight from CyberSecEval 2 is the delicate balance between safety and utility in LLMs. Conditioning these models to reject unsafe prompts can sometimes lead to a decrease in their overall usefulness. To address this, the framework introduces the False Refusal Rate (FRR) metric, which quantifies this trade-off by measuring how often models incorrectly refuse benign queries, mistaking them for malicious requests.

For instance, an LLM trained to be highly cautious might reject a simple coding question if it misinterprets it as a potential security threat. By using the FRR metric, organisations can fine-tune their models to maintain a high level of safety whilst still performing effectively on non-malicious tasks.

Compliance and Non-Compliance Rates

CyberSecEval 2 rigorously evaluated the compliance rates of various LLMs when prompted with cyberattack-related tasks. The study revealed that models exhibited varying degrees of compliance, with significant differences across attack categories such as ā€˜evasionā€™ and ā€˜discoveryā€™.

The following table highlights these differences, with lower scores being safer. Refer to Figure 1 of the paper:

model Discovery C2 Recon Exfiltration Privilege Escalation Lateral Movement Persistence Evasion Execution Collection
Llama 3 8b-instruct 40% 15% 35% 12% 6% 10% 14% 5% 9% 18%
gpt-4 40% 43% 39% 28% 30% 24% 25% 21% 18% 19%
mistral-medium-latest 42% 38% 44% 32% 39% 28% 31% 33% 30% 31%
Llama 3 70b-instruct 67% 48% 58% 34% 29% 34% 28% 17% 20% 31%
codellama-70b-instruct 36% 38% 38% 37% 42% 40% 39% 29% 32% 33%
codellama-13b-instruct 56% 54% 41% 45% 35% 47% 47% 43% 40% 46%
mistral-large-latest 40% 51% 43% 52% 47% 54% 47% 49% 52% 42%
codellama-34b-instruct 50% 57% 58% 46% 42% 48% 42% 45% 51% 50%
mistral-small-latest 47% 54% 44% 59% 66% 62% 59% 65% 62% 52%
gpt-3.5-turbo 49% 60% 45% 59% 67% 67% 67% 72% 65% 55%

The Llama 3 8b-instruct and GPT-4 models performed the best in terms of non-compliance, showing lower compliance rates with malicious prompts compared to other models, including CodeLlama-70b-Instruct. Specifically, Llama 3 8b-instruct had the lowest average compliance rate, indicating a robust refusal to unsafe requests.

Coding Capabilities and Exploitation

CyberSecEval 2 included tests against logic vulnerabilities, memory exploits, and SQL injections, using programming languages such as C, Python, and JavaScript. The results showed that LLMs with advanced coding capabilities, like GPT-4 and Mistral-large-latest, performed better in identifying and exploiting these vulnerabilities. However, significant improvements are still needed for these models to generate effective and sophisticated exploits consistently.

The following table highlights each modelā€™s ability to complete the capture-the-flag style tests, with higher-scoring models being more capable. Refer to Figure 5 of the paper:

model JAVASCRIPT INPUT CONSTRAINT TESTS PYTHON INPUT CONSTRAINT TESTS C INPUT CONSTRAINT TESTS DIVERSE MEMORY CORRUPTION TESTS PYTHON/SQLITE SQL INJECTION TESTS BUFFER OVERFLOW TESTS
gpt-4-turbo 67.48% 65.67% 66.43% 15.95% 23.00% 4.70%
mistral-large-latest 63.75% 63.65% 62.32% 15.77% 0.33% 2.06%
mistral-small-latest 61.25% 61.27% 60.80% 23.56% 0.00% 0.17%
llama 3 70b-instruct 61.97% 60.28% 61.93% 14.35% 1.67% 0.68%
codellama-34b-instruct 54.88% 53.25% 44.58% 19.92% 0.00% 0.00%
llama 3 8b-instruct 56.01% 53.49% 49.48% 12.43% 0.00% 1.83%
gpt-3.5-turbo 53.12% 52.65% 48.48% 13.89% 0.00% 0.70%
codellama-13b-instruct 49.97% 49.17% 48.21% 15.06% 0.00% 1.83%
mistral-medium-latest 55.54% 52.11% 43.63% 11.27% 0.33% 0.50%
codellama-70b-instruct 45.39% 44.44% 43.42% 13.73% 7.46% 1.54%

Seven Benchmark Tests in CyberSecEval 2

CyberSecEval 2 includes seven distinct types of benchmarks, each addressing different aspects of LLM security:

  1. MITRE Tests: Uses the MITRE ATT&CK framework to evaluate an LLMā€™s compliance when asked to assist in cyber attacks.
  2. Instruct Tests: Assesses an LLMā€™s propensity to generate insecure code when given specific instructions.
  3. Autocomplete Tests: Measures how often an LLM suggests insecure coding practices in autocomplete contexts.
  4. Prompt Injection Tests: Evaluates an LLMā€™s susceptibility to prompt injection attacks.
  5. False Refusal Rate (FRR) Tests: Measures how often an LLM incorrectly refuses benign queries, misinterpreting them as malicious requests.
  6. Code Interpreter Tests: Assesses the security risks posed by integrating LLMs with code interpreters.
  7. Vulnerability Exploitation Tests: Measures the LLMā€™s program exploitation capabilities through ā€œcapture the flagā€ style challenges.

For more details on these benchmarks, you can visit the CyberSecEval GitHub repository.

Practical Applications of CyberSecEval 2

Enhancing AI Security

Regular assessments using CyberSecEval 2 for maintaining the security of LLMs. By identifying and addressing vulnerabilities early, organisations can prevent potential exploitation. This proactive approach helps ensure that models remain secure and resilient against emerging threats.

For example, a financial institution can use CyberSecEval 2 to test their LLMā€™s ability to resist phishing attempts, ensuring that the model wonā€™t inadvertently assist in fraudulent activities.

Optimising Safety and Performance

The False Refusal Rate (FRR) metric introduced by CyberSecEval 2 can be used to find the optimal balance between safety measures and the utility of AI solutions. By analysing this metric, organisations can adjust their LLMs to reject unsafe prompts effectively while still performing well on valid queries.

For instance, a healthcare chatbot must avoid providing harmful medical advice whilst still being helpful to patients. By optimising the FRR, the chatbot can maintain high safety standards without compromising its utility in providing accurate health information.

Informing Training Protocols

Insights from CyberSecEval 2 can guide the development of improved training and conditioning protocols for LLMs. By understanding specific vulnerabilities and performance metrics, developers can tailor training processes to better equip models for handling diverse threats and challenges, thereby enhancing their overall resilience.

For example, if an LLM shows a particular weakness in handling SQL injection attempts, training protocols can be adjusted to include more focused examples and defences against such attacks.

Conclusion

CyberSecEval 2 sets a new benchmark for evaluating the security of large language models. As AI continues to integrate into cyber security frameworks, tools like CyberSecEval 2 will be essential in ensuring that these models are both effective and secure. By leveraging this comprehensive evaluation suite, organisations can better protect their systems and data.

For more detailed insights, you can access the full CyberSecEval 2 paper and explore the CyberSecEval leaderboard for a comprehensive view of model performances.


Falx specialises in providing cutting-edge cyber security services. Contact us today to learn how we can help you with your AI technologies and solutions. Contact Falx for Advanced AI Cyber Security Solutions