Anthropic has a new security system it says can stop almost all AI jailbreaks

0
3


  • Anthropic unveils new proof-of-concept security measure tested on Claude 3.5 Sonnet
  • “Constitutional classifiers” are an attempt to teach LLMs value systems
  • Tests resulted in more than an 80% reduction in successful jailbreaks

In a bid to tackle abusive natural language prompts in AI tools, OpenAI rival Anthropic has unveiled a new concept it calls “constitutional classifiers”; a means of instilling a set of human-like values (literally, a constitution) into a large language model.

Anthropic’s Safeguards Research Team unveiled the new security measure, designed to curb jailbreaks (or achieving output that goes outside of an LLM’s established safeguards) of Claude 3.5 Sonnet, its latest and greatest large language model, in a new academic paper.

LEAVE A REPLY

Please enter your comment!
Please enter your name here