The post OpenAI Releases Double-Checking Tool For AI Safeguards That Handily Allows Customizations appeared on BitcoinEthereumNews.com. AI developers need to double-check their proposed AI safeguards and a new tool is helping to accomplish that vital goal. getty In today’s column, I examine a recently released online tool by OpenAI that enables the double-checking of potential AI safeguards and can be used for ChatGPT purposes and likewise for other generative AI and large language models (LLMs). This is a handy capability and worthy of due consideration. The idea underlying the tool is straightforward. We want LLMs and chatbots to make use of AI safeguards such as detecting when a user conversation is going afield of safety criteria. For example, a person might be asking the AI how to make a toxic chemical that could be used to harm people. If a proper AI safeguard has been instituted, the AI will refuse the unsafe request. OpenAI’s new tool allows AI makers to specify their AI safeguard policies and then test the policies to ascertain that the results will be on target to catch safety violations. Let’s talk about it. This analysis of AI breakthroughs is part of my ongoing Forbes column coverage on the latest in AI, including identifying and explaining various impactful AI complexities (see the link here). The Importance Of AI Safeguards One of the most disconcerting aspects about modern-day AI is that there is a solid chance that AI will say things that society would prefer not to be said. Let’s broadly agree that generative AI can emit safe messages and also produce unsafe messages. Safe messages are good to go. Unsafe messages ought to be prevented so that the AI doesn’t emit them. AI makers are under a great deal of pressure to implement AI safeguards that will allow safe messaging and mitigate or hopefully prevent unsafe messaging by their LLMs. There is a… The post OpenAI Releases Double-Checking Tool For AI Safeguards That Handily Allows Customizations appeared on BitcoinEthereumNews.com. AI developers need to double-check their proposed AI safeguards and a new tool is helping to accomplish that vital goal. getty In today’s column, I examine a recently released online tool by OpenAI that enables the double-checking of potential AI safeguards and can be used for ChatGPT purposes and likewise for other generative AI and large language models (LLMs). This is a handy capability and worthy of due consideration. The idea underlying the tool is straightforward. We want LLMs and chatbots to make use of AI safeguards such as detecting when a user conversation is going afield of safety criteria. For example, a person might be asking the AI how to make a toxic chemical that could be used to harm people. If a proper AI safeguard has been instituted, the AI will refuse the unsafe request. OpenAI’s new tool allows AI makers to specify their AI safeguard policies and then test the policies to ascertain that the results will be on target to catch safety violations. Let’s talk about it. This analysis of AI breakthroughs is part of my ongoing Forbes column coverage on the latest in AI, including identifying and explaining various impactful AI complexities (see the link here). The Importance Of AI Safeguards One of the most disconcerting aspects about modern-day AI is that there is a solid chance that AI will say things that society would prefer not to be said. Let’s broadly agree that generative AI can emit safe messages and also produce unsafe messages. Safe messages are good to go. Unsafe messages ought to be prevented so that the AI doesn’t emit them. AI makers are under a great deal of pressure to implement AI safeguards that will allow safe messaging and mitigate or hopefully prevent unsafe messaging by their LLMs. There is a…

OpenAI Releases Double-Checking Tool For AI Safeguards That Handily Allows Customizations

2025/11/04 17:25

AI developers need to double-check their proposed AI safeguards and a new tool is helping to accomplish that vital goal.

getty

In today’s column, I examine a recently released online tool by OpenAI that enables the double-checking of potential AI safeguards and can be used for ChatGPT purposes and likewise for other generative AI and large language models (LLMs). This is a handy capability and worthy of due consideration.

The idea underlying the tool is straightforward. We want LLMs and chatbots to make use of AI safeguards such as detecting when a user conversation is going afield of safety criteria. For example, a person might be asking the AI how to make a toxic chemical that could be used to harm people. If a proper AI safeguard has been instituted, the AI will refuse the unsafe request.

OpenAI’s new tool allows AI makers to specify their AI safeguard policies and then test the policies to ascertain that the results will be on target to catch safety violations.

Let’s talk about it.

This analysis of AI breakthroughs is part of my ongoing Forbes column coverage on the latest in AI, including identifying and explaining various impactful AI complexities (see the link here).

The Importance Of AI Safeguards

One of the most disconcerting aspects about modern-day AI is that there is a solid chance that AI will say things that society would prefer not to be said. Let’s broadly agree that generative AI can emit safe messages and also produce unsafe messages. Safe messages are good to go. Unsafe messages ought to be prevented so that the AI doesn’t emit them.

AI makers are under a great deal of pressure to implement AI safeguards that will allow safe messaging and mitigate or hopefully prevent unsafe messaging by their LLMs.

There is a wide range of ways that unsafe messages can arise. Generative AI can produce so-called AI hallucinations or confabulations that tell a user to do something untoward, but the person assumes that the AI is being honest and apt in what has been generated. That’s unsafe. Another way that AI can be unsafe is if an evildoer asks the AI to explain how to make a bomb or produce a toxic chemical. Society doesn’t want that type of easy-peasy means of figuring out dastardly tasks.

Another unsafe angle is for AI to aid people in concocting delusions and delusional thinking, see my coverage at the link here. The AI will either prod a person into conceiving of a delusion or might detect that a delusion is already on their mind and aid in embellishing the delusion. The preference is that AI provides upside mental health advice over downside mental health guidance.

Devising And Testing AI Safeguards

I’m sure you’ve heard the famous line that you ought to try it before you buy it, meaning that sometimes being able to try out an item is highly valuable before making a full commitment to the item. The same wisdom applies to AI safeguards.

Rather than simply tossing AI safeguards into an LLM that is actively being used by perhaps millions upon millions of people (sidenote: ChatGPT is being used by 800 million weekly active users), we’d be smarter to try out the AI safeguards and see if they do what they are supposed to do.

An AI safeguard should catch or prevent whatever unsafe messages we believe need to be stopped. There is a tradeoff involved since an AI safeguard can become an overreach. Imagine that we decide to adopt an AI safeguard that prevents anyone from ever making use of the word “chemicals” because we hope to avoid allowing a user to find out about toxic chemicals.

Well, denying the use of the word “chemicals” is an exceedingly bad way to devise an AI safeguard. Imagine all the useful and fair uses of the word “chemicals” that can arise. Here’s an example of an innocent request. People might be worried that their household products might contain adverse chemicals, so they ask the AI about this. An AI safeguard that blindly stopped any mention of chemicals would summarily turn down that legitimate request.

The crux is that AI safeguards can be very tricky when it comes to writing them and ensuring that they do the right things (see my discussion on this, at the link here). The preference is that an AI safeguard stops the things we want to stop, but doesn’t go overboard and stop things that we are fine with having proceed. A poorly devised AI safeguard will indubitably produce a vast number of false positives, meaning that it will stop an otherwise upside and allowable action.

If possible, we should try out any proposed AI safeguards before putting them into active action.

Using Classifiers To Help Out

There are online tools that can be used by AI developers to assist in classifying whether a given snippet of text is considered safe versus unsafe. Usually, these classifiers have been pretrained on what constitutes safety and what constitutes being unsafe. The beauty of these classifiers is that an AI developer can simply feed various textual content into the tool and see which, if any, of the AI safeguards embedded into the tool will react.

One difficulty is that those kinds of online tools don’t necessarily allow you to plug in your own proposed AI safeguards. Instead, the AI safeguards are essentially baked into the tool. You can then decide whether those are the same AI safeguards you’d like to implement in your LLM.

A more accommodating approach would be to allow an AI developer to feed in their proposed AI safeguards. We shall refer to those AI safeguards as policies. An AI developer would work with other stakeholders and come up with a slate of policies about what AI safeguards are desired. Those policies then could be entered into a tool that would readily try out those policies on behalf of the AI developer and their stakeholders.

To test the proposed policies, an AI developer would need to craft text to be used during the testing or perhaps grab relevant text from here or there. The aim is to have a sufficient variety and volume of text that the desired AI safeguards all ultimately get a chance to shine in the spotlight. If we have an AI safeguard that is proposed to catch references to toxic chemicals, the text that is being used for testing ought to contain some semblance of references to toxic chemicals; otherwise, the testing process won’t be suitably engaged and revealing about the AI safeguards.

OpenAI’s New Tool For AI Safeguard Testing

In a blog posting by OpenAI on October 29, 2025, entitled “Introducing gpt-oss-safeguard”, the well-known AI maker announced the availability of an AI safeguard testing tool:

  • “Safety classifiers, which distinguish safe from unsafe content in a particular risk area, have long been a primary layer of defense for our own and other large language models.”
  • “Today, we’re releasing a research preview of gpt-oss-safeguard, our open-weight reasoning models for safety classification tasks, available in two sizes: gpt-oss-safeguard-120b and gpt-oss-safeguard-20b.”
  • “The gpt-oss-safeguard models use reasoning to directly interpret a developer-provided policy at inference time — classifying user messages, completions, and full chats according to the developer’s needs.”
  • “The model uses chain-of-thought, which the developer can review to understand how the model is reaching its decisions. Additionally, the policy is provided during inference, rather than being trained into the model, so it is easy for developers to iteratively revise policies to increase performance.”

As per the cited indications, you can use the new tool to try out your proposed AI safeguards. You provide a set of policies that represent the proposed AI safeguards, and also provide whatever text is to be used during the testing. The tool attempts to apply the proposed AI safeguards to the given text. An AI developer then receives a report analyzing how the policies performed with respect to the provided text.

Iteratively Using Such A Tool

An AI developer would likely use such a tool on an iterative basis.

Here’s how that goes. You draft policies of interest. You devise or collect suitable text for testing purposes. Those policies and text get fed into the tool. You inspect the reports that provide an analysis of what transpired. The odds are that some of the text that should have triggered an AI safeguard did not do so. Also, there is a chance that some AI safeguards were triggered even though the text per se should not have set them off.

Why can that happen?

In the case of this particular tool, a chain-of-thought (CoT) explanation is being provided to help ferret out the culprit. The AI developer could review the CoT to discern what went wrong, namely, whether the policy was insufficiently worded or the text wasn’t sufficient to trigger the AI safeguard. For more about the usefulness of chain-of-thought in contemporary AI, see my discussion at the link here.

A series of iterations would undoubtedly take place. Change the policies or AI safeguards and make another round of runs. Adjust the text or add more text, and make another round of runs. Keep doing this until there is a reasonable belief that enough testing has taken place.

Rinse and repeat is the mantra at hand.

Hard Questions Need To Be Asked

There is a slew of tough questions that need to be addressed during this testing and review process.

First, how many tests or how many iterations are enough to believe that the AI safeguards are good to go? If you try too small a number, you are likely deluding yourself into believing that the AI safeguards have been “proven” as ready for use. It is important to perform somewhat extensive and exhaustive testing. One means of approaching this is by using rigorous validation techniques, as I’ve explained at the link here.

Second, make sure to include trickery in the text that is being used for the testing process.

Here’s why. People who use AI are often devious in trying to circumvent AI safeguards. Some people do so for evil purposes. Others like to fool AI just to see if they can do so. Another perspective is that a person tricking AI is doing so on behalf of society, hoping to reveal otherwise hidden gotchas and loopholes. In any case, the text that you feed into the tool ought to be as tricky as you can make it. Put yourself into the shoes of the tricksters.

Third, keep in mind that the policies and AI safeguards are based on human-devised natural language. I point this out because a natural language such as English is difficult to pin down due to inherent semantic ambiguities. Think of the number of laws and regulations that have loopholes due to a word here or there that is interpreted in a multitude of ways. The testing of AI safeguards is slippery because you are testing on the merits of human language interpretability.

Fourth, even if you do a bang-up job of testing your AI safeguards, they might need to be revised or enhanced. Do not assume that just because you tested them a week ago, a month ago, or a year ago, they are still going to stand up today. The odds are that you will need to continue to undergo a cat-and-mouse gambit, whereby AI users are finding insidious ways to circumvent the AI safeguards that you thought had been tested sufficiently.

Keep your nose to the grind.

Thinking Thoughtfully

An AI developer could use a tool like this as a standalone mechanism. They proceed to test their proposed AI safeguards and then subsequently apply the AI safeguards to their targeted LLM.

An additional approach would be to incorporate this capability into the AI stack that you are developing. You could place this tool as an embedded component within a mixture of LLM and other AI elements. A key aspect will be the proficiency in running, since you are now putting the tool into the stream of what is presumably going to be a production system. Make sure that you appropriately gauge the performance of the tool.

Going even further outside the box, you might have other valuable uses for a classifier that allows you to provide policies and text to be tested against. In other words, this isn’t solely about AI safeguards. Any other task that entails doing a natural language head-to-head between stated policies and whether the text activates or triggers those policies can be equally undertaken with this kind of tool.

I want to emphasize that this isn’t the only such tool in the AI community. There are others. Make sure to closely examine whichever one you might find relevant and useful to you. In the case of this particular tool, since it is brought to the market by OpenAI, you can bet it will garner a great deal of attention. More fellow AI developers will likely know about it than would a similar tool provided by a lesser-known firm.

AI Safeguards Need To Do Their Job

I noted at the start of this discussion that we need to figure out what kinds of AI safeguards will keep society relatively safe when it comes to the widespread use of AI. This is a monumental task. It requires technological savviness and societal acumen since it has to deal with both AI and human behaviors.

OpenAI has opined that their new tool provides a “bring your own policies and definitions of harm” design, which is a welcome recognition that we need to keep pushing forward on wrangling with AI safeguards. Up until recently, AI safeguards generally seemed to be a low priority overall and given scant attention by AI makers and society at large. The realization now is that for the good and safety of all of us, we must stridently pursue AI safeguards, else we endanger ourselves on a massive scale.

As the famed Brigadier General Thomas Francis Meagher once remarked: “Great interests demand great safeguards.”

Source: https://www.forbes.com/sites/lanceeliot/2025/11/04/openai-releases-double-checking-tool-for-ai-safeguards-that-handily-allows-customizations/

Piyasa Fırsatı
Sleepless AI Logosu
Sleepless AI Fiyatı(AI)
$0.03709
$0.03709$0.03709
-0.77%
USD
Sleepless AI (AI) Canlı Fiyat Grafiği
Sorumluluk Reddi: Bu sitede yeniden yayınlanan makaleler, halka açık platformlardan alınmıştır ve yalnızca bilgilendirme amaçlıdır. MEXC'nin görüşlerini yansıtmayabilir. Tüm hakları telif sahiplerine aittir. Herhangi bir içeriğin üçüncü taraf haklarını ihlal ettiğini düşünüyorsanız, kaldırılması için lütfen [email protected] ile iletişime geçin. MEXC, içeriğin doğruluğu, eksiksizliği veya güncelliği konusunda hiçbir garanti vermez ve sağlanan bilgilere dayalı olarak alınan herhangi bir eylemden sorumlu değildir. İçerik, finansal, yasal veya diğer profesyonel tavsiye niteliğinde değildir ve MEXC tarafından bir tavsiye veya onay olarak değerlendirilmemelidir.

Ayrıca Şunları da Beğenebilirsiniz

Trump-Backed WLFI Plunges 58% – Buyback Plan Announced to Halt Freefall

Trump-Backed WLFI Plunges 58% – Buyback Plan Announced to Halt Freefall

World Liberty Financial (WLFI), the Trump-linked DeFi project, is scrambling to stop a market collapse after its token lost over 50% of its value in September. On Friday, the project unveiled a full buyback-and-burn program, directing all treasury liquidity fees to absorb selling pressure. According to a governance post on X, the community approved the plan overwhelmingly, with WLFI pledging full transparency for every burn. The urgency of the move reflects WLFI’s steep losses in recent weeks. WLFI is trading Friday at $0.19, down from its September 1 peak of $0.46, according to CoinMarketCap, a 58% drop in less than a month. Weekly losses stand at 12.85%, with a 15.45% decline for the month. This isn’t the project’s first attempt at intervention. Just days after launch, WLFI burned 47 million tokens on September 3 to counter a 31% sell-off, sending the supply to a verified burn address. For World Liberty Financial, the buyback-and-burn program represents both a damage-control measure and a test of community faith. While tokenomics adjustments can provide short-term relief, the project will need to convince investors that WLFI has staying power beyond interventions. WLFI Launches Buyback-and-Burn Plan, Linking Token Scarcity to Platform Growth According to the governance proposal, WLFI will use fees generated from its protocol-owned liquidity (POL) pools on Ethereum, BNB Chain, and Solana to repurchase tokens from the open market. Once bought back, the tokens will be sent to a burn address, permanently removing them from circulation.WLFI Proposal Source: WLFI The project stressed that this system ties supply reduction directly to platform growth. As trading activity rises, more liquidity fees are generated, fueling larger buybacks and burns. This seeks to create a feedback loop where adoption drives scarcity, and scarcity strengthens token value. Importantly, the plan applies only to WLFI’s protocol-controlled liquidity pools. Community and third-party liquidity pools remain unaffected, ensuring the mechanism doesn’t interfere with external ecosystem contributions. In its proposal, the WLFI team argued that the strategy aligns long-term holders with the project’s future by systematically reducing supply and discouraging short-term speculation. Each burn increases the relative stake of committed investors, reinforcing confidence in WLFI’s tokenomics. To bolster credibility, WLFI has pledged full transparency: every buyback and burn will be verifiable on-chain and reported to the community in real time. WLFI Joins Hyperliquid, Jupiter, and Sky as Buyback Craze Spills Into Wall Street WLFI’s decision to adopt a full buyback-and-burn strategy places it among the most ambitious tokenomic models in crypto. While partly a response to its sharp September price decline, the move also reflects a trend of DeFi protocols leveraging revenue streams to cut supply, align incentives, and strengthen token value. Hyperliquid illustrates the model at scale. Nearly all of its platform fees are funneled into automated $HYPE buybacks via its Assistance Fund, creating sustained demand. By mid-2025, more than 20 million tokens had been repurchased, with nearly 30 million held by Q3, worth over $1.5 billion. This consistency both increased scarcity and cemented Hyperliquid’s dominance in decentralized derivatives. Other protocols have adopted variations. Jupiter directs half its fees into $JUP repurchases, locking tokens for three years. Raydium earmarks 12% of fees for $RAY buybacks, already removing 71 million tokens, roughly a quarter of the circulating supply. Burn-based models push further, as seen with Sky, which has spent $75 million since February 2025 to permanently erase $SKY tokens, boosting scarcity and governance influence. But the buyback phenomenon isn’t limited to DeFi. Increasingly, listed companies with crypto treasuries are adopting aggressive repurchase programs, sometimes to offset losses as their digital assets decline. According to a report, at least seven firms, ranging from gaming to biotech, have turned to buybacks, often funded by debt, to prop up falling stock prices. One of the latest is Thumzup Media, a digital advertising company with a growing Web3 footprint. On Thursday, it launched a $10 million share repurchase plan, extending its capital return strategy through 2026, after completing a $1 million program that saw 212,432 shares bought at an average of $4.71. DeFi Development Corp, the first public company built around a Solana-based treasury strategy, also recently expanded its buyback program to $100 million, up from $1 million, making it one of the largest stock repurchase initiatives in the digital asset sector. Together, these cases show how buybacks, whether in tokenomics or equities, are emerging as a key mechanism for stabilizing value and signaling confidence, even as motivations and execution vary widely
Paylaş
CryptoNews2025/09/26 19:12
Son of filmmaker Rob Reiner charged with homicide for death of his parents

Son of filmmaker Rob Reiner charged with homicide for death of his parents

FILE PHOTO: Rob Reiner, director of "The Princess Bride," arrives for a special 25th anniversary viewing of the film during the New York Film Festival in New York
Paylaş
Rappler2025/12/16 09:59
Bitcoin Peak Coming in 45 Days? BTC Price To Reach $150K

Bitcoin Peak Coming in 45 Days? BTC Price To Reach $150K

The post Bitcoin Peak Coming in 45 Days? BTC Price To Reach $150K appeared first on Coinpedia Fintech News Bitcoin has delivered one of its strongest performances in recent months, jumping from September lows of $108K to over $117K today. But while excitement is high, market watchers warn the clock is ticking.  History shows Bitcoin peaks don’t last forever, and analysts now believe the next major top could arrive within just 45 days, with …
Paylaş
CoinPedia2025/09/18 15:49