Anthropic advances AI with Claude 3.5 Haiku, improved Claude 3.5 Sonnet with computer control, and Palantir partnership for US Government Intelligence and Defense Operations
Anthropic has released new versions of Claude 3.5 with improved capabilities versus prior Claude models and competing AI models. A partnership with Palantir provides US government intelligence and defense operations with Claude AI accredited for use on SECRET-level documents.
Claude 3.5 Sonnet has been improved with direct computer use capabilities. This allows the AI to control and operate computers by moving the mouse, opening apps, interacting with windows, and using software tools like a human. The newly-added capability has been tested against the OSWorld benchmark for open-ended tasks and scored nearly twice as well as competing AI at 14.9%, but trails humans scoring 72.36%. The reason for this performance is the lack of experience in teaching Claude how to operate computers. In other words, humans are challenged in training AI to properly operate the computer for tasks such as updating a spreadsheet with new data from several files.
A faster, smaller version called Claude 3.5 Haiku without computer use capabilities has also been released. This AI is designed to respond quickly rather than taking several seconds to think of an answer while using far fewer compute resources. This results in a much lower cost when answering simpler questions. In a direct comparison with OpenAI GPT-4o, a competing mini AI model, Haiku consistently performs better.
Anthropic and Palantir have released siloed-Claude AI in participation with Amazon Web Services (AWS) for US government use on classified documents. The Department of Defense (DoD) IL6-accredited service provides US agencies with the ability to transform complex tasks with faster completion times and reduced human workload demands, such as when identifying and targeting key targets, while protecting America.
In addition to Claude apps for Android and Apple smartphones, Anthropic has released beta versions of Claude for Windows and Mac desktops. Readers with more pedestrian uses of AI can try out the Plaud AI voice recorder (sold here on Amazon) to automatically transcribe and summarize hours of dull stand-ups.
Are you a techie who knows how to write? Then join our Team! Wanted:
- News Writer (Romania based)
Details here
Source(s)
Anthropic =====
Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku
Oct 22, 2024
5 min read
An illustration of Claude navigating a computer cursor
Update (11/04/2024): We have revised the pricing for Claude 3.5 Haiku. The model is now priced at $1 MTok input / $5 MTok output.
Today, we’re announcing an upgraded Claude 3.5 Sonnet, and a new model, Claude 3.5 Haiku. The upgraded Claude 3.5 Sonnet delivers across-the-board improvements over its predecessor, with particularly significant gains in coding—an area where it already led the field. Claude 3.5 Haiku matches the performance of Claude 3 Opus, our prior largest model, on many evaluations at a similar speed to the previous generation of Haiku.
We’re also introducing a groundbreaking new capability in public beta: computer use. Available today on the API, developers can direct Claude to use computers the way people do—by looking at a screen, moving a cursor, clicking buttons, and typing text. Claude 3.5 Sonnet is the first frontier AI model to offer computer use in public beta. At this stage, it is still experimental—at times cumbersome and error-prone. We're releasing computer use early for feedback from developers, and expect the capability to improve rapidly over time.
Asana, Canva, Cognition, DoorDash, Replit, and The Browser Company have already begun to explore these possibilities, carrying out tasks that require dozens, and sometimes even hundreds, of steps to complete. For example, Replit is using Claude 3.5 Sonnet's capabilities with computer use and UI navigation to develop a key feature that evaluates apps as they’re being built for their Replit Agent product.
The upgraded Claude 3.5 Sonnet is now available for all users. Starting today, developers can build with the computer use beta on the Anthropic API, Amazon Bedrock, and Google Cloud’s Vertex AI. The new Claude 3.5 Haiku will be released later this month.
Claude 3.5 Sonnet: Industry-leading software engineering skills
The updated Claude 3.5 Sonnet shows wide-ranging improvements on industry benchmarks, with particularly strong gains in agentic coding and tool use tasks. On coding, it improves performance on SWE-bench Verified from 33.4% to 49.0%, scoring higher than all publicly available models—including reasoning models like OpenAI o1-preview and specialized systems designed for agentic coding. It also improves performance on TAU-bench, an agentic tool use task, from 62.6% to 69.2% in the retail domain, and from 36.0% to 46.0% in the more challenging airline domain. The new Claude 3.5 Sonnet offers these advancements at the same price and speed as its predecessor.
Early customer feedback suggests the upgraded Claude 3.5 Sonnet represents a significant leap for AI-powered coding. GitLab, which tested the model for DevSecOps tasks, found it delivered stronger reasoning (up to 10% across use cases) with no added latency, making it an ideal choice to power multi-step software development processes. Cognition uses the new Claude 3.5 Sonnet for autonomous AI evaluations, and experienced substantial improvements in coding, planning, and problem-solving compared to the previous version. The Browser Company, in using the model for automating web-based workflows, noted Claude 3.5 Sonnet outperformed every model they’ve tested before.
As part of our continued effort to partner with external experts, joint pre-deployment testing of the new Claude 3.5 Sonnet model was conducted by the US AI Safety Institute (US AISI) and the UK Safety Institute (UK AISI).
We also evaluated the upgraded Claude 3.5 Sonnet for catastrophic risks and found that the ASL-2 Standard, as outlined in our Responsible Scaling Policy, remains appropriate for this model.
Claude 3.5 Haiku: State-of-the-art meets affordability and speed
Claude 3.5 Haiku is the next generation of our fastest model. For a similar speed to Claude 3 Haiku, Claude 3.5 Haiku improves across every skill set and surpasses even Claude 3 Opus, the largest model in our previous generation, on many intelligence benchmarks. Claude 3.5 Haiku is particularly strong on coding tasks. For example, it scores 40.6% on SWE-bench Verified, outperforming many agents using publicly available state-of-the-art models—including the original Claude 3.5 Sonnet and GPT-4o.
With low latency, improved instruction following, and more accurate tool use, Claude 3.5 Haiku is well suited for user-facing products, specialized sub-agent tasks, and generating personalized experiences from huge volumes of data—like purchase history, pricing, or inventory records.
Claude 3.5 Haiku will be made available later this month across our first-party API, Amazon Bedrock, and Google Cloud’s Vertex AI—initially as a text-only model and with image input to follow.
Teaching Claude to navigate computers, responsibly
With computer use, we're trying something fundamentally new. Instead of making specific tools to help Claude complete individual tasks, we're teaching it general computer skills—allowing it to use a wide range of standard tools and software programs designed for people. Developers can use this nascent capability to automate repetitive processes, build and test software, and conduct open-ended tasks like research.
To make these general skills possible, we've built an API that allows Claude to perceive and interact with computer interfaces. Developers can integrate this API to enable Claude to translate instructions (e.g., “use data from my computer and online to fill out this form”) into computer commands (e.g. check a spreadsheet; move the cursor to open a web browser; navigate to the relevant web pages; fill out a form with the data from those pages; and so on). On OSWorld, which evaluates AI models' ability to use computers like people do, Claude 3.5 Sonnet scored 14.9% in the screenshot-only category—notably better than the next-best AI system's score of 7.8%. When afforded more steps to complete the task, Claude scored 22.0%.
While we expect this capability to improve rapidly in the coming months, Claude's current ability to use computers is imperfect. Some actions that people perform effortlessly—scrolling, dragging, zooming—currently present challenges for Claude and we encourage developers to begin exploration with low-risk tasks. Because computer use may provide a new vector for more familiar threats such as spam, misinformation, or fraud, we're taking a proactive approach to promote its safe deployment. We've developed new classifiers that can identify when computer use is being used and whether harm is occurring. You can read more about the research process behind this new skill, along with further discussion of safety measures, in our post on developing computer use.
Looking ahead
Learning from the initial deployments of this technology, which is still in its earliest stages, will help us better understand both the potential and the implications of increasingly capable AI systems.
We’re excited for you to explore our new models and the public beta of computer use—and welcome you to share your feedback with us. We believe these developments will open up new possibilities for how you work with Claude, and we look forward to seeing what you'll create.
Anthropic ====
Developing a computer use model
Oct 22, 2024
7 min read
An abstract representation of AI computer use, with a computer cursor clicking on a stylized representation of a neural network
Claude can now use computers. The latest version of Claude 3.5 Sonnet can, when run through the appropriate software setup, follow a user’s commands to move a cursor around their computer’s screen, click on relevant locations, and input information via a virtual keyboard, emulating the way people interact with their own computer.
We think this skill—which is currently in public beta—represents a significant breakthrough in AI progress. Below, we share some insights from the research that went into developing computer use models—and into making them safer.
Why computer use?
Why is this new capability important? A vast amount of modern work happens via computers. Enabling AIs to interact directly with computer software in the same way people do will unlock a huge range of applications that simply aren’t possible for the current generation of AI assistants.
Over the last few years, many important milestones have been reached in the development of powerful AI—for example, the ability to perform complex logical reasoning and the ability to see and understand images. The next frontier is computer use: AI models that don’t have to interact via bespoke tools, but that instead are empowered to use essentially any piece of software as instructed.
The research process
Our previous work on tool use and multimodality provided the groundwork for these new computer use skills. Operating computers involves the ability to see and interpret images—in this case, images of a computer screen. It also requires reasoning about how and when to carry out specific operations in response to what’s on the screen. Combining these abilities, we trained Claude to interpret what’s happening on a screen and then use the software tools available to carry out tasks.
When a developer tasks Claude with using a piece of computer software and gives it the necessary access, Claude looks at screenshots of what’s visible to the user, then counts how many pixels vertically or horizontally it needs to move a cursor in order to click in the correct place. Training Claude to count pixels accurately was critical. Without this skill, the model finds it difficult to give mouse commands—similar to how models often struggle with simple-seeming questions like “how many A’s in the word ‘banana’?”.
We were surprised by how rapidly Claude generalized from the computer-use training we gave it on just a few pieces of simple software, such as a calculator and a text editor (for safety reasons we did not allow the model to access the internet during training). In combination with Claude’s other skills, this training granted it the remarkable ability to turn a user’s written prompt into a sequence of logical steps and then take actions on the computer. We observed that the model would even self-correct and retry tasks when it encountered obstacles.
Although the subsequent advances came quickly once we made the initial breakthrough, it took a great deal of trial and error to get there. Some of our researchers noted that developing computer use was close to the “idealized” process of AI research they’d pictured when they first started in the field: constant iteration and repeated visits back to the drawing board until there was progress.
The research paid off. At present, Claude is state-of-the-art for models that use computers in the same way as a person does—that is, from looking at the screen and taking actions in response. On one evaluation created to test developers’ attempts to have models use computers, OSWorld, Claude currently gets 14.9%. That’s nowhere near human-level skill (which is generally 70-75%), but it’s far higher than the 7.7% obtained by the next-best AI model in the same category.
Making computer use safe
Every advance in AI brings with it new safety challenges. Computer use is mainly a way of lowering the barrier to AI systems applying their existing cognitive skills, rather than fundamentally increasing those skills, so our chief concerns with computer use focus on present-day harms rather than future ones. We confirmed this by assessing whether computer use increases the risk of frontier threats as outlined in our Responsible Scaling Policy. We found that the updated Claude 3.5 Sonnet, including its new computer use skill, remains at AI Safety Level 2—that is, it doesn’t require a higher standard of safety and security measures than those we currently have in place.
When future models require AI Safety Level 3 or 4 safeguards because they present catastrophic risks, computer use might exacerbate those risks. We judge that it’s likely better to introduce computer use now, while models still only need AI Safety Level 2 safeguards. This means we can begin grappling with any safety issues before the stakes are too high, rather than adding computer use capabilities for the first time into a model with much more serious risks.
In this spirit, our Trust & Safety teams have conducted extensive analysis of our new computer-use models to identify potential vulnerabilities. One concern they've identified is “prompt injection”—a type of cyberattack where malicious instructions are fed to an AI model, causing it to either override its prior directions or perform unintended actions that deviate from the user's original intent. Since Claude can interpret screenshots from computers connected to the internet, it’s possible that it may be exposed to content that includes prompt injection attacks.
Those using the computer-use version of Claude in our public beta should take the relevant precautions to minimize these kinds of risks. As a resource for developers, we have provided further guidance in our reference implementation.
As with any AI capability, there’s also the potential for users to intentionally misuse Claude’s computer skills. Our teams have developed classifiers and other methods to flag and mitigate these kinds of abuses. Given the upcoming U.S. elections, we’re on high alert for attempted misuses that could be perceived as undermining public trust in electoral processes. While computer use is not sufficiently advanced or capable of operating at a scale that would present heightened risks relative to existing capabilities, we've put in place measures to monitor when Claude is asked to engage in election-related activity, as well as systems for nudging Claude away from activities like generating and posting content on social media, registering web domains, or interacting with government websites. We will continuously evaluate and iterate on these safety measures to balance Claude's capabilities with responsible use during the public beta.
Consistent with our standard approach to data privacy, by default we don’t train our generative AI models on user-submitted data, including any of the screenshots Claude receives.
The future of computer use
Computer use is a completely different approach to AI development. Up until now, LLM developers have made tools fit the model, producing custom environments where AIs use specially-designed tools to complete various tasks. Now, we can make the model fit the tools—Claude can fit into the computer environments we all use every day. Our goal is for Claude to take pre-existing pieces of computer software and simply use them as a person would.
There’s still a lot to do. Even though it’s the current state of the art, Claude’s computer use remains slow and often error-prone. There are many actions that people routinely do with computers (dragging, zooming, and so on) that Claude can’t yet attempt. The “flipbook” nature of Claude’s view of the screen—taking screenshots and piecing them together, rather than observing a more granular video stream—means that it can miss short-lived actions or notifications.
Even while we were recording demonstrations of computer use for today’s launch, we encountered some amusing errors. In one, Claude accidentally clicked to stop a long-running screen recording, causing all footage to be lost. In another, Claude suddenly took a break from our coding demo and began to peruse photos of Yellowstone National Park.
We expect that computer use will rapidly improve to become faster, more reliable, and more useful for the tasks our users want to complete. It’ll also become much easier to implement for those with less software-development experience. At every stage, our researchers will be working closely with our safety teams to ensure that Claude’s new capabilities are accompanied by the appropriate safety measures.
We invite developers who try computer use in our public beta to contact us with their feedback using this form, so that our researchers can continue to improve the usefulness and safety of this new capability.
PALANTIR ====
11 / 07 / 2024
Anthropic and Palantir Partner to Bring Claude AI Models to AWS for U.S. Government Intelligence and Defense Operations
DENVER--(BUSINESS WIRE)-- Anthropic and Palantir Technologies Inc. (NYSE: PLTR) today announced a partnership with Amazon Web Services (AWS) to provide U.S. intelligence and defense agencies access to the Claude 3 and 3.5 family of models on AWS. This partnership allows for an integrated suite of technology to operationalize the use of Claude within Palantir’s AI Platform (AIP) while leveraging the security, agility, flexibility, and sustainability benefits provided by AWS.
The partnership facilitates the responsible application of AI, enabling the use of Claude within Palantir’s products to support government operations such as processing vast amounts of complex data rapidly, elevating data driven insights, identifying patterns and trends more effectively, streamlining document review and preparation, and helping U.S. officials to make more informed decisions in time-sensitive situations while preserving their decision-making authorities. Claude became accessible within Palantir AIP on AWS earlier this month.
With Palantir's AIP, customers can now operationalize Claude using an integrated suite of technology, facilitated by Amazon SageMaker, an accredited fully managed service, and hosted on Palantir's Impact Level 6 (IL6) accredited environment, supported by AWS. Palantir and AWS are among a limited number of companies to receive the Defense Information Systems Agency (DISA) IL6 accreditation requiring some of the strictest security protocols.
“Our partnership with Anthropic and AWS provides U.S. defense and intelligence communities the tool chain they need to harness and deploy AI models securely, bringing the next generation of decision advantage to their most critical missions," said Shyam Sankar, Chief Technology Officer, Palantir. “Palantir is proud to be the first industry partner to bring Claude models to classified environments. We’ve already seen firsthand the impact of these models with AIP in the commercial sector: for example, one leading American insurer automated a significant portion of their underwriting process with 78 AI agents powered by AIP and Claude, transforming a process that once took two weeks into one that could be done in three hours. We are now providing this same asymmetric AI advantage to the U.S. government and its allies."
"We're proud to be at the forefront of bringing responsible AI solutions to U.S. classified environments, enhancing analytical capabilities and operational efficiencies in vital government operations. Access to Claude 3 and Claude 3.5 within Palantir AIP on AWS will equip U.S. defense and intelligence organizations with powerful AI tools that can rapidly process and analyze vast amounts of complex data. This will dramatically improve intelligence analysis and enable officials in their decision-making processes, streamline resource intensive tasks and boost operational efficiency across departments," said Kate Earle Jensen, Head of Sales and Partnerships, Anthropic.
“We are excited to partner with Anthropic and Palantir and offer new generative AI capabilities that will drive innovation across the public sector. At AWS, we are committed to providing public sector customers and partners with the most secure, innovative, and comprehensive set of cloud services,” said Dave Levy, VP, Worldwide Public Sector, AWS.
About Palantir Technologies Inc.
Foundational software of tomorrow. Delivered today. Additional information is available at www.palantir.com.
Forward-Looking Statements
This press release contains forward-looking statements within the meaning of Section 27A of the Securities Act of 1933, as amended, and Section 21E of the Securities Exchange Act of 1934, as amended. These statements may relate to, but are not limited to, Palantir’s expectations regarding the amount and the terms of the contract and the expected benefits of our software platforms. Forward-looking statements are inherently subject to risks and uncertainties, some of which cannot be predicted or quantified. Forward-looking statements are based on information available at the time those statements are made and were based on current expectations as well as the beliefs and assumptions of management as of that time with respect to future events. These statements are subject to risks and uncertainties, many of which involve factors or circumstances that are beyond our control. These risks and uncertainties include our ability to meet the unique needs of our customer; the failure of our platforms to satisfy our customer or perform as desired; the frequency or severity of any software and implementation errors; our platforms’ reliability; and our customer’s ability to modify or terminate the contract. Additional information regarding these and other risks and uncertainties is included in the filings we make with the Securities and Exchange Commission from time to time. Except as required by law, we do not undertake any obligation to publicly update or revise any forward-looking statement, whether as a result of new information, future developments, or otherwise.
Media Contact
Morgan Gress
Source: Palantir Technologies Inc.