How Smart Does My AI Need to be?
When ChatGPT was first released to the public, it used the earlier version of GPT 3.5, not the current version of 3.5 we know and love today. For many people, that was their first experience with generative AI. And for most, it felt like magic.
As a regular user of ChatGPT and a host of other generative AI tools, I continue to be amazed when I ask GPT a question and it quickly churns out a well-written response, or I ask it to complete a task and, sure enough, it does as I instruct.
But for many other users, the initial burst of excitement around generative AI tools wore off rather quickly, either as they got hands on experience or heard stories about where the tools *failed*. I’ve heard recounts…ChatGPT didn’t know the person I was asking about… It wouldn’t follow my directions… It couldn’t complete the task. It made things up. It lied to me.
Pre-ChatGPT, we didn’t expect the ability to talk to a machine using natural language. But with a small amount of exposure, our expectations evolved and then skyrocketed. Now, only shortly after the release of these tools, most users seemed to think AI should be able to know and do absolutely anything without fail.
Unfortunately, we’re not there yet.
Over the last 18 months, we’ve experienced incremental improvements to the original ChatGPT, first with the upgrade to GPT-3.5 turbo model, then to GPT-4, and now GPT-4o. Meanwhile, a handful of competitors entered the scene, including Google’s PaLM then Gemini, Anthropic’s Claude 2.0 and 3.0, and an incredible number of specialized language models. With each of these upgrades or alternative models, the same concerns exist – When a language model answers me, is it fact or fiction? Can I trust it to do what I asked?
That brings us back to our title question. Whether we’ve been burned by a GPT fail or have read about someone else’s experience, we all want to know… How smart does my AI need to be?
First, it’s important to recognize that the term ‘smart’ is relative, and the answer to the above question will depend on the task required.
The initial wave of AI releases had people thinking that one model would take the lead over others and handle everything perfectly, but the reality is that there are good reasons to use different AI products for different assignments. Knowledge and accuracy are certainly important, but, in most tasks, they must be weighed against cost and performance.
For the simple stuff…
If your task includes summarization, editing, or consuming existing text and providing feedback or review, you *probably* don’t need the biggest and best available language model. Small models do remarkably well when you provide them with the information they need to act on, along with simple, common directions about what to do. Your choice of model for this type of task may depend on what’s easily accessible because free versions of the big chat tools are obvious, but less commonly leveraged options like Mistral and Llama are perfectly suited for these tasks as well.
This is why RAG (Retrieval Augmented Generation) systems, like Betty, are performing so well and becoming more common (more on RAG systems here). When language models are fed information to use, they are grounded in ways that make them significantly more accurate and reliable, even when the information doesn’t directly answer the question (we’ll explain that in a separate blog post!).
However, within a RAG system there is still significant benefit at the top end of the AI market. A GPT-3.5 system can consume content and answer questions, but if you want to control exactly how those questions are answered, you should use a more advanced model. At Betty, we regularly test newer, smaller, faster models and while they still produce exciting, almost magical results, they do *not* follow instructions nearly as well.
For the more complex…
If your area of expertise is complex, deep, technical, or involves some ambiguity and nuance, you probably want your AI to keep that in mind when leveraging your content to provide assistance. You may want the AI to request clarifying questions before answering or completing a task or provide specific in-line citations. You may ask the AI to format answers differently based on perceived rather than explicit intent. Lower end models may be able to handle these instructions, but often are not reliable, especially when you are feeding them a lot of information. For larger tasks with more complex instructions or context, you should look to the latest and greatest. In our opinion, that includes ChatGPT-4o and Claude 3.0 Opus (Gemini 1.0 Ultra and 1.5 Pro are not quite there, but 1.5 is rapidly improving).
After switching to GPT-4, Betty Bot’s intelligence showed remarkable improvement. We have since upgraded to GPT-4-turbo, making her faster and cheaper, and most recently to GPT-4o, which brought major improvements to how she follows instructions. We’ve also tested Claude Opus and believe it to be a very close second place model. Claude Opus is often considered to be the ‘smartest’, and while it tends to provide extremely high-quality answers, we found that it also had a slightly higher tendency to ignore specific instructions as conversations progressed. From our experience, no other models keep up.
So back to the original question – How smart does YOUR AI need to be? The simple (and maybe generic) answer is your AI needs to be smart enough to meet expectations for your task. Start by defining your goal and work backwards to determine which AI tool will help you accomplish that goal. If you need a tool that preserves your voice and maintains your reputation for high-quality and reliable answers, you should look to top-of-the-line products. If your task involves a simple body of knowledge or is more focused on creativity rather than accuracy, lower-tier models should work perfectly fine and minimize cost.
Like all things AI, this is changing fast, but the permanent lesson is that the underlying models matter. If you want to avoid being the next story about AI presenting hallucinations as facts, make sure to leverage the model that will meet your threshold for reliability and intelligence based on the task at hand.
Not sure what’s right for your organization? Book some time with us and we’ll help you figure it out!