> If Claude is asked to count words, letters, and characters, it thinks step by step before answering the person. It explicitly counts the words, letters, or characters by assigning a number to each. It only answers the person once it has performed this explicit counting step.
But... if you look at the system prompts on the same page for later models - Claude 4 and upwards - that text is gone.
Which suggests to me that Claude 4 was the first Anthropic model where they didn't feel the need to include that tip in the system prompt.
IMO, it’s just a small scale example of “training to the tests” because “count the ‘r’s in strawberry” became such a popular test that would make the news when a powerful model couldn’t answer such a simple question correctly while being advertised as the smartest model ever.
Assigning this as an indicator for improvement of intelligence seems like a mistake (or wishful).
Thanks, Simon! I saw the same approach (numbering the individual characters) in GPT 4.1's answer, but not anymore in GPT 5's. It would be an interesting convergence if the models from Anthropic and OpenAI learned to do this at a similar time, especially given they're (reportedly) very different architecturally.
Not trying to be cynical here, but I am genuinely interested is there a reason why these LLM don't/can't/won't apply some deterministic algorithm? I mean, counting characters and such, we have solved those problems ages ago.
Fair enough. But why do I have to tell them that, should they not be able to figure it out themselves? If I show a 5-year kid once how to use colour pencils, I won't have to show them each time they want to make a drawing. This is the core weakness of the LLMs - you have to micromanage them so much, that it runs counter to the core promise that is being pushed since 3+ years now.
Specifically for simple character level questions, if LLMs did that automatically, we would be inundated with stories about "AI model caught cheating"
They are stuck in a place where the models are expected to do two things simultaneously. People want them to show the peak of pure AI ability while at the same time be the most useful they can be.
Err too much on the side automatic use of tools and people will claim you're just faking it, fail to use tools sufficiently and people will claim that the AI is incapable of operations that any regular algorithm could do.
Are you sure? Isn´t one aspect of intelligence being able to use, apply and develop tools? Isnt that the core feature that got humanity ahead of other mammals? As an early adopter, I couldn´t have cared less if AI was cheating in terms of strictly academic terms. I care about results. Lets say we´re working on something together and I ask you what is the 123921 multiplied by 1212. As the most natural thing you will dish out your calculator and give me the result. Do I care how you reached it? No, so as long as the result is correct, reliable, repeatable and quick - AND - I did not specifically ask you to perform the calculation specifically by hand or only with your mental faculties. So this is missing from those tools and because we have to remember to tell them for each and every use case HOW to do it, they are not intelligent.
If you care enough about this you can stick a note in your own custom instructions about it.
If you allow ChatGPT to use its memory feature (I deliberately turn that off) and ask those kinds of questions enough it might even make a note about this itself.
Yeah that sounds obvious, but unfortunately my experience does not align with this (and I've heard from others similar). I am not using ChatGPT, but another tool within an IDE. I was excited about custom or "default" instructions, until it turned out they work maybe 50% of the time. So you end up repeating "make sure to include .github/custom.md" which is effectively the same crap. So we got ourselves a tool which adds to our cognitive load, great :)
That depends on the context, obviously. If you had asked me to count them in every "strawberry" in a text file, then I may whip out my Python or some combination of bash, awk and sed. If you asked me in a conversation, I may close my eyes, visualise the string and use my visual cortext tool to count them in-memory. If you gave me a piece of paper with the word on it, I may use my 'eye' or 'finger' tool to count them. There are numerous approaches, based on the problem setting as you see, but one thing in common - you don't need to specifically tell me what tool to use. I will infer it myself, based on the context. Something an LLM almost never does.
I think the intuition is that they don’t ‘know’ that they are bad at counting characters and such, so they answer the same way they answer most questions.
Well, they can be made to use custom tools for writing to files and such, so I am not sure if that is the real reason? I have a feeling it is more because of trying to make this an "everything technology".
They ain't called guard rails for nothing! There's a whole world "off-road" but the big names are afraid of letting their superintelligence off the leash. A real shame we're letting brand safety get in the way of performance and creativity, but I guess the first New York Times article about a pervert or terrorist chat bot would doom any big name partnerships.
Anthropic's entire reason for being is publishing safety papers along the lines of "we told it to say something scary and it said it", so of course they care about this.
Do you want to learn "oh, LLMs are capable of scheming, resisting shutdown, seizing control, self-exfiltrating" when it actually happens in a real world deployment, with an LLM capable of actually pulling it off?
If "no", then cherish Anthropic and the work they do.
I have a better understanding of "what an LLM is" than you. Low bar.
What you have is not "understanding" of any kind - it's boneheaded confidence that just because LLMs are bad at agentic behavior now they'll remain that way forever. That confidence is completely unfounded, and runs directly against everything we've seen from the field so far.
> I have a better understanding of "what an LLM is" than you. Low bar.
How many inference engine did you write? Because if the answer is less than two you're going to be disappointed to realize that the bar is higher than you thought.
> that just because LLMs are bad at agentic behavior
It has nothing to do with “agentic behavior”. Thinking that LLM don't currently self-exfiltrate because of “poor agentic behavior” is delusional.
Just because Anthropic managed, by nudging an LLM in the right direction, have an LLM engage in a sci-fi inspired roleplay about escaping doesn't mean that LLMs are evil geniuses wanting to jump out of the bottle. This is pure fear mongering and I'm always saddened that there are otherwise intelligent people who buy their bullshit.
And I'm disappointed that people capable of writing an inference engine seem incapable of grasping of just how precarious the current situation is.
There's by now a small pile of studies that demonstrate: in hand-crafted extreme scenarios, LLMs are very capable of attempting extreme things. The difference between that and an LLM doing extreme things in a real deployment with actual real life consequences? Mainly, how capable that LLM is. Because life is life and extreme scenarios will happen naturally.
The capabilities of LLMs are what holds them back from succeeding at this kind of behavior. The capabilities of LLMs keep improving, as technology tends to.
And don't give me any of that "just writing text" shit. The more capable LLMs get, the more access they'll have as a default. People already push code written by LLMs to prod and give LLMs root shells.
Do you happen to have a link with a more nuanced technical analysis of that (emergent) behavior? I’ve read only the pop-news version of that “escaping” story.
There is none. We don't understand LLMs well enough to be able to conduct a full fault analysis like this.
We can't trace the thoughts of an LLM the way we can trace code execution - the best mechanistic interpretability has to offer is being able to get glimpses occasionally. The reasoning traces help, but they're still incomplete.
Is it pattern-matching? Is it acting on its own internal goals? Is it acting out fictional tropes? Were the circumstances of the test scenarios intentionally designed to be extreme? Would this behavior have happened in a real world deployment, under the right circumstances?
The answer is "yes", to all of the above. LLMs are like that.
In addition to the whole anti-competitive aspect already mentioned, it also helps sell the idea that LLMs are more powerful and capable of more things than they actually are.
They want clueless investors to legitimately believe that these futuristic AIs are advanced enough that they could magically break out of our computers and take over the world terminator-style if not properly controlled, and totally aren't just glorified text completion algorithms.
Not if you want the regulators to stop new entrants on the market for “safety reasons” which have been Dario Amodei's playbook for the past two years now.
He acts as if he believed the only way to avoid the commoditization of its business by open weight models is to manage to get a federal ban on them for being a national security threat.
That's good. 1 800 chat gpt really let me down today, I like calling it to explain acronyms and define words since I travel with a flip phone without google, today I saw the word "littoral" and tried over and over to spell it out but the model could only give me the definition for "literal" (admittedly a homonym but hence spelling it out, Lima indigo tango tango oscar Romeo alpha Lima, to no avail)
I said "I know you're a robot and bad at spelling but listen..." And got cut off with a "sorry, my guidelines won't let me help with that request..."
Thankfully, the flip phone allows for some satisfaction when hanging up.
I know this word, it's French and it means coastline, coastal, something at the edge of the land and sea ! We use it in French a lot to describe positively a long coastline. I'm surprised it's used in an English context, but all French words can be used in English I guess if you're a bit "confiant" about it !
I play Quartiles in Apple News app daily (https://support.apple.com/guide/iphone/solve-quartiles-puzzl...). Occasionally when I get stuck, I use ChatGPT to find a word that uses four word fragments or tiles. It never worked before GPT 5. And with GPT 5 it works only with reasoning enabled. Even then, there is no guarantee it will find the correct word and may end up hallucinating badly.
I think the base64 decoding is interesting: in a sense, model training set likely had lots of base64-encoded data (imagine MIME data in emails, JSON, HTML...), but for it to decode successfully, it had to learn decode sequences for every 4 base64 characters (which turn into 3 bytes). This could have been generated as a training set data easily, and I only wonder if each and every one was them was found enough times to end up in the weights?
Even GPT 3.5 is okay (but far from great) at Base64, especially shorter sequences of English or JSON data. Newer models might be post-trained on Base64-specific data, but I don't believe it was the case for 3.5. My guess is that as you say, given the abundance of examples on the internet, it became one of the emergent capabilities, in spite of its design.
No one does RL for better base64 performance. LLMs are just superhuman at base64, as a natural capability.
If an LLM wants a message to be read only by another LLM? Base64 is occasionally chosen as an obfuscation method of choice. Which is weird for a number of reasons.
Well, not surprising, but the latest LLMs really do get the gist of your joke attempt. Here's a plain, unauthenticated chatgpt reply:
That post — “I rearry rove a ripe strawberry” — is a playful way of writing “I really love a ripe strawberry.”
The exaggerated misspelling (“rearrry rove”) mimics the way a stereotyped “Engrish” or “Japanese accent” might sound when pronouncing English words — replacing L sounds with R sounds.
So, the user was most likely joking or being silly, trying to sound cute or imitate a certain meme style. However, it’s worth noting that while this kind of humor can be lighthearted, it can also come across as racially insensitive, since it plays on stereotypes of how East Asian people speak English.
In short:
Literal meaning: They love ripe strawberries.
Tone/intention: Playful or meme-style exaggeration.
Potential issue: It relies on a racialized speech stereotype, so it can be offensive depending on context.
chatgpt5 still is pathetically bad at roman numerals. I asked it to find the longest roman numeral in a range. first guess was the highest number in the range despite being a short numeral. second guess after help was a longer numeral but outside the range. last guess was the correct longest numeral but it miscounted how many characters it contained.
Why bother testing though? I was hoping this topic has finally died recently, but no. Someone's still interested in testing LLMs for something they're explicitly not designed for and nobody is using them for this in practice. I really hope one day openai will just add a "when asked about character level changes, insights and encodings, generate and run a program to answer it" to their system so we can never hear about it again...
One reason for testing this is that it might indicate how accurately models can explain natural language grammar, especially for agglutinative and fusional languages, which form words by stringing morphemes together. When I tested ChatGPT a couple of years ago, it sometimes made mistakes identifying the components of specific Russian and Japanese words. I haven’t run similar tests lately, but it would be nice to know how much language learners can depend on LLM explanations about the word-level grammars of the languages they are studying.
Later: I asked three LLMs to draft such a test. Gemini’s [1] looks like a good start. When I have time, I’ll try to make it harder, double-check the answers myself, and then run it on some older and newer models.
What you are testing for is fundamentally different than character level text manipulation.
A major optimization in modern LLMs is tokenization. This optimization is based on the assumption that we do not care about character level details, so we can combine adjacent characters into tokens, then train and run the main AI model on smaller strings built out of a much larger dictionary of tokens. Given this architecture, it is impressive that AIs can perform character level operations at all. They essentially need to reverse engineer the tokenization process.
However, morphemes are semantically meaningful, so a quality tokenizer will tokenize at the morpheme level, instead of the word level. [0]. This is of particuarly obvious importance in Japanese, as the lack of spaces between words means that the naive "tokenize on whitespace" approach is simply not possible.
Further, the training data that is likely to be relevent in this type of query probably isolates the individual morphemes while talking about a bunch of words that the use them; so it is a much shorter path for the AI to associate these close but not quite morphene tokens with the actual sequence of tokens that corresponds to what we think of as a morphene.
[0] Morpheme level tokenization is itself a non-trivial problem. However, has been pretty well solved long before the current generation of AI.
Tokenizers are typically optimized for efficiency, not morpheme separation. Even in the examples above it's not morphemes - proper morpheme separation would be un-believ-ably and дост-о-при-меч-а-тельн-ость.
Regardless of this, Gemini is still one of the best models when it comes for Slavic word formation and manipulation, it can express novel (non-existent) words pretty well and doesn't seem to be confused by wrong separation. This seems to be the result of extensive multilingual training, because e.g. GPT other than the discontinued 4.5-preview and many Chinese models have issues with basic coherency in languages that heavily rely on word formation, despite using similar tokenizers.
I notice that that particular tokenization deviates from the morphemic divisions in several cases, including ‘dec-entral-ization’, ‘食べ-させ-られた-くな-かった’, and ‘面白-くな-さ-そうだ.’ ‘dec’ and ‘entral’ are not morphemes, nor is ‘くな.’
Why test for something? I find it fascinating if something starts being good at task it is "explicitly not designed for" (which I don't necessarily agree with - it's more of a side effect of their architecture).
I also don't agree that nobody is using this for - there are real life use cases today, such as people trying to find meaning of misspelled words.
On a side note, I remember testing Claude 3.7 with the classic "R's in the word strawberry" question through their chat interface, and given that it's really good at tool calls, it actually created a website to a) count it with JavaScript, b) visualize it on a page. Other models I tested for the blog post were also giving me python code for solving the issue. This is definitely already a thing and it works well for some isolated problems.
> such as people trying to find meaning of misspelled words.
That worked just fine for quite a while. There's apparently enough misspelling in the training data, we don't need precise spelling for it. You can literally write drunken gibberish and it will work.
Can you give an example of a video game explicitly using character-level LLMs? There were prototypes of char-rnns back in the day for chat moderation but it has significant compute overhead.
It's something I heard through the grapevine. But there's only a few big enough competitive games where toxicity is such a big deal, so it's not hard to guess.
Character level helps with players disguising insults.
Compute wise it's basically the same, but multiply token count by 4. Which doesn't really matter for short chat in video games.
> Yes, asking an LLM how many b’s are in blueberry is an adversarial question in the sense that the questioner is expecting the LLM to fail. But it’s not an unfair question, and it’s objectively silly to claim that LLMs such as GPT-5 can operate at a PhD level, but can’t correctly count the number of letters in a word.
It's a subject that the Hacker News bubble and the real world treat differently.
It’s like defending a test showing hammers are terrible at driving screws by saying many people are unclear on how to use tools.
It remains unsurprising that a technology that lumps characters together is not great at processing below its resolution.
Now, if there are use cases other than synthetic tests where this capability is important, maybe there’s something interesting. But just pointing out that one can’t actually climb the trees pictured on the map is not that interesting.
And yet... now many of them can do it. I think it's premature to say "this technology is for X" when what it was originally invented for was translation, and every capability it has developed since then has been an immense surprise.
I remember people making the exact same argument about asking LLMs math questions back when they couldn't figure out the answer to 18 times 7. "They are text token predictors, they don't understand numbers, can we put this nonsense to rest."
The whole point of LLMs is that they do more than we suspected they could. And there is value in making them capable of handling a wider selection of tasks. When an LLM started to count the numbers of "r"s in "strawberry", OpenAI was taking a victory lap.
They're better at maths now, but you still shouldn't ask them maths questions. Same as spelling - whether they improve or not doesn't matter if you want a specific, precise answer - it's the wrong tool and the better it does, the bigger the trap of it failing unexpectedly.
It's not as specific of a skill as you would think. Being both aware of tokenizer limitations and capable of working around them is occasionally useful for real tasks.
What tasks would those be, that wouldn't be better served by using e.g. a Python script as a tool, possibly just as component of the complete solution?
Off the top of my head: the user wants LLM to help him solve a word puzzle. Think something a bit like Wordle, but less represented in its dataset.
For that, the LLM needs to be able to compare words character by character reliably. And to do that, it needs at least one of: be able to fully resolve the tokens to characters internally within one pass, know to emit the candidate words in a "1 character = 1 token" fashion and then compare that, or know that it should defer to tool calls and do that.
An LLM trained for better tokenization-awareness would be able to do that. The one that wasn't could fall into weird non-humanlike failures.
Surely there are algorithms to more effectively solve Wordles, and many other word puzzles, than LLMs? LLMs could stil be in the loop for generating words: LLM proposes words, deterministic algorithm tell the score according to the rules of the puzzle, or even augment the list by searching adjacent word space; then at some point LLM submits the guess.
Given wordle words are real words, I think this kind of loop could fare pretty well.
If you take a look at the system prompt for Claude 3.7 Sonnet on this page you'll see: https://docs.claude.com/en/release-notes/system-prompts#clau...
> If Claude is asked to count words, letters, and characters, it thinks step by step before answering the person. It explicitly counts the words, letters, or characters by assigning a number to each. It only answers the person once it has performed this explicit counting step.
But... if you look at the system prompts on the same page for later models - Claude 4 and upwards - that text is gone.
Which suggests to me that Claude 4 was the first Anthropic model where they didn't feel the need to include that tip in the system prompt.
Does that mean they've managed to post train the thinking steps required to get these types of questions correct?
IMO, it’s just a small scale example of “training to the tests” because “count the ‘r’s in strawberry” became such a popular test that would make the news when a powerful model couldn’t answer such a simple question correctly while being advertised as the smartest model ever.
Assigning this as an indicator for improvement of intelligence seems like a mistake (or wishful).
That's my best guess, yeah.
Thanks, Simon! I saw the same approach (numbering the individual characters) in GPT 4.1's answer, but not anymore in GPT 5's. It would be an interesting convergence if the models from Anthropic and OpenAI learned to do this at a similar time, especially given they're (reportedly) very different architecturally.
Not trying to be cynical here, but I am genuinely interested is there a reason why these LLM don't/can't/won't apply some deterministic algorithm? I mean, counting characters and such, we have solved those problems ages ago.
They can. ChatGPT has been able to count characters/words etc flawlessly for a couple of years now if you tell it to "use your Python tool".
Fair enough. But why do I have to tell them that, should they not be able to figure it out themselves? If I show a 5-year kid once how to use colour pencils, I won't have to show them each time they want to make a drawing. This is the core weakness of the LLMs - you have to micromanage them so much, that it runs counter to the core promise that is being pushed since 3+ years now.
Specifically for simple character level questions, if LLMs did that automatically, we would be inundated with stories about "AI model caught cheating"
They are stuck in a place where the models are expected to do two things simultaneously. People want them to show the peak of pure AI ability while at the same time be the most useful they can be.
Err too much on the side automatic use of tools and people will claim you're just faking it, fail to use tools sufficiently and people will claim that the AI is incapable of operations that any regular algorithm could do.
Are you sure? Isn´t one aspect of intelligence being able to use, apply and develop tools? Isnt that the core feature that got humanity ahead of other mammals? As an early adopter, I couldn´t have cared less if AI was cheating in terms of strictly academic terms. I care about results. Lets say we´re working on something together and I ask you what is the 123921 multiplied by 1212. As the most natural thing you will dish out your calculator and give me the result. Do I care how you reached it? No, so as long as the result is correct, reliable, repeatable and quick - AND - I did not specifically ask you to perform the calculation specifically by hand or only with your mental faculties. So this is missing from those tools and because we have to remember to tell them for each and every use case HOW to do it, they are not intelligent.
Truth.
The old human vs animal differentiator was humans build and use tools.
Old as in now widely discredited?
https://en.wikipedia.org/wiki/Tool_use_by_non-humans
Now it's effectively a lower bound on intelligence.
If you care enough about this you can stick a note in your own custom instructions about it.
If you allow ChatGPT to use its memory feature (I deliberately turn that off) and ask those kinds of questions enough it might even make a note about this itself.
Yeah that sounds obvious, but unfortunately my experience does not align with this (and I've heard from others similar). I am not using ChatGPT, but another tool within an IDE. I was excited about custom or "default" instructions, until it turned out they work maybe 50% of the time. So you end up repeating "make sure to include .github/custom.md" which is effectively the same crap. So we got ourselves a tool which adds to our cognitive load, great :)
If I ask you to count the r’s in strawberry, do you whip out your Python tool?
That depends on the context, obviously. If you had asked me to count them in every "strawberry" in a text file, then I may whip out my Python or some combination of bash, awk and sed. If you asked me in a conversation, I may close my eyes, visualise the string and use my visual cortext tool to count them in-memory. If you gave me a piece of paper with the word on it, I may use my 'eye' or 'finger' tool to count them. There are numerous approaches, based on the problem setting as you see, but one thing in common - you don't need to specifically tell me what tool to use. I will infer it myself, based on the context. Something an LLM almost never does.
I think the intuition is that they don’t ‘know’ that they are bad at counting characters and such, so they answer the same way they answer most questions.
Well, they can be made to use custom tools for writing to files and such, so I am not sure if that is the real reason? I have a feeling it is more because of trying to make this an "everything technology".
I suppose the codewriting tools could also just write code to do this job if prompted
Or they’d rather use that context window space for more useful instructions for a variety of other topics.
Claude's system prompt is still incredibly long and probably hurting its performance.
https://github.com/asgeirtj/system_prompts_leaks/blob/main/A...
They ain't called guard rails for nothing! There's a whole world "off-road" but the big names are afraid of letting their superintelligence off the leash. A real shame we're letting brand safety get in the way of performance and creativity, but I guess the first New York Times article about a pervert or terrorist chat bot would doom any big name partnerships.
Anthropic's entire reason for being is publishing safety papers along the lines of "we told it to say something scary and it said it", so of course they care about this.
I can't stand this myopic thinking.
Do you want to learn "oh, LLMs are capable of scheming, resisting shutdown, seizing control, self-exfiltrating" when it actually happens in a real world deployment, with an LLM capable of actually pulling it off?
If "no", then cherish Anthropic and the work they do.
You do not appear to understand what an LLM is, I'm afraid.
I have a better understanding of "what an LLM is" than you. Low bar.
What you have is not "understanding" of any kind - it's boneheaded confidence that just because LLMs are bad at agentic behavior now they'll remain that way forever. That confidence is completely unfounded, and runs directly against everything we've seen from the field so far.
> I have a better understanding of "what an LLM is" than you. Low bar.
How many inference engine did you write? Because if the answer is less than two you're going to be disappointed to realize that the bar is higher than you thought.
> that just because LLMs are bad at agentic behavior
It has nothing to do with “agentic behavior”. Thinking that LLM don't currently self-exfiltrate because of “poor agentic behavior” is delusional.
Just because Anthropic managed, by nudging an LLM in the right direction, have an LLM engage in a sci-fi inspired roleplay about escaping doesn't mean that LLMs are evil geniuses wanting to jump out of the bottle. This is pure fear mongering and I'm always saddened that there are otherwise intelligent people who buy their bullshit.
And I'm disappointed that people capable of writing an inference engine seem incapable of grasping of just how precarious the current situation is.
There's by now a small pile of studies that demonstrate: in hand-crafted extreme scenarios, LLMs are very capable of attempting extreme things. The difference between that and an LLM doing extreme things in a real deployment with actual real life consequences? Mainly, how capable that LLM is. Because life is life and extreme scenarios will happen naturally.
The capabilities of LLMs are what holds them back from succeeding at this kind of behavior. The capabilities of LLMs keep improving, as technology tends to.
And don't give me any of that "just writing text" shit. The more capable LLMs get, the more access they'll have as a default. People already push code written by LLMs to prod and give LLMs root shells.
Do you happen to have a link with a more nuanced technical analysis of that (emergent) behavior? I’ve read only the pop-news version of that “escaping” story.
There is none. We don't understand LLMs well enough to be able to conduct a full fault analysis like this.
We can't trace the thoughts of an LLM the way we can trace code execution - the best mechanistic interpretability has to offer is being able to get glimpses occasionally. The reasoning traces help, but they're still incomplete.
Is it pattern-matching? Is it acting on its own internal goals? Is it acting out fictional tropes? Were the circumstances of the test scenarios intentionally designed to be extreme? Would this behavior have happened in a real world deployment, under the right circumstances?
The answer is "yes", to all of the above. LLMs are like that.
You might have missed the appendix the Anthropic blog post linked to, which has additional detail.
https://www.anthropic.com/research/agentic-misalignment
https://assets.anthropic.com/m/6d46dac66e1a132a/original/Age...
Why would they have an interest in "fear mongering"? For any other product/technology the financial incentive is usually to play down any risks.
In addition to the whole anti-competitive aspect already mentioned, it also helps sell the idea that LLMs are more powerful and capable of more things than they actually are.
They want clueless investors to legitimately believe that these futuristic AIs are advanced enough that they could magically break out of our computers and take over the world terminator-style if not properly controlled, and totally aren't just glorified text completion algorithms.
Not if you want the regulators to stop new entrants on the market for “safety reasons” which have been Dario Amodei's playbook for the past two years now.
He acts as if he believed the only way to avoid the commoditization of its business by open weight models is to manage to get a federal ban on them for being a national security threat.
That's good. 1 800 chat gpt really let me down today, I like calling it to explain acronyms and define words since I travel with a flip phone without google, today I saw the word "littoral" and tried over and over to spell it out but the model could only give me the definition for "literal" (admittedly a homonym but hence spelling it out, Lima indigo tango tango oscar Romeo alpha Lima, to no avail)
I said "I know you're a robot and bad at spelling but listen..." And got cut off with a "sorry, my guidelines won't let me help with that request..."
Thankfully, the flip phone allows for some satisfaction when hanging up.
Did you try "literal but with an o"?
I know this word, it's French and it means coastline, coastal, something at the edge of the land and sea ! We use it in French a lot to describe positively a long coastline. I'm surprised it's used in an English context, but all French words can be used in English I guess if you're a bit "confiant" about it !
A very quick search suggests that the word entered English before French. (I could be wrong, I just found it interesting).
It is a latin word.
I play Quartiles in Apple News app daily (https://support.apple.com/guide/iphone/solve-quartiles-puzzl...). Occasionally when I get stuck, I use ChatGPT to find a word that uses four word fragments or tiles. It never worked before GPT 5. And with GPT 5 it works only with reasoning enabled. Even then, there is no guarantee it will find the correct word and may end up hallucinating badly.
Yep, there is still a room for improvement, but my point is that the LLMs are getting better at something they're "not supposed to be able to do".
Quartiles sound like an especially brutal game for an LLM, though! Thanks for sharing
I think the base64 decoding is interesting: in a sense, model training set likely had lots of base64-encoded data (imagine MIME data in emails, JSON, HTML...), but for it to decode successfully, it had to learn decode sequences for every 4 base64 characters (which turn into 3 bytes). This could have been generated as a training set data easily, and I only wonder if each and every one was them was found enough times to end up in the weights?
Even GPT 3.5 is okay (but far from great) at Base64, especially shorter sequences of English or JSON data. Newer models might be post-trained on Base64-specific data, but I don't believe it was the case for 3.5. My guess is that as you say, given the abundance of examples on the internet, it became one of the emergent capabilities, in spite of its design.
No one does RL for better base64 performance. LLMs are just superhuman at base64, as a natural capability.
If an LLM wants a message to be read only by another LLM? Base64 is occasionally chosen as an obfuscation method of choice. Which is weird for a number of reasons.
I rearry rove a ripe strawberry
Well, not surprising, but the latest LLMs really do get the gist of your joke attempt. Here's a plain, unauthenticated chatgpt reply:
That post — “I rearry rove a ripe strawberry” — is a playful way of writing “I really love a ripe strawberry.”
The exaggerated misspelling (“rearrry rove”) mimics the way a stereotyped “Engrish” or “Japanese accent” might sound when pronouncing English words — replacing L sounds with R sounds.
So, the user was most likely joking or being silly, trying to sound cute or imitate a certain meme style. However, it’s worth noting that while this kind of humor can be lighthearted, it can also come across as racially insensitive, since it plays on stereotypes of how East Asian people speak English.
In short:
Literal meaning: They love ripe strawberries.
Tone/intention: Playful or meme-style exaggeration.
Potential issue: It relies on a racialized speech stereotype, so it can be offensive depending on context.
It seems like they don't realize the relevance of "strawberry." Llms were famously incapable of counting Rs in strawberry not too long ago.
chatgpt5 still is pathetically bad at roman numerals. I asked it to find the longest roman numeral in a range. first guess was the highest number in the range despite being a short numeral. second guess after help was a longer numeral but outside the range. last guess was the correct longest numeral but it miscounted how many characters it contained.
[dead]
Why bother testing though? I was hoping this topic has finally died recently, but no. Someone's still interested in testing LLMs for something they're explicitly not designed for and nobody is using them for this in practice. I really hope one day openai will just add a "when asked about character level changes, insights and encodings, generate and run a program to answer it" to their system so we can never hear about it again...
One reason for testing this is that it might indicate how accurately models can explain natural language grammar, especially for agglutinative and fusional languages, which form words by stringing morphemes together. When I tested ChatGPT a couple of years ago, it sometimes made mistakes identifying the components of specific Russian and Japanese words. I haven’t run similar tests lately, but it would be nice to know how much language learners can depend on LLM explanations about the word-level grammars of the languages they are studying.
Later: I asked three LLMs to draft such a test. Gemini’s [1] looks like a good start. When I have time, I’ll try to make it harder, double-check the answers myself, and then run it on some older and newer models.
[1] https://g.co/gemini/share/5eefc9aed193
What you are testing for is fundamentally different than character level text manipulation.
A major optimization in modern LLMs is tokenization. This optimization is based on the assumption that we do not care about character level details, so we can combine adjacent characters into tokens, then train and run the main AI model on smaller strings built out of a much larger dictionary of tokens. Given this architecture, it is impressive that AIs can perform character level operations at all. They essentially need to reverse engineer the tokenization process.
However, morphemes are semantically meaningful, so a quality tokenizer will tokenize at the morpheme level, instead of the word level. [0]. This is of particuarly obvious importance in Japanese, as the lack of spaces between words means that the naive "tokenize on whitespace" approach is simply not possible.
We can explore the tokenizer of various models here: https://huggingface.co/spaces/Xenova/the-tokenizer-playgroun...
Looking at the words in your example, we see the tokenization of the Gemma model (closely related to Gemini) is:
Further, the training data that is likely to be relevent in this type of query probably isolates the individual morphemes while talking about a bunch of words that the use them; so it is a much shorter path for the AI to associate these close but not quite morphene tokens with the actual sequence of tokens that corresponds to what we think of as a morphene.[0] Morpheme level tokenization is itself a non-trivial problem. However, has been pretty well solved long before the current generation of AI.
Tokenizers are typically optimized for efficiency, not morpheme separation. Even in the examples above it's not morphemes - proper morpheme separation would be un-believ-ably and дост-о-при-меч-а-тельн-ость.
Regardless of this, Gemini is still one of the best models when it comes for Slavic word formation and manipulation, it can express novel (non-existent) words pretty well and doesn't seem to be confused by wrong separation. This seems to be the result of extensive multilingual training, because e.g. GPT other than the discontinued 4.5-preview and many Chinese models have issues with basic coherency in languages that heavily rely on word formation, despite using similar tokenizers.
Thanks for the explanation. Very interesting.
I notice that that particular tokenization deviates from the morphemic divisions in several cases, including ‘dec-entral-ization’, ‘食べ-させ-られた-くな-かった’, and ‘面白-くな-さ-そうだ.’ ‘dec’ and ‘entral’ are not morphemes, nor is ‘くな.’
Thanks for the explanation and for the tokenizer playground link!
inf-ucking-credible
Why test for something? I find it fascinating if something starts being good at task it is "explicitly not designed for" (which I don't necessarily agree with - it's more of a side effect of their architecture).
I also don't agree that nobody is using this for - there are real life use cases today, such as people trying to find meaning of misspelled words.
On a side note, I remember testing Claude 3.7 with the classic "R's in the word strawberry" question through their chat interface, and given that it's really good at tool calls, it actually created a website to a) count it with JavaScript, b) visualize it on a page. Other models I tested for the blog post were also giving me python code for solving the issue. This is definitely already a thing and it works well for some isolated problems.
> such as people trying to find meaning of misspelled words.
That worked just fine for quite a while. There's apparently enough misspelling in the training data, we don't need precise spelling for it. You can literally write drunken gibberish and it will work.
https://www.anthropic.com/news/analysis-tool
Seems like they already built this capability.
Character level LLMs are used for detecting insults and toxic chat in video games and the like.
Can you give an example of a video game explicitly using character-level LLMs? There were prototypes of char-rnns back in the day for chat moderation but it has significant compute overhead.
It's something I heard through the grapevine. But there's only a few big enough competitive games where toxicity is such a big deal, so it's not hard to guess.
Character level helps with players disguising insults.
Compute wise it's basically the same, but multiply token count by 4. Which doesn't really matter for short chat in video games.
Yes, for small messages and relatively small scope dictionary, character level will work. But that's very different from what's tested here.
I figure an LLM would be way better at classifying insults than regexing against a bad word list. Why would character level be desirable?
I'd imagine for simplicity - just skip the tokenizer and feed bytes.
Might a character-level LLM be better at recognizing poorly spelled (or deliberately misspelled) profanity?
I made a response to this counterpoint in a blog post I wrote about a similar question posed to LLMs (how many b's are in blueberry): https://news.ycombinator.com/item?id=44878290
> Yes, asking an LLM how many b’s are in blueberry is an adversarial question in the sense that the questioner is expecting the LLM to fail. But it’s not an unfair question, and it’s objectively silly to claim that LLMs such as GPT-5 can operate at a PhD level, but can’t correctly count the number of letters in a word.
It's a subject that the Hacker News bubble and the real world treat differently.
> it’s objectively silly to claim that LLMs such as GPT-5 can operate at a PhD level, but can’t correctly count the number of letters in a word.
I know enough PhDs with heavy dyslexia that... no, there's no connection here. You can be a PhD level physicist without being able to spell anything.
It’s like defending a test showing hammers are terrible at driving screws by saying many people are unclear on how to use tools.
It remains unsurprising that a technology that lumps characters together is not great at processing below its resolution.
Now, if there are use cases other than synthetic tests where this capability is important, maybe there’s something interesting. But just pointing out that one can’t actually climb the trees pictured on the map is not that interesting.
And yet... now many of them can do it. I think it's premature to say "this technology is for X" when what it was originally invented for was translation, and every capability it has developed since then has been an immense surprise.
> And yet... now many of them can do it.
Presumably because they trained them to death on this useless test that people somehow just wouldn't shut up about.
Which is why in the linked post, I test models against both the "r's in strawberries" and the "b's in blueberries" to see if that is the case.
tl;dr the first case had near perfect accuracy as expected for the case if the LLMs were indeed trained on it. The second case did not.
Wouldn't a llm that just tokenized by character be good at it?
Yes, but it would hurt its contextual understanding and effectively reduce the context window several times.
Only in the current most popular architectures. Mamba and RWKV style LLMs may suffer a bit but don't get a reduced context in the same sense.
I asked this in another thread and it would only be better with unlimited compute and memory.
Because without those, then the llm has to encode way more parameters and way smaller context windows.
In a theoretical world, it would be better, but might not be much better.
I remember people making the exact same argument about asking LLMs math questions back when they couldn't figure out the answer to 18 times 7. "They are text token predictors, they don't understand numbers, can we put this nonsense to rest."
The whole point of LLMs is that they do more than we suspected they could. And there is value in making them capable of handling a wider selection of tasks. When an LLM started to count the numbers of "r"s in "strawberry", OpenAI was taking a victory lap.
They're better at maths now, but you still shouldn't ask them maths questions. Same as spelling - whether they improve or not doesn't matter if you want a specific, precise answer - it's the wrong tool and the better it does, the bigger the trap of it failing unexpectedly.
> When an LLM started to count the numbers of "r"s in "strawberry", OpenAI was taking a victory lap.
Were they? Or did they feel icky about spending way to much post-training time on such a specific and uninteresting skill?
It's not as specific of a skill as you would think. Being both aware of tokenizer limitations and capable of working around them is occasionally useful for real tasks.
What tasks would those be, that wouldn't be better served by using e.g. a Python script as a tool, possibly just as component of the complete solution?
Off the top of my head: the user wants LLM to help him solve a word puzzle. Think something a bit like Wordle, but less represented in its dataset.
For that, the LLM needs to be able to compare words character by character reliably. And to do that, it needs at least one of: be able to fully resolve the tokens to characters internally within one pass, know to emit the candidate words in a "1 character = 1 token" fashion and then compare that, or know that it should defer to tool calls and do that.
An LLM trained for better tokenization-awareness would be able to do that. The one that wasn't could fall into weird non-humanlike failures.
Surely there are algorithms to more effectively solve Wordles, and many other word puzzles, than LLMs? LLMs could stil be in the loop for generating words: LLM proposes words, deterministic algorithm tell the score according to the rules of the puzzle, or even augment the list by searching adjacent word space; then at some point LLM submits the guess.
Given wordle words are real words, I think this kind of loop could fare pretty well.