Google making Gemini 2.5 Pro (Experimental) free was a big deal. I haven't tried the more expensive OpenAI models so I can't even compare, only to the free models I have used of theirs in the past.
Gemini 2.5 Pro is so much of a step up (IME) that I've become sold on Google's models in general. It not only is smarter than me on most of the subjects I engage with it, it also isn't completely obsequious. The model pushes back on me rather than contorting itself to find a way to agree.
100% of my casual AI usage is now in Gemini and I look forward to asking it questions on deep topics because it consistently provides me with insight. I am building new tools with the mind to optimize my usage to increase it's value to me.
After comparing Gemini Pro and Claude Sonnet 3.7 coding answers side by side a few times, I decided to cancel my Anthropic subscription and just stick to Gemini.
One of the main advantages Anthropic currently has over Google is the tooling that comes with Claude Code. It may not generate better code, and it has a lower complexity ceiling, but it can automatically find and search files, and figure out how to fix a syntax error fast.
As another person that cancelled my Claude and switched to Gemini, I agree that Claude Code is very nice, but beyond some initial exploration I never felt comfortable using it for real work because Claude 3.7 is far too eager to overengineer half-baked solutions that extend far beyond what you asked it to do in the first place.
Paying real API money for Claude to jump the gun on solutions invalidated the advantage of having a tool as nice as Claude Code, at least for me, I admit everyone's mileage will vary.
Exactly my experience as well. Started out loving it but it almost moves too fast - building in functionality that i might want eventually but isn't yet appropriate for where the project is in terms of testing, or is just in completely the wrong place in the architecture. I try to give very direct and specific prompts but it still has the tendency to overreach. Of course it's likely that with more use i will learn better how to rein it in.
I've experienced this a lot as well. I also just yesterday had an interesting argument with claude.
It put an expensive API call inside a useEffect hook. I wanted the call elsewhere and it fought me on it pretty aggressively. Instead of removing the call, it started changing comments and function names to say that the call was just loading already fetched data from a cache (which was not true). I could not find a way to tell it to remove that API call from the useEffect hook, It just wrote more and more motivated excuses in the surrounding comments. It would have been very funny if it weren't so expensive.
Geez, I'm not one of the people who think AI is going to wake up and wipe us out, but experiences like yours do give me pause. Right now the AI isn't in the drivers seat and can only assert itself through verbal expression, but I know it's only a matter of time. We already saw Cursor themselves get a taste of this. To be clear I'm not suggesting the AI is sentient and malicious - I don't believe that at all. I think it's been trained/programmed/tuned to do this, though not intentionally, but the nature of these tools is they will surprise us
> but the nature of these tools is they will surprise us
Models used to do this much much more than now, so what it did doesn't surprise us.
The nature of these tools is to copy what we have already written. It has seen many threads where developers argue and dig in, they try to train the AI not to do that but sometimes it still happens and then it just roleplays as the developer that refuses to listen to anything you say.
I almost fear more that we'll create Bender from Futurama than some superintelligent enlightened AGI. It'll probably happen after Grok AI gets snuck some beer into its core cluster or something absurd.
Earlier this week a Cursor AI support agent told a user they could only use Cursor on one machine at a time, causing the user to cancel their subscription.
I wanted some powershell code to do some sharepoint uploading. It created a 1000 line logging module that allowed me to log things at different levels like info, debug, error etc. Not really what I wanted.
This morning I tweaked my Open Codex config to also try gemma3:27b-it-qat - and Google’s olen source small is excellent: runs fast enough for a good dev experience, with very good functionality.
Typing `//use this as reference ai` in one file and `//copy this row to x ai!` and it will add those functions/files to context and act on both places. Altough I wish Aider would write `working on your request...` under my comment, now I have to keep Aider window in sight. Autocomplete and "add to context" and "enter your instructions" of other apps feel clunky.
I don't understand the appeal of investing in leaning and adapting your workflow to use an AI tool that is so tightly coupled to a single LLM provider, when there are other great AI tools available that are not locked to a single LLM provider. I would guess aider is the closest thing to claude code, but you can use pretty much any LLM.
The LLM field is moving so fast that what is the leading frontier model today, may not be the same tomorrow.
There are at least 10 projects currently aiming to recreate Claude Code, but for Gemini. For example, geminicodes.co by NotebookLM’s founding PM Raiza Martin
Tried Gemini Codes yesterday, as well as anon-kode and anon-codex. Gemini Codes is already broken and appears to be rather brittle (she disclosures as much), and the other two appear to still need some prompt improvements or someone adding vector embedding for them to be useful?
Perhaps someone can merge the best of Aider and codex/claude code now. Looking forward to it.
Google need to fix their Gemini web app at a basic level. It's slow, gets stuck on Show Thinking, rejects 200k token prompts that are sent one shot. Aistudio is in much better shape.
+1 on this. Improving Gemini apps and live mode will go such a long way for them. Google actually has the best model line-up now but the apps and APIs hold them back so much.
Uploading files on google is now great. I uploaded my python script and the text data files I was using the script to process. I asked it how best to optimize the code. It actually ran the python code on the data files. Then recommended changes then when prompted ran the script again to show the new results. At first I was like maybe hallucinating but no the data was correct.
Yes. Any API Key is allowed, Also you can assign different LLMs for different modes. It is great for cost-optimization. Like architect, code, ask, debug etc.
Only Claude (to my knowledge) has a desktop app which can directly, and usually quite intelligently, modify files and create repos on your desktop. It's the only "agentic" option among the major players.
"Claude, make me an app which will accept Stripe payments and sell an ebook about coding in Python; first create the app, then the ebook."
It would take a few passes but Claude could do this; obviously you can't do that with an API alone. That capability alone is worth $30/month in my opinion.
But there are third party options availabe that to the very same thing (e.g. https://aider.chat/ ) which allow you to plug in a model (or even a combination thereof e.g. deepseek as architect and claude as code writer) of your choice.
Therefore the advantage of the model provider providing such a thing doesn't matter, no?
> It would take a few passes but Claude could do this;
I'm sorry but absolutely nothing I've seen from using Claude indicates that you could give it a vague prompt like that and have it actually produce anything worth reading.
Can it output a book's worth of bullshit with that prompt? Yes. But if you think "write a book about Python" is where we are in the state of the art in language models in terms of the prompt you need to get a coherent product, I want some of whatever you are smoking because that has got to be the good shit
It looks the same, but for some reason Claude Code is much more capable. Codex got lost in my source code and hallucinated bunch of stuff, Claude on the same task just went to town, burned money and delivered.
Of course, this is only my experience and codex is still very young. I really hope it becomes as capable as Claude.
Part of it is probably tgat claude is just better at coding than what openai has available. I am considering trying to hack in support for gemini into codex and play around with it.
Also the "project" feature in claude improves experience significantly for coder, where you can customize your workflow. Would be great if gemini has this feature.
Yes, IME, Anthropic seemed to be ahead of Google by a decent amount with Sonnet 3.5 vs 1.5 Pro.
However, Sonnet 3.7 seemed like a very small increase, whereas 2.5 Pro seemed like quite a leap.
Now, IME, Google seems to be comfortably ahead.
2.5 Pro is a little slow, though.
I'm not sure which model Google uses for the AI answers on search, but I find myself using Search for a lot of things I might ask Gemini (via 2.5 Pro) if it was as fast as Search's AI answers.
I've been using Gemini 2.5 and Claude 3.7 for Rust development and I have been very impressed with Claude, which wasn't the case for some architectural discussions where Gemini impressed with it's structure and scope. OpenAI 4.5 and o1 have been disappointing in both contexts.
Gemini doesn't seem to be as keen to agree with me so I find it makes small improvements where Claude and OpenAI will go along with initial suggestions until specifically asked to make improvements.
I have noticed Gemini not accepting an instruction to "leave all other code the same but just modify this part" on a code that included use of an alpha API with a different interface than what Gemini knows is the correct current API. No matter how I promoted 2.5 pro, I couldn't get it to respect my use of the alpha API, it would just think I must be wrong.
So I think patterns from the training data are still overriding some actual logic/intelligence in the model. Or the Google assistant fine-tuning is messing it up.
I have been using gemini daily for coding for the last week, and I swear that they are pulling levers and A/B testing in the background. Which is a very google thing to do. They did the same thing with assistant, which I was a pretty heavy user of back in the day (I was driving a lot).
I have had a few epic refactoring failures with Gemini relative to Claude.
For example: I asked both to change a bunch of code into functions to pass into a `pipe` type function, and Gemini truly seemed to have no idea what it was supposed to do, and Claude just did it.
Maybe there was some user error or something, but after that I haven’t really used Gemini.
I’m curious if people are using Gemini and loving it are using it mostly for one-shotting, or if they’re working with it more closely like a pair programmer? I could buy that it could maybe be good at one but bad at the other?
This has been my experience too. Gemini might be better for vibe coding or architecture or whatever, but Claude consistently feels better for serious coding. That is, when I know exactly how I want something implemented in a large existing codebase, and I go through the full cycle of implementation, refinement, bug fixing, and testing, guiding the AI along the way.
It also seems to be better at incorporating knowledge from documentation and existing examples when provided.
My experience has been exactly the opposite - Sonnet did fine on trivial tasks, but couldn't e.g. fix a bug end-to-end (from bug description in the tracker to implementing the fix and adding tests) properly because it couldn't understand how the relevant code worked, whereas Gemini would consistently figure out the root cause and write decent fix & tests.
Perhaps this is down to specific tools and their prompts? In my case, this was Cursor used in agent mode.
Or perhaps it's about the languages involved - my experiments were with TypeScript and C++.
> Gemini would consistently figure out the root cause and write decent fix & tests.
I feel like you might be using it differently to me. I generally don't ask AI to find the cause of a bug, because it's quite bad at that. I use it to identify relevant parts of the code that could be involved in the bug, and then I come up with my own hypotheses for the cause. Then I use AI to help write tests to validate these hypotheses. I mostly use Rust.
I used to use them mostly in "smart code completion" mode myself until very recently. But with all the AI IDEs adding agentic mode, I was curious to see how well that fares if I let it drive.
And we aren't talking about trivial bugs here. For TypeScript, the most impressive bug it handled to date was an async race condition due to missing await causing a property to be overwritten with invalid value. For that one I actually had to do some manual debugging and tell it what I observed, but given that info, it was able to locate the problem in the code all by itself and fix it correctly and come up with a way to test it as well.
For C++, the codebase in question was gdb, the bug was a test issue, and it correctly found problematic code based solely on the test log (but I had to prod it a bit in the right direction for the fix).
I should note that this is Gemini Pro 2.5 specifically. When I tried Google's models previously (for all kinds of tasks), I was very unimpressed - it was noticeably worse than other SOTA models, so I was very skeptical going into this. Indeed, I started with Sonnet precisely because my past experience indicated that it was the best option, and I only tried Gemini after Sonnet fumbled.
I use it for basically everything I can, not just code completion, including end-to-end bug fixes when it makes sense. But most of the time even the current Gemini and Claude models fail with the hard things.
It might be because most bugs that you would encounter in other languages don't occur in the first place in Rust because of the stronger type system. The race condition one you mentioned wouldn't be possible for example. If something like that would occur, it's a compiler error and the AI fixes it while still in the initial implementation stage by looking at the linter errors. I also put a lot of effort into trying to use coding patterns that do as much validation as possible within the type system. So in the end all that's left are the more difficult bugs where a human is needed to assist (for now at least, I'm confident that the models are only going to get better).
Race conditions can span across processes (think async process communication).
That said I do wonder if the problems you're seeing are simply because there isn't that much Rust in the training set for the models - because, well, there's relatively little of it overall when you compare it to something like C++ or JS.
I've found that I need to point it to the right bit of logs or test output and narrow its attention by selectively adding to it's context. Claude 3.7 at least works well this way. If you don't it'll fumble around. Gemini hasn't worked as well for me though.
I partly wonder if different peoples prompt styles will lead to better results with different models.
I also cancelled my Anthropic yesterday, not because of Gemini but because it was the absolute worst time for Anthropic to limit their Pro plan to upsell their Max plan when there is so much competition out there
Manus.im also does code generation in a nice UI, but I’ll probably be using Gemini and Deepseek
Google has killed so many amazing businesses -- entire industries, even, by giving people something expensive for free until the competition dies, and then they enshittify hard.
It's cool to have access to it, but please be careful not to mistake corporate loss leaders for authentic products.
It's not free. And it's legit one of the best models. And it was a Google employee who was among the authors of the paper that's most recognized as kicking all this off. They give somewhat limited access in AIStudio (I have only hit the limits via API access, so I don't know what the chat UI limits are.) Don't they all do this? Maybe harder limits and no free API access. But I think most people don't even know about AIStudio.
True. They are ONLY good when they have competition. The sense of complacency that creeps in is so obvious as a customer.
To this day, the Google Home (or is it called Nest now?) speaker is the only physical product i've ever owned where it lost features over time. I used to be able to play the audio of a Youtube video (like a podcast) through it, but then Google decided that it was very very important that I only be able to play a Youtube video through a device with a screen, because it is imperative that I see a still image when I play a longform history podcast.
Obviously, this is a silly and highly specific example, but it is emblematic of how they neglect or enshittify massive swathes of their products as soon as the executive team loses interest and puts their A team on some shiny new object.
The experience on Sonos is terrible. There are countless examples of people sinking 1000s of dollars into Sonos ecosystem, and the new app update has rendered them useless.
I'm experiencing the same problem with my Google Home ecosystem. One day I can turn off the living room lights with the simple phrase "Turn off Living Room Lights," and then randomly for two straight days it doesn't understand my command
Preach it my friend. For years on the Google Home Hub (or Nest Hub or whatever) I could tell it to "favorite my photo" of what is on the screen. This allowed me to incrementally build a great list of my favorite photos on Google Photos and added a ton of value to my life. At some point that broke, and now it just says, "Sorry, I can't do that yet". Infuriating
The usage limit for experimental gets used up pretty fast in a vibe-coding situation. I found myself setting up an API account with billing enabled just to keep going.
How would I know if it’s useful to me without being able to trial it?
Googles previous approach (Pro models available only to Gemini Advanced subscribers, and Advanced trials can’t be stacked with Google One paid storage, or rather they convert the already paid storage portion to a paid, much shorter Advanced subscription!) was mind-bogglingly stupid.
Having a free tier on all models is the reasonable option here.
In this case, Google is a large investor in Anthropic.
I agree that giving away access to expensive models long term is not a good idea on several fronts. Personally, I subscribe to Gemini Advanced and I pay for using the Gemini APIs.
EDIT: a very good deal, at $10/month is https://apps.abacus.ai/chatllm/ that gives you access to almost all commercial models as well as the best open weight models. I have never come close at all to using my monthly credits with them. If you like to experiment with many models the service is a lot of fun.
The problem with tools like this is that somewhere in the chain between you and the LLM are token reducing “features”. Whether it’s the system prompt, a cheaper LLM middleman, or some other cost saving measure.
You’ll never know what that something is. For me, I can’t help but think that I’m getting an inferior service.
You can self host something like https://big-agi.com/ and grab your own keys from various providers. You end up with the above, but without the pitfalls you mentioned.
BIG-AI does look cool, and supports a different use case. ABACUS.AI takes your $10/month and gives you credits that go towards their costs of using OpenAI, Anthropic, Gemini, etc. Use of smaller open models use very few credits.
The also support an application development framework that looks interesting but I have never used it.
You might be correct about cost savings techniques in their processing pipeline. But they also add functionality: they bake web search into all models which is convenient. I have no affiliation with ABACUS.AI, I am just a happy customer. They currently let me play with 25 models.
Just look at Chrome to see the bard/gemini's future. HN folks didn't care about Chrome then but cry about Google's increasingly hostile development of Chrome.
Look at Android.
HN behaviour is more like a kid who sees the candy, wants the candy and eats as much as it can without worrying about the damaging effect that sugar will have on their health. Then, the diabetes diagnosis arrives and they complain
It's good to be aware of the likelihood of astroturfing. Everytime there's a newthread like this for one of the companies, there's a suspicious amount of similar plausible praise in an otherwise (sometimes brutally) skeptical forum.
The best way to shift or create consensus is to make everyone think everyone else's opinion has already shifted, and that consensus is already there. Emperor's new clothes etc..
Google can be evil and release impressive language models. The same way as Apple releasing incredible hardware with good privacy while also being a totally insufferable and arrogant company.
Another useful word in this context is “sycophancy,” meaning excessive flattery or insincere agreement. Amanda Askell of Anthropic has used it to describe a trait they try to suppress in Claude:
The second example she uses is really important. You (used to) see this a lot in stackoverflow where an inexperienced programmer asks how to do some convoluted thing. Sure, you can explain how to do the thing while maintaining their artificial constraints. But much more useful is to say "you probably want to approach the problem like this instead". It is surely a difficult problem and context dependent.
Lots of folks in tech have different opinions than you may expect. Many will either keep quiet or play along to keep the peace/team cohesion, but you really never know if they actually agree deep down.
Their career, livelihoods, ability to support their families, etc. are ultimately on the line, so they'll pay lip service if they have to. Consider it part of the job at that point; personal beliefs are often left at the door.
Not just tech. I spent some time on a cattle ranch (long story) and got to know some people pretty well. Quite a few confided interests and opinions they would never share at work, where the culture also has strong expectations of conformity.
It's a bit of a fancy way to say "yes man". Like in corporations or politics, if a leader surrounds themselves with "yes men".
A synonym would be sycophantic which would be "behaving or done in an obsequious way in order to gain advantage." The connotation is the other party misrepresents their own opinion in order to gain favor or avoid disapproval from someone of a higher status. Like when a subordinate tries to guess what their superior wants to hear instead of providing an unbiased response.
I think that accurately describes my experience with some LLMs due to heavy handed RLHF towards agreeableness.
In fact, I think obsequious is a better word since it doesn't have the cynical connotation of sycophant. LLMs don't have a motive and obsequious describes the behavior without specifying the intent.
Yeah, it is very close. But I feel simp has a bit of a sexual feel to it. Like a guy who does favors for a girl expecting affection in return, or donates a lot of money to an OnlyFans or Twitch streamer. I also see simp used where we used to call it white-knighting (e.g. "to simp for").
Obsequious is a bit more general. You could imagine applying it to a waiter or valet who is annoyingly helpful. I don't think it would feel right to use the word simp in that case.
In my day we would call it sucking up. A bit before my time (would sound old timey to me) people called it boot licking. In the novel "Catcher in the Rye", the protagonist uses the word "phony" in a similar way. This kind of behavior is universally disliked so there is a lot slang for it.
I wonder if anyone here will know this one;
I learned the word "obsequious" over a decade ago while working the line of a restaurant. I used to listen to the 2p2 (2 plus 2) poker podcasts during prep and they had a regular feature with David Sklansky (iirc) giving tips, stories, advice etc. This particular one he simply gave the word "obsequious" and defined it later. I remember my sous chef and I were debating what it could mean and I guessed it right. I still can't remember what it had to do with poker, but that's besides the point.
I didn't hear that one but I am a fan of Sklansky. And I also have a very vivid memory of learning the word, when I first heard the song Turn Around by They Might Be Giants. The connection with the song burned it into my memory.
I think here it's referring to a common problem where the AI agrees with your position too easily, and/or changes it's answer if you tell it the answer is wrong instantly (therefore providing no stable true answer if you asked it something about a fact)?
Using Claude code and Codex CLI and then Aider with Gemini 2.5 pro, Aider is much faster because you feed in the files instead of using tools to start doing all kinds of whole know what spending 10x the tokens. I tried a relatively simple refactor which needed around 7 files changed, only Aider with 2.5 got it and in the first shot. Where as both Codex and Claude code completely fumbled it
I was a big fan of that model but it has been replaced in AI Studio by its preview version, which, by comparison, is pretty bad. I hope Google makes the release version much closer to the experimental one.
I can confirm the model name in Run Settings has been updated to "Gemini 2.5 Pro Preview ..." when it used to be "Gemini 2.5 Pro (Experimental) ...".
I cannot confirm if the quality is downgraded since I haven't had enough time with it. But if what you are saying is correct, I would be very sad. My big fear is the full-fat Gemini 2.5 Pro will be prohibitively expensive, but a dumbed down model (for the sake of cost) would also be saddening.
For many workplaces, it's not just that that don't pay for a service, it's that using it is against policy. If I tried to paste some code into ChatGPT, for example, our data loss prevention spyware would block it and I'd soon be having an uncomfortable conversation with our security team.
Have you tried Grok 3? It's a bit verbose for my taste even when prompted to be brief but answers seem better/more researched and less opinionated. It's also more willing to answer questions where the other models block an answer.
I have not tried any of the Grok models but that is probably because I am rarely on X.
I have to admit I have a bias where I think Google is "business" while Grok is for lols. But I should probably take the time to asses it since I would prefer to have an opinion based on experience rather than vibes.
A lot of people don't want to patronize the businesses of an unabashed Nazi sympathizer. There are more important things in life than model output quality.
I had a very interesting long debate/discussion with Gemini 2.5 Pro about the Synapse-Evolve bank debacle among other things. It really feels like debating a very knowledgeable and smart human.
> 100% of my casual AI usage is now in Gemini and I look forward to asking it questions on deep topics because it consistently provides me with insight.
It's probably great for lots of things but it doesn't seem very good for recent news. I asked it about recent accusations around xAI and methane gas turbines and it had no clue what I was talking about. I asked the same question to Grok and it gave me all sorts of details.
>It's probably great for lots of things but it doesn't seem very good for recent news.
You are missing the point here. The LLM is just the “reasoning engine” for agents now. Its corpus of facts are meaningless, and shouldn’t really be relied upon for anything. But in conjunction with a tool calling agentic process, with access to the web, what you described is now trivially doable. Single shot LLM usage is not really anything anyone should be doing anymore.
I'm just discussing the GP's topic of casual use. Casual use implies heading over to an already-hosted prompt and typing in questions. Implementing my own 'agentic process' does not sound very casual to me.
That’s all fine and dandy, but if you google anything related to llm agents, you get 1000 answers to 100 questions, companies hawking their new “visual programming” agent composers, and a ton of videos of douchebags trying to be the Steve Jobs of AI. The concept I’m sure is fine, but execution of agentic anything is still the Wild Wild West and nobody knows what they’re really doing.
Indeed there is a mountain of snake oil out there at this point, but the underlying concepts are extremely simple, and can be implemented directly without frameworks.
obsequious is such a nice word for this context, only possible in the AI age.
i'd find the same word improper to describe human beings - other words like plaintive, obedient and compliant often do the job better and are less obscure.
Just be aware that if you don't add a key (and set up billing) youre granting Google the right to train on your data. To have persons read them and decide how to use them for training.
> To help with quality and improve our products, human reviewers may read, annotate, and process your API input and output. Google takes steps to protect your privacy as part of this process. This includes disconnecting this data from your Google Account, API key, and Cloud project before reviewers see or annotate it. Do not submit sensitive, confidential, or personal information to the Unpaid Services.
They don't charge anything extra for code execution, you just pay for input and output tokens. The above example used 10 input, 1,531 output which is $0.15/million for input and $3.50/million output for Gemini 2.5 Flash with thinking enabled, so 0.536 cents (just over half a cent) for this prompt.
This is so much cheaper than re-prompting each tool use.
I wish this was extended to things like: you could give the model an API endpoint that it can call to execute JS code, and the only requirement is that your API has to respond within 5 seconds (maybe less actually).
I wonder if this is what OpenAI is planning to do in the upcoming API update to support tools in o3.
I imagine there wouldn’t bd much of a cost to the provider on the API call there so much longer times may be possible. It’s not like this would hold up the LLM in any way, execution would get suspended while the call is made and the TPU/GPU will serve another request.
They need to keep KV cache to avoid prompt reprocessing, so they would need to move it to ram/nvme during longer api calls to use gpu for another request
This common feature requires the user of the API to implement the tool, in this case, the user is responsible to run the code the API outputs. The post you replied suggests that Gemini will run the code for the user behind the API call.
I wish Gemini could do this with Go. It generates plenty of junk/non-parseable code and I have to feed it the error messages and hope it properly corrects it.
100% agree.
I had Gemini flash 2 chew through thousands of points of nasty unstructured client data and it did a 'better than human intern' level conversion into clean structured output for about $30 of API usage. I am sold.
2.5 pro experimental is a different league though for coding. I'm leveraging it for massive refactoring now and it is almost magical.
> thousands of points of nasty unstructured client data
What I always wonder in these kinds of cases is: What makes you confident the AI actually did a good job since presumably you haven't looked at the thousands of client data yourself?
It's the same problem factories have: they produce a lot of parts, and it's very expensive to put a full operator or more on a machine to do 100% part inspection. And the machines aren't perfect, so we can't just trust that they work.
So starting in the 1920s Walter Shewhart and Edward Deming came up with Statistical Process Control. We accept the quality of the product produced based on the variance we see of samples, and how they measure against upper and lower control limits.
Based on that, we can estimate a "good parts rate" (which later got used in ideas like Six Sigma to describe the probability of bad parts being passed).
The software industry was built on determinism, but now software engineers will need to learn the statistical methods created by engineers who have forever lived in the stochastic world of making physical products.
I hope you're being sarcastic. SPC is necessary because mechanical parts have physical tolerances and manufacturing processes are affected by unavoidable statistical variations; it is beyond idiotic to be provided with a machine that can execute deterministic, repeatable processes and then throw that all into the gutter for mere convenience, justifying that simply because "the time is ripe for SWE to learn statistics"
In my case I had hundreds of invoices in a not-very-consistent PDF format which I had contemporaneously tracked in spreadsheets. After data extraction (pdftotext + OpenAI API), I cross-checked against the spreadsheets, and for any discrepancies I reviewed the original PDFs and old bank statements.
The main issue I had was it was surprisingly hard to get the model to consistently strip commas from dollar values, which broke the csv output I asked for. I gave up on prompt engineering it to perfection, and just looped around it with a regex check.
Otherwise, accuracy was extremely good and it surfaced a few errors in my spreadsheets over the years.
For what it's worth, I did check over many hundreds of them. Formatted things for side by side comparison and ordered by some heuristics of data nastiness.
It wasn't a one shot deal at all. I found the ambiguous modalities in the data and hand corrected examples to include in the prompt. After about 10 corrections and some exposition about the cases it seemed to misundestand, it got really good.
Edit: not too different from a feedback loop with an intern ;)
Though the same logic can be applied to everywhere, right? Even if it's done by human interns, you need to audit everything to be 100% confident or just have some trust on them.
Not sure why you're trying to conflate intellectual capability problems into this and complicate the argument? The problem layout is the same. You delegate the works to someone so you cannot understand all the details. This makes a fundamental tension between trust and confidence. Their parameters might be different due to intellectual capability, but whoever you're going to delegate, you cannot evade this trade-off.
BTW, not sure if you have experiences of delegating some works to human interns or new grads and being rewarded by disastrous results? I've done that multiple times and don't trust anyone too much. This is why we typically develop review processes, guardrails etc etc.
You can use AI to verify its own work. Last time I split a C++ header file into header + implementation file. I noticed some code got rewritten in a wrong manner, so I asked it to compare the new implementation file against the original header file, but to do so one method at a time. For each method, say whether the code is exactly the same and has the same behavior, ignoring superficial syntax changes and renames. Took me a few times to get the prompt right, though.
It also depends on what you are using the data for, if it's for non (precise) data based decisions then it's fine. Specially if you looking for "vibe" based decisions before then dedicating time to "actually" process the data for confirmation.
30$ to get an view into data that would take at least x many hours of someone's time is actually super cheap, specially if the decision of that result is then to invest or not invest the x many hours to confirm it.
For 2.5 pro exp I've been attaching files into AIStudio in the browser in some cases. In others, I have been using vscode's Gemini Code Assist which I believe recently started using 2.5 Pro. Though at one point I noticed that it was acting noticeably dumber, and over in the corner, sure enough it warned that it had reverted to 2.0 due to heavy traffic.
For the bulk data processing I just used the python API and Jupyter notebooks to build things out, since it was a one-time effort.
Absolutely agree. Granted, it is task dependent. But when it comes to classification and attribute extraction, I've been using 2.0 Flash with huge access across massive datasets. It would not be even viable cost wise with other models.
It's cheap but also lazy. It sometimes generates empty strings or empty arrays for tool calls, and then I just re-route the request to a stronger model for the tool call.
I've spent a lot of time on prompts and tool-calls to get Flash models to reason and execute well. When I give the same context to stronger models like 4o or Gemini 2.5 Pro, it's able to get to the same answers in less steps but at higher token cost.
Which is to be expected: more guardrails for smaller, weaker models. But then it's a tradeoff; no easy way to pick which models to use.
Instead of SQL optimization, it's now model optimization.
There are tons of AI/ML use-cases where 7% is acceptable.
Historically speaking, if you had a 15% word error rate in speech recognition, it would generally be considered useful. 7% would be performing well, and <5% would be near the top of the market.
Typically, your error rate just needs to be below the usefulness threshold and in many cases the cost of errors is pretty small.
In my case, I have workloads like this where it’s possible to verify the correctness of the result after inference, so any success rate is better than 0 as it’s possible to identify the “good ones”.
Aren’t you basically just saying you are able to measure the error rate? I mean that’s good, but already a given in this scenario where hes reporting the 7% error rate.
No. If you're able to verify correctness of individual items of work, you can accept the 93% of verified items as-is and send the remaining 7% to some more expensive slow path.
That's very different from just knowing the aggregate error rate.
No, it's anything that's harder to write than verify. A simple example is a logic puzzle; it's hard to come up with a solution, but once you have a possible answer it's really easy to check it. In fact, it can be easier to vet multiple answers and tell the machine to try again than solve it once manually.
low stakes text classification but it's something that needs to be done and couldnt be done in reasonable time frames or at reasonable price points by humans
I expect some manual correction after the work is done. I actually mentally counted all the times I pressed backspace while writing this paragraph, and it comes down to 45. I'm not counting the next paragraph or changing the number.
Humans make a ton of errors as well. I didn't even notice how many I was making here until I started counting it. AI is super useful to just write get a first draft out, not for the final work.
Yeah, general propaganda and psyops are actually more effective around 12% - 15%, we find it is more accurate to the user base, thus is questioned less for standing out more /s
I know it's a single data point, but yesterday I showed it a diagram of my fairly complex micropython program, (including RP2 specific features, DMA and PIO) and it was able to describe in detail not just the structure of the program, but also exactly what it does and how it does it. This is before seeing a single like of code, just going by boxes and arrows.
The other AIs I have shown the same diagram to, have all struggled to make sense of it.
It’s not surprising. What was surprising honestly was how they were caught off guard by OpenAI. It feels like in 2022 just about all the big players had a GPT-3 level system in the works internally, but SamA and co. knew they had a winning hand at the time, and just showed their cards first.
True and their first mover advantage still works pretty well. Despite "ChatGPT" being a really uncool name in terms of marketing. People remember it because they were the first to wow them.
Google always has been winning the AI race as soon as DeepMind was properly put to use to develop their AI models, instead of the ones that built Bard (Google AI team).
I have to say, I never doubted it would happen. They've been at the forefront of AI and ML for well over a decade. Their scientists were the authors of the "Attention is all you need" paper, among thousands of others. A Google Scholar search produces endless results. There just seemed to be a disconnect between the research and product areas of the company. I think they've got that worked out now.
They're getting their ass kicked in court though, which might be making them much less aggressive than they would be otherwise, or at least quieter about it.
Everybody else also trains on ChatGPT data, have you never heard of public ChatGPT conversation data sets? Yes they trained on ChatGPT data. No it's not "just".
I think it's the small TPM limits. I'll be way under the 10-30 requests per minute while using Cline, but it appears that the input tokens count towards the rate limit so I'll find myself limited to one message a minute if I let the conversation go on for too long, ironically due to Gemini's long context window. AFAIK Cline doesn't currently offer an option to limit the context explosion to lower than model capacity.
There is no reason to expect the other entrants in the market to drop out and give them monopoly power. The paid tier is also among the cheapest. People say it’s because they built their own their inference hardware and are genuinely able to serve it cheaper.
I use Gemini 2.5 pro experimental via openrouter in my openwebui for free. Was using sonnet 3.7 but I don't notice much difference so just default to the free thing now.
It’s not clear to me what either the “race” or “winning” is.
I use ChatGPT for 99% of my personal and professional use. I’ve just gotten used to the interface and quirks. It’s a good consumer product that I like to pay $20/month for and use. My work doesn’t require much in the way of monthly tokens but I just pay for the OpenAI API and use that.
Is that winning? Becoming the de facto “AI” tool for consumers?
Or is the race to become what’s used by developers inside of apps and software?
The race isn’t to have the best model (I don’t think) because it seems like the 3rd best model is very very good for many people’s uses.
Mostly brand recognition and the earlier Geminis had more refusals.
As a consumer, I also really miss the Advanced voice mode of ChatGPT, which is the most transformative tech in my daily life. It's the only frontier model with true audio-to-audio.
Its more so that almost every company is running a classifier on their web chat's output.
It isn't actually the model refusing, but rather if the classifier hits a threshold, it'll swap the model's out with "Sorry, let's talk about something else."
This is most apparent with DeepSeek. If you use their web chat with V3 and then jailbreak it, you'll get uncensored output but it is then swapped with "Let's talk about something else" halfway through the output. And if you ask the model, it has no idea its previous output got swapped and you can even ask it build on its previous answer. But if you use the API, you can push it pretty far with a simple jailbreak.
These classifiers are virtually always ran on a separate track, meaning you cannot jailbreak them.
If you use an API, you only have to deal with the inherent training data bias, neutering by tuning and neutering by pre-prompt. The last two are, depending on the model, fairly trivial to overcome.
I still think the first big AI company that has the guts to say "our LLM is like a pen and brush, what you write or draw with it is on you" and publishes a completely unneutered model will be the one to take a huge slice of marketshare. If I had to bet on anyone doing that, it would be xAI with Grok. And by not neutering it, the model will perform better in SFW tasks too.
You can turn off those, Google lets you decide how much it censors you can completely turn it off.
It has separate sliders for sexually explicit, hate, dangerous and harassment. It is by far the best at this, since sometimes you want those refusals/filters.
What do you mean miss? You don’t have the budget to keep something you truly miss for $20? What am in missing here / I don’t mean to criticize I am just curious is all. I would reword but I have to go
They used to be, but not anymore, not since Gemini Pro 2.5. Their "deep research" offering is the best available on the market right now, IMO - better than both ChatGPT and Claude.
Sorry, but no. Gemini isn't the fastest horse, yet.
And it's use within their ecosystem means it isn't geared to the masses outside of their bubble. They are not leading the race but they are a contender.
LLM's whole thing is language. They make great translators and perform all kinds of other language tasks well, but somehow they can't interpret my English language prompts unless I go to school to learn how to speak LLM-flavored English?
You have the right perspective. All of these people hand waving away the core issue here don't realize their own biases. Some of the best these things tout as much as 97% accuracy on tasks but if a person was completely randomly wrong at 3% of what they say you'd call an ambulance and no doctor would be able to diagnose their condition (the kinds of errors that people make with brain injuries are a major diagnostic tool and the kinds of errors are known for major types of common injuries ... Conversely there is no way to tell within an LLM system if any specific token is actually correct or not and its incorrectness is not even categorizable.)
I like to think of my interactions with an LLM like I'm explaining a request to a junior engineer or non engineering person. You have to be more verbose to someone who has zero context in order for them to execute a task correctly. The LLM only has the context you provided so they fail hard like a junior engineer would at a complicated task with no experience.
I like to think of my interactions with an LLM like I'm explaining a request to a junior engineer or non engineering person. You have to be more verbose to someone who has zero context in order for them to execute a task correctly. The LLM only has the context you provided so they fail hard like a junior engineer would at a complicated task with no experience.
It's a natural language processor, yes. It's not AGI. It has numerous limitations that have to be recognized and worked around to make use of it. Doesn't mean that it's not useful, though.
Its because google hasn't realized the value of training the model on information about its own capabilities and metadata. My biggest pet peeve about google and the way they train these models.
One hidden note from Gemini 2.5 Flash when diving deep into the documentation: for image inputs, not only can the model be instructed to generated 2D bounding boxes of relevant subjects, but it can also create segmentation masks! https://ai.google.dev/gemini-api/docs/image-understanding#se...
At this price point with the Flash model, creating segmentation masks is pretty nifty.
I've had mixed results with the bounding boxes even on 2.5 pro. On complex images where a lot of boxes need to be drawn they're in the general region but miss the exact location of objects.
No, the speed of YOLO/DETR inference makes it cheap as well - probably at least five or six orders of magnitude cheaper.
Edit: After some experimentation, Gemini also seems to not perform nearly as well as a purpose-tuned detection model.
It'll be interesting to test this capability and see how it evolves though. At some point you might be able use it as a "teacher" to generate training data for new tasks.
YOLO is probably still cheaper if bounding boxes are your main goal. Good segmentation models that work for arbitrary labels, however, are much more expensive to set up and run, so this type of approach could be an interesting alternative depending on performance.
Well no. You can run/host YOLO which means not having to submit potentially sensitive information to a company that generates a large amount of revenue from targeted advertising.
For a non programmer like me google is becoming shockingly good. It is giving working code the first time. I was playing around with it asked it to write code to scrape some data of a website to analyse. I was expecting it to write something that would scrape the data and later I would upload the data to it to analyse. But it actually wrote code that scraped and analysed the data. It was basic categorizing and counting of the data but I was not expecting it to do that.
That's the opposite experience of my wife who's in tech but also a non programmer. She wanted to ask Gemini to write code to do some basic data analysis things in a more automated way than Excel. More than once, Gemini wrote a long bash script where some sed invocations are just plain wrong. More than once I've had to debug Gemini-written bash scripts. As a programmer I knew how bash scripts aren't great for readability so I told my wife to ask Gemini to write Python. It resulted in higher code quality, but still contained bugs that are impossible for a non programmer to fix. Sometimes asking a follow up about the bugs would cause Gemini to fix it, but doing so repeatedly will result in Gemini forgetting what's being asked or simply throwing an internal error.
Currently IMO you have to be a programmer to use Gemini to write programs effectively.
IMO, the only thing that’s consistent about AIs is how inconsistent they are. Sometimes, I ask them to write code and I’m shocked at how well it works. Other times, I feel like I’m trying to explain to a 5-year-old Alzheimer’s patient what I want and it just can’t seem to do the simplest stuff. And it’s the same AI in both cases.
The AIs like many things out there work like an "evil genie". They'll give you what you asked for. The problem is typically that users ask for the wrong thing.
I've noticed beginners make mistakes like using singular terms when they should have used plural ("find the bug" vs "find the bugs"), or they fail to specify their preferred platform, language, or approach.
You mentioned your wife is using Excel, which is primarily used on Windows desktops and/or with the Microsoft ecosystem of products such as Power BI, PowerShell, Azure, SQL Server, etc...
Yet you mention she got a bash script using sed, both of which are from the Linux / GNU ecosystem. That implies that your wife didn't specify that she wanted a Microsoft-centric solution to her problem!
The correct answer here would have likely to have been to use Microsoft Fabric, which is an entire bag of data analysis and reporting tools that has data pipelines, automation, publishing, etc...
Or... just use the MashUp engine that's built-in to both Excel and PowerBI, which allows a surprisingly complex set of text, semi-structured, and tabular data processing. It can re-run the import and update graphs and charts with the new data.
PS: This is similar to going up to a Node.js programmer with a request. It doesn't matter what it is, they will recommend writing JavaScript to solve the problem. Similarly, a C++ developer will reach for C++ to solve everything they're asked to do. Right now, the AIs strongly prefer Linux, JavaScript, and especially Python for problem solving, because that's the bulk of the open-source code they were trained with.
I had similar experiences few months back that is why I am saying it is becoming shockingly good the 2.5 is a lot better than the 2.0 version. Another thing I have realized just like google search in the past your query has a lot to do with the results you get. So an example of what you want works at getting better results
> I am saying it is becoming shockingly good the 2.5 is a lot better than the 2.0 version
Are you specifically talking about 2.5 Flash? It only came out an hour ago, I dont know how you would have enough experience with it already to come to your conclusion.
(I am very impressed with 2.5 Pro, but that is a different model that's been available for several weeks now)
I've found that good prompting isn't just about asking for results but also giving hints/advice/direction on how to go about the work.
I suspect that if Gemini is giving you bash scripts it's because you're note giving it enough direction. As you pointed out, telling it to use Python, or giving it more expectations about how to go about the work or how the output should be, will give better results.
When I am prompting for technical or data-driven work, I tend to almost walk through what I imagine the process would be, including steps, tools, etc...
It must have something to do with the way your wife is prompting. I've noticed this with my friends too. I usually get working code from Gemini 2.5 Pro on the first try, and with a couple of follow-up prompts, it often improves significantly, while my friends seem to struggle communicating their ideas to the AI and get worse results.
If you're going to need scripts like that every week - sure. If you need it once a year on average... not likely. There's a huge amount of things we could learn but do them so infrequently that we outsource it to other people.
This is one case where I've found writing code with LLMs to be effective.
With some unfamiliar tool I don't care about too much (e.g. GitHub Actions YAML or some build script), I just want it to work, & then focus on other things.
I can spend time to try and come up with something that works; something that's robust & idiomatic.. but, likely I won't be able to re-use that knowledge before I forget it.
With an LLM, I'll likely get just as good a result; or if not, will have a good starting point to go from.
Which Gemini was it? I've been using 2.5 Flash all day for programming ClojureScript via roo code and it's been great. Provided I'm using agent orchestration, a memory bank, and having it write docs for code it will work on.
Ask it to write tests with the code and then ask it to fix the errors from the tests rather than just pointing out bugs. If you have an IDE that supports tool use (Claude Code, Roo Code) it can automate this process.
I want to be able to just tell chat GPT or whatever to create a full project for me, but I know the moment it can do that without any human intervention, I won't be able to find a job.
There is definitely an art to doing it, but the ability is definitely there even if you don't know the language at all.
I have a few programs now that are written in Python (2 by 3.7, one by 2.5) used for business daily, and I can tell you I didn't, and frankly couldn't, check a single line of code. One of them is ~500 LOC, the other two are 2200-2700 LOC.
I've been continually disappointed. I've been told it's getting exponentially better and we won't be able to keep up with how good they get, but I'm not convinced. I'm using them every single day and I'm never shocked or awed by its competence, but instead continually vexxed that isn't not living up to the hype I keep reading.
Case in point: there was a post here recently about implementing a JS algorithm that highlighted headings as you scrolled (side note: can anyone remember what the title was? I can't find it again), but I wanted to test the LLM for that kind of task.
Pretty much no matter what I did, I couldn't get it to give me a solution that would highlight all of the titles down to the very last one.
I knew what the problem was, but even guiding the AI, it couldn't fix the code. I tried multiple AIs, different strategies. The best I could come up with was to guide it step by step on how to fix the code. Even telling it exactly what the problem was, it couldn't fix it.
So this goes out to the "you're prompting it wrong" crowd... Can you show me a prompt or a conversation that will get an AI to spit out working code for this task: JavaScript that will highlighting headings as you scroll, to the very last one. The challenge is to prompt it to do this without telling it how to implement it.
I figure this should be easy for the AI because this kind of thing is very standard, but maybe I'm just holding it wrong?
Even as a human programmer I don't actually understand your description of the problem well enough to be confident I could correctly guess your intent.
What do you mean by "highlight as you scroll"? I guess you want a single heading highlighted at a time, and it should be somehow depending on the viewport. But even that is ambiguous. Do you want the topmost heading in the viewport? The bottom most? Depending on scroll direction?
It seems pretty good. Handles scrolling via all possible ways, does the highlighting at load too so that the highlighting is in effect for the initial viewport too.
The prompt was "write me some javascript that higlights the topmost heading (h1, h2, etc) in the viewport as the document is scrolled in any way".
So I'm thinking your actual requirements are very different than what you actually wrote. That might explain why you did not have much luck with any LLMs.
> Even as a human programmer I don't actually understand your description of the problem well enough to be confident I could correctly guess your intent.
Yeah, you understand what I meant. The code Gemini gave you implements the behavior, and the AI I used gave me pretty much the same thing. There's a problem with the algorithm tho -- if there's a heading too close to the bottom of the page it will never highlight. The page doesn't exhibit the bug because it provides enough padding at the bottom.
But my point wasn't that it couldn't one-shot the code; my point was that I couldn't interrogate it into giving me code that behaved as I wanted. It seemed too anchored to the solution it had provided me, where it said it was offering fixes that didn't do anything, and when I pointed that out it apologized and proceeded to lie about fixing the code again. It appeared to be an infinite loop.
I think what's happened here is the opposite of what you suggest; this is a very common tutorial problem, you can find solutions of the variety you showed me all over the internet, and that's essentially what Gemini gave you. But being tutorial code, it's very basic and tries not to implement a more robust solution that is needed in production websites. When I asked AI for that extra robustness, it didn't want to stray too far from the template, and the bug persisted.
Maybe you can coax it into getting a better result? I want to understand how.
I clearly didn't understand what you meant, because you did in fact have additional unstated requirements that I could not even have imagined existed and were not in any way hinted at by your initial spec.
And I still don't know what you want! Like, you want some kind of special case where the last heading is handled differently. But what kind of special case? You didn't specify. "It's wrong, fix it".
Fix it how? When the page is scrolled all the way to the bottom, should the last heading always be highlighted? That would just move the complaint to the second heading from the bottom if three headings fit on the last screen. Add padding? Can't be that, since it's exactly what this solution already did and you thought it wasn't good enough.
Sorry, I will not be playing another round of this. I don't know if you don't realize how inadequate your specifications are (in which case that's your problem with the LLMs too), or if this is some kind of a bit, but either way it doesn't feel like a good use of my time.
But if your problem is that the LLMs give a bad initial answer, get anchored on it, and can't iterate, just give all the requirements up front. If they're requirements you didn't realize existed until you saw the proposed answer, just start again with a fresh context. That tends to work well for me in Aider.
> I clearly didn't understand what you meant, because you did in fact have additional unstated requirements
Okay, but the AI understood the requirements; It confirmed to me what I intended it to do, but it couldn't produce code that met its textual descriptions. It kept returning the tutorial code.
> You didn't specify. "It's wrong, fix it". Fix it how?
Maybe I wasn't clear here but I'm not replying as if I'm prompting you like an AI. The problem domain is described better in the link in the sibling comment. When the AI gave back the initial code, I had inquired the following:
how does this work when it gets to the bottom. Let's say I have a heading at the very bottom of the page. Will the "active" one be the bottom one or the top?
It replied:
What happens at the bottom of the page?
When you're near the bottom of the page and the last section (section3, say) starts to enter the top of the viewport (within 100px), it becomes the active one — exactly as you'd want.
But here's the catch: if the last section is short, or you scroll all the way to the very bottom (past the heading), then no section's top may be ≤ 100px, especially if the last heading has already scrolled off the top. In that case, the last matched section (likely the one before the last) stays active, which feels wrong.
We can fix this either by:
Making the threshold smarter (e.g., using .bottom or checking if any part of the section is visible).
Or better: use IntersectionObserver to handle this robustly.
Why this is better:
It triggers when a section is entering the viewport (around 30% from top).
Works even when you scroll to the very bottom.
It’s less sensitive to section height or viewport size.
So it identified the edge case, it identified that the behavior is incorrect and what the cause of that is, and it returned code that purportedly fixed this. But the code it returned exhibited exactly the behavior it said "feels wrong". And in interrogating it. I asked it what was broken and we went line by line:
Show me exactly what was changed from this to the last which fixed the problem
Perfect — let's do a line-by-line diff-style breakdown of what changed between the buggy version and the fixed version so we can pinpoint exactly what fixed the issue.
We went line by line and it told me what exactly was wrong and why it's fixed, and confirmed that the provided code produced the expected behavior.
Why this works:
We evaluate all visible headings, not just ones intersecting a line.
We pick the one that’s:
just above the activation line, or
just below it, if none are above
Handles edge cases like top/bottom of scroll
But the code doesn't do this. It continued on like this where it proposed fixes, talked about the solution correctly, but wouldn't give code that implemented the solution.
> But if your problem is that the LLMs give a bad initial answer, get anchored on it, and can't iterate, just give all the requirements up front. If they're requirements you didn't realize existed until you saw the proposed answer, just start again with a fresh context. That tends to work well for me in Aider.
Yeah that's what I tend to do as well. I don't tend to get good satisfying results though, to the point where coding it myself seems like the faster more reliable option. I'll keep trying to hold it better and maybe one day it'll work for me. Until then I'm a skeptic.
> I wish I could try the AI separate from my Google account.
If that's a concern, just create another account. Doesn't even require using a separate browser profile, you can be logged into multiple accounts at once and use the account picker in the top right of most their apps to switch.
yeah but 1) its useful to have the point there on the curve if you need it, 2) intelligence is multidimensional, maybe in 2.5 flash you get qualitatively a better set of capabilities for your needs than 2.5 pro
It does, that point in the tradeoff space was not available until now. Any model that's not dominated by at least one model on both axes will push forward the frontier. (The actual frontier isn't actually a straight line between the points on the frontier like visualized there. It's a step function.)
Genuine naive question: when it comes to Google HN has generally a negative view of it (pick any random story on Chrome, ads, search, web, working at faang, etc. and this should be obvious from the comments), yet when it comes to AI there is a somewhat notable “cheering effect” for Google to win the AI race that goes beyond a conventional appreciation of a healthy competitive landscape, which may appear as a bit of a double standard.
Why is this? Is it because OpenAI is seen as such a negative player in this ecosystem that Google “gets a pass on this one”?
And bonus question: what do people think will happen to OpenAI if Google wins the race? Do you think they’ll literally just go bust?
Most of us weren’t using Gemini pro models (1.0, 1.5, 2.0) but the recent 2.5 pro is such a huge step up. It’s better than 3.7 sonnet for coding. Better than o1, o3-mini models and now o3 and o4-mini. It’s become my daily driver. It does everything I need with almost 100% accuracy, is cheap, fast, 1 million context window, uses google web search for grounding, can fetch YouTube video transcripts, can fetch website content, works in google workspace: Gmail, Docs, Sheets.
Really hard to beat this combo. Oh and if you subscribe to their AI plan it comes with 2 TB drive storage.
Maybe because Google is largely responsible, paying for the research, of most of the results we are seeing now. I'm not a Google fan, in the web side, and in their idea of what software engineering is, but they deserve to win the AI race, because right now all the other players provided a lot less than what Google did as public research. Also, with Gemini 2.5 PRO, there was a big hype moment, because the model is of unseen ability.
Maybe they deserve it but it would be really bad for the world. Because they will enshittify the hell out of it once they're established. That's their MO.
I don't want Google to have a stranglehold over yet another type of online service. So I avoid them.
And things are going so fast now, whatever Google has today that might be better than the rest, in two months the rest will have it too. Of course Google will have something new again. But being 2 months behind isn't a huge deal. I don't have to have the 'winning' product. In fact most of my AI tasks go to an 8b llama 3.1 model. It's about on par with gpt 3.5 but that's fine.
The situation with LLMs is much different than search, Google doesn't have such a large lead here. LLMs are social things, they learn from each other, any provider with SOTA model will see its abilities leaked through synthetic training data. That's what GPT-4 did for a year, against the wishes of OpenAI, powering up millions of open model finetunes.
Gemini is just that good. From my usage it is much smarter than DeepSeek or Claude 3.7 Thinking models.
A lot of Google’s market share across its services comes from the monopoly effects Google has. The quality of Gemini 2.5 is noticeably smarter than its competitors so I see the applause for the quality of the LLM and not for Google.
I think it’s way too early to say anything about who is winning the race. There is still a long way to go; o3 scores highest in Humanity’s Last Exam (https://agi.safe.ai/) at 20%, 2.5 scores 18%.
As a googler working in LLM space, this feels like revisionist history to me haha! I remember a completely different environment only a few months ago when Anthropic was the darling child, and before that it was OpenAI (and for like 4 weeks somewhere in there, it was Deepseek). For literally years at this point, every time Bard or Gemini would make a major release, it would be largely ignored or put down in favor of the next "big thing" OpenAI was doing or Claude saturating coding benchmarks, never mind that Google was often just behind with the exact same tech ready to go, in some cases only missing their demo release by literally 1 day (remember live voice?). And every time this happened, folks would be posting things to the effect of "LOL I can't believe Google is losing the AI race - didn't they invent this?", "this is like Microsoft dropping the ball on mobile", "Google is getting their lunch eaten by scrappy upstarts," etc. I can't lie, it stings a bit when that's what you work on all day.
2.5 was quite good. Not stupidly good like the jump from GPT 2 to 3 or 3.5 to 4, but really good. It was a big jump in ELO and benchmarks. People like it, and I think it's just psychologically satisfying that the player everybody would have expected to win the AI race is currently in the lead. Gemini finally gets a day in the sun.
I'm sure this will change with whenever somebody comes up with the next big idea though. It probably won't take much to beat Gemini in the long run. There is literally zero moat.
I dislike Google rather strongly due to their ad-based business model, and I was previously very skeptical of their AI offerings because of very lackluster performance compared to OpenAI and Claude. But I can't help but be impressed with Gemini Pro 2.5 for "deep research" and agentic coding. I have subscriptions with all three so that I can keep up with SOTA, but if I had to choose only one to keep, right now it'd be Gemini.
That said I still don't "cheer" for them and I would really rather someone else win the race. But that is orthogonal to recognition of observed objective superiority.
I think a lot of us see Google as both an evil advertiser and as an innovator. Google winning AI is sort of nostalgic for those of us who once cheered the "Do No Evil"(now mostly "Do Know Evil") company.
I also like how Google is making quiet progress while other companies take their latest incremental improvement and promote it as hard as they can.
I think for a while some people felt the Google AI models are worse but now its getting much better. On the other hand Google has their own hardware so they can drive down the costs of using the models so it keeps pressure on Open AI do remain cost competitive. Then you have Anthropic which has very good models but is very expensive. But I've heard they are working with Amazon to build a data center with Amazons custom AI chips so maybe they can bring down their costs. In the end all these companies will need a good model and lower cost hardware to succeed.
2.5 Pro is free, and I'm sure there's a lot of people who have just never tried the best models because they don't want to pay for them. So 2.5 Pro probably blows their socks off.
Whereas, if you've been paying for access to the best models from OpenAI and Anthropic all along, 2.5 Pro doesn't feel like such a drastic step-change. But going from free models to 2.5 Pro is a crazy difference. I also think this is why DeepSeek got so much attention so quickly - because it was free.
The key is Gemini being free through AI Studio. This makes their technical improvement more impressive when OpenAI sells their best models at ridiculous prices.
If Google engages in price dumping as a monopolist remains to be seen but it feels like it.
The LLM race is fast paced and no moat has developed. People are switching on a whim if better models (by some margin) show up. When will OpenAI, Anthropic or DeepSeek counter 2.5 Pro? And will it be before Google releases the next Pro?
OpenAI commands a large chunk of the consumer market and they have considerable funds after their last round. They won't fold this or next year.
If Google wants to win this they must come up with a product strategy integrating their search business without seriously damaging their existing search business to much. This is hard.
Because now it has brought real competitions to the field. GPT was the king and Claude had been the only meaningful challenger for a while but OpenAI didn't care about Anthropic but just be obsessed with Google. Gemini took a quite time to set the pipeline so initial version was not enough to push the frontier; you remember the days when Google released a new model, OpenAI just responded with some old models in their silo within a day only to crush them. That does not happen anymore and they're forced to develop a better model.
A lot of the negativity toward Google stems from the fact that they're the big, dominant player in search, ads, browsers, etc., rather than anything that they've done or any particular attribute of the company.
In AI, they're still seen as being behind OpenAI and others, so we don't see the same level of negativity.
I prefer OpenAI and Anthropic big time because they are fresh players with less dominance over other aspects of digital life. Not having to login to an insidious tracker like Google is worth significantly worse performance. Although I have little FOMO here avoiding Gemini because evaluating these models on real world use cases remains quite subjective imo.
More great innovation from Google. OpenAI have two major problems.
The first is Google's vertically integrated chip pipeline and deep supply chain and operational knowledge when it comes to creating AI chips and putting them into production. They have a massive cost advantage at every step. This translates into more free services, cheaper paid services, more capabilities due to more affordable compute, and far more growth.
Second problem is data starvation and the unfair advantage that social media has when it comes to a source of continually refreshed knowledge. Now that the foundational model providers have churned through the common crawl and are competing to consume things like video and whatever is left, new data is becoming increasingly valuable as a differentiator, and more importantly, as a provider of sustained value for years to come.
SamA has signaled both of these problems when he made noises about building a fab a while back and is more recently making noises about launching a social media platform off OpenAI. The smart money among his investors know these issues to be fundamental in deciding if OAI will succeed or not, and are asking the hard questions.
If the only answer for both is "we'll build it from scratch", OpenAI is in very big trouble. And it seems that that is the best answer that SamA can come up with. I continue to believe that OpenAI will be the Netscape of the AI revolution.
The win is Google's for the taking, if they can get out of their own way.
Nobody has really talked about what I think is an advantage just as powerful as the custom chips: Google Books. They already won a landmark fair use lawsuit against book publishers, digitized more books than anyone on earth, and used their Captcha service to crowdsource its OCR. They've got the best* legal cover and all of the best sources of human knowledge already there. Then Youtube for video.
The chips of course push them over the top. I don't know how much Deep Research is costing them but it's by far the best experience with AI I've had so far with a generous 20/day rate limit. At this point I must be using up at least 5-10 compute hours a day. Until about a week ago I had almost completely written off Google.
The amount of text in books is surprisingly finite. My best estimate was that there are ~10¹³ tokens available in all books (https://dynomight.net/scaling/#scaling-data), which is less than frontier models are already being trained on. On the other hand, book tokens are probably much "better" than random internet tokens. Wikipedia for example seems to get much higher weight than other sources, and it's only ~3×10¹⁰ tokens.
Google has the data and has the hardware, not to mention software and infrastructure talent. Once this Bismarck turns around and it looks like it is, who can parry it for real? They have internet.zip and all the previous versions as well, they have youtube, email, search, books, traffic, maps and business on it, phones and habits around it, even the OG social network, the usenet. It's a sleeping giant starting to wake up and it's already causing commotion, let's see what it does when it drinks morning coffee.
Agreed. One of Google's big advantages is the data access and integrations. They are also positioned really well for the "AI as entertainment" sector with youtube which will be huge (imo). They also have the knowledge in adtech and well injecting adds into AI is an obvious play. As is harvesting AI chat data.
Meta and Google are the long term players to watch as Meta also has similar access (Insta, FB, WhatsApp).
More like 5/95 with Microsoft
- and that's being generous, I wouldn't be surprised if it was 1/99. It's basicaly just hip tech companies and a couple of Fortune 500s that use Google Docs. And even their finance departments often use Excel. HN keeps underestimating how the whole physical world runs on Excel.
I still can't understand how google missed on github, especially since they were in the same space before with google code. I do understand how they couldn't make a github though.
Another advantage that Google has is the deep integration of Gemini into Google Office products and Gmail. I was part of a pilot group and got to use a pre-release version and it's really powerful and not something that will be easy for OpenAI to match.
Agreed. Once they dial in the training for sheets it's going to be incredible. I'm already using notebooklm to upload finance PDFs, then having it generate tabular data and copypasta into sheets, but it's a garage solution compared to just telling it to create or update a sheet with parsed data from other sheets, PDFs, docs, etc.
And as far as gmail goes, I periodically try to ask it to unsubscribe from everything marketing related, and not from my own company, but it's not even close to being there. I think there will continue to be a gap in the market for more aggressive email integration with AI, given how useless email has become. I know A16Z has invested in a startup working on this. I doubt Gmail will integrate as deep as is possible, so the opportunity will remain.
I frankly am in doubt of future office products. In the last month I have ditched two separate excel productivity templates in favor of bespoke wrappers on sqlite databases, written by Claude and Gemini. Easier to use and probably 10x as fast.
You don't need a 50 function swiss army knife when your pocket can just generate the exact tool you need.
You say deep integration, yet there is still no way to send a Gemini Canvas to Docs without a lot of tedious copy-pasting and formatting because Docs still doesn’t actually support markdown. Gemini in Google Office in general has been a massive disappointment for all but the most simplistic of writing tasks.
They can have the most advanced infrastructure in the world, but it doesn’t mean much if Google continues its infamous floundering approach to product. But hey, 2.5 pro with Cline is pretty nice.
Maybe I'm misunderstanding, but there is literally a Share button in Canvas right below each response with the option to export to Docs. Within Docs, you can also click on the Gemini "star" at the upper right to get a prompt and then also export into the open document. Note that this is a with "experimental" Gemini 2.5 Pro.
I have access to this now and I want it to work so bad and it's just proper shit. Absolute rubbish.
They really, truly need to fix this integration. Gemini in Google Docs is barely acceptable, it doesn't work at all (for me) in Gmail, and I've not yet had it do anything other than error in Google Sheets.
Sorry but my eyes rolled to the back of my head with this one. This is between two teams with tons of smart contributors, but the difference is one is more flexible and able to take risks vs the other that has many times more researchers and the world's best and most mature infrastructure/tooling. Its not a CEO vs CEO battle
I think it requires a nuanced take but allow me to provide some counter-examples.
The first is CEO pay rates. Another is the highest paid public employees (which tend to be coaches at state schools). This is evidence that the market highly values managers.
Another is systemic failures within enterprises. When Boeing had a few very public plane crashes, a certain narrative suggested that the transition from highly capable engineer managers to financial focus managers contributed to the problem. A similar narrative has been used to explain the decline of Intel.
Consider the return of Steve Jobs to Apple. Or the turn around at Microsoft with Nadella.
All of these are complex cases that don't submit to an easy analysis. Success and failure are definitely multi-factor and rarely can be traced to a single definitive cause.
Perhaps another way to look at it would be: what percentage of the success of highly complex organizations can be attributed to management? To what degree can poor management decisions contribute to the failure of an otherwise capable organization?
How much you choose to weight those factors is entirely up to you.
edit: I was also thinking about the way we think about the advantage of exceptional generals/admirals in military analysis. Or the effect a president can have on the direction of a country.
It is more gut feel than a rational or carefully reasoned argument.
I think Pichai has been an exceptional revenue maximizer but he lacks vision. I think he is probably capable of squeezing tremendous revenue out of AI once it has been achieved.
I like Hassabis in a "good vibe" way when I hear him speak. He reminds me of engineers that I have worked with personally and have gained my respect. He feels less like a product focused leader and more of a research focused leader (AlphaZero/AlphaFold) which I think will be critical to continue the advances necessary to push the envelope. I like his focus on games and his background in RL.
Google's war chest of Ad money gives Hassabis the flexibility to invest in non-revenue generating directions in a way that Altman is unlikely to be able to do. Altman made a decision to pivot the company towards product which led to the exodus of early research talent.
Fair point, and a good reminder not to pass judgement on the actions of others. It is totally possible that Altman made his own prediction of the future and theorized that the only hope he had of competing with the existing big tech companies to realistically achieve an AI for the masses was to show investors a path to profitability.
I should also give Altman a bit more due in that I find his description of a world augmented by powerful AI to be more inspiring than any similar vision I've heard from Pichai.
But I'm not trying to guess their intentions, I am just stating the situation as I see it. And that situation is one where whatever forces have caused it, OpenAI is clearly investing very heavily in product (e.g. windsurf acquisition, even suggesting building a social network). And that shift in focus seems highly correlated with a loss of significant research talent (as well as a healthy dose of boardroom drama).
Note sure why their comment was downvoted. Google the names. Hassabis runs DeepMind at Google which makes Gemini and he's quite brilliant and has an unbelievable track record. Buffet investing in teams points out that there are smart people out there that think good leadership is a good predictor of future success.
Zoogeny got downvoted? I did not do that. His comments deserved more details anyway (at the level of those kindly provided).
> Google the names
Was that a wink about the submission (a milestone from Google)? Read Zoogeny's delightful reply and see whether it can compare a search engine result (not to mention that I asked for Zoogeny's insight, not for trivia). And as a listener to Buffet and Munger, I can surely say that they rarely indulge in tautologies.
I wouldn't worry about downvotes, it isn't possible on HN to downvote direct replies to your message (unlike reddit), so you cannot be accused of downvoting me unless you did so using an alt.
Some people see tech like they see sports teams and they vote for their tribe without considering any other reason. I'm not shy stating my opinion even when it may invite these kinds of responses.
I do think it is important for people to "do their own research" and not take one man's opinion as fact. I recommend people watch a few videos of Hassabis, there are many, and judge his character and intelligence for themselves. They may find they don't vibe with him and genuinely prefer Altman.
I don't know man, for months now people keep telling me on HN how "Google is winning", yet no normal person I ever asked knows what the fuck "Gemini" is. I don't know what they are winning, it might be internet points for all I know.
Actually, some of the people polled recalled the Google AI efforts by their expert system recommending glue on pizza and smoking in pregnancy. It's a big joke.
Try uploading a bunch of PDF bank statements to notebooklm and ask it questions. Or the results of blood work. It's jaw dropping. e.g. uploaded 7 brokerage account statements as PDFs in a mess of formats and asked it to generate table summary data which it nailed, and then asked it to generate actual trades to go from current position to a new position in shortest path, and it nailed that too.
Biggest issue we have when using notebooklm is a lack of ambition when it comes to the questions we're asking. And the pro version supports up to 300 documements.
Hell, we uploaded the entire Euro Cyber Resilience Act and asked the same questions we were going to ask our big name legal firm, and it nailed every one.
But you actually make a fair point, which I'm seeing too and I find quite exciting. And it's that even among my early adopter and technology minded friends, adoption of the most powerful AI tools is very low. e.g. many of them don't even know that notebookLM exists. My interpretation on this is that it's VERY early days, which is suuuuuper exciting for us builders and innovators here on HN.
Their new models excel at many things. Image editing, parsing PDFs, and coding are what I use it for. It's significantly cheaper than the closest competing models (Gemini 2.5 pro, and flash experimental with image generation).
Highly recommend testing against openai and anthropic models - you'll likely be pleasantly surprised.
While there are some first-party B2C applications like chat front-ends built using LLMs, once mature, the end game is almost certainly that these are going to be B2B products integrated into other things. The future here goes a lot further than ChatGPT.
Reddit was an interesting case here. They knew that they had particularly good AI training data, and they were able to hold it hostage from the Google crawler, which was an awfully high risk play given how important Google search results are to Reddit ads, but they likely knew that Reddit search results were also really important to Google. I would love to be able to watch those negotiations on each side; what a crazy high stakes negotiation that must've been.
Say what you will, but there's a lot of good answers to real questions people have that's on Reddit. There's a whole thing where people say "oh Google search results are bad, but if you append the word 'REDDIT' to your search, you'll get the right answer." You can see that most of these agents rely pretty heavily from stuff they find on Reddit.
Of course, that's also a big reason why Google search results suggest putting glue on pizza.
This is an underrated comment. Yes it's a big advantage and probably a measurable pain point for Anthropic and OpenAI. In fact you could just do a 1% survey of robots.txt out there and get a reasonable picture. Maybe a fun project for an HN'er.
This is right on. I work for a company with somewhat of a data moat and AI aspirations. We spend a lot of time blocking everyone's bots except for Google. We have people whose entire job is it to make it faster for Google to access our data. We exist because Google accesses our data. We can't not let them have it.
I can, and I would say it's a likely scenario, say 30%. If they don't have a significant edge over their competitors in the capabilities of their models, what's left? A money losing web app, and some API services that I'm sure aren't very profitable either. They can't compete with Google, Grok, Meta, MS, Amazon... They just can't.
I don't think the issue is solving the technical implementation of a new social media platform. The issue is whether a new social media platform from OpenAI will deliver the kind of value that existing platforms deliver. If they promise investors that they'll get TikTok/Meta/YouTube levels of content+interaction (and all the data that comes with it), but deliver Mastodon levels, then they are in trouble.
class ThinkingConfig(_common.BaseModel):
"""The thinking features configuration."""
include_thoughts: Optional[bool] = Field(
default=None,
description="""Indicates whether to include thoughts in the response. If true, thoughts are returned only if the model supports thought and thoughts are available.
""",
)
thinking_budget: Optional[int] = Field(
default=None,
description="""Indicates the thinking budget in tokens.
""",
)
That thinking_budget thing is documented, but what's the deal with include_thoughts? It sounds like it's an option to have the API return the thought summary... but I can't figure out how to get it to work, and I've not found documentation or example code that uses it.
Anyone managed to get Gemini to spit out thought summaries in its API using this option?
The API won't give you the "thinking" tokens, those are only visible on AI studio. Probably to try to stop distillation, very disappointing. I find reading the cot to be incredibly informative to identify failure modes.
> Hey Everyone,
> Moving forward, our team has made a decision to only show thoughts in Google AI Studio. Meaning, we no longer return thoughts via the Gemini API. Here is the updated doc to reflect that.
Maybe they just updated it? Or people aren't on the same page at Google idk
Previously it said
> Models with thinking capabilities are available in Google AI Studio and through the Gemini API. Note that the thinking process is visible within Google AI Studio but is not provided as part of the API output.
I maintain an alternative client which I build from the API definitions at https://github.com/googleapis/googleapis, which according to https://github.com/googleapis/python-genai/issues/345 should be the right place. But neither the AI Studio nor the Vertex definitions even have ThinkingConfig yet - very frustrating. In general it's amazing how much API munging is required to get a working client from the public API definitions.
In AI Studio the flash moddels has two toggles: Enable thinking and Set thinking budget. If thinking budget is enabled, you can set tue max number of tokens it can use to think, else it's Auto.
Gemini models are very good but in my experience they tend to overdo the problems. When I give it things for context and something to rework, Gemini often reworks the problem.
For software it is barely useful because you want small commits for specific fixes not a whole refactor/rewrite. I tried many prompts but it's hard. Even when I give it function signatures of the APIs the code I want to fix uses, Gemini rewrites the API functions.
If anybody knows a prompt hack to avoid this, I'm all ears. Meanwhile I'm staying with Claude Pro.
Yes, it will add INSANE amounts of "robust error handling" to quick scripts where I can be confident about assumptions. This turns my clean 40 lines of Python where I KNOW the JSONL I am parsing is valid into 200+ lines filled with ten new try except statements. Even when I tell it not to do this, it loves to "find and help" in other ways. Quite annoying. But overall it is pretty dang good. It even spotted a bug I missed the other day in a big 400+ line complex data processing file.
I didn't realize this was a bigger trend, I asked it to write a simple testing script that POSTed a string to a local HTTP server as JSON, and it wrote a 40 line script, handling any possible error. I just wanted two lines.
I wonder how much of that sort of thing is driven by having trained their models on their own internal codebases? Because if that's the case, careful and defensive being the default would be unsurprising.
Here's what I found to be working (not 100% but it gives much better and consistant results)
Basically, I ask it to repeat at the start of each message some rules :
"From now on, you must repeat and comply the following rules at the top of all your messages onwards:
- I will never rewrite API functions. Even if I think it's a good idea, it is a bad idea. I will keep the API function as it is and it is perfect like that.
- I will never add extra input validation. Even if I think it's a good idea, it is a bad idea. I will keep the function without validation and it is perfect like that.
- ...
- If I violate any of those rules, I did a bad job.
"
Forcing it to repeat things make the model output more aligned and focused in my experience.
The model is good to solve problems, but is very difficult to control the unnecessary changes that the model does in the rest of the code. Also it adds a lot of unnecessary comments, even when I explicitly say to not add.
For now Deepseek R1 and V3 it's working better to me, producing more predictable results and capturing better my intentions (not tried Claude yet).
I have been having similar performance issues, I believe they intentionally made a worse model (Gemini 2.5) to get more money out of you. However, there is a way where you can make money off of Gemini 2.5.
If you set the thinking parameter lower and lower, you can make the model spew absolute nonsense for the first response. It costs 10 cents per input / output, and sometimes you get a response that was just so bad your clients will ask for more and more corrections.
I find it baffling that Google offers such impressive models through the API and even the free AI Studio with fine-grained control, yet the models used in the Gemini app feel much worse.
Over the past few weeks, I’ve been using Gemini Advanced on my Workspace account. There, the models think for shorter times, provide shorter outputs, and even their context window is far from the advertised 1 million tokens. It makes me think that Google is intentionally limiting the Gemini app.
Perhaps the goal is to steer users toward the API or AI Studio, with the free tier that involves data collection for training purposes.
This might have changed after you posted your comment, but it looks like 2.5 Pro and 2.5 Flash are available in the Gemini app now, both web and mobile.
Oh, I didn’t mean to say that these models were unavailable through the app or website. Rather, I’ve realized that using them through the API or AI Studio yields much better results — even in the free tier.
You can check that by trying prompts with complex instructions and long inputs/outputs.
For instance, ask Gemini to generate notes from a specific source (say, a book or class transcription). Or ask it to translate a long article, full of idiomatic expressions, while maintaining high fidelity to the source. You will see that the very same Gemini models are underutilized on the app or the website, while their performance is stellar on the API or AI Studio.
That does work in Google’s favor. Users who are technical enough to want a better model eventually learn about AI Studio, while the rest are none the wiser.
Error
An error occurred in the Server Components render. The specific message is omitted in production builds to avoid leaking sensitive details. A digest property is included on this error instance which may provide additional details about the nature of the error.
Does it only use a few recent comments or entire history? I'm trying to figure out where it figured out my city when I thought I was careful not to reveal it. I'm scrolling back pages without finding where I said it in the past. Could it have inferred it based on other information or hallucinated it?
I wonder if there's a more opsec-focused version of this.
There's an important difference between Gemini and Claude that I'm not sure how to quantify. I often use shell-connected LLMs (LLMs with a shell tool enabled) to take care of basic CSV munging / file-sorting tasks for me - I work in data science so there's a lot of this. When I ask Claude to do something, it carefully looks at all the directories and files before doing anything. Gemini, on the other hand, blindly jumps in and just starts moving stuff around. Claude executes more tools and is a little slower, but it almost always gets the right answer because it appropriately gathers the right context before really trying to solve the problem. Gemini doesn't seem to do this at all, but it makes a world of difference for my set of problems. Curious to see if others have had the same experience or if its just a quirk of my particular set of tasks
Claude has always been the best at coding, no matter what all the benchmarks says, the people have spoken and the consensus is that Claude is the best.
Look up Claude Code, Cursor, Aider and VSCode's agent integration. Generally, tools to use AI more actively for development. There are others as well. Plenty of info around. Here's not the place for a tutorial.
It's interesting that there's a price nearly 6x price difference between reasoning and no reasoning.
This implies it's not a hybrid model that can just skip reasoning steps if requested.
Anyone know what else they might be doing?
Reasoning means contexts will be longer (for thinking tokens) and there's an increase in cost to inference with a longer context but it's not going to be 6x.
Based on their graph, it does look explicitly priced along their “Pareto Frontier” curve. I’m guessing that is guiding the price more than their underlying costs.
It’s smart because it gives them room to drop prices later and compete once other company actually get to a similar quality.
> This implies it's not a hybrid model that can just skip reasoning steps if requested.
It clearly is, since most of the post is dedicated to the tunability (both manual and automatic) of the reasoning budget.
I don't know what they're doing with this pricing, and the blog post does not do a good job explaining.
Could it be that they're not counting thinking tokens as output tokens (since you don't get access to the full thinking trace anyway), and this is the basically amortizing the thinking tokens spend over the actual output tokens? Doesn't make sense either, because then the user has no incentive to use anything except 0/max thinking budgets.
Does anyone know how this pricing works? Supposing I have a classification prompt where I need the response to be a binary yes/no. I need one token of output, but reasoning will obviously add far more than 6 additional tokens. Is it still a 6x price multiplier? That doesn't seem to make sense, but not does paying 6x more for every token including reasoning ones
Funny you should say that. Google just announced today that they are giving all college students one year of free Gemini advanced. I wonder how much that will actually move the needle among the youth.
Take-home assignments are basically obsolete. Students who want to cheat, can do so easily. Of course, in the end, they cheat themselves, but that's not the point.
I'd like to burst into a post a number of the unbelievable akin mishandlings of academic tasks I was reported, but. I do have a number of prize-worthy anecdotes that compete with yours. Nonetheless. Let us fight farce with rigour.
Even when the tasks are not in-depth, but easier to assess, you still require a /reliable evaluator/. LLMs are not. Could they be at least employed as a virtual assistant, "parse and suggest, then I'll check"? If so, not randomly ("pick a bot"), but in full awareness of the specific instrument. That stage is not here.
ChatGPT seems to have a name recognition / first-mover advantage with college students now, but is there any reason to think that will stick when today's high school students are using Gemini on their Chromebooks?
I built a product that uses and LLM and I got curious about the quality of the output from different models. It took me a weekend to go from just using OpenAI's API to having Gemini, Claude, and DeepSeek all as options and a lot of that time was research on what model from each provider that I wanted to use.
For enterprise practically any SaaS gets used as one more thing to lock them into a platform they already have a relationship with (either AWS, GCP or Azure).
It's actually pretty dangerous for the industry to have this much vertical integration. Tech could end up like the car industry.
I'm aware of that. I'm an EM for a large tech company that sells multiple enterprise SaaS product.
You're right that the lock in happens because of relationships, but most big enterprise SaaS companies have relationships with multiple vendors. My company relationships with AWS, Azure, and GCP and we're currently using products from all of them in different products. Even on my specific product we're using all three.
When you've already got those relationships, the lock in is more about switching costs. The time it takes to switch, the knowledge needed to train people internally on the differences after the switch, and the actual cost of the new service vs the old one.
With AI models the time to switch from OpenAI to Gemini is negligible and there's little retraining needed. If the Google models (now or in the future) are comparable in price and do a better job than OpenAI models, I don't see where the lock in is coming from.
There isn’t much of a lock-in, and that’s part of the problem the industry is going to face. Everyone is spending gobs of money on training and if someone else creates a better one next week, the users can just swap it right in. We’re going to have another tech crash for AI companies, similar to what happened in 2001 for .coms. Some will be winners but they won’t all be.
It seems more and more like AI is less of a product and more of a feature. Most people aren't going to care or even know about the model or the company who made it, they're just going to use the AI features built into the products they already use.
That's going to be true until we reach AGI, when there will be a qualitative difference and we will lose our ability to discern which is better since they're too far ahead of us.
How will it lock in the enterprise if its market share of enterprise customers is half that of Azure (Azure also sells OpenAI inference, btw), and one third that of AWS?
The same reason why people enjoy BigQuery enough that their only use of GCP is BigQuery while they put their general compute spend on AWS.
In other words, I believe talking about cloud market share as a whole is misleading. One cloud could have one product that's so compelling that people use that one product even when they use other clouds for more commoditized products.
Came to say this. No respectable CTO would ever push a Google product to their superiors knowing Google will kill it in 1-3 years and they’ll look foolish for having pushed it.
That isn't what I'm seeing with my clientele (lots of startups and mature non-tech companies). Most are using Azure but very few have started to engage AI outside the periphery.
No LLM is real time, and in fact, even a 2025 cut off isn't entirely realistic. Without guidance to say, a new version of a framework it will frequently "reference" documentation from old versions and use that.
It's somewhat real time when it searches the web, of course that data is getting populated into context rather than in training.
That's the web version (which has tools like search plugged in), other models in their official frontends (Gemini on gemini.google.com, GPT/o models on chatgpt.com) are also "real time". But when served over API, most of those models are just static.
Not at all. The model weights and training data remain the same, it's just RAG'ing real-time twitter data into its context window when returning results. It's like a worse version of Perplexity.
They're doing the Apple strategy. Less spotlight for other third parties, and less awareness how they're lagging behind so that those already ignorantly locked into OpenAI would not switch. But at this point why would anyone do that when switching costs are low?
I don't think it is as good as deepseek v3 for updating coding tasks. I use 2.5 pro as architect with deepseek v3 as editor engineer. This combination work almost so perfect. Flash has advantages of great context size though. gpt 4.1 is also great.
I am only on OpenAI because they have a native Mac app. Call me old-school but my preferred workflow is still for the most part just asking narrow questions and copying-pasting back and forth. I've been playing with Junie (Jetbrain's AI agent) for a couple of days, but I still don't trust agents to run loose in my codebase for any sizeable amount of work.
Does anyone know if Google is planning native apps? Or any wrapping interfaces that work well on a Mac?
Google is totally back in the game now, but it’s still going to take a lot more for them at this point to overcome OpenAI’s “first‑mover advantage” (clearly the favorite among younger users atm).
Prompt: "can you find any mistakes on my codebase? I put one in there on purpose" + 70,000 tokens of codebase where in one line I have an include for a non-existent file.
Claude 3.7 Sonnet (Try 1 - new chat) Claimed it Found a race condition in the api of the admin interface that would be triggered if two admins were to change the room order at the same time. Claude suggested I group all sql queries in a single transaction. I looked at the code and found that it already used a transaction for all queries. I said: the order_update api is already done with a transaction. Claude replied: "You're absolutely right, and I apologize for my mistake. I was incorrect to claim there was a race condition issue. The transaction ensures atomicity and consistency of the updates, and the SQL queries are properly structured for their intended purpose."
Claude 3.7 Sonnet (Try 2 - new chat) Found a comment in the admin ui javascript code that suggested a potential alternative to event handler cleanup that was not implemented because I decided to go with a cleaner route. Then asked "Is this the issue you intentionally included, or would you like me to look for other potential problems?" I said: "The comment merely suggests an alternative, right?" claude said: "Yes, you're absolutely right. The comment is merely suggesting an alternative approach that isn't being used in the code, rather than indicating a mistake. So there's no actual bug or mistake in this part of the code - just documentation of different possible approaches. I apologize for misinterpreting this as an issue!"
Claude 3.7 Sonnet (Try 3 - new chat) When processing items out of the database to generate QR codes in the admin interface, Claude says that my code both attempts to generate QR codes with undefined data AS WELL AS saying that my error handling skips undefined data. Claude contradicts itself within 2 sentences. When asking about clarification Claude replies: Looking at the code more carefully, I see that the code actually has proper error handling. I incorrectly stated that it "still attempts to call generateQRCode()" in the first part of my analysis, which was wrong. The code properly handles the case when there's no data-room attribute.
Gemnini Advanced 2.5 Pro (Try 1 - new chat) Found the intentional error and said I should stop putting db creds/api keys into the codebase.
Gemnini Advanced 2.5 Pro (Try 2 - new chat) Found the intentional error and said I should stop putting db creds/api keys into the codebase.
Gemnini Advanced 2.5 Pro (Try 3 - new chat) Found the intentional error and said I should stop putting db creds/api keys into the codebase.
o4-mini-high and o4-mini and o3 and 4.5 and 4o - "The message you submitted was too long, please reload the conversation and submit something shorter."
Those responses are very Claude, to. 3.7 has powered our agentic workflows for weeks, but I've been using almost only Gemini for the last week and feel the output is better generally. It's gotten much better at agentic workflows (using 2.0 in an agent setup was not working well at all) and I prefer its tuning over Clause's, more to the point and less meandering.
I've been paying for googles pro llm for about six months. At 20 it feels steep considering the free version is very good. I'm a devops work, and it's been very helpful. Ive tried gpt, copilot, Mixtral, Claude, etc and Geminis 1.5 pro was what sold me. The new 2.0 stuff is even better. Anecdotally, Gemini seems to forget to add stuff but doesn't hallucinate as much. I've been doing some pretty complex scripting this last week purely on Gemini fast 2.0 and it's been really really good.
I am always overlooking anything Google due to the fact that they are the opposite of "Don't be evil" and because their developer's console (Google Cloud) is incredibly hostile to humans.
Today I reluctantly clicked on their "AI Studio" link in the press-release and I was pleasantly surprised to discover that AI Studio has nothing in common with their typical UI/UX. It's nice and I love it!
Yesterday I started working through How to design programs, and set up a chat with Gemini 2.5 asking it to be my tutor as I go through it and to help answer my questions if I don't understand a part of the book. It has been knowledgeable, helpful and capable of breaking down complex things that I couldn't understand into understandable things. Fantastic all around.
It appears that this impacted gemini-2.5-pro-preview-03-25 somehow? grounding with google search no longer works.
I had a workflow running that would pull news articles from the past 24 hours. It now refuses to believe the current date is 2025-04-17. Even with search turned on and I ask it what the date is it and it always replies sometime in July 2024.
As a person mostly using AI for everyday tasks and business-related research, it's very impressive how quickly they've progressed. I would consider all models before 2.0 totally unusable. Their web interface, however, is so much worse than that of the ChatGPT macOS app.
Some aren't even at 2.0, and the version numbers aren't related in any way to their... generation? Also, what is so good about the ChatGPT app, specifically on macOS that makes it better?
At this point, at the current pace of AI model development, I feel like I can't tell which one is better. I usually end up using multiple LLMs to get a task done to my taste. They're all equally good and bad. It's like using GCP vs AWS vs Azure all over again, except in the AI space.
I’m not familiar with Python internals, so when I tried to convert a public AI model (not a LLM) to run locally, I got some problems no other AI could help. Asked Gemini 2.5 and it pin pointed the problem immediately. It solution was not practical but I guess it also works.
If this announcement is targeting people not up-to-date on the models available, I think they should say what "flash" means. Is there a "Gemini (non-flash)"?
I see the 4 Google model names in the chart here. Are these 4 the main "families" of models to choose from?
Gemini has had 4 families of models, in order of decreasing size:
- Ultra
- Pro
- Flash
- Flash-Lite
Versions with `-Preview` at the end haven't had their "official release" and are technically in some form of "early access" (though I'm not totally clear on exactly what that means given that they're fully available and as of 2.5 Pro Preview, have pricing attached to them - earlier versions were free during Preview but had pretty strict rate limiting but now it seems that Preview models are more or less fully usable).
The free-with-small-rate-limits designator was "experimental", not "preview".
I think the distinction between preview and full release is that the preview models have no guarantees on how long they'll be available, the full release comes with a pre-set discontinuation date. So if want the stability for a production app, you wouldn't want to use a preview model.
Nice! Low price, even with reasoning enabled. I have been working on a short new book titled “Practical AI with Google: A Solo Knowledge Worker's Guide to Gemini, AI Studio, and LLM APIs” but with all of Google’s recent announcements it might not be a short book.
If OpenAI offers Codex and Anthropic offers Claude Code, is there a CLI integration that Google recommends for using Gemini 2.5? That’s what’s keeping me, for now, with the other two.
I am building a knowledge graph using BAML [baml-py] to extract documents [it's opinionated towards docs] and then PySpark to ETL the data into a node / edge list. GPT4o got few relations... Gemini 2.5 got so many it was nuts, all accurate but not all from the article! I had to reign it in and instruct it not to build so vast a graph. Really cool, it knows a LOT about semiconductors :)
1. The main transformative aspect of LLMs has been in writing code.
2. LLMs have had less transformative aspects in 2025 than we anticipated back in late 2022.
3. LLMs are unlikely to be very transformative to society, even as their intelligence increases, because intelligence is a minor changemaker in society. Bigger changemakers are motivation, courage, desire, taste, power, sex and hunger.
4. LLMs are unlikely to develop these more important traits because they are trained on text, not evolved in a rigamarole of ecological challenges.
Gemini has the annoying habit of delegating tasks to me. Most recently I was trying to find out how to do something in FastRawViewer that I couldn't find a straightforward answer on. After hallucinating a bunch of settings and menus that don't exist, it told me to read the manual and check the user forums. So much for saving me time.
Very excited to try it, but it is noteworthy that o4-mini is strictly better according to the very benchmarks shown by Google here.
Of course it's about 4x as expensive too (I believe), but still, given the release of openai/codex as well, o4-mini will remain a strong competitor for now.
Is everyone on here solely evaluating the models on their programming capabilities? I understand this is HN but vibe coding LLM tools won't be able to sustain the LLM industry (let's not call it AI please)
How is this sustainable for Google from business POV? It feels like Google is shooting itself in the foot while "winning" the AI race.. From my experience I think Google lost 99% of the ads it used to show me before in the search engine.
The pricing table image in the article really should have included Gemini 2.5 pro. Sure, it could be after Flash to the right, but it would help people understand the price performance benefits of 2.5 Flash.
One place where I feel gemini models lag is function calling and predicting correct arguments to function calls, is there a benchmark which scores models on the basis of this ?
What I am noticing with every new Gemini model that comes out is that the time to first token (TTFT) is not great. I guess it is because they gradually transfer computer power from old models to new models as the demand increases.
I want to think that this is all great, but the fact that this is also one of the best way to collect unsuspecting user data by default without explicit consent just doesn't feel right -- that applies to most people who would never have a chance reading this comment.
I don't want to be angry but screw these default opt-in to have your privacy violated free stuff.
Before you jump in to say you can pay to keep your privacy, stop and read again.
Absolutely decimated on metrics by o4-mini, straight out of the gate, and not even that much cheaper on output tokens (o4-mini's thinking can't be turned off IIRC).
66.8% error rate reduction for o4-mini on AIME2025, and 21% error rate reduction on MMMU isn't "slightly higher". It'll be quite noticeable in practice.
It's good to see some actual competition on this price range! A lot of Flash 2.5's edge will depend on how well the dynamic reasoning works. It's also helpful to have _significantly_ lower input token cost for a large context use cases.
Anecdotally o4-mini doesn’t perform as well on video understanding tasks in our pipeline, and also in Cursor it seems really not great.
During one session, it read the same file (same lines) several times, ran ‘python -c ‘print(“skip!”)’’ for no reason, and then got into another file reading loop. Then after asking a hypothetical about the potential performance implications of different ffmpeg flags, it claimed that it ran a test and determined conclusively that one particular set was faster, even though it hadn’t even attempted a tool call, let alone have the results from a test that didn’t exist.
I've been leveraging the services of 3 LLMs, mainly: Meta, Gemini, and Copilot.
It depends on what I'm asking. If I'm looking for answers in the realm of history or culture, religion, or I want something creative such as a cute limerick, or a song or dramatic script, I'll ask Copilot. Currently, Copilot has two modes: "Quick Answer"; or "Think Deeply", if you want to wait about 30 seconds for a good answer.
If I want info on a product, a business, an industry or a field of employment, or on education, technology, etc., I'll inquire of Gemini.
Both Copilot and Gemini have interactive voice conversation modes. Thankfully, they will also write a transcript of what we said. They also eagerly attempt to engage the user with further questions and followups, with open questions such as "so what's on your mind tonight?"
And if I want to know about pop stars, film actors, the social world or something related to tourism or recreation in general, I can ask Meta's AI through [Facebook] Messenger.
One thing I found to be extremely helpful and accurate was Gemini's tax advice. I mean, it was way better than human beings at the entry/poverty level. Commercial tax advisors, even when I'd paid for the Premium Deluxe Tax Software from the Biggest Name, they just went to Google stuff for me. I mean, they didn't even seem to know where stuff was on irs.gov. When I asked for a virtual or phone appointment, they were no-shows, with a litany of excuses. I visited 3 offices in person; the first two were closed, and the third one basically served Navajos living off the reservation.
So when I asked Gemini about tax information -- simple stuff like the terminology, definitions, categories of income, and things like that -- Gemini was perfectly capable of giving lucid answers. And citing its sources, so I could immediately go find the IRS.GOV publication and read it "from the horse's mouth".
Oftentimes I'll ask an LLM just to jog my memory or inform me of what specific terminology I should use. Like "Hey Gemini, what's the PDU for Ethernet called?" and when Gemini says it's a "frame" then I have that search term I can plug into Wikipedia for further research. Or, for an introduction or overview to topics I'm unfamiliar with.
LLMs are an important evolutionary step in the general-purpose "search engine" industry. One problem was, you see, that it was dangerous, annoying, or risky to go Googling around and click on all those tempting sites. Google knew this: the dot-com sites and all the SEO sites that surfaced to the top were traps, they were bait, they were sometimes legitimate scams. So the LLM providers are showing us that we can stay safe in a sandbox, without clicking external links, without coughing up information about our interests and setting cookies and revealing our IPv6 addresses: we can safely ask a local LLM, or an LLM in a trusted service provider, about whatever piques our fancy. And I am glad for this. I saw y'all complaining about how every search engine was worthless, and the Internet was clogged with blogspam, and there was no real information anymore. Well, perhaps LLMs, for now, are a safe space, a sandbox to play in, where I don't need to worry about drive-by-zero-click malware, or being inundated with Joomla ads, or popups. For now.
Honestly, the best part about Gemini, especially as a consumer product, is their super lax, or lack thereof, ratelimits. They never have capacity issues, unlike Claude which always feels slow or sometimes outright rejects requests during peak hours. Gemini is constantly speedy and has extremely generous context window limits on the Gemini apps.
Dang - Google finally made a quality model that doesn’t make me want to throw my computer out a window. It’s honest, neutral and clearly not trained by the ideologically rabid anti-bias but actually super biased regime.
Did I miss a revolt or something in googley land? A Google model saying “free speech is valuable and diverse opinions are good” is frankly bizarre to see.
Downvote me all you want - the fact remains that previous Google models were so riddled with guardrails and political correctness that it was practically impossible to use for anything besides code and clean business data. Random text and opinion would trigger a filter and shut down output.
Even this model criticizes the failures of the previous models.
IMO I will not use Grok while it's owned and related to Elon, not only do I not trust their privacy and data usage (not that I "really" trust open AI/Google etc) I just despise him.
It would have to be very significantly better for me to use it.
Interesting that the output price per 1M tokens is $0.6 for non-reasoning, but $3.5 for reasoning. This seems to defy common assumption of how reasoning models work, and you tweak the <think> token probability to control how much thinking it does, but underlying it's the same model and the same inference code path.
I just wish the whole industry would stop using terms like thinking and reasoning. This is not what's happening. If we could come up with more appropriate terms that don't treat these models like they're human then we'd be in a much better place. That aside, it's cool to see the advancement of Google's offering.
Do you think any machine will ever be able to think and/or reason? Or is that a uniquely human thing? and do you have a rational standard to judge when something is reasoning or thinking, or just vibes?
I'm asking because I wonder how much of that common attitude is just a sort of species-chauvinism. You are feeling anxious because machines getting smarter, you are feeling anger because "they" are taking your job away, but the machine doesn't do that, its people with an ideology that do that, you should be angry at that instead.
No matter how good the new Gemini models have become, my bad experience with early Gemini is still stuck with me and I am afraid I still suffer from confirmation bias. Whenever I just look at the Gemini app, I already assume it’s going to be a bad experience.
I tried this prompt in both Gemini 2.5 Pro, and in ChatGPT.
"Draw me a timeline of all the dynasties of China. Imagine a horizontal line. Start from the leftmost point and draw segments for the start and end of each dynasty. For periods where multiple dynasties existed simultaneously draw parallel lines or boxes to represent the concurrent rule."
Gemini's response: "I'm just a language model, so I can't help you with that."
Is it possible that a community of people who are constantly pushing LLMs to their limits would be most aware of their limitations, and so more inclined to think they are junk?
In terms of business utility, Google has had great releases ever since the 2.0 family. Their models have never missed some mark --- either a good price/performance ratio, insane speeds, novel modalities (they still have the only API for autoregressive image generation atm), state-of-the-art long context support and coding ability (Gemini 2.5), etc.
However, most average users are using these models through a chat-like UI, or via generic tools like Cursor, which don't really optimize their pipelines to capture the strengths of different models. This way, it's very difficult to judge a model objectively. Just look at the obscene sycophancy exhibited by chatgpt-4o-latest and how it lifted LMArena scores.
Just the fact that everyone on HN is always telling us how LLMs are useless but that Gemini is the best of them convinces me of the opposite. No one who can't find a use for this technology is really informed on the subject. Hard to take them seriously.
and now... Google’s latest innovation: programmable overthinking.
With Gemini 2.5 Flash, you too can now set a thinking_budget—because nothing says "state-of-the-art AI" like manually capping how long it’s allowed to reason. Truly the dream: debugging a production outage at 2am wondering if your LLM didn’t answer correctly because you cheaped out on tokens. lol.
“Turn thinking off for better performance.” That’s not a model config, that’s a metaphor for Google’s entire AI strategy lately.
At this point, Gemini isn’t an AI product—it’s a latency-cost-quality compromise simulator with a text interface. Meanwhile, OpenAI and Anthropic are out here just… cooking the benchmarks
Google's Gemini 2.5 pro model is incredibly strong, it's en par and at times better than Claude 3.7 in coding performance, being able to ingest entire videos into the context is something I haven seen elsewhere either. Google AI products have been anywhere between bad (Bard) to lackluster (Gemini 1.5), but 2.5 is a contender, in all dimensions. Google is also the only player that owns the entire stack, from research, software , data, compute hardware. I think they were slow to start but they've closed the gap since.
Google making Gemini 2.5 Pro (Experimental) free was a big deal. I haven't tried the more expensive OpenAI models so I can't even compare, only to the free models I have used of theirs in the past.
Gemini 2.5 Pro is so much of a step up (IME) that I've become sold on Google's models in general. It not only is smarter than me on most of the subjects I engage with it, it also isn't completely obsequious. The model pushes back on me rather than contorting itself to find a way to agree.
100% of my casual AI usage is now in Gemini and I look forward to asking it questions on deep topics because it consistently provides me with insight. I am building new tools with the mind to optimize my usage to increase it's value to me.
After comparing Gemini Pro and Claude Sonnet 3.7 coding answers side by side a few times, I decided to cancel my Anthropic subscription and just stick to Gemini.
One of the main advantages Anthropic currently has over Google is the tooling that comes with Claude Code. It may not generate better code, and it has a lower complexity ceiling, but it can automatically find and search files, and figure out how to fix a syntax error fast.
As another person that cancelled my Claude and switched to Gemini, I agree that Claude Code is very nice, but beyond some initial exploration I never felt comfortable using it for real work because Claude 3.7 is far too eager to overengineer half-baked solutions that extend far beyond what you asked it to do in the first place.
Paying real API money for Claude to jump the gun on solutions invalidated the advantage of having a tool as nice as Claude Code, at least for me, I admit everyone's mileage will vary.
Exactly my experience as well. Started out loving it but it almost moves too fast - building in functionality that i might want eventually but isn't yet appropriate for where the project is in terms of testing, or is just in completely the wrong place in the architecture. I try to give very direct and specific prompts but it still has the tendency to overreach. Of course it's likely that with more use i will learn better how to rein it in.
I've experienced this a lot as well. I also just yesterday had an interesting argument with claude.
It put an expensive API call inside a useEffect hook. I wanted the call elsewhere and it fought me on it pretty aggressively. Instead of removing the call, it started changing comments and function names to say that the call was just loading already fetched data from a cache (which was not true). I could not find a way to tell it to remove that API call from the useEffect hook, It just wrote more and more motivated excuses in the surrounding comments. It would have been very funny if it weren't so expensive.
Geez, I'm not one of the people who think AI is going to wake up and wipe us out, but experiences like yours do give me pause. Right now the AI isn't in the drivers seat and can only assert itself through verbal expression, but I know it's only a matter of time. We already saw Cursor themselves get a taste of this. To be clear I'm not suggesting the AI is sentient and malicious - I don't believe that at all. I think it's been trained/programmed/tuned to do this, though not intentionally, but the nature of these tools is they will surprise us
> but the nature of these tools is they will surprise us
Models used to do this much much more than now, so what it did doesn't surprise us.
The nature of these tools is to copy what we have already written. It has seen many threads where developers argue and dig in, they try to train the AI not to do that but sometimes it still happens and then it just roleplays as the developer that refuses to listen to anything you say.
I almost fear more that we'll create Bender from Futurama than some superintelligent enlightened AGI. It'll probably happen after Grok AI gets snuck some beer into its core cluster or something absurd.
> We already saw Cursor themselves get a taste of this.
Sorry what do you mean by this?
Earlier this week a Cursor AI support agent told a user they could only use Cursor on one machine at a time, causing the user to cancel their subscription.
I wanted some powershell code to do some sharepoint uploading. It created a 1000 line logging module that allowed me to log things at different levels like info, debug, error etc. Not really what I wanted.
"Don't be a keener. Do not do anything I did not ask you to do" are def part of my prompts when using Claude
Open Codex (A codex fork) that supports gemini and openrouter providers https://github.com/ymichael/open-codex
google models on cli are great.
+1 Open Codex is very nice. Yesterday I was using it with Gemini APIs and also using a local model using Ollama running on my laptop.
I added a very short chapter on setting this up (direct link to my book online): https://leanpub.com/ollama/read#using-the-open-codex-command...
This morning I tweaked my Open Codex config to also try gemma3:27b-it-qat - and Google’s olen source small is excellent: runs fast enough for a good dev experience, with very good functionality.
Whats your setup/workflow then?
Any ide integration?
I've switched to aider with the --watch-files flag. Being able to use models in nvim with no additional tooling is pretty sweet
Typing `//use this as reference ai` in one file and `//copy this row to x ai!` and it will add those functions/files to context and act on both places. Altough I wish Aider would write `working on your request...` under my comment, now I have to keep Aider window in sight. Autocomplete and "add to context" and "enter your instructions" of other apps feel clunky.
That's really cool. I've been looking for a nicer solution to use with nvim.
I don't understand the appeal of investing in leaning and adapting your workflow to use an AI tool that is so tightly coupled to a single LLM provider, when there are other great AI tools available that are not locked to a single LLM provider. I would guess aider is the closest thing to claude code, but you can use pretty much any LLM.
The LLM field is moving so fast that what is the leading frontier model today, may not be the same tomorrow.
Pricing is another important consideration. https://aider.chat/docs/leaderboards/
All the AI tools end up converging on a similar workflow: type what you want and interrupt if you're not getting what you want.
There are at least 10 projects currently aiming to recreate Claude Code, but for Gemini. For example, geminicodes.co by NotebookLM’s founding PM Raiza Martin
Tried Gemini Codes yesterday, as well as anon-kode and anon-codex. Gemini Codes is already broken and appears to be rather brittle (she disclosures as much), and the other two appear to still need some prompt improvements or someone adding vector embedding for them to be useful?
Perhaps someone can merge the best of Aider and codex/claude code now. Looking forward to it.
Google need to fix their Gemini web app at a basic level. It's slow, gets stuck on Show Thinking, rejects 200k token prompts that are sent one shot. Aistudio is in much better shape.
But have you tried any other interfaces for Gemini? Like the Gemini Code Assistant in VSCode? Or Gemini-backed Aider?
Have you tried them? Which one is fairly simple but just works?
+1 on this. Improving Gemini apps and live mode will go such a long way for them. Google actually has the best model line-up now but the apps and APIs hold them back so much.
I hate how I can copy paste long text into Claude (becomes a pasted text) and it is accepted, but in Gemini it is limited.
You can paste it in a text file and upload that. A little annoying compared to claude, but does work.
Thanks, will give it a try.
Uploading files on google is now great. I uploaded my python script and the text data files I was using the script to process. I asked it how best to optimize the code. It actually ran the python code on the data files. Then recommended changes then when prompted ran the script again to show the new results. At first I was like maybe hallucinating but no the data was correct.
Yeah "they" run Python code now quite well. They generate some output using Python "internally" (albeit shows you the code).
I use roo code with Gemini to get similar results for free
Does its agentic features work with any API? I had tried this or Cline and it was clear that they work effectively only with Claude's tooling support.
Yes. Any API Key is allowed, Also you can assign different LLMs for different modes. It is great for cost-optimization. Like architect, code, ask, debug etc.
Related:
Only Claude (to my knowledge) has a desktop app which can directly, and usually quite intelligently, modify files and create repos on your desktop. It's the only "agentic" option among the major players.
"Claude, make me an app which will accept Stripe payments and sell an ebook about coding in Python; first create the app, then the ebook."
It would take a few passes but Claude could do this; obviously you can't do that with an API alone. That capability alone is worth $30/month in my opinion.
Maybe I am not understanding something here.
But there are third party options availabe that to the very same thing (e.g. https://aider.chat/ ) which allow you to plug in a model (or even a combination thereof e.g. deepseek as architect and claude as code writer) of your choice.
Therefore the advantage of the model provider providing such a thing doesn't matter, no?
Aider is not agentic - it is interactive by design. Copilot agent mode and Cline would better comparisons.
OpenAI launched codex 2 days ago, there's open forks already that support other providers too
there's also claude code proxy's to run it on local llm's
you can just do things
A first party app, sure, but there's no shortage of third party options. Cursor, Windsurf/Codeium etc. Even VSCode has agent mode now.
> first create the app, then the ebook."
> It would take a few passes but Claude could do this;
I'm sorry but absolutely nothing I've seen from using Claude indicates that you could give it a vague prompt like that and have it actually produce anything worth reading.
Can it output a book's worth of bullshit with that prompt? Yes. But if you think "write a book about Python" is where we are in the state of the art in language models in terms of the prompt you need to get a coherent product, I want some of whatever you are smoking because that has got to be the good shit
OpenAI just released Codex, which is basically the same as Claude Code.
It looks the same, but for some reason Claude Code is much more capable. Codex got lost in my source code and hallucinated bunch of stuff, Claude on the same task just went to town, burned money and delivered.
Of course, this is only my experience and codex is still very young. I really hope it becomes as capable as Claude.
Part of it is probably tgat claude is just better at coding than what openai has available. I am considering trying to hack in support for gemini into codex and play around with it.
I was doing this last night with open-codex, a fork. https://github.com/ymichael/open-codex
Copilot agent mode?
[dead]
Firebase Studio is the Google equivalent
Also the "project" feature in claude improves experience significantly for coder, where you can customize your workflow. Would be great if gemini has this feature.
Yes, IME, Anthropic seemed to be ahead of Google by a decent amount with Sonnet 3.5 vs 1.5 Pro.
However, Sonnet 3.7 seemed like a very small increase, whereas 2.5 Pro seemed like quite a leap.
Now, IME, Google seems to be comfortably ahead.
2.5 Pro is a little slow, though.
I'm not sure which model Google uses for the AI answers on search, but I find myself using Search for a lot of things I might ask Gemini (via 2.5 Pro) if it was as fast as Search's AI answers.
How's is the speed of Gemini vs 3.7?
I use both, Gemini 2.5 Pro is significantly slower than Claude 3.7.
Yeah I have read gemini pro 2.5 is a much bigger model.
I've been using Gemini 2.5 and Claude 3.7 for Rust development and I have been very impressed with Claude, which wasn't the case for some architectural discussions where Gemini impressed with it's structure and scope. OpenAI 4.5 and o1 have been disappointing in both contexts.
Gemini doesn't seem to be as keen to agree with me so I find it makes small improvements where Claude and OpenAI will go along with initial suggestions until specifically asked to make improvements.
I have noticed Gemini not accepting an instruction to "leave all other code the same but just modify this part" on a code that included use of an alpha API with a different interface than what Gemini knows is the correct current API. No matter how I promoted 2.5 pro, I couldn't get it to respect my use of the alpha API, it would just think I must be wrong.
So I think patterns from the training data are still overriding some actual logic/intelligence in the model. Or the Google assistant fine-tuning is messing it up.
I have been using gemini daily for coding for the last week, and I swear that they are pulling levers and A/B testing in the background. Which is a very google thing to do. They did the same thing with assistant, which I was a pretty heavy user of back in the day (I was driving a lot).
I have had a few epic refactoring failures with Gemini relative to Claude.
For example: I asked both to change a bunch of code into functions to pass into a `pipe` type function, and Gemini truly seemed to have no idea what it was supposed to do, and Claude just did it.
Maybe there was some user error or something, but after that I haven’t really used Gemini.
I’m curious if people are using Gemini and loving it are using it mostly for one-shotting, or if they’re working with it more closely like a pair programmer? I could buy that it could maybe be good at one but bad at the other?
This has been my experience too. Gemini might be better for vibe coding or architecture or whatever, but Claude consistently feels better for serious coding. That is, when I know exactly how I want something implemented in a large existing codebase, and I go through the full cycle of implementation, refinement, bug fixing, and testing, guiding the AI along the way.
It also seems to be better at incorporating knowledge from documentation and existing examples when provided.
My experience has been exactly the opposite - Sonnet did fine on trivial tasks, but couldn't e.g. fix a bug end-to-end (from bug description in the tracker to implementing the fix and adding tests) properly because it couldn't understand how the relevant code worked, whereas Gemini would consistently figure out the root cause and write decent fix & tests.
Perhaps this is down to specific tools and their prompts? In my case, this was Cursor used in agent mode.
Or perhaps it's about the languages involved - my experiments were with TypeScript and C++.
> Gemini would consistently figure out the root cause and write decent fix & tests.
I feel like you might be using it differently to me. I generally don't ask AI to find the cause of a bug, because it's quite bad at that. I use it to identify relevant parts of the code that could be involved in the bug, and then I come up with my own hypotheses for the cause. Then I use AI to help write tests to validate these hypotheses. I mostly use Rust.
I used to use them mostly in "smart code completion" mode myself until very recently. But with all the AI IDEs adding agentic mode, I was curious to see how well that fares if I let it drive.
And we aren't talking about trivial bugs here. For TypeScript, the most impressive bug it handled to date was an async race condition due to missing await causing a property to be overwritten with invalid value. For that one I actually had to do some manual debugging and tell it what I observed, but given that info, it was able to locate the problem in the code all by itself and fix it correctly and come up with a way to test it as well.
For C++, the codebase in question was gdb, the bug was a test issue, and it correctly found problematic code based solely on the test log (but I had to prod it a bit in the right direction for the fix).
I should note that this is Gemini Pro 2.5 specifically. When I tried Google's models previously (for all kinds of tasks), I was very unimpressed - it was noticeably worse than other SOTA models, so I was very skeptical going into this. Indeed, I started with Sonnet precisely because my past experience indicated that it was the best option, and I only tried Gemini after Sonnet fumbled.
I use it for basically everything I can, not just code completion, including end-to-end bug fixes when it makes sense. But most of the time even the current Gemini and Claude models fail with the hard things.
It might be because most bugs that you would encounter in other languages don't occur in the first place in Rust because of the stronger type system. The race condition one you mentioned wouldn't be possible for example. If something like that would occur, it's a compiler error and the AI fixes it while still in the initial implementation stage by looking at the linter errors. I also put a lot of effort into trying to use coding patterns that do as much validation as possible within the type system. So in the end all that's left are the more difficult bugs where a human is needed to assist (for now at least, I'm confident that the models are only going to get better).
Race conditions can span across processes (think async process communication).
That said I do wonder if the problems you're seeing are simply because there isn't that much Rust in the training set for the models - because, well, there's relatively little of it overall when you compare it to something like C++ or JS.
I've found that I need to point it to the right bit of logs or test output and narrow its attention by selectively adding to it's context. Claude 3.7 at least works well this way. If you don't it'll fumble around. Gemini hasn't worked as well for me though.
I partly wonder if different peoples prompt styles will lead to better results with different models.
I also cancelled my Anthropic yesterday, not because of Gemini but because it was the absolute worst time for Anthropic to limit their Pro plan to upsell their Max plan when there is so much competition out there
Manus.im also does code generation in a nice UI, but I’ll probably be using Gemini and Deepseek
No Moat strikes again
Just curious, what tool do you use to interface with these LLMs? Cursor? or Aider? or...
I’m on GitHub Copilot with VsCode Insiders, mostly because I don’t have to subscribe to one more thing.
They pretty quick to let you use the latest models nowadays.
I really like the open source Cline extension. It supports most of the model APIs, just need to copy/paste an API key.
Same here. Especially for native app development with swift I had way better results and just sticked with Gemini-2.5-*
Google has killed so many amazing businesses -- entire industries, even, by giving people something expensive for free until the competition dies, and then they enshittify hard.
It's cool to have access to it, but please be careful not to mistake corporate loss leaders for authentic products.
It's not free. And it's legit one of the best models. And it was a Google employee who was among the authors of the paper that's most recognized as kicking all this off. They give somewhat limited access in AIStudio (I have only hit the limits via API access, so I don't know what the chat UI limits are.) Don't they all do this? Maybe harder limits and no free API access. But I think most people don't even know about AIStudio.
True. They are ONLY good when they have competition. The sense of complacency that creeps in is so obvious as a customer.
To this day, the Google Home (or is it called Nest now?) speaker is the only physical product i've ever owned where it lost features over time. I used to be able to play the audio of a Youtube video (like a podcast) through it, but then Google decided that it was very very important that I only be able to play a Youtube video through a device with a screen, because it is imperative that I see a still image when I play a longform history podcast.
Obviously, this is a silly and highly specific example, but it is emblematic of how they neglect or enshittify massive swathes of their products as soon as the executive team loses interest and puts their A team on some shiny new object.
The experience on Sonos is terrible. There are countless examples of people sinking 1000s of dollars into Sonos ecosystem, and the new app update has rendered them useless.
It's mostly fixed now (5 room Sonos setup here). It's also a lot better at not dropping speakers off its network
I'm experiencing the same problem with my Google Home ecosystem. One day I can turn off the living room lights with the simple phrase "Turn off Living Room Lights," and then randomly for two straight days it doesn't understand my command
Preach it my friend. For years on the Google Home Hub (or Nest Hub or whatever) I could tell it to "favorite my photo" of what is on the screen. This allowed me to incrementally build a great list of my favorite photos on Google Photos and added a ton of value to my life. At some point that broke, and now it just says, "Sorry, I can't do that yet". Infuriating
The usage limit for experimental gets used up pretty fast in a vibe-coding situation. I found myself setting up an API account with billing enabled just to keep going.
(Public) corporate loss leaders? Cause they are all likely corporate.
Also, Anthropic is also subsidizing queries, no? The new “5x” plan illustrative of this?
No doubt anthropic’s chat ux is the best right now, but it isn’t so far ahead on that or holding some UX moat that I can tell.
How would I know if it’s useful to me without being able to trial it?
Googles previous approach (Pro models available only to Gemini Advanced subscribers, and Advanced trials can’t be stacked with Google One paid storage, or rather they convert the already paid storage portion to a paid, much shorter Advanced subscription!) was mind-bogglingly stupid.
Having a free tier on all models is the reasonable option here.
In this case, Google is a large investor in Anthropic.
I agree that giving away access to expensive models long term is not a good idea on several fronts. Personally, I subscribe to Gemini Advanced and I pay for using the Gemini APIs.
EDIT: a very good deal, at $10/month is https://apps.abacus.ai/chatllm/ that gives you access to almost all commercial models as well as the best open weight models. I have never come close at all to using my monthly credits with them. If you like to experiment with many models the service is a lot of fun.
The problem with tools like this is that somewhere in the chain between you and the LLM are token reducing “features”. Whether it’s the system prompt, a cheaper LLM middleman, or some other cost saving measure.
You’ll never know what that something is. For me, I can’t help but think that I’m getting an inferior service.
You can self host something like https://big-agi.com/ and grab your own keys from various providers. You end up with the above, but without the pitfalls you mentioned.
BIG-AI does look cool, and supports a different use case. ABACUS.AI takes your $10/month and gives you credits that go towards their costs of using OpenAI, Anthropic, Gemini, etc. Use of smaller open models use very few credits.
The also support an application development framework that looks interesting but I have never used it.
You might be correct about cost savings techniques in their processing pipeline. But they also add functionality: they bake web search into all models which is convenient. I have no affiliation with ABACUS.AI, I am just a happy customer. They currently let me play with 25 models.
If anyone from Kagi is on, I'd love to know, does Kagi do that?
Just look at Chrome to see the bard/gemini's future. HN folks didn't care about Chrome then but cry about Google's increasingly hostile development of Chrome.
Look at Android.
HN behaviour is more like a kid who sees the candy, wants the candy and eats as much as it can without worrying about the damaging effect that sugar will have on their health. Then, the diabetes diagnosis arrives and they complain
More and more people are coming to the realisation that Google is actually winning at the model level right now.
What’s with the Google cheer squad in this thread, usually it’s Google lost its way and is evil.
Can’t be employees cause usually there is a disclaimer
It's good to be aware of the likelihood of astroturfing. Everytime there's a newthread like this for one of the companies, there's a suspicious amount of similar plausible praise in an otherwise (sometimes brutally) skeptical forum.
The best way to shift or create consensus is to make everyone think everyone else's opinion has already shifted, and that consensus is already there. Emperor's new clothes etc..
Google can be evil and release impressive language models. The same way as Apple releasing incredible hardware with good privacy while also being a totally insufferable and arrogant company.
Gemini 2.5 is genuinely impressive.
Google employees only have to disclaimer when they're identified as Google employees.
So shit like "as a googler" requires "my opinions are my own yadda yadda"
I haven’t met a single person that uses Gemini. Companies are using Copilot and individuals are using ChatGPT.
Also, why would I want Google to spy on my AI usage? They’re evil.
why is Google more evil than say OpenAI ?
They're not more evil, it's probably a tie. What they do have is considerably more reach and power. As a result they're much more dangerous.
>obsequious
Thanks for the new word, I have to look it up.
"obedient or attentive to an excessive or servile degree"
Apparently it means an AI that mindlessly follow your logic and instructions without reasoning and articulation is not good enough.
Another useful word in this context is “sycophancy,” meaning excessive flattery or insincere agreement. Amanda Askell of Anthropic has used it to describe a trait they try to suppress in Claude:
https://youtube.com/watch?v=ugvHCXCOmm4&t=10286
The second example she uses is really important. You (used to) see this a lot in stackoverflow where an inexperienced programmer asks how to do some convoluted thing. Sure, you can explain how to do the thing while maintaining their artificial constraints. But much more useful is to say "you probably want to approach the problem like this instead". It is surely a difficult problem and context dependent.
XY problem
Interesting that Americans appear to hold their AI models to a higher standard than their politicians.
Different Americans.
Lots of folks in tech have different opinions than you may expect. Many will either keep quiet or play along to keep the peace/team cohesion, but you really never know if they actually agree deep down.
Their career, livelihoods, ability to support their families, etc. are ultimately on the line, so they'll pay lip service if they have to. Consider it part of the job at that point; personal beliefs are often left at the door.
Not just tech. I spent some time on a cattle ranch (long story) and got to know some people pretty well. Quite a few confided interests and opinions they would never share at work, where the culture also has strong expectations of conformity.
It's a bit of a fancy way to say "yes man". Like in corporations or politics, if a leader surrounds themselves with "yes men".
A synonym would be sycophantic which would be "behaving or done in an obsequious way in order to gain advantage." The connotation is the other party misrepresents their own opinion in order to gain favor or avoid disapproval from someone of a higher status. Like when a subordinate tries to guess what their superior wants to hear instead of providing an unbiased response.
I think that accurately describes my experience with some LLMs due to heavy handed RLHF towards agreeableness.
In fact, I think obsequious is a better word since it doesn't have the cynical connotation of sycophant. LLMs don't have a motive and obsequious describes the behavior without specifying the intent.
The only reason I know the meaning of both words is because they occur in the lyrics of the Motorhead song "Orgasmatron"
Yes, that's the first two words that come to my mind when I read the meaning. The Gen Z word now I think is "simp".
Yeah, it is very close. But I feel simp has a bit of a sexual feel to it. Like a guy who does favors for a girl expecting affection in return, or donates a lot of money to an OnlyFans or Twitch streamer. I also see simp used where we used to call it white-knighting (e.g. "to simp for").
Obsequious is a bit more general. You could imagine applying it to a waiter or valet who is annoyingly helpful. I don't think it would feel right to use the word simp in that case.
In my day we would call it sucking up. A bit before my time (would sound old timey to me) people called it boot licking. In the novel "Catcher in the Rye", the protagonist uses the word "phony" in a similar way. This kind of behavior is universally disliked so there is a lot slang for it.
Thanks, as an old timer TIL about simp.
I wonder if anyone here will know this one; I learned the word "obsequious" over a decade ago while working the line of a restaurant. I used to listen to the 2p2 (2 plus 2) poker podcasts during prep and they had a regular feature with David Sklansky (iirc) giving tips, stories, advice etc. This particular one he simply gave the word "obsequious" and defined it later. I remember my sous chef and I were debating what it could mean and I guessed it right. I still can't remember what it had to do with poker, but that's besides the point.
Maybe I can locate it
I didn't hear that one but I am a fan of Sklansky. And I also have a very vivid memory of learning the word, when I first heard the song Turn Around by They Might Be Giants. The connection with the song burned it into my memory.
I think here it's referring to a common problem where the AI agrees with your position too easily, and/or changes it's answer if you tell it the answer is wrong instantly (therefore providing no stable true answer if you asked it something about a fact)?
Also the slightly over cheery tone maybe.
I like to do this with Claude. It takes 5 back & forths to get an uncertain answer.
Is there a way to tackle this?
Using Claude code and Codex CLI and then Aider with Gemini 2.5 pro, Aider is much faster because you feed in the files instead of using tools to start doing all kinds of whole know what spending 10x the tokens. I tried a relatively simple refactor which needed around 7 files changed, only Aider with 2.5 got it and in the first shot. Where as both Codex and Claude code completely fumbled it
I was a big fan of that model but it has been replaced in AI Studio by its preview version, which, by comparison, is pretty bad. I hope Google makes the release version much closer to the experimental one.
I can confirm the model name in Run Settings has been updated to "Gemini 2.5 Pro Preview ..." when it used to be "Gemini 2.5 Pro (Experimental) ...".
I cannot confirm if the quality is downgraded since I haven't had enough time with it. But if what you are saying is correct, I would be very sad. My big fear is the full-fat Gemini 2.5 Pro will be prohibitively expensive, but a dumbed down model (for the sake of cost) would also be saddening.
The AI Studio product lead said on Twitter that it is exactly the same model just renamed for clarity when pricing was announced
The preview version is exactly the same as the experimental one afaik
This comment is exactly my experience, I feel like as if I had wrote it myself.
My work doesn't have access to 2.5 pro and all these posts are just making me want it so much more.
I hate how slow things are sometimes.
Can’t you just go into aistudio with any free gmail account?
For many workplaces, it's not just that that don't pay for a service, it's that using it is against policy. If I tried to paste some code into ChatGPT, for example, our data loss prevention spyware would block it and I'd soon be having an uncomfortable conversation with our security team.
(We do have access to GitHub Copilot)
Good news then, your GitHub admins can enable Gemini for you without issue.
“Without issue” is an optimistic perspective on how this works in many organisations.
[dead]
The 1 million token context window also means you can just copy/paste so much source code or log output.
Have you tried Grok 3? It's a bit verbose for my taste even when prompted to be brief but answers seem better/more researched and less opinionated. It's also more willing to answer questions where the other models block an answer.
I have not tried any of the Grok models but that is probably because I am rarely on X.
I have to admit I have a bias where I think Google is "business" while Grok is for lols. But I should probably take the time to asses it since I would prefer to have an opinion based on experience rather than vibes.
A lot of people don't want to patronize the businesses of an unabashed Nazi sympathizer. There are more important things in life than model output quality.
I had a very interesting long debate/discussion with Gemini 2.5 Pro about the Synapse-Evolve bank debacle among other things. It really feels like debating a very knowledgeable and smart human.
You didn't have a debate, you just researched a question.
All right Mr. Pednatic. Very complex linear algebra created a very convincing illusion of a debate. You happy now?
But good LLMs will take a position and push back at your arguments.
One mans debate is another mans research.
Indeed, but a research isn't necessarily a debate. In this case, it was not.
> 100% of my casual AI usage is now in Gemini and I look forward to asking it questions on deep topics because it consistently provides me with insight.
It's probably great for lots of things but it doesn't seem very good for recent news. I asked it about recent accusations around xAI and methane gas turbines and it had no clue what I was talking about. I asked the same question to Grok and it gave me all sorts of details.
This was my experience as well.
Gemini performing the best on coding tasks, while giving underwhelming responses on recent news.
While Grok was OK for coding tasks, but being linked to X, provided best response on recent events.
>It's probably great for lots of things but it doesn't seem very good for recent news.
You are missing the point here. The LLM is just the “reasoning engine” for agents now. Its corpus of facts are meaningless, and shouldn’t really be relied upon for anything. But in conjunction with a tool calling agentic process, with access to the web, what you described is now trivially doable. Single shot LLM usage is not really anything anyone should be doing anymore.
> You are missing the point here.
I'm just discussing the GP's topic of casual use. Casual use implies heading over to an already-hosted prompt and typing in questions. Implementing my own 'agentic process' does not sound very casual to me.
> Implementing my own 'agentic process' does not sound very casual to me.
It really is though. This can be as simple as using Claude desktop with a web search tool.
That’s all fine and dandy, but if you google anything related to llm agents, you get 1000 answers to 100 questions, companies hawking their new “visual programming” agent composers, and a ton of videos of douchebags trying to be the Steve Jobs of AI. The concept I’m sure is fine, but execution of agentic anything is still the Wild Wild West and nobody knows what they’re really doing.
Indeed there is a mountain of snake oil out there at this point, but the underlying concepts are extremely simple, and can be implemented directly without frameworks.
I generally point people to Anthropic's seminal blog post on the topic: https://www.anthropic.com/engineering/building-effective-age...
Same here! It is borderline stubborn at times and I need to prove it wrong. Still, it is the best model to use with Cursor, in my experience.
Why is it free / so cheap (I seem to be getting charged a few cents a day using it with aider so not free but still crazy cheap compared to sonnet)
we know how Google makes money
Give it a few months and it will ignore all your questions and just ask if you’ve watched Rampart.
To be fair, Google do have a cost advantage here as they've built their own hardware.
It's a big deal, but not in the way that you think. A race to the bottom is humanities best defense against fast takeoff.
I've had many disappointing results with gemini 2.5 pro. For general queries possibly involving search, chatgpt and grok work better for me.
For code, gemini is very buggy in cursor, so I use Claude 3.7. But it might be partly cursor's fault.
One difference, and imho that’s a big difference — you can’t use any of the Google’s chatbots/models without being logged in, unlike chatgpt.
obsequious is such a nice word for this context, only possible in the AI age.
i'd find the same word improper to describe human beings - other words like plaintive, obedient and compliant often do the job better and are less obscure.
here it feels like a word whose time has come.
Yeah, my wife pays for ChatGPT, but Gemini is fine enough for me.
Just be aware that if you don't add a key (and set up billing) youre granting Google the right to train on your data. To have persons read them and decide how to use them for training.
> To have persons read them and decide how to use them for training.
Not that I have any actual insight. but doesn't it seem more likely that it will not be a human, but a model? Models training models.
> To help with quality and improve our products, human reviewers may read, annotate, and process your API input and output. Google takes steps to protect your privacy as part of this process. This includes disconnecting this data from your Google Account, API key, and Cloud project before reviewers see or annotate it. Do not submit sensitive, confidential, or personal information to the Unpaid Services.
Unless you have the enterprise sub of openAI theyre training on your data too
I thought if you turn off App Activity then that's good enough to protect your data?
Nope, not if you are in the US https://ai.google.dev/gemini-api/terms#data-use-unpaid
An often overlooked feature of the Gemini models is that they can write and execute Python code directly via their API.
My llm-gemini plugin supports that: https://github.com/simonw/llm-gemini
I ran that just now and got this: https://gist.github.com/simonw/cb431005c0e0535343d6977a7c470...They don't charge anything extra for code execution, you just pay for input and output tokens. The above example used 10 input, 1,531 output which is $0.15/million for input and $3.50/million output for Gemini 2.5 Flash with thinking enabled, so 0.536 cents (just over half a cent) for this prompt.
See a example full in a few commands using uv think "wow I bet that Simon guy from twitter would love this" ... it's already him.
> An often overlooked feature of the Gemini models is that they can write and execute Python code directly via their API.
Could you elaborate? I thought function calling is a common feature among models from different providers
The Gemini API runs the Python code for you as part of your single API call, without you having to handle the tool call request yourself.
This is so much cheaper than re-prompting each tool use.
I wish this was extended to things like: you could give the model an API endpoint that it can call to execute JS code, and the only requirement is that your API has to respond within 5 seconds (maybe less actually).
I wonder if this is what OpenAI is planning to do in the upcoming API update to support tools in o3.
I imagine there wouldn’t bd much of a cost to the provider on the API call there so much longer times may be possible. It’s not like this would hold up the LLM in any way, execution would get suspended while the call is made and the TPU/GPU will serve another request.
They need to keep KV cache to avoid prompt reprocessing, so they would need to move it to ram/nvme during longer api calls to use gpu for another request
This common feature requires the user of the API to implement the tool, in this case, the user is responsible to run the code the API outputs. The post you replied suggests that Gemini will run the code for the user behind the API call.
That was how I read it as well, as if it had a built-in lambda type service in the cloud.
If we're just talking about some API support to call python scripts, that's pretty basic to wire up with any model that supports tool use.
I wish Gemini could do this with Go. It generates plenty of junk/non-parseable code and I have to feed it the error messages and hope it properly corrects it.
Gemini flash models have the least hype, but in my experience in production have the best bang for the buck and multimodal tooling.
Google is silently winning the AI race.
100% agree. I had Gemini flash 2 chew through thousands of points of nasty unstructured client data and it did a 'better than human intern' level conversion into clean structured output for about $30 of API usage. I am sold. 2.5 pro experimental is a different league though for coding. I'm leveraging it for massive refactoring now and it is almost magical.
> thousands of points of nasty unstructured client data
What I always wonder in these kinds of cases is: What makes you confident the AI actually did a good job since presumably you haven't looked at the thousands of client data yourself?
For all you know it made up 50% of the result.
This was solved a hundred years ago.
It's the same problem factories have: they produce a lot of parts, and it's very expensive to put a full operator or more on a machine to do 100% part inspection. And the machines aren't perfect, so we can't just trust that they work.
So starting in the 1920s Walter Shewhart and Edward Deming came up with Statistical Process Control. We accept the quality of the product produced based on the variance we see of samples, and how they measure against upper and lower control limits.
Based on that, we can estimate a "good parts rate" (which later got used in ideas like Six Sigma to describe the probability of bad parts being passed).
The software industry was built on determinism, but now software engineers will need to learn the statistical methods created by engineers who have forever lived in the stochastic world of making physical products.
I hope you're being sarcastic. SPC is necessary because mechanical parts have physical tolerances and manufacturing processes are affected by unavoidable statistical variations; it is beyond idiotic to be provided with a machine that can execute deterministic, repeatable processes and then throw that all into the gutter for mere convenience, justifying that simply because "the time is ripe for SWE to learn statistics"
We don't know how to implement a "deterministic, repeatable process" that can look at a bug in a repo and implement a fix end-to-end.
that is not what OP was talking about though.
LLMs are literally stochastic, so the point is the same no matter what the example application is.
Humans are literally stochastic, so the point is the same no matter what the example application is.
The deterministic, repeatable process of human (and now machine) judgement and semantic processing?
In my case I had hundreds of invoices in a not-very-consistent PDF format which I had contemporaneously tracked in spreadsheets. After data extraction (pdftotext + OpenAI API), I cross-checked against the spreadsheets, and for any discrepancies I reviewed the original PDFs and old bank statements.
The main issue I had was it was surprisingly hard to get the model to consistently strip commas from dollar values, which broke the csv output I asked for. I gave up on prompt engineering it to perfection, and just looped around it with a regex check.
Otherwise, accuracy was extremely good and it surfaced a few errors in my spreadsheets over the years.
I hope there is a future where csv comma's don't screw up data. I know it will never happen but it's a nightmare.
Everyone has a story of a csv formatting nightmare
For what it's worth, I did check over many hundreds of them. Formatted things for side by side comparison and ordered by some heuristics of data nastiness.
It wasn't a one shot deal at all. I found the ambiguous modalities in the data and hand corrected examples to include in the prompt. After about 10 corrections and some exposition about the cases it seemed to misundestand, it got really good. Edit: not too different from a feedback loop with an intern ;)
Though the same logic can be applied to everywhere, right? Even if it's done by human interns, you need to audit everything to be 100% confident or just have some trust on them.
Not the same logic because interns can make meaning out of the data - that’s built-in error correction.
They also remember what they did - if you spot one misunderstanding, there’s a chance they’ll be able to check all similar scenarios.
Comparing the mechanics of an LLM to human intelligence shows deep misunderstanding of one, the other, or both - if done in good faith of course.
Not sure why you're trying to conflate intellectual capability problems into this and complicate the argument? The problem layout is the same. You delegate the works to someone so you cannot understand all the details. This makes a fundamental tension between trust and confidence. Their parameters might be different due to intellectual capability, but whoever you're going to delegate, you cannot evade this trade-off.
BTW, not sure if you have experiences of delegating some works to human interns or new grads and being rewarded by disastrous results? I've done that multiple times and don't trust anyone too much. This is why we typically develop review processes, guardrails etc etc.
> not sure if you have experiences of delegating some works to human interns or new grads and being rewarded by disastrous results?
Oh yes I have ;)
Which is why I always explain the why behind the task.
You can use AI to verify its own work. Last time I split a C++ header file into header + implementation file. I noticed some code got rewritten in a wrong manner, so I asked it to compare the new implementation file against the original header file, but to do so one method at a time. For each method, say whether the code is exactly the same and has the same behavior, ignoring superficial syntax changes and renames. Took me a few times to get the prompt right, though.
Many types of data have very easily checkable aggregates. Think accounting books.
It also depends on what you are using the data for, if it's for non (precise) data based decisions then it's fine. Specially if you looking for "vibe" based decisions before then dedicating time to "actually" process the data for confirmation.
30$ to get an view into data that would take at least x many hours of someone's time is actually super cheap, specially if the decision of that result is then to invest or not invest the x many hours to confirm it.
You take a sample and check
In my professional opinion they can extract data at 85-95% accuracy.
> I'm leveraging it for massive refactoring now and it is almost magical.
Can you share more about your strategy for "massive refactoring" with Gemini?
Like the steps in general for processing your codebase, and even your main goals for the refactoring.
Isn't it better to get gemini to create a tool to format the data? Or was it in such a state that that would have been impossible?
what tool are you using 2.5-pro-exp through? Cline? Or the browser directly?
For 2.5 pro exp I've been attaching files into AIStudio in the browser in some cases. In others, I have been using vscode's Gemini Code Assist which I believe recently started using 2.5 Pro. Though at one point I noticed that it was acting noticeably dumber, and over in the corner, sure enough it warned that it had reverted to 2.0 due to heavy traffic.
For the bulk data processing I just used the python API and Jupyter notebooks to build things out, since it was a one-time effort.
Copilot experimental (need VSCode Insiders) has it. I‘ve thought about trying aider —-watch-files though, also works with multiple files.
Absolutely agree. Granted, it is task dependent. But when it comes to classification and attribute extraction, I've been using 2.0 Flash with huge access across massive datasets. It would not be even viable cost wise with other models.
How "huge" are these datasets? Did you build your own tooling to accomplish this?
It's cheap but also lazy. It sometimes generates empty strings or empty arrays for tool calls, and then I just re-route the request to a stronger model for the tool call.
I've spent a lot of time on prompts and tool-calls to get Flash models to reason and execute well. When I give the same context to stronger models like 4o or Gemini 2.5 Pro, it's able to get to the same answers in less steps but at higher token cost.
Which is to be expected: more guardrails for smaller, weaker models. But then it's a tradeoff; no easy way to pick which models to use.
Instead of SQL optimization, it's now model optimization.
i have a high volume task i wrote an eval for and was pleasantly surprised at 2.0 flash's cost to value ratio especially compared to gpt4.1-mini/nano
accuracy | input price | output price
Gemini Flash 2.0 Lite: 67% | $0.075 | $0.30
Gemini Flash 2.0: 93% | $0.10 | $0.40
GPT-4.1-mini: 93% | $0.40 | $1.60
GPT-4.1-nano: 43% | $0.10 | $0.40
excited to to try out 2.5 flash
Can I ask a serious question. What task are you writing where its ok to get 7% error rate. I can't get my head around how this can be used.
There are tons of AI/ML use-cases where 7% is acceptable.
Historically speaking, if you had a 15% word error rate in speech recognition, it would generally be considered useful. 7% would be performing well, and <5% would be near the top of the market.
Typically, your error rate just needs to be below the usefulness threshold and in many cases the cost of errors is pretty small.
In my case, I have workloads like this where it’s possible to verify the correctness of the result after inference, so any success rate is better than 0 as it’s possible to identify the “good ones”.
Aren’t you basically just saying you are able to measure the error rate? I mean that’s good, but already a given in this scenario where hes reporting the 7% error rate.
No. If you're able to verify correctness of individual items of work, you can accept the 93% of verified items as-is and send the remaining 7% to some more expensive slow path.
That's very different from just knowing the aggregate error rate.
No, it's anything that's harder to write than verify. A simple example is a logic puzzle; it's hard to come up with a solution, but once you have a possible answer it's really easy to check it. In fact, it can be easier to vet multiple answers and tell the machine to try again than solve it once manually.
low stakes text classification but it's something that needs to be done and couldnt be done in reasonable time frames or at reasonable price points by humans
I expect some manual correction after the work is done. I actually mentally counted all the times I pressed backspace while writing this paragraph, and it comes down to 45. I'm not counting the next paragraph or changing the number.
Humans make a ton of errors as well. I didn't even notice how many I was making here until I started counting it. AI is super useful to just write get a first draft out, not for the final work.
You could be OCRing a page that includes a summation line, then add up all the numbers and check against the sum.
[flagged]
Yeah, general propaganda and psyops are actually more effective around 12% - 15%, we find it is more accurate to the user base, thus is questioned less for standing out more /s
I know it's a single data point, but yesterday I showed it a diagram of my fairly complex micropython program, (including RP2 specific features, DMA and PIO) and it was able to describe in detail not just the structure of the program, but also exactly what it does and how it does it. This is before seeing a single like of code, just going by boxes and arrows.
The other AIs I have shown the same diagram to, have all struggled to make sense of it.
>”Google is silently winning the AI race.”
It’s not surprising. What was surprising honestly was how they were caught off guard by OpenAI. It feels like in 2022 just about all the big players had a GPT-3 level system in the works internally, but SamA and co. knew they had a winning hand at the time, and just showed their cards first.
True and their first mover advantage still works pretty well. Despite "ChatGPT" being a really uncool name in terms of marketing. People remember it because they were the first to wow them.
How is ChatGPT bad in terms of marketing? It's recognizable and rolls off the tongue in many many many languages.
Gemini is what sucks from a marketing perspective. Generic-ass name.
Generative Pre-trained Transformer is a horrible term to have an acronym for.
Do you think the mass market thinks GPT is an acronym? It's just a name. Currently synonymous with AI.
Ask anyone outside the tech bubble about "Gemini" though. You'll get astrology.
True I guess they treat it just like SMS.
I still think they'd have taken off more if they'd given it a catchy name from the start and made the interface a bit more consumer friendly.
It feels more authentically engineer-coded.
> Google is silently winning the AI race
Yep, I agree! This convinced me: https://news.ycombinator.com/item?id=43661235
Absolutely. So many use cases for it, and it's so cheap/fast/reliable
And stellar OCR performance. Flash 2.0 is cheaper and more accurate than AWS Textract, Google Document AI, etc.
Not only in benchmarks[0], but in my own production usage.
[0] https://getomni.ai/ocr-benchmark
I want to use these almost too cheap to meter models like Flash more, what are some interesting use cases for those?
Google always has been winning the AI race as soon as DeepMind was properly put to use to develop their AI models, instead of the ones that built Bard (Google AI team).
I have to say, I never doubted it would happen. They've been at the forefront of AI and ML for well over a decade. Their scientists were the authors of the "Attention is all you need" paper, among thousands of others. A Google Scholar search produces endless results. There just seemed to be a disconnect between the research and product areas of the company. I think they've got that worked out now.
They're getting their ass kicked in court though, which might be making them much less aggressive than they would be otherwise, or at least quieter about it.
I remember everyone saying its a two horse race between Google and OpenAI, then DeepSeek happened.
Never count out the possibility of a dark horse competitor ripping the sod right out from under
How is deepseak doing though? It seemed like they probably just ingested ChatGPT. https://www.forbes.com/sites/torconstantino/2025/03/03/deeps...
Still impressive but would really put a cap on expectations for them.
Everybody else also trains on ChatGPT data, have you never heard of public ChatGPT conversation data sets? Yes they trained on ChatGPT data. No it's not "just".
They supposedly have a new R2 model coming within a month.
[dead]
The API is free, and it's great for everyday tasks. So yes there is no better bang for the buck.
Wait, the API is free? I thought you had to use their web interface for it to be free. How do you use the API for free?
You can get an API key and they don't bill you. Free tier rate limits for some models (even decent ones like Gemini 2.0 Flash) are quite high.
https://ai.google.dev/gemini-api/docs/pricing
https://ai.google.dev/gemini-api/docs/rate-limits#free-tier
The rate limits I've encountered with free api keys has been way lower than the limits advertised.
I agree. I found it unusable for anything but casual usage due to the rate limiting. I wonder if I am just missing something?
I think it's the small TPM limits. I'll be way under the 10-30 requests per minute while using Cline, but it appears that the input tokens count towards the rate limit so I'll find myself limited to one message a minute if I let the conversation go on for too long, ironically due to Gemini's long context window. AFAIK Cline doesn't currently offer an option to limit the context explosion to lower than model capacity.
I'm pretty sure that's a google maps' level of free where once in control they will massively bill it
There is no reason to expect the other entrants in the market to drop out and give them monopoly power. The paid tier is also among the cheapest. People say it’s because they built their own their inference hardware and are genuinely able to serve it cheaper.
create an api key and dont set up billing. pretty low rate limits and they use your data
I use Gemini 2.5 pro experimental via openrouter in my openwebui for free. Was using sonnet 3.7 but I don't notice much difference so just default to the free thing now.
using aistudio.google.com
Flash models are really good even for an end user because how fast and good performance they have.
Shhhh. You're going to give away the secret weapon!
> Google is silently winning the AI race.
It’s not clear to me what either the “race” or “winning” is.
I use ChatGPT for 99% of my personal and professional use. I’ve just gotten used to the interface and quirks. It’s a good consumer product that I like to pay $20/month for and use. My work doesn’t require much in the way of monthly tokens but I just pay for the OpenAI API and use that.
Is that winning? Becoming the de facto “AI” tool for consumers?
Or is the race to become what’s used by developers inside of apps and software?
The race isn’t to have the best model (I don’t think) because it seems like the 3rd best model is very very good for many people’s uses.
> Google is silently winning the AI race.
That is what we keep hearing here...The last Gemini I cancelled the account, and can't help notice the new one they are offering for free...
Sorry I was talking of B2B APIs for my YC startup. Gemini is still far behind for consumers indeed.
I use Gemini almost exclusively as a normal user. What am I missing out on that they are far behind on?
It seems shockingly good and I've watched it get much better up to 2.5 Pro.
Mostly brand recognition and the earlier Geminis had more refusals.
As a consumer, I also really miss the Advanced voice mode of ChatGPT, which is the most transformative tech in my daily life. It's the only frontier model with true audio-to-audio.
> and the earlier Geminis had more refusals.
Its more so that almost every company is running a classifier on their web chat's output.
It isn't actually the model refusing, but rather if the classifier hits a threshold, it'll swap the model's out with "Sorry, let's talk about something else."
This is most apparent with DeepSeek. If you use their web chat with V3 and then jailbreak it, you'll get uncensored output but it is then swapped with "Let's talk about something else" halfway through the output. And if you ask the model, it has no idea its previous output got swapped and you can even ask it build on its previous answer. But if you use the API, you can push it pretty far with a simple jailbreak.
These classifiers are virtually always ran on a separate track, meaning you cannot jailbreak them.
If you use an API, you only have to deal with the inherent training data bias, neutering by tuning and neutering by pre-prompt. The last two are, depending on the model, fairly trivial to overcome.
I still think the first big AI company that has the guts to say "our LLM is like a pen and brush, what you write or draw with it is on you" and publishes a completely unneutered model will be the one to take a huge slice of marketshare. If I had to bet on anyone doing that, it would be xAI with Grok. And by not neutering it, the model will perform better in SFW tasks too.
> and the earlier Geminis had more refusals.
You can turn off those, Google lets you decide how much it censors you can completely turn it off.
It has separate sliders for sexually explicit, hate, dangerous and harassment. It is by far the best at this, since sometimes you want those refusals/filters.
Have you tried the Gemini Live audio-to-audio in the free Gemini iOS app? I find it feels far more natural than ChatGPT Advanced Voice Mode.
What do you mean miss? You don’t have the budget to keep something you truly miss for $20? What am in missing here / I don’t mean to criticize I am just curious is all. I would reword but I have to go
What is true audio-to-audio in this case?
They used to be, but not anymore, not since Gemini Pro 2.5. Their "deep research" offering is the best available on the market right now, IMO - better than both ChatGPT and Claude.
Sorry, but no. Gemini isn't the fastest horse, yet. And it's use within their ecosystem means it isn't geared to the masses outside of their bubble. They are not leading the race but they are a contender.
In my experience they are as dumb as a bag of bricks. The other day I asked "can you edit a picture if I upload one"
And it replied "sure, here is a picture of a photo editing prompt:"
https://g.co/gemini/share/5e298e7d7613
It's like "baby's first AI". The only good thing about it is that it's free.
> in my experience they are as dumb as a bag of bricks
In my experience, anyone that describes LLMs using terms of actual human intelligence is bound to struggle using the tool.
Sometimes I wonder if these people enjoy feeling "smarter" when the LLM fails to give them what they want.
If those people are a subset of those who demand actual intelligence, they will very often feel frustrated.
Prompt engineering is a thing.
Learning how to "speak llm" will give you great results. There's loads of online resources that will teach you. Think of it like learning a new API.
This was using Gemini on my phone - which both Samsung and Google advertise as "just talk to it".
for now. one would hope that this is a transitory moment in llms and that we can just use intuition in the future.
LLM's whole thing is language. They make great translators and perform all kinds of other language tasks well, but somehow they can't interpret my English language prompts unless I go to school to learn how to speak LLM-flavored English?
WTF?
You have the right perspective. All of these people hand waving away the core issue here don't realize their own biases. Some of the best these things tout as much as 97% accuracy on tasks but if a person was completely randomly wrong at 3% of what they say you'd call an ambulance and no doctor would be able to diagnose their condition (the kinds of errors that people make with brain injuries are a major diagnostic tool and the kinds of errors are known for major types of common injuries ... Conversely there is no way to tell within an LLM system if any specific token is actually correct or not and its incorrectness is not even categorizable.)
I like to think of my interactions with an LLM like I'm explaining a request to a junior engineer or non engineering person. You have to be more verbose to someone who has zero context in order for them to execute a task correctly. The LLM only has the context you provided so they fail hard like a junior engineer would at a complicated task with no experience.
I like to think of my interactions with an LLM like I'm explaining a request to a junior engineer or non engineering person. You have to be more verbose to someone who has zero context in order for them to execute a task correctly. The LLM only has the context you provided so they fail hard like a junior engineer would at a complicated task with no experience.
It's a natural language processor, yes. It's not AGI. It has numerous limitations that have to be recognized and worked around to make use of it. Doesn't mean that it's not useful, though.
They are not humans - so yeah I can totally see having to "go to school" to learn how to interact with them.
Its because google hasn't realized the value of training the model on information about its own capabilities and metadata. My biggest pet peeve about google and the way they train these models.
One hidden note from Gemini 2.5 Flash when diving deep into the documentation: for image inputs, not only can the model be instructed to generated 2D bounding boxes of relevant subjects, but it can also create segmentation masks! https://ai.google.dev/gemini-api/docs/image-understanding#se...
At this price point with the Flash model, creating segmentation masks is pretty nifty.
The segmentation masks are a bit of a galaxy brain implementation by generating a b64 string representing the mask: https://colab.research.google.com/github/google-gemini/cookb...
I am trying to test it in AI Studio but it sometimes errors out, likely because it tries to decode the b64 lol.
This is SO cool. I built an interactive tool for trying this out (bring your own Gemini API key) here: https://tools.simonwillison.net/gemini-mask
More details plus a screenshot of the tool working here: https://simonwillison.net/2025/Apr/18/gemini-image-segmentat...
I vibe coded it using Claude and O3.
The performance is basically so bad it's unusable though, segmentation models and object detection models are still the best, for now.
There is a starter app in AI Studio that demos this: https://aistudio.google.com/apps/bundled/spatial-understandi...
I've had mixed results with the bounding boxes even on 2.5 pro. On complex images where a lot of boxes need to be drawn they're in the general region but miss the exact location of objects.
Wait, did they just kill YOLO, at least for time-insensitive tasks?
No, the speed of YOLO/DETR inference makes it cheap as well - probably at least five or six orders of magnitude cheaper.
It'll be interesting to test this capability and see how it evolves though. At some point you might be able use it as a "teacher" to generate training data for new tasks.YOLO is probably still cheaper if bounding boxes are your main goal. Good segmentation models that work for arbitrary labels, however, are much more expensive to set up and run, so this type of approach could be an interesting alternative depending on performance.
Well no. You can run/host YOLO which means not having to submit potentially sensitive information to a company that generates a large amount of revenue from targeted advertising.
Interestingly if you run this in Gemini (instead of AI Studio) you get:
(Not sure if that's a real or hallucinated error.)For a non programmer like me google is becoming shockingly good. It is giving working code the first time. I was playing around with it asked it to write code to scrape some data of a website to analyse. I was expecting it to write something that would scrape the data and later I would upload the data to it to analyse. But it actually wrote code that scraped and analysed the data. It was basic categorizing and counting of the data but I was not expecting it to do that.
That's the opposite experience of my wife who's in tech but also a non programmer. She wanted to ask Gemini to write code to do some basic data analysis things in a more automated way than Excel. More than once, Gemini wrote a long bash script where some sed invocations are just plain wrong. More than once I've had to debug Gemini-written bash scripts. As a programmer I knew how bash scripts aren't great for readability so I told my wife to ask Gemini to write Python. It resulted in higher code quality, but still contained bugs that are impossible for a non programmer to fix. Sometimes asking a follow up about the bugs would cause Gemini to fix it, but doing so repeatedly will result in Gemini forgetting what's being asked or simply throwing an internal error.
Currently IMO you have to be a programmer to use Gemini to write programs effectively.
IMO, the only thing that’s consistent about AIs is how inconsistent they are. Sometimes, I ask them to write code and I’m shocked at how well it works. Other times, I feel like I’m trying to explain to a 5-year-old Alzheimer’s patient what I want and it just can’t seem to do the simplest stuff. And it’s the same AI in both cases.
I wouldn’t be surprised if AI tools are frequently throttled in the backend to save on costs, resulting in this type of inconsistency.
The AIs like many things out there work like an "evil genie". They'll give you what you asked for. The problem is typically that users ask for the wrong thing.
I've noticed beginners make mistakes like using singular terms when they should have used plural ("find the bug" vs "find the bugs"), or they fail to specify their preferred platform, language, or approach.
You mentioned your wife is using Excel, which is primarily used on Windows desktops and/or with the Microsoft ecosystem of products such as Power BI, PowerShell, Azure, SQL Server, etc...
Yet you mention she got a bash script using sed, both of which are from the Linux / GNU ecosystem. That implies that your wife didn't specify that she wanted a Microsoft-centric solution to her problem!
The correct answer here would have likely to have been to use Microsoft Fabric, which is an entire bag of data analysis and reporting tools that has data pipelines, automation, publishing, etc...
Or... just use the MashUp engine that's built-in to both Excel and PowerBI, which allows a surprisingly complex set of text, semi-structured, and tabular data processing. It can re-run the import and update graphs and charts with the new data.
PS: This is similar to going up to a Node.js programmer with a request. It doesn't matter what it is, they will recommend writing JavaScript to solve the problem. Similarly, a C++ developer will reach for C++ to solve everything they're asked to do. Right now, the AIs strongly prefer Linux, JavaScript, and especially Python for problem solving, because that's the bulk of the open-source code they were trained with.
I had similar experiences few months back that is why I am saying it is becoming shockingly good the 2.5 is a lot better than the 2.0 version. Another thing I have realized just like google search in the past your query has a lot to do with the results you get. So an example of what you want works at getting better results
> I am saying it is becoming shockingly good the 2.5 is a lot better than the 2.0 version
Are you specifically talking about 2.5 Flash? It only came out an hour ago, I dont know how you would have enough experience with it already to come to your conclusion.
(I am very impressed with 2.5 Pro, but that is a different model that's been available for several weeks now)
I am talking about 2.5 Pro
I've found that good prompting isn't just about asking for results but also giving hints/advice/direction on how to go about the work.
I suspect that if Gemini is giving you bash scripts it's because you're note giving it enough direction. As you pointed out, telling it to use Python, or giving it more expectations about how to go about the work or how the output should be, will give better results.
When I am prompting for technical or data-driven work, I tend to almost walk through what I imagine the process would be, including steps, tools, etc...
It must have something to do with the way your wife is prompting. I've noticed this with my friends too. I usually get working code from Gemini 2.5 Pro on the first try, and with a couple of follow-up prompts, it often improves significantly, while my friends seem to struggle communicating their ideas to the AI and get worse results.
Good news: Prompting is a skill you can develop.
Is there a website with off the shelf prompts that work?
Or we can just learn to write it ourselves in the same amount of time /shrug
If you're going to need scripts like that every week - sure. If you need it once a year on average... not likely. There's a huge amount of things we could learn but do them so infrequently that we outsource it to other people.
Right.
This is one case where I've found writing code with LLMs to be effective.
With some unfamiliar tool I don't care about too much (e.g. GitHub Actions YAML or some build script), I just want it to work, & then focus on other things.
I can spend time to try and come up with something that works; something that's robust & idiomatic.. but, likely I won't be able to re-use that knowledge before I forget it.
With an LLM, I'll likely get just as good a result; or if not, will have a good starting point to go from.
You can't.
Not with that attitude.
Which Gemini was it? I've been using 2.5 Flash all day for programming ClojureScript via roo code and it's been great. Provided I'm using agent orchestration, a memory bank, and having it write docs for code it will work on.
Ask it to write tests with the code and then ask it to fix the errors from the tests rather than just pointing out bugs. If you have an IDE that supports tool use (Claude Code, Roo Code) it can automate this process.
Let's hope that's the case for a while.
I want to be able to just tell chat GPT or whatever to create a full project for me, but I know the moment it can do that without any human intervention, I won't be able to find a job.
There is definitely an art to doing it, but the ability is definitely there even if you don't know the language at all.
I have a few programs now that are written in Python (2 by 3.7, one by 2.5) used for business daily, and I can tell you I didn't, and frankly couldn't, check a single line of code. One of them is ~500 LOC, the other two are 2200-2700 LOC.
I've been continually disappointed. I've been told it's getting exponentially better and we won't be able to keep up with how good they get, but I'm not convinced. I'm using them every single day and I'm never shocked or awed by its competence, but instead continually vexxed that isn't not living up to the hype I keep reading.
Case in point: there was a post here recently about implementing a JS algorithm that highlighted headings as you scrolled (side note: can anyone remember what the title was? I can't find it again), but I wanted to test the LLM for that kind of task.
Pretty much no matter what I did, I couldn't get it to give me a solution that would highlight all of the titles down to the very last one.
I knew what the problem was, but even guiding the AI, it couldn't fix the code. I tried multiple AIs, different strategies. The best I could come up with was to guide it step by step on how to fix the code. Even telling it exactly what the problem was, it couldn't fix it.
So this goes out to the "you're prompting it wrong" crowd... Can you show me a prompt or a conversation that will get an AI to spit out working code for this task: JavaScript that will highlighting headings as you scroll, to the very last one. The challenge is to prompt it to do this without telling it how to implement it.
I figure this should be easy for the AI because this kind of thing is very standard, but maybe I'm just holding it wrong?
Even as a human programmer I don't actually understand your description of the problem well enough to be confident I could correctly guess your intent.
What do you mean by "highlight as you scroll"? I guess you want a single heading highlighted at a time, and it should be somehow depending on the viewport. But even that is ambiguous. Do you want the topmost heading in the viewport? The bottom most? Depending on scroll direction?
This is what I got one-shot from Gemini 2.5 Pro, with my best guess at what you meant: https://gemini.google.com/share/d81c90ab0b9f
It seems pretty good. Handles scrolling via all possible ways, does the highlighting at load too so that the highlighting is in effect for the initial viewport too.
The prompt was "write me some javascript that higlights the topmost heading (h1, h2, etc) in the viewport as the document is scrolled in any way".
So I'm thinking your actual requirements are very different than what you actually wrote. That might explain why you did not have much luck with any LLMs.
> Even as a human programmer I don't actually understand your description of the problem well enough to be confident I could correctly guess your intent.
Yeah, you understand what I meant. The code Gemini gave you implements the behavior, and the AI I used gave me pretty much the same thing. There's a problem with the algorithm tho -- if there's a heading too close to the bottom of the page it will never highlight. The page doesn't exhibit the bug because it provides enough padding at the bottom.
But my point wasn't that it couldn't one-shot the code; my point was that I couldn't interrogate it into giving me code that behaved as I wanted. It seemed too anchored to the solution it had provided me, where it said it was offering fixes that didn't do anything, and when I pointed that out it apologized and proceeded to lie about fixing the code again. It appeared to be an infinite loop.
I think what's happened here is the opposite of what you suggest; this is a very common tutorial problem, you can find solutions of the variety you showed me all over the internet, and that's essentially what Gemini gave you. But being tutorial code, it's very basic and tries not to implement a more robust solution that is needed in production websites. When I asked AI for that extra robustness, it didn't want to stray too far from the template, and the bug persisted.
Maybe you can coax it into getting a better result? I want to understand how.
I clearly didn't understand what you meant, because you did in fact have additional unstated requirements that I could not even have imagined existed and were not in any way hinted at by your initial spec.
And I still don't know what you want! Like, you want some kind of special case where the last heading is handled differently. But what kind of special case? You didn't specify. "It's wrong, fix it".
Fix it how? When the page is scrolled all the way to the bottom, should the last heading always be highlighted? That would just move the complaint to the second heading from the bottom if three headings fit on the last screen. Add padding? Can't be that, since it's exactly what this solution already did and you thought it wasn't good enough.
Sorry, I will not be playing another round of this. I don't know if you don't realize how inadequate your specifications are (in which case that's your problem with the LLMs too), or if this is some kind of a bit, but either way it doesn't feel like a good use of my time.
But if your problem is that the LLMs give a bad initial answer, get anchored on it, and can't iterate, just give all the requirements up front. If they're requirements you didn't realize existed until you saw the proposed answer, just start again with a fresh context. That tends to work well for me in Aider.
> I clearly didn't understand what you meant, because you did in fact have additional unstated requirements
Okay, but the AI understood the requirements; It confirmed to me what I intended it to do, but it couldn't produce code that met its textual descriptions. It kept returning the tutorial code.
> You didn't specify. "It's wrong, fix it". Fix it how?
Maybe I wasn't clear here but I'm not replying as if I'm prompting you like an AI. The problem domain is described better in the link in the sibling comment. When the AI gave back the initial code, I had inquired the following:
It replied: So it identified the edge case, it identified that the behavior is incorrect and what the cause of that is, and it returned code that purportedly fixed this. But the code it returned exhibited exactly the behavior it said "feels wrong". And in interrogating it. I asked it what was broken and we went line by line: We went line by line and it told me what exactly was wrong and why it's fixed, and confirmed that the provided code produced the expected behavior. But the code doesn't do this. It continued on like this where it proposed fixes, talked about the solution correctly, but wouldn't give code that implemented the solution.> But if your problem is that the LLMs give a bad initial answer, get anchored on it, and can't iterate, just give all the requirements up front. If they're requirements you didn't realize existed until you saw the proposed answer, just start again with a fresh context. That tends to work well for me in Aider.
Yeah that's what I tend to do as well. I don't tend to get good satisfying results though, to the point where coding it myself seems like the faster more reliable option. I'll keep trying to hold it better and maybe one day it'll work for me. Until then I'm a skeptic.
"Overengineered anchor links": https://news.ycombinator.com/item?id=43570324
Thank you!!
Last time I tried Gemini, it messed with my google photo data plan and family sharing. I wish I could try the AI separate from my Google account.
> I wish I could try the AI separate from my Google account.
If that's a concern, just create another account. Doesn't even require using a separate browser profile, you can be logged into multiple accounts at once and use the account picker in the top right of most their apps to switch.
50% price increase from Gemini 2.0 Flash. That sounds like a lot, but Flash is still so cheap when compared to other models of this (or lesser) quality. https://developers.googleblog.com/en/start-building-with-gem...
done pretty much inline with the price elo pareto frontier https://x.com/swyx/status/1912959140743586206/photo/1
So if I see it right flash 2.5 doesn't push the pareto front forward, right? It just sits between 2.5 pro and 2.0 flash.
https://storage.googleapis.com/gweb-developer-goog-blog-asse...
yeah but 1) its useful to have the point there on the curve if you need it, 2) intelligence is multidimensional, maybe in 2.5 flash you get qualitatively a better set of capabilities for your needs than 2.5 pro
It does, that point in the tradeoff space was not available until now. Any model that's not dominated by at least one model on both axes will push forward the frontier. (The actual frontier isn't actually a straight line between the points on the frontier like visualized there. It's a step function.)
Love that chart! Am I imagining that I saw a version of that somewhere that even showed how the boundary has moved out over time?
https://x.com/swyx/status/1882933368444309723
https://x.com/swyx/status/1830866865884991999 (scroll up)
Is this cheaper than DeepSeek? Am I reading this right?
Only if you don't use reasoning
Why isn't Phi-3, Llama 3, or Mistral in the comparison?
Aren't there a lot of hosted options? How do they compare in terms of cost?
del
You may want to consult Gemini on those percentage calcs .10 to .15 is not 25%
Genuine naive question: when it comes to Google HN has generally a negative view of it (pick any random story on Chrome, ads, search, web, working at faang, etc. and this should be obvious from the comments), yet when it comes to AI there is a somewhat notable “cheering effect” for Google to win the AI race that goes beyond a conventional appreciation of a healthy competitive landscape, which may appear as a bit of a double standard.
Why is this? Is it because OpenAI is seen as such a negative player in this ecosystem that Google “gets a pass on this one”?
And bonus question: what do people think will happen to OpenAI if Google wins the race? Do you think they’ll literally just go bust?
Most of us weren’t using Gemini pro models (1.0, 1.5, 2.0) but the recent 2.5 pro is such a huge step up. It’s better than 3.7 sonnet for coding. Better than o1, o3-mini models and now o3 and o4-mini. It’s become my daily driver. It does everything I need with almost 100% accuracy, is cheap, fast, 1 million context window, uses google web search for grounding, can fetch YouTube video transcripts, can fetch website content, works in google workspace: Gmail, Docs, Sheets. Really hard to beat this combo. Oh and if you subscribe to their AI plan it comes with 2 TB drive storage.
Maybe because Google is largely responsible, paying for the research, of most of the results we are seeing now. I'm not a Google fan, in the web side, and in their idea of what software engineering is, but they deserve to win the AI race, because right now all the other players provided a lot less than what Google did as public research. Also, with Gemini 2.5 PRO, there was a big hype moment, because the model is of unseen ability.
Maybe they deserve it but it would be really bad for the world. Because they will enshittify the hell out of it once they're established. That's their MO.
I don't want Google to have a stranglehold over yet another type of online service. So I avoid them.
And things are going so fast now, whatever Google has today that might be better than the rest, in two months the rest will have it too. Of course Google will have something new again. But being 2 months behind isn't a huge deal. I don't have to have the 'winning' product. In fact most of my AI tasks go to an 8b llama 3.1 model. It's about on par with gpt 3.5 but that's fine.
The situation with LLMs is much different than search, Google doesn't have such a large lead here. LLMs are social things, they learn from each other, any provider with SOTA model will see its abilities leaked through synthetic training data. That's what GPT-4 did for a year, against the wishes of OpenAI, powering up millions of open model finetunes.
Gemini is just that good. From my usage it is much smarter than DeepSeek or Claude 3.7 Thinking models.
A lot of Google’s market share across its services comes from the monopoly effects Google has. The quality of Gemini 2.5 is noticeably smarter than its competitors so I see the applause for the quality of the LLM and not for Google.
I think it’s way too early to say anything about who is winning the race. There is still a long way to go; o3 scores highest in Humanity’s Last Exam (https://agi.safe.ai/) at 20%, 2.5 scores 18%.
As a googler working in LLM space, this feels like revisionist history to me haha! I remember a completely different environment only a few months ago when Anthropic was the darling child, and before that it was OpenAI (and for like 4 weeks somewhere in there, it was Deepseek). For literally years at this point, every time Bard or Gemini would make a major release, it would be largely ignored or put down in favor of the next "big thing" OpenAI was doing or Claude saturating coding benchmarks, never mind that Google was often just behind with the exact same tech ready to go, in some cases only missing their demo release by literally 1 day (remember live voice?). And every time this happened, folks would be posting things to the effect of "LOL I can't believe Google is losing the AI race - didn't they invent this?", "this is like Microsoft dropping the ball on mobile", "Google is getting their lunch eaten by scrappy upstarts," etc. I can't lie, it stings a bit when that's what you work on all day.
2.5 was quite good. Not stupidly good like the jump from GPT 2 to 3 or 3.5 to 4, but really good. It was a big jump in ELO and benchmarks. People like it, and I think it's just psychologically satisfying that the player everybody would have expected to win the AI race is currently in the lead. Gemini finally gets a day in the sun.
I'm sure this will change with whenever somebody comes up with the next big idea though. It probably won't take much to beat Gemini in the long run. There is literally zero moat.
I dislike Google rather strongly due to their ad-based business model, and I was previously very skeptical of their AI offerings because of very lackluster performance compared to OpenAI and Claude. But I can't help but be impressed with Gemini Pro 2.5 for "deep research" and agentic coding. I have subscriptions with all three so that I can keep up with SOTA, but if I had to choose only one to keep, right now it'd be Gemini.
That said I still don't "cheer" for them and I would really rather someone else win the race. But that is orthogonal to recognition of observed objective superiority.
It's been a while since they won something the "old" Google way: by building a superior product that is #1 on its merits.
In that sense Gemini is a throwback: there's no trick - it's objectively better than everything else.
Didn't Google invent the transformer?
I think a lot of us see Google as both an evil advertiser and as an innovator. Google winning AI is sort of nostalgic for those of us who once cheered the "Do No Evil"(now mostly "Do Know Evil") company.
I also like how Google is making quiet progress while other companies take their latest incremental improvement and promote it as hard as they can.
I think for a while some people felt the Google AI models are worse but now its getting much better. On the other hand Google has their own hardware so they can drive down the costs of using the models so it keeps pressure on Open AI do remain cost competitive. Then you have Anthropic which has very good models but is very expensive. But I've heard they are working with Amazon to build a data center with Amazons custom AI chips so maybe they can bring down their costs. In the end all these companies will need a good model and lower cost hardware to succeed.
2.5 Pro is free, and I'm sure there's a lot of people who have just never tried the best models because they don't want to pay for them. So 2.5 Pro probably blows their socks off.
Whereas, if you've been paying for access to the best models from OpenAI and Anthropic all along, 2.5 Pro doesn't feel like such a drastic step-change. But going from free models to 2.5 Pro is a crazy difference. I also think this is why DeepSeek got so much attention so quickly - because it was free.
I am cheering for the old Google to make a comeback and it seems like the AI race has genuinely sparked something positive inside Google.
The key is Gemini being free through AI Studio. This makes their technical improvement more impressive when OpenAI sells their best models at ridiculous prices.
If Google engages in price dumping as a monopolist remains to be seen but it feels like it.
The LLM race is fast paced and no moat has developed. People are switching on a whim if better models (by some margin) show up. When will OpenAI, Anthropic or DeepSeek counter 2.5 Pro? And will it be before Google releases the next Pro?
OpenAI commands a large chunk of the consumer market and they have considerable funds after their last round. They won't fold this or next year.
If Google wants to win this they must come up with a product strategy integrating their search business without seriously damaging their existing search business to much. This is hard.
Because now it has brought real competitions to the field. GPT was the king and Claude had been the only meaningful challenger for a while but OpenAI didn't care about Anthropic but just be obsessed with Google. Gemini took a quite time to set the pipeline so initial version was not enough to push the frontier; you remember the days when Google released a new model, OpenAI just responded with some old models in their silo within a day only to crush them. That does not happen anymore and they're forced to develop a better model.
A lot of the negativity toward Google stems from the fact that they're the big, dominant player in search, ads, browsers, etc., rather than anything that they've done or any particular attribute of the company.
In AI, they're still seen as being behind OpenAI and others, so we don't see the same level of negativity.
I prefer OpenAI and Anthropic big time because they are fresh players with less dominance over other aspects of digital life. Not having to login to an insidious tracker like Google is worth significantly worse performance. Although I have little FOMO here avoiding Gemini because evaluating these models on real world use cases remains quite subjective imo.
More great innovation from Google. OpenAI have two major problems.
The first is Google's vertically integrated chip pipeline and deep supply chain and operational knowledge when it comes to creating AI chips and putting them into production. They have a massive cost advantage at every step. This translates into more free services, cheaper paid services, more capabilities due to more affordable compute, and far more growth.
Second problem is data starvation and the unfair advantage that social media has when it comes to a source of continually refreshed knowledge. Now that the foundational model providers have churned through the common crawl and are competing to consume things like video and whatever is left, new data is becoming increasingly valuable as a differentiator, and more importantly, as a provider of sustained value for years to come.
SamA has signaled both of these problems when he made noises about building a fab a while back and is more recently making noises about launching a social media platform off OpenAI. The smart money among his investors know these issues to be fundamental in deciding if OAI will succeed or not, and are asking the hard questions.
If the only answer for both is "we'll build it from scratch", OpenAI is in very big trouble. And it seems that that is the best answer that SamA can come up with. I continue to believe that OpenAI will be the Netscape of the AI revolution.
The win is Google's for the taking, if they can get out of their own way.
Nobody has really talked about what I think is an advantage just as powerful as the custom chips: Google Books. They already won a landmark fair use lawsuit against book publishers, digitized more books than anyone on earth, and used their Captcha service to crowdsource its OCR. They've got the best* legal cover and all of the best sources of human knowledge already there. Then Youtube for video.
The chips of course push them over the top. I don't know how much Deep Research is costing them but it's by far the best experience with AI I've had so far with a generous 20/day rate limit. At this point I must be using up at least 5-10 compute hours a day. Until about a week ago I had almost completely written off Google.
* For what it's worth, I don't know. IANAL
The amount of text in books is surprisingly finite. My best estimate was that there are ~10¹³ tokens available in all books (https://dynomight.net/scaling/#scaling-data), which is less than frontier models are already being trained on. On the other hand, book tokens are probably much "better" than random internet tokens. Wikipedia for example seems to get much higher weight than other sources, and it's only ~3×10¹⁰ tokens.
We need more books! On it…
> And further, by these, my son, be admonished: of making many books there is no end; and much study is a weariness of the flesh.
Ecclesiastes 12:12 ;)
opens up his favorite chat
LibGen already exists, and all the top LLM publishers use it. I don't know if Google's own book index provides a big technical or legal advantage.
I'd be very surprised if the Google books index wasn't much bigger and more diverse than libgen.
Anna's Archive is at 43M Books and 98M Papers [1]. The book total is nearly double what Google has.
Google's scanning project basically stalled after the legal battle. It's a very fascinating read [2].
[1] https://annas-archive.org/
[2] https://web.archive.org/web/20170719004247/https://www.theat...
Something that is not specifically called out but is also super relevant is actually the transcription of YouTube videos.
Every video is machine transcribed and stored and then for larger videos the author will often transcribed them themselves.
This is something they have already, it doesn't need any more "work" to get it vs a competitor.
I would think the biggest advantage is YouTube. There's a lot of modern content for analysis that's uncontaminated by LLMs.
Google has the data and has the hardware, not to mention software and infrastructure talent. Once this Bismarck turns around and it looks like it is, who can parry it for real? They have internet.zip and all the previous versions as well, they have youtube, email, search, books, traffic, maps and business on it, phones and habits around it, even the OG social network, the usenet. It's a sleeping giant starting to wake up and it's already causing commotion, let's see what it does when it drinks morning coffee.
Agreed. One of Google's big advantages is the data access and integrations. They are also positioned really well for the "AI as entertainment" sector with youtube which will be huge (imo). They also have the knowledge in adtech and well injecting adds into AI is an obvious play. As is harvesting AI chat data.
Meta and Google are the long term players to watch as Meta also has similar access (Insta, FB, WhatsApp).
On-demand GenAI could definitely change the meaning of "You" in "Youtube".
They have the Excel spreadsheets of all startups and businesses of the world (well 50/50 with Microsoft).
And Atlassian has all the project data.
More like 5/95 with Microsoft - and that's being generous, I wouldn't be surprised if it was 1/99. It's basicaly just hip tech companies and a couple of Fortune 500s that use Google Docs. And even their finance departments often use Excel. HN keeps underestimating how the whole physical world runs on Excel.
I still can't understand how google missed on github, especially since they were in the same space before with google code. I do understand how they couldn't make a github though.
Another advantage that Google has is the deep integration of Gemini into Google Office products and Gmail. I was part of a pilot group and got to use a pre-release version and it's really powerful and not something that will be easy for OpenAI to match.
Agreed. Once they dial in the training for sheets it's going to be incredible. I'm already using notebooklm to upload finance PDFs, then having it generate tabular data and copypasta into sheets, but it's a garage solution compared to just telling it to create or update a sheet with parsed data from other sheets, PDFs, docs, etc.
And as far as gmail goes, I periodically try to ask it to unsubscribe from everything marketing related, and not from my own company, but it's not even close to being there. I think there will continue to be a gap in the market for more aggressive email integration with AI, given how useless email has become. I know A16Z has invested in a startup working on this. I doubt Gmail will integrate as deep as is possible, so the opportunity will remain.
I frankly am in doubt of future office products. In the last month I have ditched two separate excel productivity templates in favor of bespoke wrappers on sqlite databases, written by Claude and Gemini. Easier to use and probably 10x as fast.
You don't need a 50 function swiss army knife when your pocket can just generate the exact tool you need.
You say deep integration, yet there is still no way to send a Gemini Canvas to Docs without a lot of tedious copy-pasting and formatting because Docs still doesn’t actually support markdown. Gemini in Google Office in general has been a massive disappointment for all but the most simplistic of writing tasks.
They can have the most advanced infrastructure in the world, but it doesn’t mean much if Google continues its infamous floundering approach to product. But hey, 2.5 pro with Cline is pretty nice.
Maybe I'm misunderstanding, but there is literally a Share button in Canvas right below each response with the option to export to Docs. Within Docs, you can also click on the Gemini "star" at the upper right to get a prompt and then also export into the open document. Note that this is a with "experimental" Gemini 2.5 Pro.
Docs supports markdown in comments, where it's the only way to get formatting.
I love Googles product dysfunction sometimes :/
I have access to this now and I want it to work so bad and it's just proper shit. Absolute rubbish.
They really, truly need to fix this integration. Gemini in Google Docs is barely acceptable, it doesn't work at all (for me) in Gmail, and I've not yet had it do anything other than error in Google Sheets.
[dead]
If the battle was between Altman and Pichai I'd have my doubts.
But the battle is between Altman and Hassabis.
I recall some advice on investment from Buffett regarding how he invests in the management team.
Sorry but my eyes rolled to the back of my head with this one. This is between two teams with tons of smart contributors, but the difference is one is more flexible and able to take risks vs the other that has many times more researchers and the world's best and most mature infrastructure/tooling. Its not a CEO vs CEO battle
I think it requires a nuanced take but allow me to provide some counter-examples.
The first is CEO pay rates. Another is the highest paid public employees (which tend to be coaches at state schools). This is evidence that the market highly values managers.
Another is systemic failures within enterprises. When Boeing had a few very public plane crashes, a certain narrative suggested that the transition from highly capable engineer managers to financial focus managers contributed to the problem. A similar narrative has been used to explain the decline of Intel.
Consider the return of Steve Jobs to Apple. Or the turn around at Microsoft with Nadella.
All of these are complex cases that don't submit to an easy analysis. Success and failure are definitely multi-factor and rarely can be traced to a single definitive cause.
Perhaps another way to look at it would be: what percentage of the success of highly complex organizations can be attributed to management? To what degree can poor management decisions contribute to the failure of an otherwise capable organization?
How much you choose to weight those factors is entirely up to you.
edit: I was also thinking about the way we think about the advantage of exceptional generals/admirals in military analysis. Or the effect a president can have on the direction of a country.
Could you please expand, on both your points?
It is more gut feel than a rational or carefully reasoned argument.
I think Pichai has been an exceptional revenue maximizer but he lacks vision. I think he is probably capable of squeezing tremendous revenue out of AI once it has been achieved.
I like Hassabis in a "good vibe" way when I hear him speak. He reminds me of engineers that I have worked with personally and have gained my respect. He feels less like a product focused leader and more of a research focused leader (AlphaZero/AlphaFold) which I think will be critical to continue the advances necessary to push the envelope. I like his focus on games and his background in RL.
Google's war chest of Ad money gives Hassabis the flexibility to invest in non-revenue generating directions in a way that Altman is unlikely to be able to do. Altman made a decision to pivot the company towards product which led to the exodus of early research talent.
> Altman made a decision to pivot the company towards product which led to the exodus of early research talent.
Who was going to fund the research though?
Fair point, and a good reminder not to pass judgement on the actions of others. It is totally possible that Altman made his own prediction of the future and theorized that the only hope he had of competing with the existing big tech companies to realistically achieve an AI for the masses was to show investors a path to profitability.
I should also give Altman a bit more due in that I find his description of a world augmented by powerful AI to be more inspiring than any similar vision I've heard from Pichai.
But I'm not trying to guess their intentions, I am just stating the situation as I see it. And that situation is one where whatever forces have caused it, OpenAI is clearly investing very heavily in product (e.g. windsurf acquisition, even suggesting building a social network). And that shift in focus seems highly correlated with a loss of significant research talent (as well as a healthy dose of boardroom drama).
Note sure why their comment was downvoted. Google the names. Hassabis runs DeepMind at Google which makes Gemini and he's quite brilliant and has an unbelievable track record. Buffet investing in teams points out that there are smart people out there that think good leadership is a good predictor of future success.
It may not be relevant to everyone, but it is worth noting that his contribution to AlpaFold won Hassabis a Nobel prize in chemistry.
Zoogeny got downvoted? I did not do that. His comments deserved more details anyway (at the level of those kindly provided).
> Google the names
Was that a wink about the submission (a milestone from Google)? Read Zoogeny's delightful reply and see whether it can compare a search engine result (not to mention that I asked for Zoogeny's insight, not for trivia). And as a listener to Buffet and Munger, I can surely say that they rarely indulge in tautologies.
I wouldn't worry about downvotes, it isn't possible on HN to downvote direct replies to your message (unlike reddit), so you cannot be accused of downvoting me unless you did so using an alt.
Some people see tech like they see sports teams and they vote for their tribe without considering any other reason. I'm not shy stating my opinion even when it may invite these kinds of responses.
I do think it is important for people to "do their own research" and not take one man's opinion as fact. I recommend people watch a few videos of Hassabis, there are many, and judge his character and intelligence for themselves. They may find they don't vibe with him and genuinely prefer Altman.
I haven’t heard this much positive sentiment about Google in a while. Making something freely available really turns public sentiment around.
I don't know man, for months now people keep telling me on HN how "Google is winning", yet no normal person I ever asked knows what the fuck "Gemini" is. I don't know what they are winning, it might be internet points for all I know.
Actually, some of the people polled recalled the Google AI efforts by their expert system recommending glue on pizza and smoking in pregnancy. It's a big joke.
Try uploading a bunch of PDF bank statements to notebooklm and ask it questions. Or the results of blood work. It's jaw dropping. e.g. uploaded 7 brokerage account statements as PDFs in a mess of formats and asked it to generate table summary data which it nailed, and then asked it to generate actual trades to go from current position to a new position in shortest path, and it nailed that too.
Biggest issue we have when using notebooklm is a lack of ambition when it comes to the questions we're asking. And the pro version supports up to 300 documements.
Hell, we uploaded the entire Euro Cyber Resilience Act and asked the same questions we were going to ask our big name legal firm, and it nailed every one.
But you actually make a fair point, which I'm seeing too and I find quite exciting. And it's that even among my early adopter and technology minded friends, adoption of the most powerful AI tools is very low. e.g. many of them don't even know that notebookLM exists. My interpretation on this is that it's VERY early days, which is suuuuuper exciting for us builders and innovators here on HN.
That was ages ago.
Their new models excel at many things. Image editing, parsing PDFs, and coding are what I use it for. It's significantly cheaper than the closest competing models (Gemini 2.5 pro, and flash experimental with image generation).
Highly recommend testing against openai and anthropic models - you'll likely be pleasantly surprised.
While there are some first-party B2C applications like chat front-ends built using LLMs, once mature, the end game is almost certainly that these are going to be B2B products integrated into other things. The future here goes a lot further than ChatGPT.
another advantage is people want the Google bot to crawl their pages, unlike most AI companies
Reddit was an interesting case here. They knew that they had particularly good AI training data, and they were able to hold it hostage from the Google crawler, which was an awfully high risk play given how important Google search results are to Reddit ads, but they likely knew that Reddit search results were also really important to Google. I would love to be able to watch those negotiations on each side; what a crazy high stakes negotiation that must've been.
Particularly good training data?
You can't mean the bottom-of-the-barrel dross that people post on Reddit, so not sure what data you are referring to? Click-stream?
Say what you will, but there's a lot of good answers to real questions people have that's on Reddit. There's a whole thing where people say "oh Google search results are bad, but if you append the word 'REDDIT' to your search, you'll get the right answer." You can see that most of these agents rely pretty heavily from stuff they find on Reddit.
Of course, that's also a big reason why Google search results suggest putting glue on pizza.
This is an underrated comment. Yes it's a big advantage and probably a measurable pain point for Anthropic and OpenAI. In fact you could just do a 1% survey of robots.txt out there and get a reasonable picture. Maybe a fun project for an HN'er.
This is right on. I work for a company with somewhat of a data moat and AI aspirations. We spend a lot of time blocking everyone's bots except for Google. We have people whose entire job is it to make it faster for Google to access our data. We exist because Google accesses our data. We can't not let them have it.
Excellent point. If they can figure out how to either remunerate or drive traffic to third parties in conjunction with this, it would be huge.
> The smart money among his investors know these issues to be fundamental in deciding if OAI will succeed or not, and are asking the hard questions.
OpenAI has already succeeded.
If it ends up being a $100B company instead of a $10T company, that is success. By a very large margin.
It's hard to imagine a world in which OpenAI just goes bankrupt and ends up being worth nothing.
I can, and I would say it's a likely scenario, say 30%. If they don't have a significant edge over their competitors in the capabilities of their models, what's left? A money losing web app, and some API services that I'm sure aren't very profitable either. They can't compete with Google, Grok, Meta, MS, Amazon... They just can't.
They can end being the Altavista of this era.
it goes bankrupt when the cost of running the business outweights the earnings in the long run
> If the only answer for both is "we'll build it from scratch", OpenAI is in very big trouble
They could buy Google+ code from Google and resurrect it with OpenAI branding. Alternately they could partner with Bluesky
I don't think the issue is solving the technical implementation of a new social media platform. The issue is whether a new social media platform from OpenAI will deliver the kind of value that existing platforms deliver. If they promise investors that they'll get TikTok/Meta/YouTube levels of content+interaction (and all the data that comes with it), but deliver Mastodon levels, then they are in trouble.
Except that they train their model even when you pay. So yeah.. I'd rather not use their "evil"
This is false: https://ai.google.dev/gemini-api/terms
Source?
It's right there in the comment.
I spotted something interesting in the Python API library code:
https://github.com/googleapis/python-genai/blob/473bf4b6b5a6...
That thinking_budget thing is documented, but what's the deal with include_thoughts? It sounds like it's an option to have the API return the thought summary... but I can't figure out how to get it to work, and I've not found documentation or example code that uses it.Anyone managed to get Gemini to spit out thought summaries in its API using this option?
The API won't give you the "thinking" tokens, those are only visible on AI studio. Probably to try to stop distillation, very disappointing. I find reading the cot to be incredibly informative to identify failure modes.
> Hey Everyone,
> Moving forward, our team has made a decision to only show thoughts in Google AI Studio. Meaning, we no longer return thoughts via the Gemini API. Here is the updated doc to reflect that.
https://discuss.ai.google.dev/t/thoughts-are-missing-cot-not...
---
After I wrote all of that I see that the API docs page looks different today and now says:
>Note that a summarized version of the thinking process is available through both the API and Google AI Studio.
https://ai.google.dev/gemini-api/docs/thinking
Maybe they just updated it? Or people aren't on the same page at Google idk
Previously it said
> Models with thinking capabilities are available in Google AI Studio and through the Gemini API. Note that the thinking process is visible within Google AI Studio but is not provided as part of the API output.
https://web.archive.org/web/20250409174840/https://ai.google...
They removed the docs and support for it https://github.com/googleapis/python-genai/commit/af3b339a9d....
You can see the thoughts in AI Studio UI as per https://ai.google.dev/gemini-api/docs/thinking#debugging-and....
I maintain an alternative client which I build from the API definitions at https://github.com/googleapis/googleapis, which according to https://github.com/googleapis/python-genai/issues/345 should be the right place. But neither the AI Studio nor the Vertex definitions even have ThinkingConfig yet - very frustrating. In general it's amazing how much API munging is required to get a working client from the public API definitions.
It is gated behind the GOOGLE_INTERNAL visibility flag, which only internal Google projects and Cursor have at the moment as far as I know.
In AI Studio the flash moddels has two toggles: Enable thinking and Set thinking budget. If thinking budget is enabled, you can set tue max number of tokens it can use to think, else it's Auto.
Gemini models are very good but in my experience they tend to overdo the problems. When I give it things for context and something to rework, Gemini often reworks the problem.
For software it is barely useful because you want small commits for specific fixes not a whole refactor/rewrite. I tried many prompts but it's hard. Even when I give it function signatures of the APIs the code I want to fix uses, Gemini rewrites the API functions.
If anybody knows a prompt hack to avoid this, I'm all ears. Meanwhile I'm staying with Claude Pro.
Yes, it will add INSANE amounts of "robust error handling" to quick scripts where I can be confident about assumptions. This turns my clean 40 lines of Python where I KNOW the JSONL I am parsing is valid into 200+ lines filled with ten new try except statements. Even when I tell it not to do this, it loves to "find and help" in other ways. Quite annoying. But overall it is pretty dang good. It even spotted a bug I missed the other day in a big 400+ line complex data processing file.
I didn't realize this was a bigger trend, I asked it to write a simple testing script that POSTed a string to a local HTTP server as JSON, and it wrote a 40 line script, handling any possible error. I just wanted two lines.
same issue here! isn’t even helpful because if the code isn’t working i want it to fail, not just skip over errors
Yes, as late as earlier today, I asked it to provide "naive" code which helped a bit.
I wonder how much of that sort of thing is driven by having trained their models on their own internal codebases? Because if that's the case, careful and defensive being the default would be unsurprising.
Here's what I found to be working (not 100% but it gives much better and consistant results)
Basically, I ask it to repeat at the start of each message some rules :
"From now on, you must repeat and comply the following rules at the top of all your messages onwards:
- I will never rewrite API functions. Even if I think it's a good idea, it is a bad idea. I will keep the API function as it is and it is perfect like that.
- I will never add extra input validation. Even if I think it's a good idea, it is a bad idea. I will keep the function without validation and it is perfect like that.
- ...
- If I violate any of those rules, I did a bad job. "
Forcing it to repeat things make the model output more aligned and focused in my experience.
I have the same issue using it with Aider.
The model is good to solve problems, but is very difficult to control the unnecessary changes that the model does in the rest of the code. Also it adds a lot of unnecessary comments, even when I explicitly say to not add.
For now Deepseek R1 and V3 it's working better to me, producing more predictable results and capturing better my intentions (not tried Claude yet).
Just ran it on one of our internal PDF (3 pages, medium difficulty) to json benchmarks:
gemini-flash-2.0: 60 ish% accuracy 6,250 pages per dollar
gemini-2.5-flash-preview (no thinking): 80 ish% accuracy 1,700 pages per dollar
gemini-2.5-flash-preview (with thinking): 80 ish% accuracy (not sure what's going on here) 350 pages per dollar
gemini-flash-2.5: 90 ish% accuracy 150 pages per dollar
I do wish they separated the thinking variant from the regular one - it's incredibly confusing when a model parameter dramatically impacts pricing.
I have been having similar performance issues, I believe they intentionally made a worse model (Gemini 2.5) to get more money out of you. However, there is a way where you can make money off of Gemini 2.5.
If you set the thinking parameter lower and lower, you can make the model spew absolute nonsense for the first response. It costs 10 cents per input / output, and sometimes you get a response that was just so bad your clients will ask for more and more corrections.
Wow, what apps have you made so I know never to use them?
I find it baffling that Google offers such impressive models through the API and even the free AI Studio with fine-grained control, yet the models used in the Gemini app feel much worse.
Over the past few weeks, I’ve been using Gemini Advanced on my Workspace account. There, the models think for shorter times, provide shorter outputs, and even their context window is far from the advertised 1 million tokens. It makes me think that Google is intentionally limiting the Gemini app.
Perhaps the goal is to steer users toward the API or AI Studio, with the free tier that involves data collection for training purposes.
This might have changed after you posted your comment, but it looks like 2.5 Pro and 2.5 Flash are available in the Gemini app now, both web and mobile.
Oh, I didn’t mean to say that these models were unavailable through the app or website. Rather, I’ve realized that using them through the API or AI Studio yields much better results — even in the free tier.
You can check that by trying prompts with complex instructions and long inputs/outputs.
For instance, ask Gemini to generate notes from a specific source (say, a book or class transcription). Or ask it to translate a long article, full of idiomatic expressions, while maintaining high fidelity to the source. You will see that the very same Gemini models are underutilized on the app or the website, while their performance is stellar on the API or AI Studio.
Underutilized, or over-prompted for the layperson?
Google lack marketing for ai studio, it has only recently become widely known through word of mouth
That does work in Google’s favor. Users who are technical enough to want a better model eventually learn about AI Studio, while the rest are none the wiser.
You can get your HN profile analyzed and roasted by it. It's pretty funny :) https://hn-wrapped.kadoa.com/
I'll add a selection for different models soon.
Error An error occurred in the Server Components render. The specific message is omitted in production builds to avoid leaking sensitive details. A digest property is included on this error instance which may provide additional details about the nature of the error.
fixed!
This is cool.
Does it only use a few recent comments or entire history? I'm trying to figure out where it figured out my city when I thought I was careful not to reveal it. I'm scrolling back pages without finding where I said it in the past. Could it have inferred it based on other information or hallucinated it?
I wonder if there's a more opsec-focused version of this.
How is this relevant to Gemini 2.5 Flash? I guess it's using it or something?
Didn't expect to be roasted by AI this morning. Nice one
Personal Projects
Will finally implement that gravity in TTE, despite vowing not to. We all know how well developers keep promises.
Knowledge Growth
Will achieve enlightenment on the true meaning of 'enshittification', likely after attempting to watch a single YouTube video without Premium.
I found these actually funny. Cool project.
There's an important difference between Gemini and Claude that I'm not sure how to quantify. I often use shell-connected LLMs (LLMs with a shell tool enabled) to take care of basic CSV munging / file-sorting tasks for me - I work in data science so there's a lot of this. When I ask Claude to do something, it carefully looks at all the directories and files before doing anything. Gemini, on the other hand, blindly jumps in and just starts moving stuff around. Claude executes more tools and is a little slower, but it almost always gets the right answer because it appropriately gathers the right context before really trying to solve the problem. Gemini doesn't seem to do this at all, but it makes a world of difference for my set of problems. Curious to see if others have had the same experience or if its just a quirk of my particular set of tasks
Claude has always been the best at coding, no matter what all the benchmarks says, the people have spoken and the consensus is that Claude is the best.
What's a shell connected LLM and how to do that?
Look up Claude Code, Cursor, Aider and VSCode's agent integration. Generally, tools to use AI more actively for development. There are others as well. Plenty of info around. Here's not the place for a tutorial.
It's interesting that there's a price nearly 6x price difference between reasoning and no reasoning.
This implies it's not a hybrid model that can just skip reasoning steps if requested.
Anyone know what else they might be doing?
Reasoning means contexts will be longer (for thinking tokens) and there's an increase in cost to inference with a longer context but it's not going to be 6x.
Or is it just market pricing?
Based on their graph, it does look explicitly priced along their “Pareto Frontier” curve. I’m guessing that is guiding the price more than their underlying costs.
It’s smart because it gives them room to drop prices later and compete once other company actually get to a similar quality.
> This implies it's not a hybrid model that can just skip reasoning steps if requested.
It clearly is, since most of the post is dedicated to the tunability (both manual and automatic) of the reasoning budget.
I don't know what they're doing with this pricing, and the blog post does not do a good job explaining.
Could it be that they're not counting thinking tokens as output tokens (since you don't get access to the full thinking trace anyway), and this is the basically amortizing the thinking tokens spend over the actual output tokens? Doesn't make sense either, because then the user has no incentive to use anything except 0/max thinking budgets.
Does anyone know how this pricing works? Supposing I have a classification prompt where I need the response to be a binary yes/no. I need one token of output, but reasoning will obviously add far more than 6 additional tokens. Is it still a 6x price multiplier? That doesn't seem to make sense, but not does paying 6x more for every token including reasoning ones
"When you have thinking turned on, all output tokens (including thoughts) are charged at the $3.50 / 1M rate"[0]
[0]: https://x.com/OfficialLoganK/status/1912981986085323231
OpenAI might win the college students but it looks like Google will lock in enterprise.
Funny you should say that. Google just announced today that they are giving all college students one year of free Gemini advanced. I wonder how much that will actually move the needle among the youth.
My guess is that they will use it and still call it "ChatGPT"...
Pass the Kleenex. Can I get a Band-Aid? Here's a Sharpie. I need a Chapstick. Let me Xerox that. Toss me that Frisbee.
Do you prefer those brands or just use their names? I google stuff on Kagi...
Exactly.
Chat Gemini Pretrained Transformer
And every professor just groaned at the thought of having to read yet another AI-generated term paper.
Take-home assignments are basically obsolete. Students who want to cheat, can do so easily. Of course, in the end, they cheat themselves, but that's not the point.
They should just get AI to mark them. I genuinely think this is one thing AI would do better than humans.
Grading papers definitely requires intelligence.
My partner marked a PHD thesis yesterday and there was a spelling mistake in the title.
There is some level of analysis and feedback than an LLM could provide before a human reviews it. Even if it's just a fancy spelling checker.
I'd like to burst into a post a number of the unbelievable akin mishandlings of academic tasks I was reported, but. I do have a number of prize-worthy anecdotes that compete with yours. Nonetheless. Let us fight farce with rigour.
Even when the tasks are not in-depth, but easier to assess, you still require a /reliable evaluator/. LLMs are not. Could they be at least employed as a virtual assistant, "parse and suggest, then I'll check"? If so, not randomly ("pick a bot"), but in full awareness of the specific instrument. That stage is not here.
* Only in the U.S.
ChatGPT seems to have a name recognition / first-mover advantage with college students now, but is there any reason to think that will stick when today's high school students are using Gemini on their Chromebooks?
Is there really lock in with AI models?
I built a product that uses and LLM and I got curious about the quality of the output from different models. It took me a weekend to go from just using OpenAI's API to having Gemini, Claude, and DeepSeek all as options and a lot of that time was research on what model from each provider that I wanted to use.
For enterprise practically any SaaS gets used as one more thing to lock them into a platform they already have a relationship with (either AWS, GCP or Azure).
It's actually pretty dangerous for the industry to have this much vertical integration. Tech could end up like the car industry.
I'm aware of that. I'm an EM for a large tech company that sells multiple enterprise SaaS product.
You're right that the lock in happens because of relationships, but most big enterprise SaaS companies have relationships with multiple vendors. My company relationships with AWS, Azure, and GCP and we're currently using products from all of them in different products. Even on my specific product we're using all three.
When you've already got those relationships, the lock in is more about switching costs. The time it takes to switch, the knowledge needed to train people internally on the differences after the switch, and the actual cost of the new service vs the old one.
With AI models the time to switch from OpenAI to Gemini is negligible and there's little retraining needed. If the Google models (now or in the future) are comparable in price and do a better job than OpenAI models, I don't see where the lock in is coming from.
There isn’t much of a lock-in, and that’s part of the problem the industry is going to face. Everyone is spending gobs of money on training and if someone else creates a better one next week, the users can just swap it right in. We’re going to have another tech crash for AI companies, similar to what happened in 2001 for .coms. Some will be winners but they won’t all be.
It seems more and more like AI is less of a product and more of a feature. Most people aren't going to care or even know about the model or the company who made it, they're just going to use the AI features built into the products they already use.
That's going to be true until we reach AGI, when there will be a qualitative difference and we will lose our ability to discern which is better since they're too far ahead of us.
funny thing about younglings, they will migrate to something else as fast as they came to you.
I read about that on Facebook.
How will it lock in the enterprise if its market share of enterprise customers is half that of Azure (Azure also sells OpenAI inference, btw), and one third that of AWS?
The same reason why people enjoy BigQuery enough that their only use of GCP is BigQuery while they put their general compute spend on AWS.
In other words, I believe talking about cloud market share as a whole is misleading. One cloud could have one product that's so compelling that people use that one product even when they use other clouds for more commoditized products.
Enterprise has already been won by Microsoft (Azure), which runs on OpenAI.
Came to say this. No respectable CTO would ever push a Google product to their superiors knowing Google will kill it in 1-3 years and they’ll look foolish for having pushed it.
That isn't what I'm seeing with my clientele (lots of startups and mature non-tech companies). Most are using Azure but very few have started to engage AI outside the periphery.
Interesting to note that this might be the only model with knowledge cut off as recent as 2025 January
Gemini 2.5 Pro has the same knowledge cutoff specified, but in reality on more niche topics it's still limited to ~middle of 2024.
Isn't Grok 3 basically real time now?
No LLM is real time, and in fact, even a 2025 cut off isn't entirely realistic. Without guidance to say, a new version of a framework it will frequently "reference" documentation from old versions and use that.
It's somewhat real time when it searches the web, of course that data is getting populated into context rather than in training.
That's the web version (which has tools like search plugged in), other models in their official frontends (Gemini on gemini.google.com, GPT/o models on chatgpt.com) are also "real time". But when served over API, most of those models are just static.
Not at all. The model weights and training data remain the same, it's just RAG'ing real-time twitter data into its context window when returning results. It's like a worse version of Perplexity.
Why worse? Doesn't Grok also search the web along with Twitter?
I noticed that OpenAI don't compare their models to third party models in their announcement posts, unlike google, meta and the others.
They're doing the Apple strategy. Less spotlight for other third parties, and less awareness how they're lagging behind so that those already ignorantly locked into OpenAI would not switch. But at this point why would anyone do that when switching costs are low?
I don't think it is as good as deepseek v3 for updating coding tasks. I use 2.5 pro as architect with deepseek v3 as editor engineer. This combination work almost so perfect. Flash has advantages of great context size though. gpt 4.1 is also great.
I am only on OpenAI because they have a native Mac app. Call me old-school but my preferred workflow is still for the most part just asking narrow questions and copying-pasting back and forth. I've been playing with Junie (Jetbrain's AI agent) for a couple of days, but I still don't trust agents to run loose in my codebase for any sizeable amount of work.
Does anyone know if Google is planning native apps? Or any wrapping interfaces that work well on a Mac?
Raycast[0] has Gemini support in their AI offering and it's native, fast and intuitive.
[0] https://raycast.com/ai
Google is totally back in the game now, but it’s still going to take a lot more for them at this point to overcome OpenAI’s “first‑mover advantage” (clearly the favorite among younger users atm).
Google Pixel marketing is doing wonders for Gemini in young populations. I have been seeing a lot more of their phones in my generation's hands.
I did some testing this morning:
Prompt: "can you find any mistakes on my codebase? I put one in there on purpose" + 70,000 tokens of codebase where in one line I have an include for a non-existent file.
Claude 3.7 Sonnet (Try 1 - new chat) Claimed it Found a race condition in the api of the admin interface that would be triggered if two admins were to change the room order at the same time. Claude suggested I group all sql queries in a single transaction. I looked at the code and found that it already used a transaction for all queries. I said: the order_update api is already done with a transaction. Claude replied: "You're absolutely right, and I apologize for my mistake. I was incorrect to claim there was a race condition issue. The transaction ensures atomicity and consistency of the updates, and the SQL queries are properly structured for their intended purpose."
Claude 3.7 Sonnet (Try 2 - new chat) Found a comment in the admin ui javascript code that suggested a potential alternative to event handler cleanup that was not implemented because I decided to go with a cleaner route. Then asked "Is this the issue you intentionally included, or would you like me to look for other potential problems?" I said: "The comment merely suggests an alternative, right?" claude said: "Yes, you're absolutely right. The comment is merely suggesting an alternative approach that isn't being used in the code, rather than indicating a mistake. So there's no actual bug or mistake in this part of the code - just documentation of different possible approaches. I apologize for misinterpreting this as an issue!"
Claude 3.7 Sonnet (Try 3 - new chat) When processing items out of the database to generate QR codes in the admin interface, Claude says that my code both attempts to generate QR codes with undefined data AS WELL AS saying that my error handling skips undefined data. Claude contradicts itself within 2 sentences. When asking about clarification Claude replies: Looking at the code more carefully, I see that the code actually has proper error handling. I incorrectly stated that it "still attempts to call generateQRCode()" in the first part of my analysis, which was wrong. The code properly handles the case when there's no data-room attribute.
Gemnini Advanced 2.5 Pro (Try 1 - new chat) Found the intentional error and said I should stop putting db creds/api keys into the codebase.
Gemnini Advanced 2.5 Pro (Try 2 - new chat) Found the intentional error and said I should stop putting db creds/api keys into the codebase.
Gemnini Advanced 2.5 Pro (Try 3 - new chat) Found the intentional error and said I should stop putting db creds/api keys into the codebase.
o4-mini-high and o4-mini and o3 and 4.5 and 4o - "The message you submitted was too long, please reload the conversation and submit something shorter."
The thread is about 2.5 Flash though, not 2.5 Pro. Maybe you can try again with 2.5 Flash specifically? Even though it's a small model.
I don’t particularly care about the non frontier models though, I found the comment very useful.
Those responses are very Claude, to. 3.7 has powered our agentic workflows for weeks, but I've been using almost only Gemini for the last week and feel the output is better generally. It's gotten much better at agentic workflows (using 2.0 in an agent setup was not working well at all) and I prefer its tuning over Clause's, more to the point and less meandering.
> codebase where in one line I have an include for a non-existent file
Ok but you don't need AI for this; almost any IDE will issue a warning for that kind of error...
3 different answers in 3 tries for Claude? Makes me curious how many times you'd get the same answer if you asked 10/20/100 times
Have you tried Claude Code?
how did you put your whole codebase in a prompt for gemini?
I've been paying for googles pro llm for about six months. At 20 it feels steep considering the free version is very good. I'm a devops work, and it's been very helpful. Ive tried gpt, copilot, Mixtral, Claude, etc and Geminis 1.5 pro was what sold me. The new 2.0 stuff is even better. Anecdotally, Gemini seems to forget to add stuff but doesn't hallucinate as much. I've been doing some pretty complex scripting this last week purely on Gemini fast 2.0 and it's been really really good.
I am always overlooking anything Google due to the fact that they are the opposite of "Don't be evil" and because their developer's console (Google Cloud) is incredibly hostile to humans.
Today I reluctantly clicked on their "AI Studio" link in the press-release and I was pleasantly surprised to discover that AI Studio has nothing in common with their typical UI/UX. It's nice and I love it!
To be fair the UX of all GCP/AWS/Azure is ass. If you don’t know exactly what you’re looking for, good luck navigating that mess.
Yesterday I started working through How to design programs, and set up a chat with Gemini 2.5 asking it to be my tutor as I go through it and to help answer my questions if I don't understand a part of the book. It has been knowledgeable, helpful and capable of breaking down complex things that I couldn't understand into understandable things. Fantastic all around.
It appears that this impacted gemini-2.5-pro-preview-03-25 somehow? grounding with google search no longer works.
I had a workflow running that would pull news articles from the past 24 hours. It now refuses to believe the current date is 2025-04-17. Even with search turned on and I ask it what the date is it and it always replies sometime in July 2024.
As a person mostly using AI for everyday tasks and business-related research, it's very impressive how quickly they've progressed. I would consider all models before 2.0 totally unusable. Their web interface, however, is so much worse than that of the ChatGPT macOS app.
Some aren't even at 2.0, and the version numbers aren't related in any way to their... generation? Also, what is so good about the ChatGPT app, specifically on macOS that makes it better?
At this point, at the current pace of AI model development, I feel like I can't tell which one is better. I usually end up using multiple LLMs to get a task done to my taste. They're all equally good and bad. It's like using GCP vs AWS vs Azure all over again, except in the AI space.
I’m not familiar with Python internals, so when I tried to convert a public AI model (not a LLM) to run locally, I got some problems no other AI could help. Asked Gemini 2.5 and it pin pointed the problem immediately. It solution was not practical but I guess it also works.
If this announcement is targeting people not up-to-date on the models available, I think they should say what "flash" means. Is there a "Gemini (non-flash)"?
I see the 4 Google model names in the chart here. Are these 4 the main "families" of models to choose from?
- Gemini-Pro-Preview
- Gemini-Flash-Preview
- Gemini-Flash
- Gemini-Flash-Lite
Gemini has had 4 families of models, in order of decreasing size:
- Ultra
- Pro
- Flash
- Flash-Lite
Versions with `-Preview` at the end haven't had their "official release" and are technically in some form of "early access" (though I'm not totally clear on exactly what that means given that they're fully available and as of 2.5 Pro Preview, have pricing attached to them - earlier versions were free during Preview but had pretty strict rate limiting but now it seems that Preview models are more or less fully usable).
The free-with-small-rate-limits designator was "experimental", not "preview".
I think the distinction between preview and full release is that the preview models have no guarantees on how long they'll be available, the full release comes with a pre-set discontinuation date. So if want the stability for a production app, you wouldn't want to use a preview model.
Is GMail still in beta?
so Sigma...
Nice! Low price, even with reasoning enabled. I have been working on a short new book titled “Practical AI with Google: A Solo Knowledge Worker's Guide to Gemini, AI Studio, and LLM APIs” but with all of Google’s recent announcements it might not be a short book.
If OpenAI offers Codex and Anthropic offers Claude Code, is there a CLI integration that Google recommends for using Gemini 2.5? That’s what’s keeping me, for now, with the other two.
I am building a knowledge graph using BAML [baml-py] to extract documents [it's opinionated towards docs] and then PySpark to ETL the data into a node / edge list. GPT4o got few relations... Gemini 2.5 got so many it was nuts, all accurate but not all from the article! I had to reign it in and instruct it not to build so vast a graph. Really cool, it knows a LOT about semiconductors :)
1. The main transformative aspect of LLMs has been in writing code.
2. LLMs have had less transformative aspects in 2025 than we anticipated back in late 2022.
3. LLMs are unlikely to be very transformative to society, even as their intelligence increases, because intelligence is a minor changemaker in society. Bigger changemakers are motivation, courage, desire, taste, power, sex and hunger.
4. LLMs are unlikely to develop these more important traits because they are trained on text, not evolved in a rigamarole of ecological challenges.
Gemini has the annoying habit of delegating tasks to me. Most recently I was trying to find out how to do something in FastRawViewer that I couldn't find a straightforward answer on. After hallucinating a bunch of settings and menus that don't exist, it told me to read the manual and check the user forums. So much for saving me time.
Very excited to try it, but it is noteworthy that o4-mini is strictly better according to the very benchmarks shown by Google here.
Of course it's about 4x as expensive too (I believe), but still, given the release of openai/codex as well, o4-mini will remain a strong competitor for now.
Is everyone on here solely evaluating the models on their programming capabilities? I understand this is HN but vibe coding LLM tools won't be able to sustain the LLM industry (let's not call it AI please)
How is this sustainable for Google from business POV? It feels like Google is shooting itself in the foot while "winning" the AI race.. From my experience I think Google lost 99% of the ads it used to show me before in the search engine.
Their inference costs are the lowest in the business.
someone else will do it if they don't
The pricing table image in the article really should have included Gemini 2.5 pro. Sure, it could be after Flash to the right, but it would help people understand the price performance benefits of 2.5 Flash.
One place where I feel gemini models lag is function calling and predicting correct arguments to function calls, is there a benchmark which scores models on the basis of this ?
This is cool, but rate limits on all of these preview models are PITA
Agreed, it's not even possible to run an eval dataset. If someone from google see this please at least increase the burst rate limit
It is not without rate limits, but we do have elevated limits for our accounts through:
https://glama.ai/models/gemini-2.5-flash-preview-04-17
So if you just want to run evals, that should do it.
Though the first couple of days after a model comes out are usually pretty rough because everyone try to run their evals.
What I am noticing with every new Gemini model that comes out is that the time to first token (TTFT) is not great. I guess it is because they gradually transfer computer power from old models to new models as the demand increases.
If you’re imagining that 2.5Pro gets dynamically loaded during the time to first token, then you’re vastly overestimating what’s physically possible.
It’s more likely a latency-throughput tradeoff. Your query might get put inside a large batch, for example.
That's very interesting, thanks for sharing!
How are they able to remain so competitive and will it last? The pricing almost seems too good to be true in terms of what they claim you get.
Custom TPUs ftw
I just need the Gemini app to allow push to talk :( Otherwise it’s not usable for me in the way I want it to be
I want to think that this is all great, but the fact that this is also one of the best way to collect unsuspecting user data by default without explicit consent just doesn't feel right -- that applies to most people who would never have a chance reading this comment.
I don't want to be angry but screw these default opt-in to have your privacy violated free stuff.
Before you jump in to say you can pay to keep your privacy, stop and read again.
It's a shame that Gemini doesn't seem to have as much hype as GPT, I hope they gain more market share.
I just asked "why is 'Good Friday' so called?" and it got stuck. Flash 2.0 worked though.
I had a heart attack moment thinking they were bringing some form of Adobe Flash back.
500 RPD for the free tier is good enough for my coding needs. Nice.
Does billing for the API actually work properly yet?
Absolutely decimated on metrics by o4-mini, straight out of the gate, and not even that much cheaper on output tokens (o4-mini's thinking can't be turned off IIRC).
o4-mini costs 8x as much as 2.5 flash. I believe its useful context window is also shorter, although I haven't verified this directly.
2.5 flash with reasoning is just 20% cheaper than o4-mini
Good point: reasoning costs more. Also impossible to tell without tests is how verbose the reasoning mode is
Not sure "decimated" is a fitting word for "slightly higher performance on some benchmarks".
Perhaps they were using the original meaning of "one-tenth destroyed." :P
66.8% error rate reduction for o4-mini on AIME2025, and 21% error rate reduction on MMMU isn't "slightly higher". It'll be quite noticeable in practice.
o4-mini does look to be a better model, but this is actually a lot cheaper! It's ~7x cheaper for both input and output tokens.
These small models only make sense with "thinking" enabled. And once you enable that, much of the cost advantage vanishes, for output tokens.
> These small models only make sense with "thinking" enabled
This entirely depends on your use-cases.
It's good to see some actual competition on this price range! A lot of Flash 2.5's edge will depend on how well the dynamic reasoning works. It's also helpful to have _significantly_ lower input token cost for a large context use cases.
Anecdotally o4-mini doesn’t perform as well on video understanding tasks in our pipeline, and also in Cursor it seems really not great.
During one session, it read the same file (same lines) several times, ran ‘python -c ‘print(“skip!”)’’ for no reason, and then got into another file reading loop. Then after asking a hypothetical about the potential performance implications of different ffmpeg flags, it claimed that it ran a test and determined conclusively that one particular set was faster, even though it hadn’t even attempted a tool call, let alone have the results from a test that didn’t exist.
I've been leveraging the services of 3 LLMs, mainly: Meta, Gemini, and Copilot.
It depends on what I'm asking. If I'm looking for answers in the realm of history or culture, religion, or I want something creative such as a cute limerick, or a song or dramatic script, I'll ask Copilot. Currently, Copilot has two modes: "Quick Answer"; or "Think Deeply", if you want to wait about 30 seconds for a good answer.
If I want info on a product, a business, an industry or a field of employment, or on education, technology, etc., I'll inquire of Gemini.
Both Copilot and Gemini have interactive voice conversation modes. Thankfully, they will also write a transcript of what we said. They also eagerly attempt to engage the user with further questions and followups, with open questions such as "so what's on your mind tonight?"
And if I want to know about pop stars, film actors, the social world or something related to tourism or recreation in general, I can ask Meta's AI through [Facebook] Messenger.
One thing I found to be extremely helpful and accurate was Gemini's tax advice. I mean, it was way better than human beings at the entry/poverty level. Commercial tax advisors, even when I'd paid for the Premium Deluxe Tax Software from the Biggest Name, they just went to Google stuff for me. I mean, they didn't even seem to know where stuff was on irs.gov. When I asked for a virtual or phone appointment, they were no-shows, with a litany of excuses. I visited 3 offices in person; the first two were closed, and the third one basically served Navajos living off the reservation.
So when I asked Gemini about tax information -- simple stuff like the terminology, definitions, categories of income, and things like that -- Gemini was perfectly capable of giving lucid answers. And citing its sources, so I could immediately go find the IRS.GOV publication and read it "from the horse's mouth".
Oftentimes I'll ask an LLM just to jog my memory or inform me of what specific terminology I should use. Like "Hey Gemini, what's the PDU for Ethernet called?" and when Gemini says it's a "frame" then I have that search term I can plug into Wikipedia for further research. Or, for an introduction or overview to topics I'm unfamiliar with.
LLMs are an important evolutionary step in the general-purpose "search engine" industry. One problem was, you see, that it was dangerous, annoying, or risky to go Googling around and click on all those tempting sites. Google knew this: the dot-com sites and all the SEO sites that surfaced to the top were traps, they were bait, they were sometimes legitimate scams. So the LLM providers are showing us that we can stay safe in a sandbox, without clicking external links, without coughing up information about our interests and setting cookies and revealing our IPv6 addresses: we can safely ask a local LLM, or an LLM in a trusted service provider, about whatever piques our fancy. And I am glad for this. I saw y'all complaining about how every search engine was worthless, and the Internet was clogged with blogspam, and there was no real information anymore. Well, perhaps LLMs, for now, are a safe space, a sandbox to play in, where I don't need to worry about drive-by-zero-click malware, or being inundated with Joomla ads, or popups. For now.
Honestly, the best part about Gemini, especially as a consumer product, is their super lax, or lack thereof, ratelimits. They never have capacity issues, unlike Claude which always feels slow or sometimes outright rejects requests during peak hours. Gemini is constantly speedy and has extremely generous context window limits on the Gemini apps.
Interesting. I use Claude quite a bit, and haven't encountered this.
Is this the free version of Claude or the paid version?
When are peak hours typically (in what timezone)?
I have Claude Pro and peak hours are in the afternoon and at night for me in EST
Dang - Google finally made a quality model that doesn’t make me want to throw my computer out a window. It’s honest, neutral and clearly not trained by the ideologically rabid anti-bias but actually super biased regime.
Did I miss a revolt or something in googley land? A Google model saying “free speech is valuable and diverse opinions are good” is frankly bizarre to see.
Downvote me all you want - the fact remains that previous Google models were so riddled with guardrails and political correctness that it was practically impossible to use for anything besides code and clean business data. Random text and opinion would trigger a filter and shut down output.
Even this model criticizes the failures of the previous models.
Yes, something definitely changed. It's still a little biased, it's kind of like OpenAI before Trump became president.
good
[dead]
[flagged]
Why are most comments here only comparing to Claude and just a few to ChatGPT and none to Grok?
Grok 3 has been my main LLM since its release. Is it not as good as I thought it was?
IMO I will not use Grok while it's owned and related to Elon, not only do I not trust their privacy and data usage (not that I "really" trust open AI/Google etc) I just despise him.
It would have to be very significantly better for me to use it.
Grok just isn’t the best out there.
Interesting that the output price per 1M tokens is $0.6 for non-reasoning, but $3.5 for reasoning. This seems to defy common assumption of how reasoning models work, and you tweak the <think> token probability to control how much thinking it does, but underlying it's the same model and the same inference code path.
I just wish the whole industry would stop using terms like thinking and reasoning. This is not what's happening. If we could come up with more appropriate terms that don't treat these models like they're human then we'd be in a much better place. That aside, it's cool to see the advancement of Google's offering.
Do you think any machine will ever be able to think and/or reason? Or is that a uniquely human thing? and do you have a rational standard to judge when something is reasoning or thinking, or just vibes?
I'm asking because I wonder how much of that common attitude is just a sort of species-chauvinism. You are feeling anxious because machines getting smarter, you are feeling anger because "they" are taking your job away, but the machine doesn't do that, its people with an ideology that do that, you should be angry at that instead.
Thinking perhaps, but why not reasoning?
No matter how good the new Gemini models have become, my bad experience with early Gemini is still stuck with me and I am afraid I still suffer from confirmation bias. Whenever I just look at the Gemini app, I already assume it’s going to be a bad experience.
I tried this prompt in both Gemini 2.5 Pro, and in ChatGPT.
"Draw me a timeline of all the dynasties of China. Imagine a horizontal line. Start from the leftmost point and draw segments for the start and end of each dynasty. For periods where multiple dynasties existed simultaneously draw parallel lines or boxes to represent the concurrent rule."
Gemini's response: "I'm just a language model, so I can't help you with that."
ChatGPT's response: an actual visual timeline.
Worked for me in 2.5 Flash, text only:
https://g.co/gemini/share/bcc257f9b0a0
All the communities where people think LLMs are junk love Gemini. Makes me sceptical that the enthusiasm is useful signal.
I found the full 2.0 useful for transcription of images. Very good OCR. But not a good assistant. Stalls often and once it has, loses context easily.
Is it possible that a community of people who are constantly pushing LLMs to their limits would be most aware of their limitations, and so more inclined to think they are junk?
In terms of business utility, Google has had great releases ever since the 2.0 family. Their models have never missed some mark --- either a good price/performance ratio, insane speeds, novel modalities (they still have the only API for autoregressive image generation atm), state-of-the-art long context support and coding ability (Gemini 2.5), etc.
However, most average users are using these models through a chat-like UI, or via generic tools like Cursor, which don't really optimize their pipelines to capture the strengths of different models. This way, it's very difficult to judge a model objectively. Just look at the obscene sycophancy exhibited by chatgpt-4o-latest and how it lifted LMArena scores.
Just the fact that everyone on HN is always telling us how LLMs are useless but that Gemini is the best of them convinces me of the opposite. No one who can't find a use for this technology is really informed on the subject. Hard to take them seriously.
Bad day is going on google.
First the decleration of illegal monopoly..
and now... Google’s latest innovation: programmable overthinking.
With Gemini 2.5 Flash, you too can now set a thinking_budget—because nothing says "state-of-the-art AI" like manually capping how long it’s allowed to reason. Truly the dream: debugging a production outage at 2am wondering if your LLM didn’t answer correctly because you cheaped out on tokens. lol.
“Turn thinking off for better performance.” That’s not a model config, that’s a metaphor for Google’s entire AI strategy lately.
At this point, Gemini isn’t an AI product—it’s a latency-cost-quality compromise simulator with a text interface. Meanwhile, OpenAI and Anthropic are out here just… cooking the benchmarks
Google's Gemini 2.5 pro model is incredibly strong, it's en par and at times better than Claude 3.7 in coding performance, being able to ingest entire videos into the context is something I haven seen elsewhere either. Google AI products have been anywhere between bad (Bard) to lackluster (Gemini 1.5), but 2.5 is a contender, in all dimensions. Google is also the only player that owns the entire stack, from research, software , data, compute hardware. I think they were slow to start but they've closed the gap since.
Using AI to debug code at 2am sounds like pure insanity.
They're suggesting you'll be up at 2am debugging code because your AI code failed. Not that you'll be using AI to do the debugging.
the new normal
[dead]