(last updated: June 18th, 2026)

In no particular order (well except for the first one which is by far my favorite of the current breed of models), these are the LLMs I personally use.

My Usecases #

The additional context that my personal tasks do not primarily revolve around agentic coding, math, and in general I mostly do not use LLMs for STEM tasks. Most of my usage revolves around more creative tasks (i.e. textual analysis and synthesis, as I enjoy reading other takes on media I enjoy and LLMs can actually have interesting takes, or at least interestingly worded takes), as well as some conversational research tasks using the Kagi MCP.

I do sometimes use LLMs for conversational coding though, like for example (a real prompt I asked):

how would i write a PF rule to allow all in on a specific interface wt0?

When I use LLMs in assistant interfaces I mostly use a variation of the Claude system prompt which has been changed and altered using Anthropic-like introspection from models (most of the alteration work came from Claude Fable 5 when it was available, but much came from MiMo as well).

The Good #

MiMo v2.5 (non-pro and pro) #

I do not know how Xiaomi of all companies did it, but MiMo is genuinely one of the best generalized models (and best modern open weights one) outside of STEM tasks, in my experience. It writes creatively and is able to write thought-out analysis, it has good introspective reasoning abilities (as is evidenced by being the other model I used for system prompt re-writing), it doesn't overthink when unnecessary, it's just... a really good model. The Pro version is somewhat smarter but the non-pro version is very, very good for its size and price as well.

It's also the best model I know of for "we have Claude at home," it has very similar neuroses (especially if you tell it its Claude). Not necessarily the "you're absolutely right" (it does do that sometimes to be fair) but things like when it decides to use lists, when it asks clarifying questions, how it hedges its answers, how it adds extra sections like this: instead of with markdown headers or such.

Ironically it's not quite as good at STEM tasks but it's like. Good enough that I can use it as a first pass for those types of questions without worrying too much about its answers, and if they seem suspicious I can get a second 'opinion' from another model.

Deepseek v4 Pro and Flash #

Very solid runner-ups though I wouldn't call it especially close overall. However, Deepseek is very (relatively) STEMmaxxed, making it good for a second smarter opinion for those types of questions. Flash is more creative when asked to write or analyze, but Pro is smarter for STEM questions; however they're both really quite close at the end of the day.

Kimi K2.6 #

Oh Kimi... from the highlight of open weights with the original K2 being maybe the best at creative writing in its day and having a very unique personality, to "we have a Claude/GPT amalgam at home." It's not bad, and is probably smarter than any Deepseeks on a good day! But it leaves a sour taste in my mouth, knowing what could have been if they kept going in their original personality lane.

Kimi K2 0711 #

Speaking of which! The original K2 is still very good if you want super creative takes on something, even if its not as smart as more modern models. Providers are starting to drop support though so it probably won't last that long.

GPT 5.5 #

Obnoxiously smart, to the point of being a smartass, but closed source by a truly evil US company and I don't like using it for that reason.

Good but Local #

Admittedly, I can't actually run many of these locally, but these ones are the ones in the range where a lot of people can run them locally. There are, of course, only really two options here:

Gemma 4 #

It's good. Like, it's not the second coming of Satan, but it's pretty good, especially for being a local 20-30b model range. The small models (E2B and E4B) are also very good for what they are!

Qwen 3.5/3.6/3.7 #

They're good for agent and STEM stuff, I guess. They also finetune easier than Gemma 4 and have more available sizes, though nothing is quite as good as E2B and E4B in the lower range.

The Mediocre #

GLM 5/5.1/5.2 #

This might be a hot take, but I haven't liked any Zhipu models since maybe GLM 4.7 Flash. They haven't been smarter enough to justify their continued lurch towards being blander and blander, and even the messianic GLM 5.2 has not modified this trend, in my testing.

Claude Opus 4.5+ #

Don't get me wrong, they're pretty smart, but as time goes on they've been getting dumber and dumber in practice for me. Whether that's in search of Claude Code-maxxing by Anthropic, I don't know. Their style has also been getting worse and worse. Opus 4.5 was good in its day, but I would argue open models have sufficiently surpassed it in all fields that matter to me.

The Really Bad #

All Ernies #

"Who is Ernie?" Don't ask questions you don't want to know the answer to.

All Lings #

They aren't awful base models but dear god they're simultaneously so unstable and so bland at the same time, and have nothing going for them vs literally everything else.

#llm #ml #api

last updated: 2026-06-18