AI has a clear vendetta against PyRight
A few months ago I swapped the main repository at my job from MyPy to PyRight. I did this mostly because MyPy was insanely slow, but then I also just liked PyRight a lot more. It worked better and found more useful errors.
However, when I made this switch to avoid having to do a billion errors at once I basically ignored most of the test files. I was planning to slowly go back through and fix them as I edited them. That mostly worked. But I have now been saddled with 61 files that are still being ignored.
When I'm actually working I generally don't have time to Vibe code in any real way. The speed of the processing for the models just tends to be slower than anything I can do by myself and it means I can't code because they panic if the files change. But it's a Saturday and I was knitting so I thought, this is a great day to ask Kilo Code to just fix those tests for me.
I'm going to be really honest here. I didn't start it as a competition. I was actually considering if I should attempt to build my own MCP for PyRight so I wanted to check if Kilo Code could just do it for me. But then, a lot of things happened and it became a competition.
We started with 559 errors. This doesn't mean 559 edit locations though because from my experience PyRight is a little bit over dramatic. So usually it's giving a higher number than 100% makes sense.
Now that I know it went really poorly it would be nice to see why it went so poorly I feel like.
I ran PyRight and figured out how many of every error there actually was:
reportUnknownVariableType: 197 occurrences
reportUnusedCallResult: 105 occurrences
reportUnknownMemberType: 66 occurrences
reportUnknownArgumentType: 48 occurrences
reportMissingTypeArgument: 36 occurrences
reportUnnecessaryTypeIgnoreComment: 30 occurrences
reportPrivateUsage: 16 occurrences
reportUnknownParameterType: 15 occurrences
reportAttributeAccessIssue: 13 occurrences
reportAssignmentType: 9 occurrences
reportCallInDefaultInitializer: 7 occurrences
reportSelfClsParameterName: 5 occurrences
reportMissingParameterType: 4 occurrences
reportArgumentType: 3 occurrences
reportIncompatibleMethodOverride: 2 occurrences
reportOperatorIssue: 2 occurrences
reportMissingTypeStubs: 1 occurrences
A lot of these are really simple.
- reportUnusedCallResult (105): this means you called something and didn't use the return, the solution is just to set it to _ = X() in non test files I would want to look closer and see if the return matters but here I don't really care.
- reportPrivateUsage (16): people make things private that they shouldn't it happens.
- reportSelfClsParameterName (5): this is calling something that should be cls -> self in the function definition
- reportUnknownVariableType (197): this means you defined something as dict or list, instead of dict[str, X] or list[X] usually.
- reportCallInDefaultInitializer (7): this means you set the default of a kwarg to a function call the solution is to change it to be none and then set it at the top of the function.
So that seems great that seems like an agent should be able to fix a ton of that. After Kilo finished we had 427 errors. So we are down by 23%. There are 30 relatively easy fixes around unnecessary ignore statements it didn't fix because I told it not to. So it shouldn't be judged for those.
reportUnknownVariableType: 160 occurrences
reportUnusedCallResult: 81 occurrences
reportUnknownMemberType: 56 occurrences
reportUnnecessaryTypeIgnoreComment: 30 occurrences
reportMissingTypeArgument: 17 occurrences
reportPrivateUsage: 16 occurrences
reportUnknownArgumentType: 15 occurrences
reportAttributeAccessIssue: 13 occurrences
reportAssignmentType: 9 occurrences
reportUnknownParameterType: 7 occurrences
reportCallInDefaultInitializer: 7 occurrences
reportSelfClsParameterName: 5 occurrences
reportMissingParameterType: 4 occurrences
reportIncompatibleMethodOverride: 2 occurrences
reportOperatorIssue: 2 occurrences
reportArgumentType: 2 occurrences
reportMissingTypeStubs: 1 occurrences
But more importantly what did it actually do:
Deltas:
reportArgumentType || initial: 3 || absolute: -1 || % change: -0.18%
reportUnknownParameterType || initial: 15 || absolute: -8 || % change: -1.43%
reportUnknownMemberType || initial: 66 || absolute: -10 || % change: -1.79%
reportMissingTypeArgument || initial: 36 || absolute: -19 || % change: -3.40%
reportUnusedCallResult || initial: 105 || absolute: -24 || % change: -4.29%
reportUnknownArgumentType || initial: 48 || absolute: -33 || % change: -5.90%
reportUnknownVariableType || initial: 197 || absolute: -37 || % change: -6.62%
Well that's weird maybe it's a skill issue? What did we prompt it with?
The initial prompt was very simple
<task>
Can you please fix the pyright errors in test_storage_client
</task>
This was on purpose because I just wanted to check if it could do it at all. It was using google/gemini-2.5-pro-preview, I didn't realize this until later but it's fine.
It does some really normal stuff
- it reads the file
- it reads the pyright config
- it decides to just hypothesize about what might be wrong with the file, just a little rubber duck
- it runs pyright to check what is actually wrong with the file it's hypothesis was fully wrong.
- it has a fun internal discussion about how it was right because it thought that it was angry that the mocks were mistyped and that's kind of like being angry that the results of the mock calls weren't being saved.
- it proposes adding _ = before the mocks. This is the correct answer and I tell it to go for it.
- it runs pyright again to check.
This take it ~ 3 minutes and costs me $0.10. It fully resolves this file.
I give it what is honestly not a great prompt, where I say
Okay that went great can you fix the other pyright errors so we can have pyright running on all the test files?
Do not change the pyright config or add any ignores if you can't reasonably fix them skip them.
I added the last bit because basically the agents favorite thing across all the agents I have used it to just turn off the thing that's failing.
- This time it doesn't read and hypothesize first it just runs PyRight on the directory so that's a win.
- It decides to fix the easiest errors first, which is a win, that's what I would do.
- The first file it edits it mistypes something and I have to tell it the correct type of that object.
- It reminds itself that it should read multiple files at the same time instead of processing one by one. It then continues to process them one by one.
- Instead of trusting me that I gave it the right type it goes through about 6 steps to confirm that I gave it the actual type when I corrected it.
- It moves to 3 different files. I repeatedly try to help it fix one of the files and it is unable to do it, so I ask it to skip that file. This is because at least one of the fixes is not in this file it's a missing annotation somewhere else and it doesn't seem to be able to figure that out.
- It tries to fix another file and basically gives up because it thinks the typing is impossible to fix because it's using factoryboy. It's wrong about this the fix is going from Factory() -> Factory.create(). I tell it to ignore any future issues with factories because it's getting so turned around.
- It tries to ignore a bunch of errors I tell it not to because I need to manually review to confirm they should be ignored.
- It tries to build an entire typing stub instead of setting an object to dict[str, Any]. I repeatedly ask it to not do that, then tell it to skip it.
- The task crashes and I ask it to write a prompt for me to start a new task, it's able to do this:
Continue fixing the pyright errors in the `tests` directory.
**Work Completed:**
* ...
* ...
* ...
* ...
* ...
* ...
* ...
* ...
**Remaining Files to Fix:**
* ...
* ...
* ...
* And all other files in the `tests` directory with pyright errors.
**Constraints:**
* Do not change the `pyrightconfig.json` file.
* Do not add any `pyright: ignore` comments.
* Skip all files in the `tests/factories` directory.
* If you can't reasonably fix the errors in a file, skip it.
This first task takes us 22 minutes and costs $1.38. I knit around 10 rows while this is happening.
New task
- I ask this task to also not make new type stubs (this was my error should have been in the prompt)
- I am not in the best mood at this point "What you are doing is literally insane you have two options type description as dict[str, Any] or skip it"
- It does manage to actually fix this one.
- I tell it again that there is nothing wrong with the factories and ask it to stop trying to cast them, this is the same thing as before it just doesn't understand how they work.
- I tell it to stop adding type ignores. I guess technically I only told it not to use
pyright: ignorebut do I have to specify you also shouldn't addtype: ignore - About 4 instances of "stop modifying the type ignores and adding new ones" later it finally stops trying to add type ignores.
- I point out that it fundamentally changed how a test works and tell it what to do to fix it. It like before does a ton of work to decide if it agrees with me.
- I tell it that it can't make up functions to solve not using private functions.
- It tells me it's done and it did a great job.
At this point in total this has taken 45 minutes, cost me $3.13, and I have knit around 14 rows.
Which to be fair $3 to fix 23% of my errors is a really good price. But then I wondered how long would it have taken me to fix the errors.
Lets remember where we are:
Deltas:
reportUnknownVariableType || initial: 197 || change: -37 || % change: -6.62%
reportUnknownArgumentType || initial: 48 || change: -33 || % change: -5.90%
reportUnusedCallResult || initial: 105 || change: -24 || % change: -4.29%
reportMissingTypeArgument || initial: 36 || change: -19 || % change: -3.40%
reportUnknownMemberType || initial: 66 || change: -10 || % change: -1.79%
reportUnknownParameterType || initial: 15 || change: -8 || % change: -1.43%
reportArgumentType || initial: 3 || change: -1 || % change: -0.18%
reportAssignmentType || initial: 9 || change: 0 || % change: 0.00%
reportIncompatibleMethodO- || initial: 2 || change: 0 || % change: 0.00%
reportPrivateUsage || initial: 16 || change: 0 || % change: 0.00%
reportAttributeAccessIssue || initial: 13 || change: 0 || % change: 0.00%
reportOperatorIssue || initial: 2 || change: 0 || % change: 0.00%
reportUnnecessaryTypeIgno- || initial: 30 || change: 0 || % change: 0.00%
reportMissingParameterType || initial: 4 || change: 0 || % change: 0.00%
reportMissingTypeStubs || initial: 1 || change: 0 || % change: 0.00%
reportSelfClsParameterName || initial: 5 || change: 0 || % change: 0.00%
reportCallInDefaultInitia- || initial: 7 || change: 0 || % change: 0.00%
Advantages I have over kilo
- I am allowed to use type ignore when reasonable
- I know the solution to the factory issue
- I know how to use find and replace
- I know more about PyRight in practice than it seems to.
I removed all of the unused type ignores up front. I didn't remember to time this. But since both kilo and I knew how to do this, we can assume it's even.
This put me at 397 errors, 0 warnings, 0 informations
- In 8 minutes I had it down from 397 -> 326.
- I then spent 7 minutes not fixing tests but looking at how the factory_boy typing library worked and fixing the typings for Faker. I was able to find a solution for this by casting it in a single location and sharing it.
- In the 10 minutes between 2:17 - 2:27 I managed to get it down to 148 errors.
This puts at us 60% less than we had at the end of Kilo. And we are fixing more types of errors:
Deltas from kilo:
reportUnknownVariableType || start 197 || kilo fixed: 37 || I fixed 145
reportUnknownArgumentType || start 48 || kilo fixed: 33 || I fixed 9
reportUnusedCallResult || start 105 || kilo fixed: 24 || I fixed 32
reportMissingTypeArgument || start 36 || kilo fixed: 19 || I fixed 12
reportUnknownMemberType || start 66 || kilo fixed: 10 || I fixed 20
reportUnknownParameterType || start 15 || kilo fixed: 8 || I fixed 2
reportArgumentType || start 3 || kilo fixed: 1 || I fixed 0
reportAssignmentType || start 9 || kilo fixed: 0 || I fixed 6
reportIncompatibleMethodO- || start 2 || kilo fixed: 0 || I fixed 0
reportPrivateUsage || start 16 || kilo fixed: 0 || I fixed 6
reportAttributeAccessIssue || start 13 || kilo fixed: 0 || I fixed 3
reportOperatorIssue || start 2 || kilo fixed: 0 || I fixed 0
reportUnnecessaryTypeIgno- || start 30 || kilo fixed: 0 || I fixed 0
reportMissingParameterType || start 4 || kilo fixed: 0 || I fixed 0
reportMissingTypeStubs || start 1 || kilo fixed: 0 || I fixed 0
reportSelfClsParameterName || start 5 || kilo fixed: 0 || I fixed 0
reportCallInDefaultInitia- || start 7 || kilo fixed: 0 || I fixed 0
This was in 27 minutes, and included me researching an fixing an error that required adding an additional library.
I had all the errors fixed by 3:19. So in 1:19 I had fixed 297 errors. And that included taking a break to get some cookies.
What I think is probably worth noting here. Is that pricewise it was much cheaper for Kilo to fix it. Once I was fixing it I couldn't knit anymore and I'm expensive.
On the other hand during my work day it really seems better for me to just be coding and not asking Kilo for this kind of thing or you are paying me to watch Kilo be slightly worse than I am.
On the third hand when I was fixing the typing I found 3 tests that weren't running before because someone forgot to prepend the function names with test. Kilo didn't see that.
Some of this is certainly a skill disparity. I am very good at fixing PyRight errors especially in this codebase. I am as stated and shown earlier not always great at prompting LLMs because I lack a lot of patience and don't really want to spend my time trying to explain to a machine how to do a thing I could do in 6 keystrokes. But this feels like the thing that's built for the LLM, I have a thing that's failing make it work. But apparently telling it turning it off is not an option really messes with things.
The main takeaway I have here is that it did a great job when I told it to fix a single file that had simple errors. So maybe it can do something where I give it the really stupid files and I do the other files. At least until I run out of free credits.