It would hopefully be possible that small stat changes wouldn't matter that much because the actions that you take aren't really different if the ship has slightly more hull or a slightly different load out
An actual example: a neural net trained to tell apart American and Russian tanks actually learned to tell apart low and high quality photos, since the photos of Russian tanks were all low-quality and taken under less-than-ideal conditions. Point being, it may learn to do something, but the *why* is extremely iffy. That's why I'm saying a completely unexpected thing could turn out to in fact be critical. Like, for example, something having exactly 3000 hull, or whatever other random bit of info that's obviously non-critical to a human but a neural net might fixate on for reasons.
The ML is still finding real patterns in this case, even if they aren't the intended patterns. There is a knowable 'why' based on the training data, it's just not the 'why' you wanted. The performance can only ever be as good as the data (in my field we say 'garbage in garbage out'). For image recognition specifically, there's going to be a lot of issues with pixel level information/patterns/noise that just aren't an issue in video games where the ML has direct access to the game states and there's pretty much no variance. For instance, if the tank identification ML was acting directly on the dimensions and properties of the tanks rather than on pixel level information representing those things, it's impossible for some of those things to go wrong. It's just sort of hard to compare results from applications with such different data and methods (supervised vs reinforcement learning etc.).
I particularly think the example of hull is not a good one because the ML/NN will naturally see tons of different hull values on tons of different ships over the course of a million training combats. That's all was trying to say: that small variations that are already covered by the training are
probably not going to be an issue.
The general point that rebalancing would almost certainly require retraining, and mods would be difficult to support is completely right though. It's something you would do as a research project on a 'finalized' open source game, not something you would try to implement during the game development process while everything is changing.
Based on the games I've seen, though (which is quite a few of them), it wasn't that. I mean, it built roaches into Void Rays and almost lost a game that should've never in a million years been close, that by the end looked like a desynced replay due to the AI's baffling decisions. Its whole zerg "strategy" - at least, what I saw of it - was a really, really, really well executed roach timing.
I got the impression from reading the paper on alpha star that the space of possible inputs in starcraft was just so large that reinforcement learning was not able to explore like it does in chess or go. They had to resort to starting it out with supervised learning (basically copying some human gameplay) and then it was allowed to try and improve from there via actual reinforcement learning. I actually noticed that they developed some specialized networks that only did particular strategies, and then tried to make generalized networks that could beat all of those specialized ones, so maybe some of the limited strategies you observed were related to that.