Microsoft’s problem isn’t how often it updates Windows—it’s how it develops it

By Peter Bright

Windows 10 during a product launch event in Tokyo in July 2015.
Enlarge / Windows 10 during a product launch event in Tokyo in July 2015.

It's fair to say that the Windows 10 October 2018 Update has not been Microsoft's most successful update. Reports of data loss quickly emerged, forcing Microsoft to suspend distribution of the update. It has since been fixed and is currently undergoing renewed testing pending a re-release.

This isn't the first Windows feature update that's had problems—we've seen things like significant hardware incompatibilities in previous updates—but it's certainly the worst. While most of us know the theory of having backups, the reality is that lots of data, especially on home PCs, has no real backup, and deleting that data is thus disastrous.

Windows as a service

Microsoft's ambition with Windows 10 was to radically shake up how it develops Windows 10. The company wanted to better respond to customer and market needs, and to put improved new features into customers' hands sooner. Core to this was the notion that Windows 10 is the "last" version of Windows—all new development work will be an update to Windows 10, delivered through feature updates several times a year. This new development model was branded "Windows as a Service." And after some initial fumbling, Microsoft settled on a cadence of two feature updates a year; one in April, one in October.

This effort has not been without its successes. Microsoft has used the new model to deliver useful new features without forcing users to wait three years for a new major version upgrade. For example, there's a clever feature to run Edge seamlessly in a virtual machine to provide greater protection from malicious websites. The Windows Subsystem for Linux (WSL), which equips Windows systems to run Linux software natively, has proven a boon for developers and administrators. The benefits for pure consumers may be a little harder to discern—though VR features compatible with SteamVR, improved game performance, and a dark theme, have all been nice additions. While the overall improvements are smaller, the current Windows 10 is certainly better than the one released three years ago.

It's hard to imagine that WSL could ever have become a useful tool in the days that Windows was only updated every three years.
Enlarge / It's hard to imagine that WSL could ever have become a useful tool in the days that Windows was only updated every three years.

This is a good thing, and I'd even argue that some parts of it could not have been done (or at least, could not have been done as successfully) without Windows as a Service. WSL's development, for example, has been guided by user feedback, with WSL users telling Microsoft of incompatibilities they've found and helping the company prioritize the development of new WSL features. I don't believe WSL could have received the traction it has without the steady progress of updates every six months—nobody would want to wait three years just to get a minor fix so that the package they care about runs properly. Regular updates reward people for reporting bugs, because they can actually see those bugs resolved in a timely manner.

The problem with Windows as a Service is quality. Previous issues with the feature and security updates have already shaken confidence in Microsoft's updating policy for Windows 10. While data is notably lacking, there is at the very least a popular perception that the quality of the monthly security updates has taken a dive with Windows 10 and that installation of the twice-annual feature updates as soon as they're available is madness. These complaints are long-standing, too. The unreliable updates have been a cause for concern since shortly after Windows 10's release.

The latest problem has brought this to a head, with commentators saying that two feature updates a year is too many and Redmond should cut back to one, and that Microsoft needs to stop developing new features and just fix bugs. Some worry that the company is dangerously close to a serious loss of trust over updates, and for some Windows users, that trust may already have been broken.

These are not the first calls for Microsoft to slow down with its feature updates—there have been concerns that there's too much churn for both IT and consumer audiences alike to handle—but with the obvious problems of the latest update, the calls take on a new urgency.

It's not how often, it's how

But saying Microsoft should only produce one update a year instead of two, or criticising the very idea of Windows as a Service, is missing the point. The problem here isn't the release frequency. It's Microsoft's development process.

Why is it the process, and not the timeframe, that's the issue? On the release schedule front, we can look at what other software does to get a feel for what's possible.

Two updates a year is more frequent than macOS, iOS, and Android, so in a sense Microsoft is attempting to overachieve. But it's not unprecedented: Ubuntu sees two releases a year, and Google's Chrome OS, like its Chrome browser, receives updates every six weeks. Beyond the operating system space, Microsoft's Office Insider program has a monthly channel that delivers new features to Office users each month, and it manages to do so without generating too many complaints while delivering a steady trickle of new features and fixes. The Visual Studio team similarly produces frequent updates for its development environment and online services. Clearly, there are teams within Microsoft that have adapted well to a world in which their applications are regularly updated.

Move beyond the world of on-premises software and into online and cloud services, and we see, both within Microsoft and beyond, increasing adoption of continuous delivery. Each update made to a system is automatically deployed onto production servers once it's passed sufficient automated testing.

It's true that none of these projects is as complicated at Windows. Ubuntu may contain a more diverse array of packages, but it benefits from many of these packages being developed as independent units anyway. Windows does, of course, contain many individual components, and Microsoft has done a lot of work to disentangle these. But the fact remains that its scale is unusually large—and unusually integrated. Windows is also, at least in places, extremely old.

These factors certainly make developing Windows challenging—but so challenging as to make two releases a year impractical? That's not clear at all. It just needs the right development process.

Page 2

Windows 10 circa its 2015 release (Where oh where are all my icons, Start menu?)
Enlarge / Windows 10 circa its 2015 release (Where oh where are all my icons, Start menu?)

Microsoft hasn't exactly revealed the development process being used with Windows 10, but the observable characteristics of the process (the way new features are shipped to insiders, the kinds of bugs that insiders have to put up with) combined with information gleaned from sources within the company betray a process that's flawed—and has a number of key similarities to the process the company used back when there were three years between Windows releases. The timescales are very condensed, but much of the approach to development is unchanged.

In the olden days, when product release cycles were two to three years, Microsoft arrived at a process divided into several phases: design and planning, feature development, integration, stabilization. Perhaps 4-6 months of planning and design, 6-8 weeks of intensive coding, and then 4 months of integration (each feature would typically be developed in its own branch, so they all have to be consolidated and merged together) and stabilization (which is to say: testing and bug fixing). Over the course of a product's development cycle, this cycle of phases would be repeated two or perhaps three times; for Windows, there would be three iterations, the first being a prototype, the next two being real. The lengths of the phases might change, but the basic structure was widely used within the company.

A few things are apparent from this kind of process. Perhaps most striking of all is that there's surprisingly little time spent actually developing new code: for a Windows release, two stints of 6-8 weeks over an entire three year period. There's a long time between the planning and design stage and actually having a working product. This factor, more than anything else, is why the process would not be described as "agile;" by the time you have something that you can put in front of customers to use, new features have been baked into the final product, making them hard to change in response to feedback.

The decoupling of development and bugfixing is also an issue: during the development and integration phases the reliability and stability of the software will take a giant nosedive. The features being integrated are fundamentally untested (because testing comes later), and have never been used with each other (because they were all developed separately in their own branches prior to the integration phase). The mess of software is then beaten into an acceptable shape through the testing, bug reporting, and bug fixing of the lengthy stabilization phase. In this process, the product's reliability should start improving once more.

Satya Nadella introduces the world to Windows 10 back in 2015.
Satya Nadella introduces the world to Windows 10 back in 2015.

The new world isn't that new

In the new world, we see the company take perhaps seven or eight months for the full cycle. Though there are only six months between releases, the start of the next cycle happens before the previous cycle is complete—this split becomes explicit to insiders each time the "Skip Ahead" group is reopened.

Each update typically starts with a fairly quiet period with few visible changes, followed by several months in which big changes—and tons of bugs—are introduced. A month or so before the update nears release we see a drastic slowdown in the number of changes made and a strong focus on bug fixes rather than new features.

As Microsoft employees have described it, the final few months of development are split into a "tell" phase, then a one month "ask" phase.  In the "tell" phase, the Windows leadership are told of the changes being made, with a default policy of accepting those changes. In the "ask" phase, the default switches to rejecting; only truly essential modifications are permitted at this stage, typically as few as a couple of changes a day.

So, for example, the first build of the October update (codenamed RS5) was released to insiders on February 14; the stable build of the April update (RS4) occurred two months later on April 16. RS5 didn't receive any significant new features until March 7. Lots of features were added over May, June, and July, before tailing off in August and September, when only small modifications were made. A couple of small features were even removed in August, as they wouldn't be ready in time for the October release.

There are certainly some differences in the process here. For example, we see new capabilities appear in the preview builds over many months. This indicates that integration of new features seems to take place much sooner—as the features are developed, rather than all in one big burst of merging at the end.

Quality takes a dive

But there are also key similarities. The big, fundamental one is that known buggy code is integrated, and the testing and stabilization phase is used to sort out any problems. This point is even acknowledged explicitly: in announcing a new preview build Microsoft warns that "As is normal with builds early in the development cycle, builds may contain bugs that might be painful for some. If this makes you uncomfortable, you may want to consider switching to the Slow ring. Slow ring builds will continue to be higher quality."

We can see an example of this in practice in RS5. Last year's October update introduced a new feature for OneDrive: placeholders to represent files that were stored in OneDrive, but not downloaded locally. Whenever an application tries to open the files, OneDrive will transparently fetch the file from cloud storage and save it locally, without the application ever knowing that the file was initially not available locally. RS5 builds on this to optionally purge cloud-replicated files from local storage if disk space is low.

This is a really clever, useful feature, and makes using cloud storage seamless. It's also all new code; there's a kernel driver that provides the glue between the cloud syncing code (used to download files and upload changes) and the placeholders on the file system. There's an API, too (it looks like third parties can plumb their code into the same system to offer their own sync services).

Preview releases of Windows have a green screen of death instead of a blue one, so that they can be easily distinguished.
Enlarge / Preview releases of Windows have a green screen of death instead of a blue one, so that they can be easily distinguished.

A reasonable expectation is that Microsoft would have a set of tests around this new code to verify that it works correctly: create a file, check it syncs properly, delete the local copy leaving a placeholder, open the file to have the real file retrieved, delete the file entirely, and so on and so forth. There's a handful of basic operations around manipulating files and directories, and in any kind of respectable agile development process, there will be tests to verify all the operations work as expected, and make sure that the API does what it's supposed to do.

Moreover, one would expect any code change that broke those tests to be rejected and not integrated. The code should be fixed, and it should pass its tests, before it's ever merged into the main Windows code—much less shipped to beta testers.

And yet, this is not what happened: many of the preview builds had a bug wherein deleting a directory that was synced to OneDrive crashed the machine. Not only was this bug integrated into the Windows code, it was allowed to ship to end users.

Test the software before you ship it, not after

This tells us some fundamental things about how Windows is being developed. Either tests do not exist at all for this code (and I've been told that yes, it's permitted to integrate code without tests, though I would hope this isn't the norm), or test failures are being regarded as acceptable, non-blocking issues, and developers are being allowed to integrate code that they know doesn't work properly. From outside we can't tell exactly which situation is in play—it could even be a mix of both—but neither is good.

For older parts of Windows that may be a little more excusable—they were developed in an era before the value of automated testing was really recognized, and they may very well not have any real test infrastructure. But the OneDrive placeholders aren't an old part of Windows; they're leveraging a brand new set of capabilities. We might excuse old code being under-tested, but there's no good reason at all that new code shouldn't have a solid set of tests to verify basic functionality. And known defective code certainly shouldn't be merged until it's fixed, let alone shipped to testers.

As a result, the development of Windows 10 is still following a trajectory similar to the one it did before Windows 10. Features get merged and stability and reliability drop. The testing and stabilization phase is expected to shore things up and beat the codebase back into an acceptable shape.

The inadequate automated testing and/or the disregard for test failures means in turn that the Windows developers can't be confident that modifications and fixes do not have ripple effects. This is what gives rise to the "ask" phase of development: the number of changes that are accepted as the update is finalized has to be very low, because Microsoft doesn't have confidence that the scope and impact of each change is isolated. That confidence only comes with massive, disciplined testing infrastructure: you know that a change is safe because all your tests run successfully. Whatever testing the company has in place for Windows, it isn't enough to earn this confidence.

But in other regards, Microsoft acts as if it does have this confidence. The company does have plenty of tests; I've been told that a full test cycle for Windows takes many weeks. That full test cycle does get used—just not on the builds that actually ship. The October 2018 update is a case in point: the code was built on September 15. It went public on October 2. Whatever build of RS5 underwent the full testing cycle, it's not the one that we're actually using, because the full testing cycle takes too long.

This is a contradictory posture. It might be OK to run the full test cycle on a slightly old build if subsequent code changes were made with high confidence that they didn't break anything. But if Microsoft had high confidence that those changes wouldn't break anything, it wouldn't have to throttle them so severely in the "ask" phase.

Page 3

Windows 10 can be a well-oiled machine, really.
Enlarge / Windows 10 can be a well-oiled machine, really.

The contrast with real agile projects is significant. Take, for example, the process Google uses for its ad server. This is a critical piece of infrastructure for the company, but new developers at the company describe that they've made code changes to fix a minor bug and seen those changes go into production within a day. When the changed code was committed to the source repository, it was automatically rebuilt and subjected to the battery of tests. The developer who owned the area of code then reviewed the change, accepted it, and it was merged into the main codebase, retested, and deployed to production.

Of course, this is a little unfair of a comparison; cloud services make it much easier to roll back a code change if a bug is discovered. A Windows change that makes systems blue screen on boot is much harder to undo and recover from. But still, the ad server is a critical Google service—it's how the company makes money, after all—and a bad change could easily cost millions of dollars. The testing and automation that Google has built into its development process means that a developer that's only just started at the company can work on this service and have their changes deployed in production within hours, and do so with confidence.

The development mindset is fundamentally different. A new feature might be unstable during its development, but before that feature can be merged into the production code, it has to meet a very high quality bar. Rather than Microsoft's approach of "merge the bugs now, we'll fix them later," the approach is to ensure that code is as bug-free as possible before it gets merged.

Use Chrome's Dev channel and usually the only clue that you're not using the release channel is that your icon looks like this.
Use Chrome's Dev channel and usually the only clue that you're not using the release channel is that your icon looks like this.

While cloud applications do afford a certain amount of flexibility, this approach can be used for desktop software. Google has comparable workflows around Chrome, for example. The Chrome development and beta branches do have occasional bugs, but in general their code is close to "release quality" at all times. Indeed, the Chrome team's working principle is that even the very latest version of the code should be release quality. You can use Chrome's development branch as your regular browser and, except for a different icon, you'd likely never know that you weren't using the "stable" branch. The extensive automated testing and reviewing process enables this: throughout its development process, Chrome's code is high quality, without the quality dip and subsequent repair work that we see with Windows.

Google has also invested in infrastructure to enable this. It has a distributed build system that builds Chrome on a thousand cores, so a full build can be done in just a few minutes. There's disciplined use of branching to try to make merges easy and predictable. Google has an extensive array of both functional tests and performance tests, to detect bugs and regressions as soon as possible. None of this comes for free, but it's critical to enabling Google to ship Chrome on a steady, regular cadence.

Windows' development process has never been great

Microsoft's new development process has, proportionately, a greater amount of time spent writing new features, and a reduced amount of time stabilizing and fixing those features. That would be fine if the quality of the features were higher to start with, with the testing infrastructure to support it and higher standards before new code was integrated. But the experience with Windows 10 thus far is that Microsoft hasn't developed the processes and systems needed to sustain this new approach.

The problem is, cutting the number of releases to one a year doesn't really fix the problem either. I often get the feeling that people look back on the old days of Windows development with rose-tinted glasses. But if we cast our minds back to the days of Windows 7 and before, we actually see very similar problems to what we have today. The regular advice was that you shouldn't upgrade to a new version of Windows until Service Pack 1 was out. Why not? Because the initial release would be unacceptably buggy and unstable, and it would take until Service Pack 1 for most of these problems to be worked out.

The difference is not that the new approach to Windows development is much worse than it used to be, or that the old process delivered better results; it's that we're seeing that "wait for Service Pack 1" moment twice a year. With each new update there's a point at which Microsoft deems the code to be good enough for corporate users, perhaps three or four months after the initial release of a feature update, and that's our "new" Service Pack 1 moment.

As such we're getting the worst of all worlds: from the old Windows development approach, we're seeing releases that just aren't good enough on day one. From the new Windows development approach, we're seeing those releases twice a year, rather than once every three years. That pre-Service Pack instability is with us for much of the year.

The fundamental flaw is that destabilizing your codebase by integrating inadequately tested features, and then hoping to fix up all the problems later, is not a good process. It wasn't good when Windows was released every three years, and it's not good when it's released every six months.

This isn't the job for Insiders

A secondary concern is the nature of the testing being performed. Microsoft used to have a huge number of dedicated testers, with each feature having both developer and testing resources assigned to it. Many of these testers were laid off or reassigned in 2014, with the idea that more of the testing burden be shifted to the developers creating the features in the first place. The Windows Insider program also provides a large amount of informal testing—with many millions of members, it's much bigger than any of the Windows beta programs ever were.

Ninja Cat has been an occasional feature of the Insider program.
Ninja Cat has been an occasional feature of the Insider program.

It's not certain that the old approach would have necessarily caught the data loss bug; perhaps the dedicated testers wouldn't have tested the particular scenario needed to cause data to be deleted. But it is clear that Microsoft is struggling to handle the bug reports made by the non-expert testers in the Insider program. The data loss was reported as much as three months before the update shipped. Many reports of the bug appear to be low quality, lacking in necessary detail or using improper terminology, but if the company didn't find the problem in three months, it's not at all obvious that an even longer development period would have made a difference. A longer development period would just mean that the bug was ignored for six months rather than three.

Microsoft has promised to change the Insider feedback process to allow bug reporters to indicate the severity of their issue and hopefully call more attention to this kind of problem. This might help, as long as insiders use severity indicators appropriately, but it seems insufficient to tackle the central problem of too many bug reports of too low a quality.

This relates to the code quality issue. The real strength of the Insider program is the diversity in hardware and software that it can expose Windows to, shaking out compatibility bugs and driver issues and so on. Insiders shouldn't, however, be the primary source for the more bread and butter "does this feature actually work" testing. Often, however, it feels as if that's how Microsoft is using the program.

Moreover, the fact that the code quality does take a dip during development means that the preview builds aren't usually suitable for daily driver PCs. They're just not reliable enough. That in turn undermines the value of the Insider testing: insiders aren't, in fact, exposing the new builds to the full range of hardware and software that's out there, because they're not using the builds on their primary machine and with the full range of hardware and software they own and use. They're using lesser-used secondary machines and virtual machines.

You've gotta invest in your tooling

Developing a Chrome-like testing infrastructure for something as complicated and sprawling as Windows would be a huge undertaking. While some parts of Windows can likely be extensively tested as isolated, standalone components, many parts can only be usefully tested when treated as integrated parts of a complete system. Some of them, such as the OneDrive file syncing feature, even depend on external network services to operate. It's not a trivial exercise at all.

Adopting the principle that the Windows code should always be shipping quality—not "after a few months of fixing" but "right now, at any moment"—would be an enormous change. But it's a necessary one. Microsoft needs to be in a position where each new update is production quality from day one; a world where updating to the latest and greatest release is a no-brainer, a choice that can be confidently taken. Feature updates should be non-events, barely noticed by users. Cutting back to one release a year, or one release every three years, doesn't do that, and it never did. It's the process itself that needs to change: not the timescale.