Over the last few weeks and the last days specifically, we've experienced queues on our Xcode stacks that have lead to delays in builds kicking off for certain time-zones. This is what's been going on and what steps we've taken to resolve the issue
Shiny new hardware
A week ago, this blogpost was meant to be a lot cheerier as we had some good news for everyone using Bitrise to build iOS apps. We've upgraded the hardware used for iOS builds across the board, which should lead to improved performance for all builds that run on our Mac infrastructure. These are the actual changes we made:
Hobby Plan, Dev Plan, Org Standard, and Public Projects
Our standard machines have changed from 8-core Intel Xeon X5570 2.93GHz CPU and 48GB RAM to MacPro's with 12-core Intel Xeon E5-2697 v2 2.70 GHz CPU's and 64GB RAM.
We'll be running 6 instead of 4 virtual machines on this hardware, so even though the amount of cores per build will stay the same, they will run on faster CPU cores. This should lead to a moderate increase in speed, but mostly enables us to do other cool stuff in the near future.
Org Elite users
Our elite machines (those used in the Org Elite plan) have changed from 12-core MacPro's to machines with 6-core Intel Xeon E5-1650 v2 3.5GHz CPU's and 32GB of RAM.
Simultaneously, we've changed the way we actually configure those, by halving the number of VM's we run on every elite machine. That means that every VM has the same amount of cores and RAM at its disposal, but with a 3.5 Ghz instead of a 2.7 Ghz CPU and some other new bells and whistles. If you're on an Org Elite plan, this should lead to a definite reduction in build-times because of these upgrades.
Virtualization appliance update
Together with the hardware changes, we updated the virtualization appliance that actually manages the previously mentioned VM's. Even though the impact of this change for you as a user isn't immediate, it will allow us to introduce some fancy new functionality in the very near future. We can't go into a lot of details just yet, but among other things, it relates to how we handle concurrencies and the configuration of VM's used for your plans or builds.
Occasional queues and delayed Xcode betas
So that should've been the good news. Now onto the unintended consequences:
When you run an iOS build, that runs on physical Mac hardware. You probably realize this because ☝️, but it's important to understand that we're not plugging into Google Cloud or AWS infrastructure that will magically scale as we grow: Machines are physically added to our Mac infrastructure pretty much continuously to keep up with the rate at which new developers are coming onto Bitrise.
Starting the upgrade project, we were aware that during the process, we actually wouldn't be able to add those machines. Taking into account our expected growth for the upgrade period, we added some overcapacity. We were also aware that during the upgrade, we wouldn't be able to add new Xcode beta stacks. For this last issue, the only resolution was really just finishing as quickly as possible and hope that Apple doesn't release a beta in the meantime 🤞
If you're building iOS apps on Bitrise at all, you will probably have noticed that for a few weeks starting February:
1. There would be occasional queues during peak times, as we didn't add enough overcapacity before starting this project;
2. Apple did actually release a new Xcode beta during this time (multiple, even) which we didn't get out to you as fast as you're used to.
Now the 'excuse' here is that we're growing faster and the upgrade took longer than expected, but in both cases, we should've been better prepared.
In an ideal world, the news above would be what we presented you with last week with some overdue apologies, but thanks to the much faster iOS builds you'd experience, we would jointly agree to chalk the upgrade down as a win.
Unfortunately, things haven't gone down quite that smoothly:
Queue issues March 11, 12, 13
To finalize the upgrade process, on Saturday the 9th, some final minor maintenance took place. After that maintenance, everything seemed to be working as expected and continued to do so throughout the weekend until Monday morning.
Monday, the queues were back with a vengeance. The upgrade and/or re-balancing during maintenance seemed to have introduced significant lag in virtual machine create- and destroy times. As we create fresh VM's for every build and destroy them after the build is done, a delay in the create-destroy process across the board created a massive increase in load on our Macs.
The root-cause looks to be connected to the read-write process to the shared storage in our Mac infrastructure. Without digging down into the details to much, what seems to be happening is that the changed configuration of our infrastructure introduced read latency where (virtually) none existed before. Likely, somehow related to how we configured all our clusters of Macs to connect to this shared storage. To resolve this issue, several things are happening:
- Both ourselves and our main vendor are applying quickfix patches to alleviate the most immediate issues;
- We're adding (significantly) more shared storage and we're changing the way we balance the load of our Mac clusters across that storage;
- We're adding more Macs (as described earlier, this happens virtually non-stop, but we're adding even more machines, faster).
We're confident that this will fix this ongoing problem, but it might be a few days before we're completely finished with implementing the last two of these steps, as it involves physically putting new hardware into the data-center.
We did apply a new fix last night that seems to have reduced the issues so far this morning, but it seems too early to already call it a definitive success.
Long term preventative measures
After we have completed all the steps described earlier, the queues will be fixed and shouldn't just get your iOS builds back to normal: You should actually see an improvement compared to performance before this entire episode kicked off. We're committed to not just solving the issue, but also preventing similar issues from impacting your development process to the degree they currently have, though.
We have several ideas that should allow us to either severely limit or fully negate this negative customer impact, but one that is on top of the list:
Add a secondary data-center
Our Mac infrastructure is currently housed in a single data-center. This means that, when issues arise in that data-center, everyone is impacted and we can't reroute builds to a secondary location. Implementing a second data-center is something we've investigated extensively and are committed to doing, but this issue has made it clear that we need to speed up that process.
Again, I would like to apologize for the inconvenience all of this has caused many members of the Bitrise community. We could've also done a much better job at communicating with you during this entire process and will strive to do so going forward.
If you have any questions and/or comments regarding these issue, please feel free to comment here on the blog, reach out through email or use the support option when you're logged into Bitrise.
Thanks for your patience and understanding,