The team had worked hard over the previous month or so to finish an update to the Ello iOS app and we were super excited to get it into the hands of our users. We added support for multi region post creation and editing, big optimizations for image loading and in-app indicators for new content. We'd waited the required week in Apple's review queue and had just received approval to launch the update. On October 8th we released it. Users were going to love it!
Casey, our Director of Product (and generally awesome guy), hollered from his desk "I am unable to login to the app, is anyone else able to login"? Nope, login is busted for everyone.
After a quick moment of panic we set out to find out what went wrong. We realized the problem fairly quickly. We had shipped a production application that was attempting to log users into our staging server.
Our first thought was to remove the app from the store and get a fix out quickly, have Apple expedite the review of the update as fast as they could.
Removing the app from the store proved impossible due to a bug(?) in iTunesConnect that prevented removal. Every time we attempted to remove the app from sale we received an error message from the website.
Within the hour we submitted an update to the AppStore. After several hours of
processing time we requested an expedited review and crossed our fingers.
Next we attempted to redirect all traffic hitting our staging api to our production api. We hoped this could happen transparently in the background so users wouldn't know the difference. Unfortunately a combination of complications with our bot blocking service, Distil, and our newly added in-app SSL pinning prevented api request redirection from working. All of our users were unable to use the app and we would have to brave the storm until the update was approved and in user's hands.
We started to notice new accounts on staging and realized that we needed to shutdown staging completely in order to prevent confusion.
As we waited for app review to approve the update and send it out to our users we dove into figuring out how we could have submitted, gotten approval and released an app that pointed to our primary staging server. A server that has thousands of confusing test posts and accounts, not millions of amazing and interesting posts and users.
In Ello Engineering we attempt to automate as many repetitive tasks as possible. This includes building the production, staging and development apps. We're able to quickly release new builds directly from CI or from the command line. We have rake tasks that trigger build scripts for most of the common tasks related to generating our apps. We also have rake tasks for reconfiguring our local development environments. A common rake task we run is
rake generate:prod_keys. We have a similar one for generating staging_keys. Both of these tasks take the values from an environment variable configuration file and swap out a bunch of variables in the app. Auth tokens, debug flags, a few other values... and the base url of our api endpoints. I bet you can start to glimpse what went wrong.
Normally, our production builds are generated by a build script that ensures the app is configured to point to our production server. With the release of Xcode 7 we had been experiencing an error that prevented the build script from correctly code signing the app, an Apple requirement. After a couple hours of debugging on the day we intended to submit the update we decided to build the production app the old fashion way, through Xcode.
Everything went swimmingly, we archived the app and submitted it to Apple. We knew we were good to go because we'd generated internal beta builds and tested them through Crashlytics.
Our mistake and subsequent botched app update occurred when we built and submitted through Xcode. Our Xcode archive and submission does not run the rake task that ensures the application is using the production configuration. At the time of archive and submission we had the computer used to submit to the app store configured to use the staging environment. We had also mistakenly convinced ourselves that everything was fine because we tested the release from our crashlytics builds (we have both production and staging beta builds) and our crashlytics builds are automated so they were correctly configured.
There is one simple step that could have prevented our error. ALWAYS use the internal iTunesConnect testing made available by Apple. Internal testing through iTunesConnect is the only chance you get to test the actual submitted application that will be sent to users. Crashlytics and Hockey App require you to build and distribute via an Ad Hoc setup, which is not the same as the AppStore setup.
In addition to using iTunesConnect internal testing we have added a build phase that alerts us at compile time when we are not configured for production. We've also implemented internal checklists for deployments that should help us avoid this type of error moving forward.
In the end our users forgave us, even thanking us for our quick response and open communication about what happened.