What Happened to ShootQ?!
First, a little history. ShootQ was started by a very small team, with no full-time operations. As a result, it made a lot of sense for us a the time to host ShootQ and its associated systems and services in a "cloud" based provider. This gave us the ability to get started without having to purchase a bunch of hardware, and to rely on a highly skilled set of people at the provider to keep us up and running. Its served us well.
About a year ago, ShootQ was acquired by Pictage, which has a large team of amazing operations staff, lots of high powered hardware, and high-end monitoring systems and procedures. About 6 months ago, we knew it was going to be necessary to move ShootQ to Pictage-controlled hardware and data centers, but we needed to wait on new hardware, expanded capacity, and a brand new data center.
Why did we decide this? Because we started to notice that our cloud provider was becoming less reliable. You may have noticed as well. We built systems to minimize the issues, put in better monitoring, and brought in an extra full-time person just to manage and automate ShootQ's operations.
The wait is almost over, we'll be moving in the next few weeks. But, it looks like we weren't quite quick enough to avoid all problems, as we were hit with a large scale issue this morning on our cloud provider's systems. Someone was abusing resources on the same resource pool as our "master database", causing our performance to take a huge hit. We have instant snapshot backups in place, so data is safe, but as soon as the problems hit, we took ShootQ offline to perform emergency maintenance.
We contacted our provider, who was painfully slow to respond to the issues. We called, emailed, filed tickets, and did everything we could to react, but we are extremely frustrated with our provider for the delay. Finally, our provider killed off the abuser in our resource pool, and for a short time, service was restored. Within about 15 minutes, we started to see problems again, so we took the application offline again. They've finally determined that hardware RAID controller that controls disk access on the hardware that hosts the ShootQ master database is the root cause. The team is working on promoting our "slave" database to restore service ASAP.
So, what are we doing to avoid issues like this in the future?
1. First and foremost, we're moving off our existing provider as planned. We'll be moving to a brand new data center, on brand new hardware, that we share with no one. The hardware is multiple times faster and more reliable than what our current provider can offer, and we're in total control. When issues arise, it will take minutes to solve rather than hours.
2. The move will also enable highly automated monitoring and system notifications that allow us to see problems coming minutes, and perhaps hours before they crop up. This provides us an unprecedented ability to resolve problems.
3. We're working on some processes to ensure that not only operations, but also customer care, and other parts of the company will be notified of issues, so when you call customer care you'll get the most up-to-date information about the status of services.
This move has been planned for months, and we've been moving as quickly as we can to make it happen. I wish we could have completed the migration by now to avoid this issue with our provider, but at the very least you deserve to peek behind the curtains, and see how issues arise. I wanted all of our customers to know that we are as frustrated about the downtime as you are, and have had a plan to avoid issues like this for a long time. I apologize for any inconvenience this may have caused you, and I will be happy to answer any questions you have here in detail.
Thanks for your patience!
--
Jonathan LaCour
VP, Software Products
ShootQ Code Poet Laureate
23 Comments