We had downtime at Mar 17th between 11:55 - 14:23 UTC. To learn more, visit https://to.short.cm/fail
Incident Report for Short.io
Resolved
This incident has been resolved.
Posted Mar 18, 2019 - 17:02 UTC
Monitoring
1. While doing routine ALTER TABLE operation we noticed, that our main database and failover database stopped working and did not recover automatically. We started investigation immediately
2. Our links in EU region were not affected, but both API, website and links from other regions were broken
3. During investigation we noticed, that AWS Aurora crashes and fails to start with "Segmentation fault, core dump" error. Both master and replicas.
4. We started to recover database from the last available backup (11:55 UTC). It was also fail and waste of time - it took an hour to recover and database was broken as well
5. At 12:46 UTC we started to process redirects for all short links from EU region. Total downtime for links was 51 minutes
6. At the same time we got response from Amazon about the issue with the database. We started recovery from the backup from 11:45 UTC
7. At 14:15 backup was restored and at 14:23 the issue was fixed completely

The issue was:
For ALTER TABLE instead of MySQL standard ALTER TABLE command Amazon Aurora issued their proprietary extension of ALTER TABLE, which should work fast, but they did not support partitioned tables and instead of updating the table the server crashed with segmentation fault error and did not recover. Their engineers are working to find and solve the issue
Posted Mar 17, 2019 - 15:12 UTC