After hour system changes, like deployments, database migrations, infrastructure changes, etc, bring risks. They start as a way to limit an incident’s customer impact if something goes wrong, but introduce others. An immature engineering org will keep choosing the quick change over gradual investments.
The manual factor. After hour changes have long checklists. They are difficult to review and usually have a missing step. As a result, going off script can occur. Post change, monitoring during less traffic hours means… less traffic; less chances for a bug to be found until the high-traffic time comes.
The human factor. People like sleeping and are hesitant about waking others up. They are self-encouraged to go fast and skip some of the checks that take time. If they need help, they will page only when absolutely critical and response from the other teams will be delayed! They were sleeping!
Engineers should be ready to break their systems to reduce risk. Feature flagging and other methods can help minimize downtime. As the business expands globally, there will be less “after hours” available. It’s important that processes for risky changes are well-tested beforehand!