Leap seconds: the extra tick that breaks everything

At 23:59:60 UTC, a second that should not exist appears on the clock. Not midnight. Not the next day. A labelled second between 23:59:59 and 00:00:00, inserted because the planet did not rotate at the tidy pace our civil-time systems wanted.

That is the awkward bargain behind leap seconds. Computers prefer steady counters. Humans prefer noon to remain roughly attached to the Sun. Standards bodies tried to keep both sides happy by occasionally pausing UTC for one extra tick, and distributed systems spent decades discovering that one extra tick is enough to expose every lazy assumption about time.

Leap seconds are small in the way a loose bolt is small. They are only one second wide, but they sit inside operating systems, NTP servers, databases, reservation systems, DNS resolvers, logs, schedulers, and monitoring pipelines. When software assumes that every minute has 60 seconds, that wall-clock time always moves forward, or that two machines will interpret a correction in the same way, the extra second stops being a curiosity and becomes an outage trigger.

Why UTC needed an extra second

UTC is a compromise between two ideas of time. International Atomic Time, or TAI, is built from atomic clocks and advances at a steady rate. UT1 follows the rotation of Earth, which is not steady enough to behave like a computer counter. Tides, mass movement inside the planet, atmosphere, oceans, earthquakes, and longer-term rotational changes all make Earth a poor metronome.

Civil time still cares about Earth rotation because clocks are social infrastructure. Noon drifting away from the Sun by a second is harmless. Letting the difference accumulate without limit is a political, scientific, and operational decision. The modern UTC compromise keeps UTC close to UT1 by inserting a leap second when the predicted UT1 minus UTC difference approaches the 0.9-second boundary described by the international time standards process.¹

The current leap-second era began in 1972 with UTC already offset from TAI by 10 seconds. The IANA time-zone database's 2023c leap-second file, updated through IERS Bulletin C65, lists TAI minus UTC as 10 seconds from 1 January 1972 and 37 seconds from 1 January 2017 onward.² That does not mean 37 leap seconds were inserted after 1972. It means the original 10-second offset plus 27 subsequent positive leap seconds produced the 37-second difference.

That count matters because folklore around leap seconds gets sloppy. By 14 August 2023, 27 positive leap seconds had been applied since the start of the modern UTC arrangement. The last one took effect at the end of 31 December 2016 UTC, becoming visible as the first moment of 1 January 2017 in the leap-second table.² No negative leap second had ever happened.

The timestamp is the part programmers remember: 23:59:59, then 23:59:60, then 00:00:00. A human can read that sequence and shrug. Many systems cannot. POSIX-style timestamps, database types, log parsers, JSON schemas, schedulers, and language standard libraries often prefer a world where valid seconds run from 00 through 59. The leap second asks the system to represent a civil-time label that the system may have no clean place to store.

The minute that does not fit

The hard part is not that the world has one extra second. The hard part is that there is no single universal implementation strategy for applying it.

Some systems step the clock. During a positive leap second, they may repeat a second or move the clock backwards by one second. That preserves the official correction, but it creates a repeated wall-clock interval. Code that measures duration by subtracting two wall-clock timestamps can suddenly observe a negative or duplicated interval. Code that orders events by timestamp can see later events appear to happen earlier.

Some systems smear the second. Instead of inserting the extra tick all at once, they slow or speed clocks very slightly across a window of hours. That avoids an abrupt repeated second, but it means the local clock is intentionally not exact UTC during the smear window. Smearing is a pragmatic engineering workaround, not a universal standard. Two networks can both be "handling" the same leap second while disagreeing about the exact time for hours.

Some systems ignore the problem until it arrives. That is the expensive option, because leap seconds are rare enough that normal testing seldom exercises them and important enough that failures concentrate in infrastructure no one wants to reboot at midnight UTC.

Google described its leap-smear approach in 2011 after seeing clustered systems stop accepting work on a small scale during the 2005 leap second and preparing for the 31 December 2008 event. Its Site Reliability Engineering team modified internal NTP servers to adjust time gradually before the leap second, so production machines could continue without seeing the same second happen twice.³ The important lesson is not that every organisation should copy Google's exact smear. The lesson is that time policy is an infrastructure decision. It has to be chosen, documented, deployed consistently, and tested.

The alternate lesson is harsher: code that needs elapsed time should usually not use wall-clock time. A wall clock is for civil labels: invoices, calendar events, logs, deadlines, and "what time did this happen?" A monotonic clock is for durations: how long did the DNS query take, has the timeout expired, how long has the lock been held, should this retry back off? Leap seconds punish systems that confuse those two jobs.

The outages were real

On 30 June 2012, the leap second was not a theoretical standards issue. It was an operational incident.

Wired reported that Reddit traced its failure to Linux machines that had not handled the leap second cleanly. The problem was connected to Linux's high-resolution timer subsystem, or hrtimer, which became confused by the time change and drove CPU-heavy behaviour. Reddit was largely unusable for roughly 30 to 40 minutes and entirely offline for about an hour and a half, while Gawker and Mozilla also hit related problems around Linux and Java workloads.⁴

The details matter because "the leap second crashed the web" is too broad. The extra second exposed a specific class of timekeeping bugs in real systems. A subsystem that ordinarily sleeps, wakes, schedules, and measures time encountered a path that had not been exercised under ordinary conditions. That is how rare temporal edge cases fail: not with one dramatic line of bad code, but with a branch that only becomes live when the planet and the clock disagree at midnight UTC.

Australia saw a different shape of failure. The Guardian reported that the Amadeus airline reservation system was disrupted for more than two hours and that more than 400 Qantas flights were delayed, with staff switching to manual check-ins. Amadeus attributed the incident to a Linux bug triggered by the leap second inserted on 30 June 2012.⁵ That is the useful framing: the leap second did not personally delay passengers; it triggered a software fault in a reservation system that sat in the path of passenger movement.

Five years later, Cloudflare published a cleaner post-mortem because the failure was narrower and the root cause was explicit. At midnight UTC on 1 January 2017, a value inside Cloudflare's custom RRDNS software went negative. Some DNS resolutions for Cloudflare-managed properties failed. The issue affected CNAME lookup handling, peaked at around 0.2 percent of DNS queries, and was fully rolled out by 06:45 UTC after patching and mitigation.⁶

The root cause was almost painfully ordinary. Code measured the apparent round-trip time to upstream resolvers using time.Now() and assumed the difference between two wall-clock readings would not be negative. During the leap second, time could move backwards from the program's perspective. The negative value was smoothed into resolver performance data and later passed to Go's rand.Int63n, which panics when given a negative argument.⁶

That is the leap-second failure in miniature. A duration calculation used a non-monotonic clock. A value that "should always be zero or positive" became negative. A selection routine inherited the impossible value. A DNS system failed. The interesting part is not that a one-character fix existed later. The interesting part is how many layers had to believe the same falsehood before a single extra second became customer-visible.

Smearing is a treaty with your own infrastructure

Smearing works because it avoids the worst local surprise. Instead of making one minute contain 61 labelled seconds, a time service slightly changes the apparent rate of the clock over a period. To most application code, time still moves forward. Durations stay non-negative. Schedulers do not see a repeated timestamp. Logs are less likely to contain a civil-time value that downstream parsers reject.

It also creates a new requirement: consistency. A fleet that smears time should know which servers smear, which do not, what smear window they use, and which clients depend on them. Mixing smeared and unsmeared time sources can create disagreement larger than the systems expected to tolerate. A public NTP pool, a cloud provider's internal time service, a GPS-backed appliance, and a database cluster may all have different policies unless someone has deliberately aligned them.

Meta's 2022 engineering post made that operational discomfort explicit. It noted that leap-second smearing has no universal method and that organisations choose different smear durations, start times, and algorithms. It also argued that introducing further leap seconds is risky and supported the industry effort to stop future insertions at the current count of 27.⁷

That is why "just smear it" is not a complete answer. Smearing is a good local tactic, especially for large distributed systems that need monotonic-looking wall time. It is not a global standard, and it does not remove the need to reason about traceability, auditing, external systems, legal timestamps, and scientific use cases that genuinely care about UTC as defined. Time is shared state. Shared state needs governance.

The 2022 standards decision moved in that direction. At its 27th meeting, held from 15 to 18 November 2022, the General Conference on Weights and Measures adopted Resolution 4 on the use and future development of UTC.⁸ The resolution noted that leap-second discontinuities risk serious malfunctions in critical digital infrastructure, including satellite navigation, telecommunications, and energy transmission systems. It also noted that different uncoordinated implementation methods threaten synchronisation resilience.¹

The decision was not worded as "delete leap seconds tomorrow". It increased, in or before 2035, the accepted maximum value for UT1 minus UTC and asked for a plan that would preserve UTC continuity for at least a century.¹ In plainer engineering terms, the world chose to stop making UTC jump every time Earth rotation drifted near the old one-second boundary, while leaving standards bodies to settle the exact long-term control mechanism.

That nuance matters. The practical effect is a phase-out path for routine leap-second insertions by 2035, not a magic rewrite of every clock, protocol, database, and timestamp already deployed. Existing systems still have to parse historical timestamps, understand TAI minus UTC offsets, and survive whatever policies their time sources use until the change is fully implemented.

Engineering as if time is hostile

Leap seconds are a useful embarrassment because they break software that looks mature. They do not require exotic input. They use official time. They arrive with months of warning. They are documented by standards bodies. And still, systems fail because "time" was treated as a primitive instead of an external dependency with policy, latency, drift, representation limits, and weird historical baggage.

The practical rules are boring, which is usually a good sign.

Use monotonic clocks for elapsed time. Wall-clock timestamps are for recording civil time, not measuring how long work took. Any timeout, retry, latency measurement, cache expiry, or scheduler loop that subtracts wall-clock readings deserves suspicion.

Keep time sources consistent inside a trust boundary. A fleet should not casually mix smeared NTP, unsmeared NTP, host clocks, container clocks, database server clocks, and application-level correction logic. If the system needs traceability to UTC, document how that traceability works during a leap event. If the system smears, document the smear.

Avoid hand-rolled calendar and timestamp parsing. Libraries and databases are imperfect, but local cleverness is worse. The valid timestamp 23:59:60 is exactly the sort of value a home-grown parser rejects because nobody writing it expected seconds to be anything other than 0 through 59.

Test the ugly edges before they become production edges. Time tests should include repeated timestamps, backwards jumps, large forward jumps, missing time-zone data, DST transitions, expired leap-second tables, and clocks that differ across nodes. The goal is not to model the universe. The goal is to catch code that accidentally made the universe simpler than it is.

Most of all, be precise about which kind of time a field represents. An instant, a duration, a local civil time, a date, a recurring calendar rule, a monotonic measurement, and a database update timestamp are not interchangeable just because they all print numbers with colons. Treating them as interchangeable is how one extra second becomes a system incident.

The leap second is scheduled for retirement because global infrastructure finally admitted that a tiny standards correction had become too expensive to keep applying directly. That does not make the old failures irrelevant. It makes them a warning label. When software says "time", ask which time, from which source, under which rules, and what happens when the clock does something legal but inconvenient.

The impossible second was never impossible. It was only inconvenient. Software is where inconvenient facts go to become outages.

Bureau International des Poids et Mesures. (2022). 'Resolution 4 of the 27th CGPM (2022): On the use and future development of UTC.' BIPM. https://www.bipm.org/en/cgpm-2022/resolution-4 ↩ ↩² ↩³
IANA Time Zone Database. (2023). 'leap-seconds.list, tzdb-2023c.' Internet Assigned Numbers Authority. https://data.iana.org/time-zones/tzdb-2023c/leap-seconds.list ↩ ↩²
Pascoe, Christopher. (2011). 'Time, technology and leaping seconds.' Official Google Blog. https://googleblog.blogspot.com/2011/09/time-technology-and-leaping-seconds.html ↩
McMillan, Robert, and Cade Metz. (2012). 'The Inside Story of the Extra Second That Crashed the Web.' Wired. https://www.wired.com/2012/07/leap-second-glitch-explained/ ↩
Arthur, Charles. (2012). 'Leap second hits Qantas air bookings, while Reddit and Mozilla stutter.' The Guardian. https://www.theguardian.com/technology/2012/jul/02/leap-second-amadeus-qantas-reddit ↩
Graham-Cumming, John. (2017). 'How and why the leap second affected Cloudflare DNS.' The Cloudflare Blog. https://blog.cloudflare.com/how-and-why-the-leap-second-affected-cloudflare-dns/ ↩ ↩²
Obleukhov, Oleg, and Byagowi, Ahmad. (2022). 'It's time to leave the leap second in the past.' Engineering at Meta. https://engineering.fb.com/2022/07/25/production-engineering/its-time-to-leave-the-leap-second-in-the-past/ ↩
Bureau International des Poids et Mesures. (2022). '27th meeting of the CGPM (2022).' BIPM. https://www.bipm.org/en/cgpm-2022 ↩

Leap seconds: the extra tick that breaks everything

Why UTC needed an extra second

The minute that does not fit

The outages were real

Smearing is a treaty with your own infrastructure

Engineering as if time is hostile

The 2038 problem: when time runs out

Floating point: the maths your computer quietly gets wrong

The symptom-fix trap: Why patching consequences breeds chaos

The organisational memory leak: why lessons disappear between teams

Leap seconds: the extra tick that breaks everything

Why UTC needed an extra second

The minute that does not fit

The outages were real

Smearing is a treaty with your own infrastructure

Engineering as if time is hostile

Footnotes

The 2038 problem: when time runs out

Floating point: the maths your computer quietly gets wrong

The symptom-fix trap: Why patching consequences breeds chaos

The organisational memory leak: why lessons disappear between teams