Mastering time for efficient queries in the NetCloud API
Most endpoints in NetCloud API offer at least one field representing time. These time fields may be in the form of a timestamp or a timeuuid. In some cases, they represent the time of creation for a record, and in other cases the time of a state change. This article is intended to clarify what these time fields mean, and to provide tips for using them effectively.
Demystifying the timeuuid
The timestamp type is relatively self-explanatory to the layperson, but the timeuuid may look alien even to experienced programmers. The timeuuid is represented as a string of hex characters, with a few hyphens mixed in.
What is a timeuuid?
At its core, a timeuuid is simply an opaque, statistically unique value (the UUID part) that has, embedded within, a timestamp (the time part). As a result, a timeuuid can be used like any UUID for the purpose of determining uniqueness, and it can be used like a timestamp for time sorting and filtering.
Are these actually Type 1 UUIDS?
Readers who have some familiarity with UUIDs may recognize this description as suspiciously similar to the Type 1 UUID standard. While a programmer may be able to unpack NetCloud UUIDs using the Type 1 format, this is not supported. We do not guarantee that our UUIDs will be Type 1 UUIDs, and they may diverge from this standard at any time. So, while they behave like Type 1 UUIDs, callers should treat them as opaque, magical strings. Callers should not try to construct their own timeuuids from scratch. Only send the API timeuuid values previously returned by the API.
Where does the NetCloud API use timeuuids?
The NetCloud API uses timeuuids in records representing time series data. These are data sets where each record represents either a metric sample at a point in time or an event. Each of our time series endpoints has a timeuuid field named "created_at_timeuuid", which represents the time that record was created on our server. Note that created_at_timeuud is not the time the data was gathered on the router, although it should be close. The decision to use server time for this field is based on the observation that the server clock is more reliable than router clocks, which can be affected by misconfiguration, for example, being unable to contact an NTP server. Using server time gives us a reliable, global ordering across our data.
We always include a created_at column alongside a created_at_timeuuid. This column is a type of timestamp and simply exposes the time component of created_at_timeuuid in a readable format.
Why not just use a timestamp?
Timestamps are mostly, but not completely unique. Two events can occur at the same time, within the precision of the timestamp representation. Ties are not usually a problem, as timestamp precision makes them rare, and the concept of two events happening at the same time is not disturbing to most people. However, if we want to sort these samples in a reliable and stable way, ties become a problem. In particular, if we are paging based on time, it is important that ordering is strictly and consistently defined.
For example, assume we have the following (simplified) records, with a page size of three, and we are paging in ascending time order:
If we sort on created_at:
Our first page will have the records with data "One", "Two", and "Three". When we want to ask for the second page, the highest timestamp we have seen so far is 2018-11-01 01:02:03.1234. It would be natural for us to ask for the next page of records by using the filter condition created_at > 2018-11-01 01:02:03.1234. However, if we use that timestamp, we will skip record "Four", because it has the same timestamp as record "Three". If we change the condition to created_at >= 2018-11-01 01:02:03.1234, we will no longer miss records, but we will see record "Three" twice and must handle duplicates. There is also no guarantee that "Three" will come before "Four" next time we make the same query, since the ordering of two instances of the same value is undefined.
If we sort on created_at_uuid:
Observing that created_at_uuid has the property of ordering by time, but never producing ties, we can use the condition created_at_uuid > d28a6977 for page two, assured that we will see all rows after row "Three" in the stable ordering, without any duplicates. If we sort on timeuuid, the same order will occur every time, because there is only one valid order.
In summary, sorting by timeuuid will result in a stable, consistent ordering, even in the face of timestamp ties. Sorting by timestamp can have unpredictable order and cause paging issues.
Why not use "offset" for paging?
You may notice that the time series endpoints do not support "offset" for paging, but use the timeuuid instead. This is important for time series data, as this method is much more efficient for large data sets than "offset". Under the covers, offset is implemented by executing the full query and then throwing away all the rows until the offset number is reached. For a small return set, this is not a problem. However, if we are paging through 10,000 records with a limit of 500, by the time we get to offset 9,500, the database is scanning and throwing away 9,500 records to return 500. In general, each subsequent page requires more work on the server than the previous page and takes longer to return. The timeuuid style of paging, however, uses indexing on the timeuuid field to jump to the start record for any page, rather than scanning all the pages before it. As a result, reading the first page is roughly as efficient as reading the hundredth page.
When should we use timeuuids?
Timeuuids should be preferred over timestamps whenever you are sorting or paging. They have all the benefits of timestamps in these cases, without the problems associated with ties. Note that a daily script that picks up where it left off the previous day is a form of paging.
Timestamps should be used when you need to extract a time value from a record or when you want to put a time-based constraint on an initial query.
Choosing a time window for time series endpoints
As a best practice, every query made to the time series endpoints should specify a closed time window, either by timestamp or by timeuuid.
Lower Bounds
If you do not specify a lower bound (minimum) time, the server will implicitly serve you records going back to the beginning of time. However, the server doesn't keep records longer than 90 days on the time series endpoints, so searching for older records is a waste of cycles. For the most part, the server code is smart enough not to spend much time on this, but it is better to be explicit and not depend on server optimizations when you don't need to.
Upper Bounds
If you do not specify an upper bound (maximum) time, the server will implicitly serve you the most recent records it finds. On the surface, this sounds perfectly fine. However, the results may not be what you expect, particularly if you are running a polling script that expects to see every record the server creates.
The problem with asking for records from "now" is that the query takes a nonzero time to execute. While it is executing, new records may come in that satisfy the conditions of the query. In this case, there are three options for any server:
- Return only records that existed at the moment of the query submission, and omit any that came through while the query was being processed.
- Return all records that existed at the moment the query finished.
- Return all the records in #1, plus some arbitrary subset of #2.
In short, #3 is what will happen in the NetCloud API. #1 and #2 require doing some form of locking to coordinate between all relevant servers in the system. Our servers use distributed database technology to provide high availability and scalability. In such an environment, global locks are impractical because they work against both availability and scalability.
Here is an example of how #3 can cause problems if your time window includes the execution period of the query itself.
Assume the database (eventually) contains the same records used by the previous example:
Assume that a query is made with no upper bound specified and reaches the server at 2018-11-01 01:02:02.1300. At this point record "One" is guaranteed to be in the return set, because it is, at this point, historical. Assume that the query takes one second to execute. During that second, records "Two", "Three", and "Four" will reach the server. Also during that second, the query processor is polling a cluster of database servers for records. The request to the server responsible for record "Two" may come before record "Two" arrives on the server, but the request to the server responsible for record "Three" may come after record "Three" arrives there. Record "Four" may not arrive before the query starts returning results. The result set, in this case, contains "One" and "Three". If I make another request later, I may be inclined to use a lower bound of the time the query returns, which would miss records "Two" and "Four", or I could use a lower bound of the time of the most recent received record, "Three", but I would still miss record "Two". There is no way for me to choose a subsequent time window that is guaranteed to catch any record that I missed, but not return records I have already seen.
If instead the query had a fixed upper bound of 2018-11-01 01:02:02.1300, the return set would be deterministic: record "One". It would not, however, be advisable to use a timestamp this close to the execution time, because server clocks should not be assumed to be perfectly synchronized with the caller's clock. A caller should allow a reasonable buffer to account for that potential difference.
In summary, you should always specify an upper bound that will not overlap the execution of the query. This upper bound should be far enough in the past that clock differences between the client and server are unlikely to be a factor. For example, asking for all records up to one minute ago is a safe query. Asking for all records up to one millisecond ago is risky.
Optimizing with "updated at" timestamps
Users of the API should be aware that some endpoints provide the opportunity to only retrieve records that have changed in a particular time window. This allows the caller to avoid pulling information that has already been read and is thus redundant.
For example, assume we want to know when any router in our account goes offline. We could write a polling script that: pulls every router record via the routers endpoint periodically, remembers the state of each of our routers, and compares every router's current state with the state from the last poll. This will work, but it will not be efficient. Let's also assume our polling period is 15 minutes, the account contains 30,000 routers and, on average, no more than 500 routers go offline in any polling period. This approach will require pulling 2,880,000 router records each day. If we request the records using a limit of 500, that maps to 5,760 API calls a day.
Now let's add a filter to our query on the "state_updated_at" field, which indicates when a state change occurs. If we rewrite our script so that it pulls the full router set on the first poll, then filters on state_updated_at for all subsequent polls, to retrieve only the records whose state has changed since the last poll, we reduce our number of calls to effectively one call per 15 minutes. This is a drop from 5,760 API calls/day to 96 API calls/day, a 60x improvement.
A similar approach could be used for difference detection using other update timestamps in the API.
Conclusion
Understanding time fields in the NetCloud API is important for both correctness and performance. When using the API:
- use timeuuids when available, not timestamps, for sorting and paging.
- specify a closed time window for time series data, with reasonable lower and upper bounds.
- do not allow your time window to overlap the time of the query for time series data; always specify a time window completely in the past.
- use filtering by updated_at fields whenever possible to avoid re-reading unchanged data.
“Behind the Code” is a series of blog posts, written by Cradlepoint engineers, about behind-the-scenes topics and issues affecting software development.