Part 2: The Role of Distributed Systems in Effectively Utilizing Time Series Data
By Jon Bosanac (LinkedIn)
For centuries we’ve been collecting data over time such as; weather temperature by time of day, tidal measures, number of cars on a street, number of birds on a lake, stock prices, currency movements over time, etc. etc. As our methods have improved for capturing and storing that data, its collection has increased. Historical analysis of this data often provides insights that we can use for future planning or to explain certain phenomena. But, what about if we wish to analyse and use this data in real time? This can often require the analysis of millions of data points a second, in thousands of locations. This is where the science of distributed systems comes into play in effectively leveraging time series data at scale.
What Is Time Series Data?
Time series data, in a nutshell, is a series of numerical data sets arranged in a successive order, or simply any data that has a timestamp. Time Series data is immutable, meaning that the data is recorded and doesn’t change. So when structuring resources to gather, store and analyse this type of data there are some factors that need to be accounted for:
- It often has high write and read requirements due to the volume of data IoT devices can generate.
- To efficiently analyze related time series data, it often needs to be co-located on the same physical storage.
- Time series data is often collected at frequent intervals which may not be relevant as data ages. Therefore there is a requirement to be able to rolled up and compress data.
Why the Time Series Data Hype?
In a nutshell, it’s the explosion of connected devices or the Internet of Things. The ability to connect devices to to the Internet and track, analyse and respond to data in real time is exciting consumers, governments and businesses on all fronts. We now have the ability to monitor:
- Water meters and determine in real time when there’s likely to be a costly leak based on historical usage.
- Weather patterns and consumption of certain foods so that retailers can stock foods that are likely to be in high demand.
- Monitor the noise output of engines and detecting anomalies which may be early signs of defects or service requirement.
- Monitor trash levels in connected trash cans so that collection timing can be optimized.
- The noise level of cars at street level can be monitored to predict road maintenance.
The list of applications for the utilization of time series data is endless – this page has some great examples. Historically analysing this data historically can be insightful and useful but leveraging data insights in real time has been a challenge. How do you deploy systems that can gather and store massive amounts of data centrally but still be responsive and smart locally?
Need Ready Access to a Distributed Compute Environment?
Read More About the nuu:bit DCE
The Role of Distributed Systems
Distributed systems really come into play in time series applications when locality is important and especially when different rules may apply in different regions. Often you may not want to send all data back to origin. On a regional scale a one minute sampling of data may suffice, but locally it may be critical to gather and analyse that data every 5 seconds. A traffic burst locally may require the time interval of traffic lights to be adjusted immediately, but regionally, a sampling of traffic levels every minute will suffice. Having a systems that can make the required decisions locally but still coordinate and synchronize regionally, nationally, globally is key.
We recently integrated the nuu:bit CDN Edge Compute platform with Microsoft Azure Cosmos DB. Our service gives site and app publishers the ability to delight their users with fast, responsive, and highly personalized experiences. The new service marries:
- a flexible, multi-model big data store
- a content delivery and acceleration service, and
- CDN Edge Compute
In effect, this will allow publishers to not only take advantage of the enhanced delivery a content delivery and acceleration service can provide, but also to push personalization logic to the edge and have the data, that the logic depends on, close at hand which will result in experiences that delight users. You can read the announcement here.
Having a distributed compute infrastructure that knows which data to exclude, include, replicate regionally and store, or not store, is all part of running a modern scalable support infrastructure for your application that is leveraging time series data. Having replicated databases, that can respond locally, but still coordinate regionally effectively is critical to leveraging time series data at scale – if you need wish to understand why, please read Part 1.
Please don’t hesitate to reach out if you’d like to understand more about nuu:bit services.