What is topological behavior analysis?
Topological Behavior Analysis (TBA) is the real-time algorithmic analysis of computer data that originates from complex virtualization and cloud environments. It derives from Topological Data Analysis that leverages K-means as its foundation.
Computer environments have many different layers that generate a large volume of statistical data – from the user experience layer (i.e. press of a button) to the data on the storage system, with many layers in between (cell phone towers, providers, networks, servers, etc.). All that data needs to be ingested, modeled (trained) and provide the “answers” to variety of questions in automated fashion that IT/DevOps may have, such as:
- Is there a problem?
- What is the root cause?
- What should I do about it?
K-means provides the ability to abstract and define the behavior of workloads and their impact on the infrastructure in a form of clusters (vs individual time series which would not scale) as well as to capture the seasonal behaviors that extremely necessary to understand the behaviors that can be very specific to the industry where the computer environment is being used (e.g., sales fluctuations in retail).
Combining K-means with Topological Data Analysis provides the ability to perform detect the anomalies based on multi-dimensional models that learn the interplay between the features of the statistical data that represent the behavior.
How do you combine k-means with mixture for TBA?
While developing a product feature that predicts performance issues within a computer environment (virtualization, cloud, etc.), we have developed an algorithm that applies Monte Carlo Simulation on top of K-means based models.
Once again, this approach leverages K-means as the foundation that provides the ability to model the behaviors of the workload and its impact on the computer environment. From the learned behavior encapsulated in clusters that also represents the seasonal behaviors of the data, we are able to derive a prediction of the behavior by:
- Deriving the predicted expected behavior of the workload and its impact on the components of the infrastructure
(such as compute, network, storage) by applying Monte Carlo Simulation. - Once the prediction for the expected workload behavior is derived for the individual workload, we perform the “stacking” function that stacks the predicted expected behavior to determine whether it will reach the capacity of the infrastructure (whether it is at the compute, network or storage layers).
Leveraging K-means and Monte Carlo Simulation we can accurately predict the performance issues within the compute environment.
What are the challenges in predicting workloads in servers?
I have mentioned a couple of issues in my prior responses, but let me summarize:
- Amount of data (Big Data),
- Inter-play as well as dependency (statistical dependency) between the features, Dimensionality of the features,
- Real-time nature of the matter that pending real-time decisions to avoid failure of critical applications,
- Seasonality of the behavior,
- Dynamic nature of the environment that moves workloads within the environment and across geographies as well as dynamic nature of the workloads that depend on user interaction as well as application changes
While (a) – (e) can be addressed through the algorithms mentioned earlier (f) requires almost weather-forecasting like analysis.
First, there is a prediction of the future based on learned behaviors. This is analogous to a 7-day weather forecast. However, like a weather forecast, severe storms (or issues in a computer environment) can start and move rapidly, affecting both the forecast and the recommendations that may be made as a result.
That is why in addition to forecasting the future, it is important to identify issues and provide recommendations (automated) in real time on how to address such issues without affecting parts of the system that were not affected by the storm and therefore should continue with the previously forecasted recommendations.
That’s where forecasting based on Monte Carlo Simulations needs to work in unison with Topological Behavior Analysis as causality algorithms (mentioned later) in real-time to track all the dynamic changes in the environment.
Why can’t you use time series modeling?
Unfortunately, time series modeling is the state of the art for most tools in the IT space, i.e. use of time series analysis. This is the case because most IT tools were built with a Computer Science approach rather than a Data Science approach. Before virtualization and cloud computing became popular, understanding and optimizing computing environments was seen as an infrastructure problem instead of a data problem. Expertise in data and statistical modeling was not a requirement or event considered. As a result, most IT tools were built with a solid knowledge of Computer Science and the IT space (i.e. architecture, design patterns, etc.). Time series analysis was the apogee of Machine Learning and implemented in IT tools today simply because it is easy to implement and understand. However, time series analysis cannot address challenges (a) – (f) in my response to the previous question.
- The amount of data that radiates from all the layers in the IT operations environment is simply impossible to deal with the individual data points and higher level of abstraction that is capable to represent the behavior is required (such as clusters) which directly relates to (a) and (c) challenges mentioned earlier.
- Time series modeling cannot capture the multi-dimensionality, interplay, and uncertainty within the features of the data (especially at scale) that is required to accurately identify the meaningful anomalies within the IT operations environment.
- Finally, some important data is not time series data but may include other features (such as data related to changes in the infrastructure, configuration, and code).
As a result, I have identified the gap and an opportunity to develop a new solution that addresses all of the challenges mentioned earlier and will ultimately deliver my vision of a self-driving datacenter that is based on data and data science, eliminating the human guesswork used today.
Why isn’t deep learning and option?
Deep learning is an option.
Today we are just scratching the surface of applying statistical modeling to IT operations data (that is not limited to metrics, but can also include code changes in the application, etc.). Our causality algorithm is already a network (Bayesian-like network) that is driven by posterior and conditional probabilities (still a pretty “shallow” model). However, we are in the process of experimenting with TensorFlow to introduce “deep”-er networks into our analysis that will enable us to address larger scale and more complex use cases (especially relevant to change management, networking and security where that are a lot of features to be explored).
In addition, our current platform operates on-premises and our goal is to push our platform into the “cloud” which would allow exposure to more compute capacity (for compute intensive operations including GPU) and more data that are essential for “deeper” models (i.e. more data and more computational power).
For example, one of the complex use cases (applied in performance and security analysis) is how to identify bad code changes that cause a problem and predict whether the bad code can cause security, reliability and performance problems. As use cases grow in scale and complexity, deep learning models will allow to determine the right features and to more dynamically and accurately discover issues that arise and their root cause(s).

As CTO, Sergey is responsible for driving product strategy and innovation at SIOS Technology Corp. A noted authority in advanced analytics and machine learning, Sergey pioneered the application of these technologies in the areas of IT security, media, and speech recognition. He is currently leading the development of innovative solutions based on these technologies that enable simple, intelligent management of applications in complex virtual and cloud environments.
Prior to joining SIOS, Sergey was an architect for EMC storage products and EMC CTO office where he drove initiatives in areas of network protocols, cloud and storage management, metrics, and analytics. Sergey has also served as Principal Investigator (PI), leader in research, development and architecture in areas of big data analytics, speech recognition, telephony, and networking.
Sergey holds PhD in computer science from the Moscow State Scientific Center of Informatics. He also holds a BS in computer science from the University of South Carolina.




