Gigamon, leader in cloud visibility and analytics, has cautioned that dealing with a security incident requires not just prompt notification of the incident but the ability to triage the cause of an incident.
The key steps are: carry out forensics, identify what other systems, users, devices and applications have been compromised or impacted by the incident, and identify the magnitude or impact of the incident, the duration of the activity that led to the incident, and many other factors.
Shehzad Merchant, Gigamon’s Chief Technology Officer, says: “In other words, notification of an incident is simply the first step in a complex journey that could lead to possibly unearthing a major cyber breach, or perhaps writing off a completely benign non-incident.”
While security orchestration automation and response (SOAR) solutions help to automate and structure these activities, the activities themselves require telemetry data that provide the breadcrumbs to help scope, identify and potentially remedy the situation. This takes increasing significance in the cloud for a few reasons:
# The public cloud shared security model may lead to gaps in telemetry — for example, lack of telemetry from the underlying infrastructure that could help to correlate breadcrumbs at the infrastructure level to the application level.
# Lack of consistency in telemetry information as applications increasingly segment into microservices, containers and Platform-as-a-Service, and as various modules come from different sources such as internal development, open source, commercial modules, and outsourced development.
# Misconfigurations and misunderstandings as control shifts between DevOps, CloudOps, and SecOps.
# All the above coupled with a significant expansion of attack surface area with the decomposition of monolith applications into microservices.
When incidents occur, the ability to quickly size the scope, impact and root cause of the incident is directly proportional to the availability of quality data and its ability to be easily queried, analysed and dissected. As companies migrate to the cloud, logs have become the de facto standard of gathering telemetry.
However, there are a few challenges when relying almost exclusively on logs for telemetry.
The first issue is that a common practice with many hackers and bad actors is to turn off logging on the compromised system to cloak their activity and footprint. This creates gaps in telemetry that can significantly delay incident response and recovery initiatives.
On occasion, DevOps teams too may reduce logging on end systems and applications to reduce CPU usage (and associated costs in the cloud) leading to additional gaps in telemetry data.
A second issue is that logs tend to be voluminous and, in many cases, written by developers for developers, leading to too much and perhaps irrelevant telemetry data. This drives up costs of storing and indexing that data, and also leads to longer query times and more effort on the part of the incident responder sifting through that data.
Finally, log levels can be increased or decreased, but ultimately the logs themselves are pre-defined as they are embedded into code. Changing what information logs put out is not something that can be done in real time or near real time in response to an incident but may require code changes, leading to significant delays and impaired incident response capability.
This leads us to the three Rs of telemetry — Reliable, Relevant and Real time.
To serve the needs of rapid response, telemetry data needs to be reliable in that it is available when needed, without gaps introduced by malicious actors or even inadvertently by various operators due to misconfiguration or miscommunication.
It needs to be relevant in that it should provide meaningful actionable insights without significantly driving up costs or query times due to excessive, duplicate and irrelevant information.
Finally, it needs to be real time in the sense that the stream of telemetry data can be changed, and new telemetry data or additional telemetry data can be derived at the click of a button.
A great way to complement logs in the cloud and address the three Rs is with telemetry data derived from observing network traffic. After all, command and control activity, lateral movement of malware, and data exfiltration happen over the network.
If end systems or applications are compromised and logging is turned off at the server or application, network activity continues and can continue capturing breadcrumbs identifying the malicious activity.
In that sense, network-based telemetry can provide a reliable stream of information even when endpoints or end systems are compromised or impacted. Metadata generated from network traffic can be surgically tuned to provide a highly relevant and targeted telemetry feed.
Security operations teams can select from thousands of metadata elements specific to their use case — for example, focusing on DNS metadata or metadata associated with Remote Desktop activity — and discard other network metadata that may not be relevant, thereby reducing cost, but equally important, being able to write targeted queries.
Should the need arise to expand or change what telemetry data is being acquired, it can be easily changed at the network level without requiring any change to the application. A simple API call can change what network metadata is being captured in near real time.
As organisations look to move to the cloud, complementing their log sources with network-based telemetry will prove invaluable in bolstering their security and compliance posture.
In that sense, network-based telemetry is an essential component in securing the move to the cloud. Simply check the market for a solution that provides a cloud agnostic platform for precisely achieving this and hitting the three Rs described.