Skip to main content

Microburst Detection

Overview

This feature requires deploying the WhiteOwl probe.

Microbursts are brief, intense spikes in network traffic that can cause packet loss, increased latency, and buffer overflows—even when average utilization appears normal. Traditional flow monitoring with 1-minute or 5-minute aggregation windows completely misses these sub-second events.

WhiteOwl adds microburst detection by analyzing traffic patterns at 10ms granularity within each flow, identifying peak throughput windows that may indicate problematic burst behavior.


How It Works

Probe-Side Collection

The WhiteOwl probe tracks traffic intensity within sliding time windows for each flow:

Flow Duration: 5 seconds
Traditional View: 500 KB total, 100 KB/s average ✓ Looks fine

Microburst View:
Window 1 (0-10ms): 2 KB
Window 2 (10-20ms): 3 KB
Window 3 (20-30ms): 1 KB
...
Window 47 (460-470ms): 150 KB ← BURST! 15 MB/s instantaneous
...
Window 500: 1 KB

The probe records:

  • max_bytes_per_window — The highest byte count seen in any single 10ms window
  • max_packets_per_window — The highest packet count seen in any single 10ms window

Why 10ms Windows?

  • Switch buffer timescales — Most switch buffers fill/drain in 1-50ms
  • TCP behavior — RTT-scale bursts affect congestion control
  • Practical detection — Catches bursts that cause real problems without excessive overhead

How Microbursts Are Reported in Flow Records

The WhiteOwl probe continuously monitors traffic intensity by dividing each flow into 10ms measurement windows. For every active flow, the probe tracks the byte and packet count within each window, comparing it against the current peak. When the flow record is exported (typically every 30 seconds), only the single highest 10ms window is included in the IPFIX record as max_bytes_per_window and max_packets_per_window. For example, a 30-second export interval contains roughly 3,000 individual 10ms windows — the probe evaluates all of them but only reports the worst-case peak. This design is intentional: microburst detection is about identifying the moment of greatest stress on switch buffers and link capacity, not the average. The burst ratio, calculated in ClickHouse by comparing the peak window against the flow's average throughput, provides a measure of how "spiky" a given flow is relative to its sustained rate.

Dashboard Usage

Creating a Microburst Widget

  1. Add Widget → Select visualization type (Bar, Time Series, Table)
  2. Data Sourceprobe_metrics
  3. Metric → Choose one of:
    • Max Burst (bytes) — for worst-case analysis
    • Max Burst (packets) — for packet-based analysis
    • Avg Burst (bytes) — for trend analysis
    • Burst Ratio — for relative burstiness
  4. Group By → Recommended dimensions:
    • src_addr / dst_addr — Find bursty hosts
    • dst_port / appid — Find bursty applications
    • src_as / dst_as — Find bursty networks

Example Widget Configurations

Top Bursty Sources (Bar Chart)

  • Type: Bar Chart
  • Metric: Max Burst (bytes)
  • Group By: src_addr
  • Shows which source IPs generate the largest bursts

Burst Trends Over Time (Time Series)

  • Type: Time Series
  • Metric: Max Burst (bytes)
  • Group By: dst_port
  • Shows how burst patterns change over time by service

Burst Analysis Table

  • Type: Table
  • Metric: Max Burst (bytes)
  • Group By: src_addr, dst_addr, dst_port
  • Shows detailed burst data per conversation

Interpreting Results

What's a "Bad" Burst?

Context matters, but general guidelines:

Max BurstInterpretation
Less than 10 KBNormal, small transfers
10-100 KBModerate bursts, usually fine
100 KB - 1 MBSignificant bursts, check if causing issues
Greater than 1 MBLarge bursts, likely causing buffer pressure

Burst Ratio Guidelines

RatioInterpretation
1-2Smooth, well-paced traffic
2-5Mildly bursty, typical for web traffic
5-20Bursty, may cause issues on congested links
20+Very bursty, investigate application behavior

Common Causes of Microbursts

  1. TCP Slow Start — New connections ramp up aggressively
  2. Backup Jobs — Large file transfers with no rate limiting
  3. Incast — Multiple servers responding simultaneously (common in distributed storage)
  4. Video Streaming — Chunk-based delivery creates periodic bursts
  5. Application Bugs — Poorly implemented sending loops

Summary

The microburst feature provides visibility into sub-second traffic patterns that traditional flow monitoring misses. By tracking peak throughput within 10ms windows, WhiteOwl can now identify:

  • Which hosts/applications generate problematic bursts
  • When bursts occur (time series analysis)
  • How "bursty" traffic is relative to its average (burst ratio)

This enables proactive identification of traffic patterns that may cause packet loss and latency issues before they impact users.