Data Retention
The Data Retention page controls how long Chompy keeps flow data, syslog data, and SNMP metrics in ClickHouse, and provides options to aggregate older data and archive it to S3 cold storage. This is the primary tool for managing disk usage and balancing storage costs against historical data availability.
Data Lifecycle Pipeline
The visual pipeline at the top of the page illustrates the three stages data passes through:
-
Raw Data (Full Detail) — Every individual flow record, syslog entry, or SNMP metric is stored at full granularity with all fields intact. This is the most storage-intensive stage but provides the highest resolution for troubleshooting and forensic analysis.
-
Aggregated ((user-defined default 5 minutes) intervals) — After the raw retention period expires, data is rolled up into time-bucketed summaries. Aggregation reduces storage significantly while preserving trends, top talkers, and traffic patterns. The aggregation interval is configurable per table.
-
S3 Archive (Cold Storage) — After the total retention period, data can optionally be exported to Amazon S3 (or S3-compatible storage) in Parquet format for long-term, cost-effective retention. Archived data is removed from ClickHouse to reclaim disk space but can still be queried via external tables if needed.
Table Retention Policies
Each data table is configured independently. The page shows collapsible cards for each table, displaying the current storage size, row count, and configured retention durations at a glance.
NetFlow Data
The primary flow table (flows_all) containing all NetFlow, sFlow, and VPC Flow Log records. This is typically the largest table and the most important to manage.
Each card header shows:
- Storage size — Current disk usage (e.g., 31.31 MiB).
- Row count — Total records in the table (e.g., 443,716 rows).
- Raw / Total — Summary of configured retention (e.g., Raw: 1d, Total: 2d).
Syslog Data
Syslog records ingested from network devices and servers. Storage characteristics depend on log volume and verbosity.
SNMP Metrics
Time-series metrics collected by the SNMP poller including interface counters (bytes in/out, errors, discards), CPU utilization, memory usage, and temperature readings.
Retention Settings
Expand any table card to configure its retention policy. Each table has the same set of controls:
Raw Data Retention
The number of days to keep data at full granularity. During this period, every individual record is preserved with all original fields — every flow with its exact timestamp, source/destination IPs, ports, bytes, packets, TCP flags, tags, and enrichment fields.
Setting this to a lower value reduces storage usage but limits the window for detailed forensic analysis. A typical production setting is 1–7 days for NetFlow data, depending on flow volume.
Enable Aggregation After Raw Retention
When checked, data older than the raw retention period is rolled up into time-bucketed summaries rather than being immediately deleted. This preserves traffic trends and patterns at reduced resolution.
Aggregation Interval — The time bucket size for rolled-up data. Options include:
- 1 minute — Near-raw resolution, minimal storage savings.
- 5 minutes — Good balance of resolution and storage reduction (default).
- 15 minutes — Significant storage savings, suitable for trend analysis.
- 30 minutes — Higher compression for long-term retention.
- 1 hour — Maximum compression, best for capacity planning and historical baselines.
Aggregated records combine multiple flows into summary rows using ClickHouse's AggregatingMergeTree engine. Aggregation sums bytes and packets, averages rates, and preserves the top IP addresses and ports per time bucket. Port-level detail may be dropped at coarser aggregation levels.
Aggregated Data Limitations
When viewing data beyond your raw retention period, WhiteOwl automatically queries the aggregated table (flows_all_agg) instead of the raw flow table (flows_all). The aggregated table stores pre-summarized data at 5-minute intervals, which means some fields and metrics are not available for longer time ranges.
Minimum Time Granularity
Aggregated data is stored at user-defined intervals default 5-minute intervals. When querying time ranges that fall into the aggregated table, the minimum chart resolution is 5 minutes regardless of the selected time bucket.
Unavailable Fields
The following fields exist in raw flow data but are not present in the aggregated table:
| Field | Description | Impact |
|---|---|---|
src_as / dst_as | BGP-learned AS numbers | Queries fall back to src_asn / dst_asn (IP-to-ASN lookup). BGP-specific AS data is not available. |
as_path | Full BGP AS path | BGP table shows "—" for AS Path on aggregated data. |
time_flow_start_ns / time_flow_end_ns | Nanosecond flow timestamps | Per-flow duration calculations are not possible. |
max_bytes_per_window / max_packets_per_window | Microburst detection fields | Microburst analysis is not available for aggregated time ranges. |
avg_rtt_us | Per-flow average RTT | Replaced by avg_rtt_us_sum and avg_rtt_us_count for weighted average calculation. |
Affected Features
- Microburst Detection — Not available. Microburst metrics (max burst bytes, max burst packets, burst ratio) require raw per-flow data and cannot be meaningfully aggregated.
- BGP AS Path Analysis — AS path data is not carried into the aggregated table. The BGP table will display AS numbers but not full paths.
- Per-Flow RTT — Individual flow RTT values are not stored. Average RTT is still available via the weighted sum/count method, but percentile or distribution analysis is not possible.
- Flow Counts — Aggregated data uses
SUM(flow_count)instead ofCOUNT(*)to accurately represent the original number of flows. - Ephemeral ports - Aggregated to 65535 in aggregated data.
What Works Normally
The following metrics and dimensions work identically on both raw and aggregated data:
- Traffic metrics — bytes, packets, bits per second, packets per second
- Network dimensions — source/destination IPs, ports, protocols, interfaces
- Application data — app IDs, SNI, hostnames, DPI classifications
- Geographic data — country names, ASN organizations, ASN numbers
- Cloud fields — cloud account, VPC, region, flow action
- Tag arrays — server tags, application tags
- Retransmit counts — available via SUM aggregation
- RTT averages — calculated from sum/count columns
Configuration
Aggregated data retention is configured in Settings → Data Retention. The system automatically selects the appropriate table based on the query time range relative to the raw retention period.
Total Retention (Including Aggregated)
The total number of days to keep data in ClickHouse, counting both the raw and aggregated periods. For example, with raw retention of 1 day and total retention of 30 days, data is kept at full granularity for 1 day, then as 5-minute aggregates for the remaining 29 days, then deleted (or archived to S3).
ClickHouse enforces this via TTL (Time-To-Live) rules on the table. Expired data is cleaned up automatically during merge operations. You can force immediate cleanup by running OPTIMIZE TABLE <table_name> FINAL in ClickHouse.
Archive to S3 After Retention Period
When checked, data that has reached the end of its total retention period is exported to S3 in Parquet format before being deleted from ClickHouse. This requires S3 Archive Configuration to be enabled and configured (see below).
Archival to S3 "Run Archive Now"
When selected, both the "flows_all" and "flows_all_agg" tables will be backed up to S3 in a single parquet file.
Save Settings
Click Save Settings to apply the retention policy for that table. Changes to raw retention and total retention update the ClickHouse TTL rules. Aggregation table creation happens automatically if aggregation is being enabled for the first time.
Reducing retention periods will cause data older than the new retention to be deleted during the next ClickHouse merge cycle. This action is irreversible — deleted data cannot be recovered unless S3 archival is enabled.
S3 Archive Configuration
The S3 Archive Configuration section at the bottom of the page configures cold storage for long-term data retention.
Enable S3 Archival
Toggle to enable or disable the automatic archival process. When enabled, a scheduled job runs daily at 2:00 AM to export and remove data that has exceeded its total retention period.
S3 Settings
When S3 archival is enabled, configure the following:
- S3 Bucket — The bucket name where archived data will be stored.
- Region — The AWS region of the bucket (e.g.,
us-east-1). - Access Key ID — AWS IAM access key with write permissions to the bucket. Stored encrypted in PostgreSQL using AES-256-GCM.
- Secret Access Key — AWS IAM secret key. Stored encrypted in PostgreSQL.
- Endpoint URL (optional) — For S3-compatible storage providers (MinIO, Wasabi, DigitalOcean Spaces, etc.). Leave empty for standard AWS S3.
- Compression — Archive compression format:
gzip,zstd, ornone. Zstandard (zstd) provides the best compression ratio for network data. - Path Prefix — The folder structure within the bucket. Defaults to
clickhouse-archive. Archived files are organized as{prefix}/{table}/year={YYYY}/month={MM}/.
Test Connection
Validates the S3 credentials and bucket access before saving. Ensures the configured IAM credentials have the required permissions (s3:PutObject, s3:GetObject, s3:ListBucket).
Querying Archived Data
Data archived to S3 can still be queried from ClickHouse using the S3 table function or by creating an external table:
-- Query archived flows directly from S3
SELECT * FROM s3(
'https://my-bucket.s3.amazonaws.com/clickhouse-archive/flows/*.parquet',
'ACCESS_KEY', 'SECRET_KEY', 'Parquet'
)
WHERE src_addr = '192.168.1.1'
LIMIT 100;
-- Create a permanent external table for archived data
CREATE TABLE flows_archive
ENGINE = S3(
'https://my-bucket.s3.amazonaws.com/clickhouse-archive/flows/*.parquet',
'ACCESS_KEY', 'SECRET_KEY', 'Parquet'
);
-- Query across live and archived data
SELECT * FROM flows_all WHERE timestamp >= now() - INTERVAL 7 DAY
UNION ALL
SELECT * FROM flows_archive WHERE timestamp < now() - INTERVAL 7 DAY
ORDER BY timestamp DESC;
Storage Planning
The table below provides rough storage estimates for NetFlow data to help plan retention policies:
| Flow Rate | Raw per Day | 5min Aggregated per Day | 1hr Aggregated per Day |
|---|---|---|---|
| 1,000 flows/sec | ~5 GB | ~500 MB | ~50 MB |
| 10,000 flows/sec | ~50 GB | ~5 GB | ~500 MB |
| 30,000 flows/sec | ~150 GB | ~15 GB | ~1.5 GB |
These estimates vary based on the number of unique IP pairs, enrichment fields, and ClickHouse compression ratios (typically 5–10x for flow data).