Migrating data to ClickStack from Elastic
Parallel Operation Strategy
When migrating from Elastic to ClickStack for observability use cases, we recommend a parallel operation approach rather than attempting to migrate historical data. This strategy offers several advantages:
- Minimal Risk: By running both systems concurrently, you maintain access to existing data and dashboards while validating ClickStack and familiarizing your users with the new system.
- Natural Data Expiration: Most observability data has a limited retention period (typically 30 days or less), allowing for a natural transition as data expires from Elastic.
- Simplified Migration: No need for complex data transfer tools or processes to move historical data between systems.
We demonstrate an approach for migrating essential data from Elasticsearch to ClickHouse in the section "Migrating data". This should not be used for larger datasets as it is rarely performant - limited by the ability for Elasticsearch to export efficiently, with only JSON format supported.
Implementation Steps
- Configure Dual Ingestion
Set up your data collection pipeline to send data to both Elastic and ClickStack simultaneously. How this is achieved depends on your current agents for collection - see "Migrating Agents".
- Adjust Retention Periods
Configure Elastic's TTL settings to match your desired retention period. Set up the ClickStack TTL to maintain data for the same duration.
- Validate and Compare:
- Run queries against both systems to ensure data consistency
- Compare query performance and results
- Migrate dashboards and alerts to ClickStack. This is currently a manual process.
- Verify that all critical dashboards and alerts work as expected in ClickStack
- Gradual Transition:
- As data naturally expires from Elastic, users will increasingly rely on ClickStack
- Once confidence in ClickStack is established, you can begin redirecting queries and dashboards
Long-term Retention
For organizations requiring longer retention periods:
- Continue running both systems in parallel until all data has expired from Elastic
- ClickStack tiered storage capabilities can help manage long-term data efficiently.
- Consider using materialized views to maintain aggregated or filtered historical data while allowing raw data to expire.
Migration Timeline
The migration timeline will depend on your data retention requirements:
- 30-day retention: Migration can be completed within a month.
- Longer retention: Continue parallel operation until data expires from Elastic.
- Historical data: If absolutely necessary, consider using Migrating data to import specific historical data.
Migrating settings
When migrating from Elastic to ClickStack, your indexing and storage settings will need to be adapted to fit ClickHouse's architecture. While Elasticsearch relies on horizontal scaling and sharding for performance and fault tolerance and thus has multiple shards by default, ClickHouse is optimized for vertical scaling and typically performs best with fewer shards.
Recommended settings
We recommend starting with a single shard and scaling vertically. This configuration is suitable for most observability workloads and simplifies both management and query performance tuning.
- ClickHouse Cloud: Uses a single-shard, multi-replica architecture by default. Storage and compute scale independently, making it ideal for observability use cases with unpredictable ingest patterns and read-heavy workloads.
- ClickHouse OSS: In self-managed deployments, we recommend:
- Starting with a single shard
- Scaling vertically with additional CPU and RAM
- Using tiered storage to extend local disk with S3-compatible object storage
- Using
ReplicatedMergeTree
if high availability is required - For fault tolerance, 1 replica of your shard is typically sufficient in Observability workloads.
When to shard
Sharding may be necessary if:
- Your ingest rate exceeds the capacity of a single node (typically >500K rows/sec)
- You need tenant isolation or regional data separation
- Your total dataset is too large for a single server, even with object storage
If you do need to shard, refer to Horizontal scaling for guidance on shard keys and distributed table setup.
Retention and TTL
ClickHouse uses TTL clauses on MergeTree tables to manage data expiration. TTL policies can:
- Automatically delete expired data
- Move older data to cold object storage
- Retain only recent, frequently queried logs on fast disk
We recommend aligning your ClickHouse TTL configuration with your existing Elastic retention policies to maintain a consistent data lifecycle during the migration. For examples, see ClickStack production TTL setup.
Migrating data
While we recommend parallel operation for most observability data, there are specific cases where direct data migration from Elasticsearch to ClickHouse may be necessary:
- Small lookup tables used for data enrichment (e.g., user mappings, service catalogs)
- Business data stored in Elasticsearch that needs to be correlated with observability data, with ClickHouse's SQL capabilities and Business Intelligence integrations making it easier to maintain and query the data compared to Elasticsearch's more limited query options.
- Configuration data that needs to be preserved across the migration
This approach is only viable for datasets under 10 million rows, as Elasticsearch's export capabilities are limited to JSON over HTTP and don't scale well for larger datasets.
The following steps allow the migration of a single Elasticsearch index from ClickHouse.
Migrate schema
Create a table in ClickHouse for the index being migrated from Elasticsearch. Users can map Elasticsearch types to their ClickHouse equivalent. Alternatively, users can simply rely on the JSON data type in ClickHouse, which will dynamically create columns of the appropriate type as data is inserted.
Consider the following Elasticsearch mapping for an index containing syslog
data:Elasticsearch mapping
The equivalent ClickHouse table schema:ClickHouse schema
Note that:
- We've used Tuples to represent nested structures instead of dot notation
- Used appropriate ClickHouse types based on the mapping:
keyword
→String
date
→DateTime
boolean
→UInt8
long
→Int64
ip
→Array(Variant(IPv4, IPv6))
. We use aVariant(IPv4, IPv6)
here as the field contains a mixture ofIPv4
andIPv6
.object
→JSON
for the syslog object whose structure is unpredictable.- Note the columns
host.ip
andhost.mac
are explicitArray
type, unlike in Elasticsearch where all types are arrays. - Added an
ORDER BY
clause using timestamp and hostname for efficient time-based queries - Used the MergeTree engine, which is optimal for log data
This approach of statically defining the schema and using the JSON type selectively where required is recommended.
This strict schema has a number of benefits:
- Data validation – enforcing a strict schema avoids the risk of column explosion, outside of specific structures.
- Avoids risk of column explosion - Although the JSON type scales to potentially thousands of columns, where subcolumns are stored as dedicated columns, this can lead to a column file explosion where an excessive number of column files are created that impacts performance. To mitigate this, the underlying Dynamic type used by JSON offers a
max_dynamic_paths
parameter, which limits the number of unique paths stored as separate column files. Once the threshold is reached, additional paths are stored in a shared column file using a compact encoded format, maintaining performance and storage efficiency while supporting flexible data ingestion. Accessing this shared column file is, however, not as performant. Note, however, that the JSON column can be used with type hints. "Hinted" columns will deliver the same performance as dedicated columns. - Simpler introspection of paths and types - Although the JSON type supports introspection functions to determine the types and paths that have been inferred, static structures can be simpler to explore e.g. with
DESCRIBE
.
Alternatively, users can simply create a table with one JSON
column.
We provide a type hint for the host.name
and timestamp
columns in the JSON definition as we use it in the ordering/primary key. This helps ClickHouse know this column won't be null and ensures it knows which sub-columns to use (there may be multiple for each type, so this is ambiguous otherwise).
This latter approach, while simpler, is best for prototyping and data engineering tasks. For production, use JSON
only for dynamic sub structures where necessary.
For more details on using the JSON type in schemas, and how to efficiently apply it, we recommend the guide "Designing your schema".
Install elasticdump
We recommend elasticdump
for exporting data from Elasticsearch. This tool requires node
and should be installed on a machine with network proximity to both Elasticsearch and ClickHouse. We recommend a dedicated server with at least 4 cores and 16GB of RAM for most exports.
elasticdump
offers several advantages for data migration:
- It interacts directly with the Elasticsearch REST API, ensuring proper data export.
- Maintains data consistency during the export process using the Point-in-Time (PIT) API - this creates a consistent snapshot of the data at a specific moment.
- Exports data directly to JSON format, which can be streamed to the ClickHouse client for insertion.
Where possible, we recommend running both ClickHouse, Elasticsearch, and elastic dump
in the same availability zone or data center to minimize network egress and maximize throughput.
Install ClickHouse client
Ensure ClickHouse is installed on the server on which elasticdump
is located. Do not start a ClickHouse server - these steps only require the client.
Stream data
To stream data between Elasticsearch and ClickHouse, use the elasticdump
command - piping the output directly to the ClickHouse client. The following inserts the data into our well structured table logs_system_syslog
.
Note the use of the following flags for elasticdump
:
type=data
- limits the response to only the document content in Elasticsearch.input-index
- our Elasticsearch input index.output=$
- redirects all results to stdout.sourceOnly
flag ensuring we omit metadata fields in our response.searchAfter
flag to use thesearchAfter
API for efficient pagination of results.pit=true
to ensure consistent results between queries using the point in time API.
Our ClickHouse client parameters here (aside from credentials):
max_insert_block_size=1000
- ClickHouse client will send data once this number of rows is reached. Increasing improves throughput at the expense of time to formulate a block - thus increasing time till data appears in ClickHouse.min_insert_block_size_bytes=0
- Turns off server block squashing by bytes.min_insert_block_size_rows=1000
- Squashes blocks from clients on the server side. In this case, we set tomax_insert_block_size
so rows appear immediately. Increase to improve throughput.query="INSERT INTO logs_system_syslog FORMAT JSONAsRow"
- Inserting the data as JSONEachRow format. This is appropriate if sending to a well-defined schema such aslogs_system_syslog.
Users can expect throughput in order of thousands of rows per second.
If inserting into a single JSON column (see the syslog_json
schema above), the same insert command can be used. However, users must specify JSONAsObject
as the format instead of JSONEachRow
e.g.
See "Reading JSON as an object" for further details.
Transform data (Optional)
The above commands assume a 1:1 mapping of Elasticsearch fields to ClickHouse columns. Users often need to filter and transform Elasticsearch data before insertion into ClickHouse.
This can be achieved using the input
table function, which allows us to execute any SELECT
query on the stdout.
Suppose we wish to only store the timestamp
and hostname
fields from our earlier data. The ClickHouse schema:
To insert from elasticdump
into this table, we can simply use the input
table function - using the JSON type to dynamically detect and select the required columns. Note this SELECT
query could easily contain a filter.
Note the need to escape the @timestamp
field name and use the JSONAsObject
input format.