Architecture

This document outlines the overall architecture of the Parseable Observability Platform, detailing the flow of MELT data from ingestion to storage and querying.

This document is organized into specific sections for each sub-system like ingestion, query, search, and index. To understand the specific decisions and trade-offs, refer the design choices document.

Parseable Architecture

Overview

Parseable is shipped as a single unified binary (or container image if you prefer). This includes the Prism and Parseable DB. There is no additional dependency to run Parseable.

The binary can be run in different modes. You’d generally run standalone mode to test and experience Parseable on your laptop, or a small testing server.

As you move to a production setup, we recommend running the distributed mode. Here each node has a specific role, i.e. ingestion, query or search.

Parseable ingestion nodes follow a shared-nothing architecture, meaning each node independently handles the entire ingestion pipeline. In production, you typically place a load balancer in front of two or more ingestion nodes, allowing ingestion requests to be distributed across nodes seamlessly.

When a node receives an ingestion request (via HTTP or Kafka), it first validates the request, then converts the payload into an Apache Arrow-based file format. During this process, it also performs auto schema detection, enabling Parseable to intelligently classify logs and generate structured schemas on the fly. This makes it easy for users to filter, search, and analyze across diverse log types with minimal upfront configuration.

The Arrow files are temporarily staged in a dedicated local disk area. Once the disk write completes, the ingestion node acknowledges the request with a success response.

To ensure data durability during staging, we recommend attaching a small, reliable disk (such as NFS, Azure Files, or EFS) to each ingestion node.

A background job then reads the staged Arrow files, converts them into highly compressed Parquet files, and uploads them to S3 or any configured object store. During this transformation, the ingestion node also generates query metadata, which significantly enhances performance during log searches and queries.

Parseable Ingestion Architecture

Query

Distributed Query nodes are Enterprise only features available in pro and enterprise versions and cannot be deployed with OSS versions.

Query node is primarily responsible responding to query API. The query workflow starts when someone calls the query API with (a PostgreSQL compatible) SQL query, and a start and end timestamp. The query node looks up the metadata locally first, falling back to object store only if not found.

Based on metadata, the node identifies the relevant parquet files and uses the object store API to get these files. Here again, this only happens if the files are not already present locally. If the files are to be downloaded from object storage - this adds to latency and hence the occasional cold queries.

Another node called Prism node responds to all the role, user management, dataset management API. In Parseable OSS, a query node also serves as the leader node.

Parseable Query Architecture

Design choices

Low latency writes

Ingested data is staged on local disk upon successful return by Parseable API. Data is then asynchronously committed to object store like S3. This ensures low latency, high throughput ingestion. To ensure data durability, we recommend using a small, reliable storage (EFS, Azure Files, NFS or equivalent) attached to the ingesting nodes. This ensures that data is not lost in case of a node failure.

Atomic ingestion

Each ingestion batch received via API is concurrently appended to the same file within a one-minute window. When converted from Arrow to Parquet, entries are reordered to ensure the latest data appears first.

Efficient storage

Parseable stores heavily compressed Parquet format to one of the most cost efficient storage, i.e. object storage. This leads to significant cost savings, especially for large datasets.

Smart caching

Frequently accessed logs are cached in memory and NVMe SSDs on query nodes for faster access. The system prioritizes recent data, manages cache eviction automatically, and minimizes object store API calls using Parseable manifest files and Parquet footers.

Index on demand

By default data is stored in columnar Parquet files, allowing fast aggregations, filtering numerical columns and SQL queries. Parseable allows indexing specific chunks of data, on demand - to allow text search on log data as and when needed.

Stateless high availability

High availability (HA) is ensured through a distributed mode in which multiple ingestion and query servers operate independently.

Object storage first

There is no separate consensus layer, eliminating complex coordination and reducing operational overhead. Object storage manages all concurrency control. See AWS S3 configuration for detailed setup instructions.

SQL for querying

We chose SQL as the query language for Parseable because it is widely used and understood, making it easier for users to interact with the system. SQL allows users to filter, aggregate, and join data from multiple sources. SQL is also very well supported by modern LLMs to generate queries from plain text.

Architecture

Overview

Ingestion

Query

Design choices

Low latency writes

Atomic ingestion

Efficient storage

Smart caching

Index on demand

Stateless high availability

Object storage first

SQL for querying

Trade-offs

Staged writes

Occasional Cold Queries

Timed queries

On this page