Architecture
This document outlines the overall architecture of the Parseable Observability Platform, detailing the flow of MELT data from ingestion to storage and querying.
This document is organized into specific sections for each sub-system like ingestion, query, search, and index. To understand the specific decisions and trade-offs, refer the design choices document.

Overview
Parseable is shipped as a single unified binary (or container image if you prefer). This includes the Prism and Parseable DB. There is no additional dependency to run Parseable.
The binary can be run in different modes. You’d generally run standalone mode to test and experience Parseable on your laptop, or a small testing server.
As you move to a production setup, we recommend running the distributed mode. Here each node has a specific role, i.e. ingestion, query or search.
Ingestion
Parseable ingestion nodes follow a shared-nothing architecture, meaning each node independently handles the entire ingestion pipeline. In production, you typically place a load balancer in front of two or more ingestion nodes, allowing ingestion requests to be distributed across nodes seamlessly.
When a node receives an ingestion request (via HTTP or Kafka), it first validates the request, then converts the payload into an Apache Arrow-based file format. During this process, it also performs auto schema detection, enabling Parseable to intelligently classify logs and generate structured schemas on the fly. This makes it easy for users to filter, search, and analyze across diverse log types with minimal upfront configuration.
The Arrow files are temporarily staged in a dedicated local disk area. Once the disk write completes, the ingestion node acknowledges the request with a success response.
To ensure data durability during staging, we recommend attaching a small, reliable disk (such as NFS, Azure Files, or EFS) to each ingestion node.
A background job then reads the staged Arrow files, converts them into highly compressed Parquet files, and uploads them to S3 or any configured object store. During this transformation, the ingestion node also generates query metadata, which significantly enhances performance during log searches and queries.

Query
Distributed Query nodes are Enterprise only features available in pro and enterprise versions and cannot be deployed with OSS versions.
Query node is primarily responsible responding to query API. The query workflow starts when someone calls the query API with (a PostgreSQL compatible) SQL query, and a start and end timestamp. The query node looks up the metadata locally first, falling back to object store only if not found.
Based on metadata, the node identifies the relevant parquet files and uses the object store API to get these files. Here again, this only happens if the files are not already present locally. If the files are to be downloaded from object storage - this adds to latency and hence the occasional cold queries.
Another node called Prism node responds to all the role, user management, dataset management API. In Parseable OSS, a query node also serves as the leader node.

Design choices
Low latency writes
Ingested data is staged on local disk upon successful return by Parseable API. Data is then asynchronously committed to object store like S3. This ensures low latency, high throughput ingestion. To ensure data durability, we recommend using a small, reliable storage (EFS, Azure Files, NFS or equivalent) attached to the ingesting nodes. This ensures that data is not lost in case of a node failure.
Atomic ingestion
Each ingestion batch received via API is concurrently appended to the same file within a one-minute window. When converted from Arrow to Parquet, entries are reordered to ensure the latest data appears first.
Efficient storage
Parseable stores heavily compressed Parquet format to one of the most cost efficient storage, i.e. object storage. This leads to significant cost savings, especially for large datasets.
Smart caching
Frequently accessed logs are cached in memory and NVMe SSDs on query nodes for faster access. The system prioritizes recent data, manages cache eviction automatically, and minimizes object store API calls using Parseable manifest files and Parquet footers.
Index on demand
By default data is stored in columnar Parquet files, allowing fast aggregations, filtering numerical columns and SQL queries. Parseable allows indexing specific chunks of data, on demand - to allow text search on log data as and when needed.
Stateless high availability
High availability (HA) is ensured through a distributed mode in which multiple ingestion and query servers operate independently.
Object storage first
There is no separate consensus layer, eliminating complex coordination and reducing operational overhead. Object storage manages all concurrency control.
SQL for querying
We chose SQL as the query language for Parseable because it is widely used and understood, making it easier for users to interact with the system. SQL allows users to filter, aggregate, and join data from multiple sources. SQL is also very well supported by modern LLMs to generate queries from plain text.
Trade-offs
Staged writes
Staging data locally on the ingestor node for at least a minute, leads to a minor lag in querying the data. We trades immediate persistence for low latency ingestion.
Occasional Cold Queries
The query layer fetches indexes from object storage (e.g., S3) and uses intelligent caching to accelerate future access. During the initial cache warm-up, some queries may access data directly from cold storage, resulting in higher latencies.
Timed queries
A query call requires start and end timestamp. This ensures data is queried across a fixed, definite set of files. Parseable ensures query response includes the staging and committed data on object storage as required.
Was this page helpful?