Storage Architecture

The following sections describe the main concepts in the LogScale storage architecture. It is assumed you are already familiar with the LogScale Internal Architecture.

Note

Primary and secondary storage are collectively referred to as local storage.

The three main types of storage of the LogScale storage architecture are summarized here:

  • Primary storage (hot storage) - Data that is ingested into a repository is stored initially on primary local storage. This is the main drive, usually mounted by the container (if used) or the install directory. It can also be configured using the DIRECTORY configuration variable. Typically on a bare metal install it will be based on fast NVMe drives or other SSDs. If LogScale is running in a cloud-based deployment such as AWS, then instance stores using NVMe are recommended for maximum performance. However, in this case, as instance stores are considered ephemeral, bucket storage needs to be configured in order to persist data on a longer time frame.

    Read more about primary storage here.

  • Secondary storage (warm storage) - To enable LogScale to have fast access to more data than the capacity of the primary storage, you can optionally configure secondary storage. This is not available by default. You can configure secondary storage using the variable SECONDARY_DATA_DIRECTORY. Secondary storage is usually SSDs or a spinning disk array. Whereas primary storage is based on faster (but more expensive) options such as NVMe-based drives or instance stores, secondary storage can be based on cheaper but higher capacity options such as SSDs and spinning disk arrays.

    When secondary storage is configured, segment files are moved from primary to secondary when the percentage threshold on primary is exceeded. This threshold is set using the variable PRIMARY_STORAGE_PERCENTAGE. Secondary storage can also be configured with SECONDARY_STORAGE_MAX_FILL_PERCENTAGE, which sets the maximum capacity of the storage to use.

    Note that if PRIMARY_STORAGE_MAX_FILL_PERCENTAGE and SECONDARY_STORAGE_MAX_FILL_PERCENTAGE are exceeded, the processing of logs stops until more disk space is made available.

    Read more about secondary storage here.

  • Bucket storage (cold storage) - As storage capacity of primary and secondary storage is of a finite size often dictated by price-performance characteristics, capacity can be exceeded for high ingest volumes.

    Further, data is typically controlled by data retention policies, so data might not be stored on local storage for the longer term, and this is where bucket storage comes into play. Bucket storage is usually a much cheaper option than using local storage such as NVMe and SSD, as it typically uses cloud-based options such as AWS S3. Also, the capacity of bucket storage is for practical purposes unlimited (although costs and data rention policies may put an upper limit of what is actually retained).

    When data on local storage (primary plus secondary) exceeds the capacity threshold set in the variable LOCAL_STORAGE_PERCENTAGE (85% by default), data is copied to bucket storage for persistence. So, for example, if the local storage is more than 85% full, data is now moved to bucket storage, based on the @timestamp of the data (older data is moved first).

    When the data stored on primary and secondary storage exceeds the set thresholds, then it is automatically deleted - this is why it is important to have a bucket storage strategy, so that all data is persisted for the duration required by the use case.

    Note that if older data (contained in segment files) is queried, that is not currently on local storage, then the relevant segment files containing that data are copied from bucket storage to local storage.

    While replication can obviate short-term node failure, bucket storage can provide longer term backup capabilities in the event that a node containing local storage fails.

    Note that bucket storage is optional, but highly recommended. It is also worth noting that data copied to bucket storage is always encrypted by LogScale.

    Read more about bucket storage here.

In addition to the main storage types, there are some more concepts that will frequently arise as you configure your storage architecture. These are summarized here, with links to more information in the guidance panel at the bottom of this page.

  • Ephemeral storage - When using LogScale in a cloud environment such as AWS, as opposed to installation directly on physical (bare metal) machines, it is possible to use NVMe-based instance stores as primary storage, and this is the recommended approach. It's important to note however, that because of how these instance stores are provisioned in cloud environments, that data on these stores is lost in the event of one of the following:
    • The instance stops or terminates
    • The underlying host fails

    For this reason, these instance stores are referred to as ephemeral storage, as the data can potentially be lost in certain circumstances. It's therefore important to have a bucket storage strategy in place. Also, if setting LogScale up initially without bucket storage, it is important to set USING_EPHEMERAL_DISKS to false, and later when bucket storage is implemented and verified, set USING_EPHEMERAL_DISKS to true, to denote that primary storage is ephemeral, and should not be relied upon for persistence. If data on ephemeral disks is lost for some reason, then LogScale would need to copy the relevant segments from bucket storage to primary storage, incurring a performance penalty.

  • Data retention - Data retention policies relate to how long you keep data. For example, you may have legal requirements to retain personal data for a certain length of time before deleting it. There can also be legal requirements not to hold personal data beyond a certain period. You might also want to reduce storage consumption by deleting data no longer required. Data retention policies control how data is handled using the following parameters:

    • Ingest limit in GB (uncompressed)
    • Storage size limit in GB (compressed)
    • Time limit in days

    When data retention thresholds are exceeded, the oldest data is deleted first.

    Note

    Data retention policies are set on a per-repository basis.

    Note though that if data is deleted from primary or secondary storage through data retention policies, its corresponding data in bucket storage is also deleted.

    Read more about data retention here.

  • Compression - LogScale uses data compression extensively. This has a number of benefits such as data consumes less storage space, data transfers require less network bandwidth, and cost is reduced. When raw data is initially compressed into mini-segments, LZ4 is used, as it is extremely fast, while offering good compression ratios. Full segments are compressed using Zstd, which provides greater compression ratios while still being nearly as fast as LZ4. Your effective overall compression ratio is shown in your cluster statistics.

    Understanding compression is important, because you need to take it into account when sizing your primary and optionally secondary storage requirements.

    To calculate your storage requirements you need to take into account the following: compression ratio (you can assume a value of 10 initially, and later refine this as required), replication (the number of copies of the stored data across all nodes in the LS cluster), the ingest rate, and data retention duration. For example, if you had an ingest per day of 50TB, a data retention of 30 days, and the default replication factor of 3, then your requirement would be 50*30*3/10 = 450TB. For practical purposes this would be split across the primary, secondary, and bucket storage of the cluster.

    Read more about storage sizing calculations.

  • Monitoring - It is important to monitor your storage usage to avoid running out of disk space, and having LogScale stop processing inbound data.

    To monitor the data storage:

    • Data storage across individual nodes can be monitoring using the Cluster nodes page

    • To monitor the amount of data stored across the cluster and the effects of compression, see Cluster statistics

    • For more detailed and historic information, use the humio/insights dashboard.

For further information see the following:

Primary Storage

This section describes primary storage. This is the key storage to consider for ensuring fast and real-time queries.

Secondary Storage

Active data is stored on local disks within each node of the Falcon LogScale cluster. Primary disks should be high performance SSD. For additional local storage, secondary storage, for example, a lower performance SSD can be used. Falcon LogScale will automatically move segment files to secondary storage once the primary disk reaches a configured limit.

Bucket Storage

To store larger volumes of data, bucket storage can be used. Similar to secondary storage, Falcon LogScale will move segments to solutions such as Amazon Bucket Storage or Google Bucket. Bucket storage also allows for deployment of nodes, expansion of an existing cluster, and to maintain back-ups in case a node or a cluster crashes.

S3 Archiving

Ingested log data can be archived to Amazon S3. Archiving stores a copy of the ingested data logs, but the the archived data is not searchable by Falcon LogScale as it is when stored on bucket storage. Archived storage can optionally be re-ingested or read by other software.

Data Retention

To avoid servers reaching their maximum storage capabilities, Falcon LogScale can be configured to expire (delete) data when approaching a given threshold, such as the compressed file sizes, uncompressed file sizes, or the age of data.

Storage Capacity

Provides some background information on storage capacity planning.

Instance Sizing

Provides information on sizing AWS instances, with suggestions for sizing primary storage.

Primary Storage

This section describes primary storage. This is the key storage to consider for ensuring fast and real-time queries.