Data Management Guide
This guide is intended for Loom administrators looking to control and adjust the disk-space usage. When a disk-space issue arises, Sophie will notify you automatically with one of the following alerts:
- "Sophie is running low on disk, data ingestion will soon be stopped. Follow the Data Management guide or contact firstname.lastname@example.org" This means disk-space doesn’t meet your current utilization. Administrators need to configure optimal utilization of the available disk space.
- "Events are being discarded due to the retention settings. Please refer to the Data Management guide." This either means one of the following:
- Old events are accidentally being streamed into Sophie. Please make sure you are streaming the correct data.
- Retention settings within Sophie do not meet the dates of the events being streamed. administrators need to configure the retention in accordance with their data’s dates.
Configure Optimal Utilization of the Available Disk Space
If Sophie is running on low disk, perform the following steps:
Verify Direct-Attached Storage
First, make sure you're running on a locally-attached disk. Running with spinning disks, or even SAN, has a severe impact on performance.
To verify whether data spins run the following command from the terminal:
If fs.data.spins is true, then data, indeed, spin.
Another option to verify this is by running: cat /sys/block/sde/queue/rotational (replace sde with the disk holding the Elastic data)
1 = spinning (bad)
0 = SSD (good)
Steps to Improve Throughput and Disk Usage
Remove data you don't need:
At your Sophie instance, go to Settings->General->Storage. Check the "heaviest" sources, for each "heavy" source:
- Review their structure - are there any properties you can remove? Decreasing the number of propertied in the stored-events has the biggest impact.
- Do you need the rawMessage property?
It's used in notifications, for free-text correlations, and it is somewhat helpful in Kibana. But it doubles the size of a document. If the structuring is good, remove this field by going to the source-settings, then set elastic.store_raw_event to false.
- First, try to be selective with what you drop. For example, you might prefer to drop low-severity events in the data-input.
- Consider sub-sampling (i.e. taking one every X events). This can be controlled per-source via the elasticsearch.subsampling_ratio setting.
Steps to improve the throughput
Optimize the bulk-indexing interval
Adjust the following general-settings:
Use the operational-dashboard to measure the effect, the objective is to get the documents.write metric as high as possible. Note that at some point, you might start seeing errors in the Elastic logs - which means you're bombarding it with more than it can take.
Keep indices size at no more than the size of the RAM allocated to Elastic
e.g. if Elastic has 30GB RAM, keep your indices smaller than that.
If some of your indices grow larger, then consider either:
- Increasing the number of shards (even if working with a single instance).
- change the index rotation to be hourly instead of daily.
Both of these settings can be found under the source-setting (elasticsearch.number_of_shards and elasticsearch.index_time_interval). Increasing the number of shards is almost always better, but if the daily volume of a source is more than x20 the size of the memory allocated to Elastic, switch to hourly indices.
Remove/disable read-heavy modules
The heavy readers are:
- Custom Alerts (especially ones querying event-* or with lengthy periods)
- ARCA modules (entity analysis, highlight analysis) Also, each alert creation involves querying Elastic, so make sure you're not generating too many alerts (a small number of incidents might be misleading. Check the number of alerts-per-incident). If there are many hundreds of daily alerts, consider tweaking the anomaly-detection engine.
Steps to reduce disk usage
Compress large indices
Under source-settings, change elasticsearch.index_codec to be best_compression. Note that this will only take effect for new indices.
Assessing disk performance
There are several ways for doing this, but the recommended one is to run iostat -xd while the system is running. Check the disk that is running Elastic, and look for the r_await and w_await columns. Decent values are up to very few milliseconds. Ten milliseconds or more means the disk is too slow.
Configure Incident and Detection History Retention Settings
Incidents & Anomaly-Detection history usually take much less space than raw data, the default retention of Incidents & Anomaly-Detection history is 120 days.
You can review how much space is being used for storing incidents, and reconfigure the retention, under Settings --> Storage:
To review the size, expand the “Database table sizes” section.
To change a setting, click on the number itself, then enter the new number:
Careful! the change takes effect immediately – once you approve the new number older incidents are deleted.
Managing Anomaly-Detection Models Retention
It is very important to choose the right setting, as the anomaly detection models retention cannot be modified without deleting all existing data.
This setting is controlled via the YAML configuration file on the server.
Connect to the server, then open the configuration file:
sudo vi /opt/loom/.conf/loom_config.yaml
In this file, under the
graphite section, change
retentions to the desired value.
The retentions format is
Frequencies and histories are specified using the following suffixes:
s – second
m – minute
h – hour
d – day
w – week
y - year
For example, 1m:21d will store a data-point every minute, and keep the data-points for 21 days.