InfluxDB and Telegraf Overview
If you need high efficiency and high performance storage, InfluxDB is a database worth evaluating. It is designed to handle monitoring, real-time analytics, Internet of Things (IoT) sensor data, and application metrics with high volumes and demands. Likewise, to collect metrics and data, Telegraf is a tool that can support these high demand environments as well. In this article, I'll explore the basics of InfluxDB and Telegraf and some key considerations that you need to be aware of when choosing to use these open source software technologies.
- What Is InfluxDB?
- How Does InfluxDB Work?
- What is Telegraf?
- How Does Telegraf Work?
- InfluxDB and Telegraf Concerns
InfluxDB is an open source time series database (TSDB). This includes APIs for storing and querying data, processing it in the background for ETL, monitoring and alerting purposes, user dashboards, visualizing and exploring the data, and more.
Although it can be used for large deployments, InfluxDB excels at smaller ingestion rates. Benchmarks from two of its major competitors (TimescaleDB and TDEngine) both confirm this.
The true power of InfluxDB (at least InfluxDB 1.x) has traditionally been in pairing it with the rest of the TICK stack: Telegraf, InfluxDB, Chronograf, and Kapacitor.
It should be noted that while some functionality is only in InfluxData’s enterprise version, both InfluxDB 2.x and 3.x remain with MIT and Apache. Also, InfluxDB is moving from Go to Rust, though unless you plan to get involved in fixing bugs, this is probably not a huge concern.
Depending on the environment, compression can be very important to optimize “writes” to InfluxDB. Data must be compressed when transferred/stored between/in different “pipeline” nodes. Up to 5-10x more data gets transferred/stored if uncompressed. It has been confirmed that the current pipeline is configured to compress data effectively.
It is possible to customize InfluxDB configuration by using influxd configuration flags, setting environment variables, or defining configuration options in the configuration file.
By default, the configuration file is located here -> INFLUXD_CONFIG_PATH=/etc/influxdb/config.toml
Change the configuration file to change the storage engine location.
https://docs.influxdata.com/influxdb/v2.7/reference/config-options/#engine-path -> engine-path = "/engine".
The influx “backup” command backs up data stored in InfluxDB to a specified directory. The backup command was tested. The process took 10 minutes / 1.5GB of data. If the average level of data in “System_metrics_15s” is 150 GB, then the anticipated backup completion time is 10x100= 1,000 minutes. That is too slow.
sudo influx backup /mnt/INFLUX_BACKUP/backup_$(date '+%Y-%m-%d_%H-%M') -t xBp3km-FD83Do308o8Z8Vkv-2vwec1LelMnSa- QQH6p8lcQSDN1REv59Brlv0hM0oRvzoGZAg4xrZswL0wRdDA== -b openlogic --compression gzip
Another option is “Logical volume snapshots."
While the real-world process may differ significantly (for example, it could be extended by data offloading to external storage), a simplified backup procedure can be done in five steps:
- Create a Logical Volume and Move All Influx Data Into It.
- Stop InfluxDB Service and Create a Snapshot:
-> sudo lvcreate -s -L 200GB -n mysnap1 /dev/mapper/Influx_VG-Influx_LV. Check the snapshot is “active” -> sudo lvdisplay /dev/Influx_VG/Influx_LV.
Please mind the size of the snapshots because if it is not properly sized (compared to the original volume size), the snapshot may become invalid (inactive). Size your volume groups and snapshots accordingly.
3. How to Restore: If you need to continue with the snapshotted data, then the original volume can be unmounted, and the snapshotted data mounted.
sudo umount /mnt/INFLUX_DATA sudo mount /dev/Influx_VG/mysnap5 /mnt/INFLUX_DATA
4. Test Snapshot: Stop the service -> mount the snapshot -> connect to the Console and check the connection is successful. -> If successful, you have valid data to restore.
sudo mount /dev/Influx_VG/mysnap5 /mnt/INFLUX_DATA
5. Revert to Origin: Now revert to the original volume:
sudo umount /mnt/INFLUX_DATA sudo mount -a #This command will pick up fstab config
- Create a Logical Volume and Move All Influx Data Into It.
Get the Decision Maker's Guide to Open Source Databases
Our experts explore top open source databases to provide comparison and insight on which one might work best for your organization.
What is Telegraf?
Telegraf is a server-based agent for collecting and sending all metrics and events from databases, systems, and IoT sensors. Telegraf is written in Go and compiles into a single binary with no external dependencies and requires a very minimal memory footprint.
How Does Telegraf Work?
In a word, plugins. The Telegraf GitHub describes the tool as “The plugin-driven server agent for collecting and reporting metrics.” The code for most of the plugins is available at https://github.com/influxdata/telegraf/tree/release-1.28/plugins (replace the release version if you want a different version). A look in that directory shows the different types of plugins: aggregators, common, inputs, outputs, parsers, processors, secretstores, and serializers. The plugin directory breaks them down into input, output, aggregator, processor, and external. Both ways of breaking them down are a little confusing, but we will leave the code for another time and discuss the breakdown in the plugin directory.
One confusing thing about the plugin directory is that the plugin types are not mutually exclusive. For example, all 12 of the “external” plugins are also input plugins. Input plugins make up the majority of plugins, with 255 as opposed to 59 output plugins.
So what makes them external? These 12 plugins are not hosted in the Telegraf project. For example, the BigBlueButton input plugin code is hosted at https://github.com/bigblueswarm/bigbluebutton-telegraf-plugin/blob/main/README.md
Another confusing thing is that not all the plugins make it to the directory. While there are 12 external plugins on the plugin directory, the source code for 1.28 lists several others, including ones that may be of interest to OpenLogic customers such as libvirt.
These plugins are not mutually exclusive. You can build a flow that looks like this: Input plugin processors/aggregators outputs.
This leaves the parsers, serializers, secretstores, and commons. The parsers look at different kinds of data, such as csv, avro, or json.
Once the plugin hands off responsibility to Telegraf, it is written to take substantial use of CPU without keeping large chunks of data in RAM, as noted in the below screenshot.
In our work with customers, we have tested three main topologies:
- A single configuration file with several same plugin type instances.
- Telegraf instances installed in different physical boxes.
- Several Telegraf instances executed as parallel processes from the same configuration file.
We have determined that in many use cases, from a “cost/performance (value)” point of view, option #3 is the best. In our lab tests, option #1 could not keep up with data coming from 25K devices.
Below is the version of the telegraf.conf file that processed data the best for use in our environment. Bold items will not work when used outside of [agent] block:
====== [global_tags] [agent] interval = "10s" # no effect on kafka input round_interval = true # no effect on kafka input metric_batch_size = 10000 # there is impact metric_buffer_limit = 1000000 # no direct impact collection_jitter = "0s" # no impact flush_interval = "10s" # no impact flush_jitter = "0s" # no impact precision = "0s" # no impact logtarget = "file" logfile = "/var/log/telegraf/telegraf.log" hostname = "" omit_hostname = false debug = true [[inputs.internal]] collect_memstats = true [[inputs.kafka_consumer]] brokers = ["10.250.49.19:9092","10.250.49.22:9092","10.250.49.3:9092","10.250.49.15:9092","10.250.49.23:9092","10.250.49.10:9092"] topics = ["raw-host-influx-metrics-nonprod-linux"] consumer_group = "telegraf_1_metrics_consumers" offset = "oldest" max_message_len = 1000000 # metric_batch_size=10000 max_undelivered_messages = 20000 # metric_buffer_limit = 250000 #data_type = "string" data_format = "json" #data_format = "value" json_strict = true json_query = "" tag_keys = [ "cpu","device","mode","url" ] json_string_fields = ["domainname","machine","nodename","release","sysname","version"] json_name_key = "measurement" json_time_key = "@timestamp" json_time_format = "unix" json_timezone = "" max_processing_time="400ms" compression_codec = 1 #[[outputs.file]] # files = ["stdout"] [[outputs.influxdb_v2]] urls = ["http://10.250.49.15:8086"] token = "9D2VyE_EypMLmrDTYsmBC_Cxbfgt5RTf8yKRU42Of_Z-OSfMbJ9Da3xLrTgBAuE9aKr0OHbArw-7CdqVBzO2TA==" organization = "openlogic" namedrop = ["internal_*"] bucket = "System_metrics_15s" content_encoding = "gzip" [[outputs.influxdb_v2]] urls = ["http://10.250.49.185:8086"] token = "9D2VyE_EypMLmrDTYsmBC_Cxbfgt5RTf8yKRU42Of_Z-OSfMbJ9Da3xLrTgBAuE9aKr0OHbArw-7CdqVBzO2TA==" organization = "openlogic" bucket = "telegraf_internal" namepass = ["internal_*"]
As you can see if you read the config closely, our use case involved Apache Kafka. Each running Telegraf process (instance) is an individual consumer from a Kafka point of view. Kafka assigns topic partitions to the consumers in a group so that each partition is assigned to one consumer in the group. This ensures that records are processed in parallel, and nobody steps on other consumers' toes, so to speak. If we have two partitions and only one consumer, the consumer reads from both. That is why there are 18 Telegraf instances in the pipeline. The number of Telegraf instances matches the number of Kafka topic partitions. 18 Telegraf instances has been proven experimentally to be able to keep up with data coming from 25,000 scrapper processes.
Kafka here is a buffer layer. If the pipeline performs at its maximum limits in an environment, or if Influxdb server is experiencing a long downtime:
- Kafka will keep on buffering the incoming records, i.e., the lag will be growing.
- Once InfluxDB operations have been restored, it will take time to process the lag, and the bigger the lag is, the longer it will take to catch up.
- It will not be possible to query the records that are within the lag timeframe. If, for example, the lag is one hour, the earliest data available will be more than one hour old.
Learn more about Kafka partitions >>
InfluxDB and Telegraf Concerns
Many of our readers will be familiar with Kubernetes and operators. Telegraf has an operator, but as of this writing, it has not had a release in over a year. As a matter of policy, we tend to consider a project dead after a year. However, the project has had commits in that timeframe, and we are just over a year. Please check the Telegraf operator GitHub at the time of your implementation and see if you are comfortable with the security implications of a project that lacks frequent releases.
In contrast, both the Influxdb-operator and the Influxdata-operator are archived on GitHub. InfluxData offers a hosted solution, so development of those operators is likely now in house.
It’s important to note that Influx projects are not maintained by a third-party such as the Linux Foundation/CNCF or Apache Software Foundation. Changes in licensing to MongoDB, Elasticsearch, MariaDB, and the Hashicorp software suite should give people pause before relying on a technology. OpenLogic has a Professional Services team that can be engaged for certain security fixes; however, these are evaluated on a case-by-case basis. While OpenLogic provides EOL support for certain versions of Java, CentOS, and AngularJS, these are specific projects. As noted at the beginning of the article, Influx and Telegraf remain with open source licenses.
Need Help with InfluxDB or Telegraf?
Our open source experts provide technical support, long-term support, and advisory services for more than 400 open source technologies.
- Support - OpenLogic Support for Open Source Databases
- Resource Collection - Intro to Open Source Databases
- White Paper - Decision Maker’s Guide to Open Source Databases
- On-Demand Webinar - Real-Time Data Lakes: Kafka Streaming With Spark