diff --git a/notebooks/05-operations/01-druid-logs-configuration.ipynb b/notebooks/05-operations/01-druid-logs-configuration.ipynb new file mode 100644 index 00000000..e708a41b --- /dev/null +++ b/notebooks/05-operations/01-druid-logs-configuration.ipynb @@ -0,0 +1,522 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "c9fc0614-4eec-4c39-ad22-ea10afa00e9d", + "metadata": {}, + "source": [ + "# Apache Druid logging\n", + "\n", + "Log configuration and maintenance is an essential task for operators of an Apache Druid cluster. Druid uses the [Apache Log4j](https://logging.apache.org/log4j/2.x/) logging framework to emit logs that are useful for day-to-day monitoring and for troubleshooting. They not only enable you to investigate issues and solve problems, but to understand how each of Druid processes work in isolation and in collaboration with one another.\n", + "\n", + "In this notebook, you will take a tour of the out-of-the-box Log4J configuration in Apache Druid and use terminal commands to locate and examine its contents." + ] + }, + { + "cell_type": "markdown", + "id": "cd9a55db-61ff-4fa4-9655-2fef292ed2aa", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "This tutorial works with Druid 29.0.0 or later, and is designed to be run on a Mac.\n", + "\n", + "Launch this tutorial and all prerequisites using the `jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).\n", + "\n", + "__DO NOT__ use the `jupyter-druid` profile with this tutorial as it will conflict with your locally running copy." + ] + }, + { + "cell_type": "markdown", + "id": "151ede96-348f-423c-adc7-0e23c6910b9e", + "metadata": {}, + "source": [ + "## Initialization\n", + "\n", + "To use this notebook, you must have Druid running locally.\n", + "\n", + "You will also make extensive use of the terminal, which you can place alongside this notebook or on another screen.\n", + "\n", + "### Install required tools\n", + "\n", + "Open a local terminal window.\n", + "\n", + "If you haven't installed `wget` or `multitail` yet, run the following commands to install them using `brew`.\n", + "\n", + "```bash\n", + "brew install multitail ; brew install wget\n", + "```\n", + "\n", + "To fetch the default configuration for `multitail` to your home folder, execute the following command. Skip this step if you are already running `multitail` as it will overwrite your own configuration.\n", + "\n", + "```bash\n", + "curl https://raw.githubusercontent.com/halturin/multitail/master/multitail.conf > ~/.multitailrc\n", + "```\n", + "\n", + "### Install Apache Druid\n", + "\n", + "Run the following to create a dedicated folder for learn-druid in your home directory:\n", + "\n", + "```bash\n", + "cd ~ ; mkdir learn-druid-local\n", + "cd learn-druid-local\n", + "```\n", + "\n", + "Pull and extract a compatible version of Druid.\n", + "\n", + "```bash\n", + "wget https://dlcdn.apache.org/druid/29.0.0/apache-druid-29.0.0-bin.tar.gz\n", + "tar -xzf apache-druid-29.0.0-bin.tar.gz\n", + "```\n", + "\n", + "Use the following commands to rename the folder.\n", + "\n", + "```bash\n", + "mv apache-druid-29.0.0 apache-druid\n", + "cd apache-druid\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "1d07feb7-92dd-42a9-ad6c-c89d5fdad8d5", + "metadata": {}, + "source": [ + "# Review the log file configuration\n", + "\n", + "The log file configuration is set in the `log4j2.xml` alongside Druid configuration files.\n", + "\n", + "Run this command to view the `auto` configuration file for logs used by the `learn-druid` script:\n", + "\n", + "```bash\n", + "more ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "```\n", + "\n", + "The [`Configuration`](https://logging.apache.org/log4j/log4j-2.4/manual/configuration.html#ConfigurationSyntax) element contains the following elements:\n", + "\n", + "* [`Properties`](https://logging.apache.org/log4j/2.x/manual/configuration.html#PropertySubstitution) provide key/values pairs that may be used throughout the configuration file.\n", + "* [`Appenders`](https://logging.apache.org/log4j/2.x/manual/appenders.html) designate the format of log messages and determine the target for the messages.\n", + "* [`Loggers`](https://logging.apache.org/log4j/2.x/manual/configuration.html#Loggers) filter the log messages and dispense them to Appenders. Loggers can filter messages based on the Java package and/or class and by message priority." + ] + }, + { + "cell_type": "markdown", + "id": "ca62693b-50f4-4831-a092-dbc8e06d8191", + "metadata": {}, + "source": [ + "## Properties\n", + "\n", + "Druid uses the `Properties` element to set a location for all logs. The location for log files is available at start-up.\n", + "\n", + "By default, this location is a \"log\" folder at the root of your Druid installation. You can overwrite it using the [log directory](https://druid.apache.org/docs/latest/configuration/logging/#log-directory)." + ] + }, + { + "cell_type": "markdown", + "id": "bec19737-2773-4c66-a05e-d0b1a707c4c8", + "metadata": {}, + "source": [ + "## Appenders\n", + "\n", + "There are two `appenders`:\n", + "\n", + "* \"Console\": [`Console`](https://logging.apache.org/log4j/log4j-2.4/manual/appenders.html#ConsoleAppender) appender for `SYSTEM_OUT`\n", + "* \"FileAppender\": [`RollingRandomAccessFile`](https://logging.apache.org/log4j/log4j-2.4/manual/appenders.html#RollingRandomAccessFileAppender) appender for detailed process logs\n", + "\n", + "### Start a Druid instance\n", + "\n", + "The default Configuration for Druid does not include a [`monitorInterval`](https://logging.apache.org/log4j/log4j-2.4/manual/configuration.html#AutomaticReconfiguration) property, so changes to logging configuration are only recognised when a process restarts.\n", + "\n", + "Run the following command to add a `monitorInterval` property to the Configuration:\n", + "\n", + "```bash\n", + "sed -i '' 's///' \\\n", + " ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "```\n", + "\n", + "Start Druid with the following command:\n", + "\n", + "```bash\n", + "nohup ~/learn-druid-local/apache-druid/bin/start-druid & disown > log.out 2> log.err < /dev/null\n", + "```\n", + "\n", + "### Look at the standard log files\n", + "\n", + "Since Druid is a distributed system, it contains log files for each Druid process. In addition, Druid also captures the output written to the standard output. Use the following command to take a look at what is here:\n", + "\n", + "```bash\n", + "cd ~/learn-druid-local/apache-druid/log ; ls\n", + "```\n", + "\n", + "This results in two sets of files being created:\n", + "\n", + "* `process name.stdout.log`: file containing information written by the processes to the output stream\n", + "* `process name.log`: file containing status, error, warning, and debug messages" + ] + }, + { + "cell_type": "markdown", + "id": "5258912e-bf23-4acd-9267-a7f4ea3bc6f4", + "metadata": {}, + "source": [ + "### Log filenames\n", + "\n", + "The \"FileAppender\" `RollingRandomAccessFile` appender has both `fileName` and `filePattern` properties. The `fileName` property sets the name of the log file being written to at the moment, while `filePattern` is applied when the log rolls over.\n", + "\n", + "Run the following command to change the default filename of the \"FileAppender\" in the Log4j configuration file so that all future detailed process logs will have a name suffixed with the hostname.\n", + "\n", + "```bash\n", + "sed -i '' 's/{sys:druid.node.type}.log/{sys:druid.node.type}-${hostName}.log/' \\\n", + " ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "```\n", + "\n", + "List the contents of the log folder again to see the changes:\n", + "\n", + "```bash\n", + "ls\n", + "```\n", + "\n", + "Revert the change with the following command:\n", + "\n", + "```bash\n", + "sed -i '' 's/{sys:druid.node.type}-${hostName}.log/{sys:druid.node.type}.log/' \\\n", + " ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "```\n", + "\n", + "Since \"FileAppender\" is a [RollingRandomAccessFileAppender](https://logging.apache.org/log4j/log4j-2.4/manual/appenders.html#RollingRandomAccessFileAppender), you can adjust the `TimeBasedTriggeringPolicy` to change when `fileName` log files are rolled over to `filePattern` log files.\n", + "\n", + "Run the following in your terminal to adjust the `filePattern` to include the hours and minutes:\n", + "\n", + "```bash\n", + "sed -i '' 's/{yyyyMMdd}/{yyyyMMdd-HH:mm}/' \\\n", + " ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "```\n", + "\n", + "Run this command to see the new file pattern:\n", + "\n", + "```bash\n", + "grep filePattern ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "```\n", + "\n", + "Since the `TimeBasedTriggeringPolicy` is set to 1 by default, a change in the least granular element of the `filePattern` triggers a rollover.\n", + "\n", + "Run the following command a few times to see files being created every minute:\n", + "\n", + "```bash\n", + "ls -l\n", + "```\n", + "\n", + "Use other Log4j policies in the usual way. For example, you can adjust the `filePattern` to `%d{yyyyMMdd-HH:mm}-%i.log` and then use [`SizeBasedTriggeringPolicy`](https://logging.apache.org/log4j/2.x/manual/appenders.html#sizebased-triggering-policy) instead of a `TimeBasedTriggeringPolicy` to have log files emitted when a log file hits a particular size, rather than based on the timestamp." + ] + }, + { + "cell_type": "markdown", + "id": "4cab489e-aa10-4482-a62d-3c106bdb06cc", + "metadata": {}, + "source": [ + "### Log patterns\n", + "\n", + "The content of log files is specified in the [`PatternLayout`](https://logging.apache.org/log4j/2.x/manual/layouts.html#pattern-layout).\n", + "\n", + "Run the following command to see the current Log4J configuration for `PatternLayout` for the \"Console\":\n", + "\n", + "```bash\n", + "xmllint -xpath Configuration/Appenders/Console/PatternLayout ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "```\n", + "\n", + "Each element refers to a specific piece of recordable information.\n", + "\n", + "* Timestamp (`%d{ISO8601}`)\n", + "* Log level (`%p`)\n", + "* Thread name (`[%t]`)\n", + "* Logger name (`%c`)\n", + "* Message (`%m%n`)\n", + "\n", + "Run this command to follow along as log entries are added to the process logs for the Coordinator, Overlord, and Broker:\n", + "\n", + "```bash\n", + "multitail -CS log4jnew -du -P a \\\n", + " -f coordinator-overlord.log \\\n", + " -f broker.log\n", + "```\n", + "\n", + "You can assign different log levels to different entries in the log to indicate how severe a log entry is. In the log, you will see entries with these levels:\n", + "\n", + "* FATAL (system failure)\n", + "* ERROR (functional failure)\n", + "* WARN (non-fatal issue)\n", + "* INFO (notable event)\n", + "* DEBUG (program debugging messages)\n", + "* TRACE (highly granular execution event)\n", + "\n", + "The thread name and logger name are helpful for diagnosis, especially for WARN and ERROR conditions.\n", + "\n", + "Messages provide insights into events, process states, and significant variables within the system.\n", + "\n", + "In your `multitail` window, press `q` to quit and return to the terminal." + ] + }, + { + "cell_type": "markdown", + "id": "df2f17b3-f359-4df8-a682-7398e7d2adf8", + "metadata": {}, + "source": [ + "### Log retention\n", + "\n", + "Since \"FileAppender\" is a [RollingRandomAccessFileAppender](https://logging.apache.org/log4j/log4j-2.4/manual/appenders.html#RollingRandomAccessFileAppender), you can adjust `DefaultRolloverStrategy` to control retention of logs by adjusting the `Delete` section.\n", + "\n", + "`IfFileName` and `IfLastModified` are used in conjunction to remove any files from the log folder that match the rules. The default is to remove matching files older than two months, based on the date last modified.\n", + "\n", + "Run the following command to change the trigger policy to a [duration](https://logging.apache.org/log4j/2.x/javadoc/log4j-core/org/apache/logging/log4j/core/appender/rolling/action/Duration.html#parseCharSequence) of \"PT2M\". As a consequence, any files matching the `IfFileName` filter older than two minutes will be deleted automatically by Log4J.\n", + "\n", + "Looking back at your previous `ls -l` results, run this command and repeat to see how the `Delete` rule is applied.\n", + "\n", + "```bash\n", + "sed -i '' 's/IfLastModified age=\"7d\"/IfLastModified age=\"PT2M\"/' \\\n", + " ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "```\n", + "\n", + "Notice how the Overlord removes all files, not just the rollover logs.\n", + "\n", + "```bash\n", + "ls\n", + "```\n", + "\n", + "Run the following command to change the retention policy to one day:\n", + "\n", + "```bash\n", + "sed -i '' 's/IfLastModified age=\"PT2M\"/IfLastModified age=\"P1D\"/' \\\n", + " ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "```\n", + "\n", + "Run the following command to return the file pattern to the default:\n", + "\n", + "```bash\n", + "sed -i '' 's/{yyyyMMdd-HH:mm}/{yyyyMMdd}/' \\\n", + " ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "4e6e8716-621c-4c94-90f2-360f568629d1", + "metadata": {}, + "source": [ + "## Loggers\n", + "\n", + "This section of the configuration controls what types of events are logged and what data is recorded.\n", + "\n", + "Before beginning, restart your Druid instance to recreate any logs that were deleted by the retention rules above.\n", + "\n", + "Run the following command to kill your Druid instance:\n", + "\n", + "```bash\n", + "kill $(ps -ef | grep 'supervise' | awk 'NF{print $2}' | head -n 1)\n", + "```\n", + "\n", + "Restart Druid with the following command:\n", + "\n", + "```bash\n", + "nohup ~/learn-druid-local/apache-druid/bin/start-druid & disown > log.out 2> log.err < /dev/null\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "da8e5519-88d7-4da1-982d-c8ba6fe4e83c", + "metadata": {}, + "source": [ + "### Logging level\n", + "\n", + "The out-of-the-box Druid configuration for Log4J sets a `Root` level for the `FileAppender` of `INFO`, meaning that only messages with a level of `INFO` and above are recorded.\n", + "\n", + "Run this command to see the configuration for your instance:\n", + "\n", + "```bash\n", + "xmllint -xpath Configuration/Loggers/Root ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "```\n", + "\n", + "Other base levels are set at a class level, reducing log noise. For example:\n", + "\n", + "```xml\n", + " \n", + " \n", + " \n", + " \n", + "```\n", + "\n", + "Run this command to monitor several log files:\n", + "\n", + "```bash\n", + "multitail -CS log4jnew -du -P a -s 2 -sn 1,3 \\\n", + " -f coordinator-overlord.log \\\n", + " -f broker.log \\\n", + " -f middlemanager.log \\\n", + " -f historical.log\n", + "```\n", + "\n", + "Open a new terminal window and run this command to amend the base logging level for all Druid processes:\n", + "\n", + "```bash\n", + "sed -i '' 's/Root level=\"info\"/Root level=\"debug\"/' \\\n", + " ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "```\n", + "\n", + "Within a short period of time you will see many more log messages appearing.\n", + "\n", + "Revert the logging level to INFO before proceeding.\n", + "\n", + "```bash\n", + "sed -i '' 's/Root level=\"debug\"/Root level=\"info\"/' \\\n", + " ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "```\n", + "\n", + "Leave `multitail` running and your second terminal open." + ] + }, + { + "cell_type": "markdown", + "id": "7e0a46b8-77f4-4792-9484-9acba0407b3d", + "metadata": {}, + "source": [ + "## Examples\n", + "\n", + "Two recommended approaches for working with log files are:\n", + "\n", + "1. Search individual log files for indications of a problem, such as WARN, ERROR, and FATAL, and of Java exceptions in stack traces (e.g. `java.net.ConnectionException: Connection refused`.\n", + "2. Read log files more like a novel, starting at the beginning of the file or some known intermediate point and follow the story logically.\n", + "\n", + "In the first approach, it's possible to work back through the history, using the time field as a key. You may notice large time gaps, or use filtering to remove any events from threads or classes that you know are not relevant to the error.\n", + "\n", + "The second approach requires more time, but helps to solve more complex problems. It's also an important learning aid when delving into how Druid processes collaborate to realize services.\n", + "\n", + "This section contains examples of using log files to diagnose issues with Druid." + ] + }, + { + "cell_type": "markdown", + "id": "f9137fc3-9f53-4887-a2b2-b8cabb061cc5", + "metadata": {}, + "source": [ + "### Failure in Apache ZooKeeper\n", + "\n", + "Druid processes rely on [Apache ZooKeeper](https://zookeeper.apache.org/) for [inter-process communication and configuration](https://druid.apache.org/docs/latest/dependencies/zookeeper.html). Run the following commands to simulate a failure in ZooKeeper.\n", + "\n", + "* The first `kill` prevents ZooKeeper from being recovered by the `learn-druid` script's `supervisor`.\n", + "* The second `kill` stops the ZooKeeper process.\n", + "\n", + "```bash\n", + "kill -STOP $(ps -ef | grep 'perl' | awk 'NF{print $2}' | head -n 1)\n", + "kill $(ps -ef | grep 'zoo' | awk 'NF{print $2}' | head -n 1)\n", + "```\n", + "\n", + "In your `multitail` window, you will now see multiple `INFO` messages relating to ZooKeeper connection issues, such as:\n", + "\n", + "```\n", + "2024-02-22T11:54:31,662 WARN [main-SendThread(localhost:2181)] org.apache.zookeeper.ClientCnxn - Session 0x10001277b290002 for server localhost/[0:0:0:0:0:0:0:1]:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.\n", + "\n", + "java.net.ConnectException: Connection refused\n", + "```\n", + "\n", + "Run the following command to restart the \"supervisor\", which will restart ZooKeeper automatically.\n", + "\n", + "```bash\n", + "kill -CONT $(ps -ef | grep 'perl' | awk 'NF{print $2}' | head -n 1)\n", + "```\n", + "\n", + "You can use the `-e` switch on `multitail` to filter the logs that are displayed. For example, try closing your `multitail` window with `q` and run the following command to filter logs to only show WARN messages in the Coordinator and Overlord, and to ignore all INFO messages on the Broker, Middle Manager, and Historical. You may then want to repeat the above.\n", + "\n", + "```bash\n", + "multitail -CS log4jnew -du -P a -s 2 -sn 1,3 \\\n", + " -e \"WARN\" -f coordinator-overlord.log \\\n", + " -ev \"INFO\" -f broker.log \\\n", + " -ev \"INFO\" -f middlemanager.log \\\n", + " -ev \"INFO\" -f historical.log\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "1023e5a3-cedc-435f-a448-30f8c3ae1535", + "metadata": {}, + "source": [ + "# Clean up\n", + "\n", + "Run the following command to stop Druid:\n", + "\n", + "```bash\n", + "kill $(ps -ef | grep 'supervise' | awk 'NF{print $2}' | head -n 1)\n", + "```\n", + "\n", + "Remove the `learn-druid-local` folder from your home folder in the usual way." + ] + }, + { + "cell_type": "markdown", + "id": "57d5921b-636e-4a85-ae4d-dc0239bfef0b", + "metadata": {}, + "source": [ + "## Learn more\n", + "\n", + "\n", + "For more information:\n", + "\n", + "* Watch [Druid optimizations for scaling customer facing analytics at Conviva](https://youtu.be/zkHXr-3GFJw?t=746) by Amir Youssefi and Pawas Ranjan from Conviva that describes how useful this information can be to tuning Druid clusters.\n", + "* Read about Druid [logging](https://druid.apache.org/docs/latest/configuration/logging.html) in the official documentation.\n", + "* See more ways to use and run `multitail` on the [official site](https://www.vanheusden.com/multitail/index.html).\n", + "* Read about Log4j in [Apache Logging Services](https://logging.apache.org/) documentation.\n", + "* Review [simple date format](https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html) for `filePattern`.\n", + "* Review [pattern layout](https://logging.apache.org/log4j/2.x/manual/layouts.html#pattern-layout) for `PatternLayout`." + ] + }, + { + "cell_type": "markdown", + "id": "4c081295-750e-41a3-9693-7fe50af401fa", + "metadata": {}, + "source": [ + "Notice that the Logger Name shows a fully-qualified class name. Here are some examples of searches you can conduct on your logs to deepen your knowledge of what your instance is doing.\n", + "\n", + "| Log | Search Term |\n", + "|---|---|\n", + "| Any | __NodeRoleWatcher__
Across all the processes, watch as they detect changes in the processes that are running in the cluster, and see what they do about it. |\n", + "| Any | __org.apache.druid.initialization.Initialization__
These messages are all about the process starting up. It can be interesting to see what exactly each one does and if it runs into issues. |\n", + "| Coordinator / Overlord | __org.apache.druid.metadata.SQLMetadataRuleManager__
This is the coordinator polling the rules in the metadata database, getting ready to apply them. The log tells you how many rules it picks up and how many data sources they cover. |\n", + "| Coordinator / Overlord | __org.apache.druid.metadata.SqlSegmentsMetadataManager__
Messages show how many segments the cluster thinks are “used” – ready to be used for queries. |\n", + "| Coordinator / Overlord | __org.apache.druid.indexing.overlord.RemoteTaskRunner__
GIves interesting information about what’s happening with ingestion resources in the cluster, including when they first advertise themselves. |\n", + "| Coordinator / Overlord | __org.apache.druid.server.coordinator.rules__
Lots of information about how the retention rules are applied. |\n", + "| Coordinator / Overlord | __org.apache.druid.server.coordinator.duty.BalanceSegments__
Here you can see what Druid decides to do when balancing the workload, such as a server is lost or added. |\n", + "| Historical | __org.apache.druid.server.coordination.BatchDataSegmentAnnouncer__
You can see individual segments being announced as available for query by each historical server as it loads them. |\n", + "| Historical | __org.apache.druid.server.coordination.SegmentLoadDropHandler__
As well as seeing how the Historical checks its local segment cache on startup, you can watch as the Historical picks up the instructions from the Coordinator and then does something about them. When there are ERRORs like “Failed to load segment for dataSource”, you get traces about what the process was trying to do – quite often something pointing to an error with its connection to deep storage. |" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fa902836-0738-4290-b3e9-76ba2c7d53d0", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.6" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}