Extend the platform,
empower your team.
Monitor your Databricks Clusters via its multiple APIs!
ExtensionThis OneAgent Extension allows you to collect metrics from your embedded Ganglia instance, the Apache Spark APIs, and/or the Databricks API on your Databricks Cluster.
NOTE: Databricks Runtime v13+ no longer supports Ganglia, please use the Spark and Databricks API Options within the configuration.
This is intended for users who:
Have Databricks cluster(s) they would like to monitor job status' and other important job and cluster level metrics
Look to analyze uptime and autoscaling issues of your Databricks Cluster(s)
This enables you to:
Define in the configuration which metrics you'd like to collect from your Databricks Clusters
Set up a global init script on your Databricks Cluster to download the Dynatrace OneAgent
Restart your Databricks Cluster to enable the Dynatrace OneAgent and this extension.
/ui/settings/builtin:eec.local
)and turning on the first two options.
Create Databricks API Token from inside your Databricks Cluster
Copy your Databricks URL
Copy Linux OA Installation wget
command from
#install/agentlinux;gf=all
)**NOTE: Databricks Clusters can go up and down quickly causing multiple HOST entities within Dynatrace. Databricks reuses IP addresses so if you'd like to have the same HOST entities for your clusters you can add this flag --set-host-id-source="ip-addresses"
to the OneAgent installation command in your Global Init Script.
For Example :
/bin/sh Dynatrace-OneAgent-Linux.sh --set-infra-only=true --set-app-log-content-access=true --set-host-id-source="ip-addresses"
Configuration for Apache Spark & Databricks API Metrics (Recommended)
Set up Global Init Script on Databricks Cluster
Dynatrace-OneAgent-Linux.sh
file can be manually uploaded to you Databricks DBFS and the script below can be modified to use those locations instead of using the wget command.#!/usr/bin/env bash
wget -O Dynatrace-OneAgent-Linux.sh "https://<TENANT>.live.dynatrace.com/api/v1/deployment/installer/agent/unix/default/latest?arch=x86&flavor=default" --header="Authorization: Api-Token <Installer API-TOKEN>"
/bin/sh Dynatrace-OneAgent-Linux.sh --set-infra-only=true --set-app-log-content-access=true --set-host-id-source="ip-addresses"
Configure OneAgent Extension in Dynatrace Cluster from
/ui/hub/ext/com.dynatrace.databricks
) -> Add Monitoring Configuration -> Select Databricks Hosts ->Select which Feature Sets of Metrics you'd like to Capture
Restart your Databricks Clusters
Verify Metrics show up on the HOST screen of your Databricks Cluster's Driver Node. All the metrics will be attached to that HOST entity.
Configuration for Ganglia (Legacy)
Create Dynatrace API with ReadConfig
Permissions
Set up Global Init Script on Databricks Cluster
#!/usr/bin/env bash
wget -O Dynatrace-OneAgent-Linux.sh "https://<TENANT>.live.dynatrace.com/api/v1/deployment/installer/agent/unix/default/latest?arch=x86&flavor=default" --header="Authorization: Api-Token <Installer API-TOKEN>"
/bin/sh Dynatrace-OneAgent-Linux.sh --set-infra-only=true --set-app-log-content-access=true --set-host-id-source="ip-addresses"
# token with 'ReadConfig' permissions
wget -O custom_python_databricks_ganglia.zip "https://<TENANT>.live.dynatrace.com/api/config/v1/extensions/custom.python.databricks_ganglia/binary" --header="Authorization: Api-Token <ReadConfig API-TOKEN>"
unzip custom_python_databricks_ganglia.zip -d /opt/dynatrace/oneagent/plugin_deployment/
# Add Databricks Workspace URL Environment Variable
cat <<EOF | sudo tee /etc/databricks_env
DB_WS_URL=https://adb-XXXXXXXXX.XX.azuredatabricks.net
DB_WS_TOKEN=dapiXXXXXXXXXXXXXXXXXXXXXXXXXXX
EOF
Create Dynatrace API Token with entities.read
& entitie.write
permissions.
Configure OneAgent Extension in Dynatrace Cluster from
/ui/hub/ext/com.dynatrace.databricks
)Restart Databricks Cluster(s)
Verify Metrics are showing up on the included Dashboard
Below is a complete list of the feature sets provided in this version. To ensure a good fit for your needs, individual feature sets can be activated and deactivated by your administrator during configuration.
Metric name | Metric key | Description | Unit |
---|---|---|---|
Databricks Cluster Upsizing Time | databricks.cluster.upsizing_time | Time spent upsizing cluster | MilliSecond |
Metric name | Metric key | Description | Unit |
---|---|---|---|
Executor RDD Blocks | databricks.spark.executor.rdd_blocks | - | Count |
Executor Memory Used | databricks.spark.executor.memory_used | - | Byte |
Executor Disk Used | databricks.spark.executor.disk_used | - | Byte |
Executor Active Tasks | databricks.spark.executor.active_tasks | - | Count |
Executor Failed Tasks | databricks.spark.executor.failed_tasks | - | Count |
Executor Completed Tasks | databricks.spark.executor.completed_tasks | - | Count |
Executor Total Tasks | databricks.spark.executor.total_tasks | - | Count |
Executor Duration | databricks.spark.executor.total_duration.count | - | MilliSecond |
Executor Input Bytes | databricks.spark.executor.total_input_bytes.count | - | Byte |
Executor Shuffle Read | databricks.spark.executor.total_shuffle_read.count | - | Byte |
Executor Shuffle Write | databricks.spark.executor.total_shuffle_write.count | - | Byte |
Executor Max Memory | databricks.spark.executor.max_memory | - | Byte |
Executor Alive Count | databricks.spark.executor.alive_count.gauge | - | Count |
Executor Dead Count | databricks.spark.executor.dead_count.gauge | - | Count |
Metric name | Metric key | Description | Unit |
---|---|---|---|
CPU User % | databricks.hardware.cpu.usr | - | Percent |
CPU Nice % | databricks.hardware.cpu.nice | - | Percent |
CPU System % | databricks.hardware.cpu.sys | - | Percent |
CPU IOWait % | databricks.hardware.cpu.iowait | - | Percent |
CPU IRQ % | databricks.hardware.cpu.irq | - | Percent |
CPU Steal % | databricks.hardware.cpu.steal | - | Percent |
CPU Idle % | databricks.hardware.cpu.idle | - | Percent |
Memory Used | databricks.hardware.mem.used | - | Byte |
Memory Total | databricks.hardware.mem.total | - | KiloByte |
Memory Free | databricks.hardware.mem.free | - | KiloByte |
Memory Buff/Cache | databricks.hardware.mem.buff_cache | - | KiloByte |
Metric name | Metric key | Description | Unit |
---|---|---|---|
Job Status | databricks.spark.job.status | - | Unspecified |
Job Duration | databricks.spark.job.duration | - | Second |
Job Total Tasks | databricks.spark.job.total_tasks | - | Count |
Job Active Tasks | databricks.spark.job.active_tasks | - | Count |
Job Skipped Tasks | databricks.spark.job.skipped_tasks | - | Count |
Job Failed Tasks | databricks.spark.job.failed_tasks | - | Count |
Job Completed Tasks | databricks.spark.job.completed_tasks | - | Count |
Job Active Stages | databricks.spark.job.active_stages | - | Count |
Job Completed Stages | databricks.spark.job.completed_stages | - | Count |
Job Skipped Stages | databricks.spark.job.skipped_stages | - | Count |
Job Failed Stages | databricks.spark.job.failed_stages | - | Unspecified |
Metric name | Metric key | Description | Unit |
---|---|---|---|
Stage Active Tasks | databricks.spark.job.stage.num_active_tasks | - | Count |
Stage Completed Tasks | databricks.spark.job.stage.num_complete_tasks | - | Count |
Stage Failed Tasks | databricks.spark.job.stage.num_failed_tasks | - | Count |
Stage Killed Tasks | databricks.spark.job.stage.num_killed_tasks | - | Count |
Stage Executor Run Time | databricks.spark.job.stage.executor_run_time | - | MilliSecond |
Stage Input Bytes | databricks.spark.job.stage.input_bytes | - | Byte |
Stage Input Records | databricks.spark.job.stage.input_records | - | Count |
Stage Output Bytes | databricks.spark.job.stage.output_bytes | - | Byte |
Stage Output Records | databricks.spark.job.stage.output_records | - | Count |
Stage Shuffle Read Bytes | databricks.spark.job.stage.shuffle_read_bytes | - | Byte |
Stage Shuffle Read Records | databricks.spark.job.stage.shuffle_read_records | - | Count |
Stage Shuffle Write Bytes | databricks.spark.job.stage.shuffle_write_bytes | - | Byte |
Stage Shuffle Write Records | databricks.spark.job.stage.shuffle_write_records | - | Count |
Stage Memory Bytes Spilled | databricks.spark.job.stage.memory_bytes_spilled | - | Byte |
Stage Disk Bytes Spilled | databricks.spark.job.stage.disk_bytes_spilled | - | Byte |
Metric name | Metric key | Description | Unit |
---|---|---|---|
Application Count | databricks.spark.application_count.gauge | - | Count |
Metric name | Metric key | Description | Unit |
---|---|---|---|
Streaming Batch Duration | databricks.spark.streaming.statistics.batch_duration | - | MilliSecond |
Streaming Receivers | databricks.spark.streaming.statistics.num_receivers | - | Count |
Streaming Active Receivers | databricks.spark.streaming.statistics.num_active_receivers | - | Count |
Streaming Inactive Receivers | databricks.spark.streaming.statistics.num_inactive_receivers | - | Count |
Streaming Completed Batches | databricks.spark.streaming.statistics.num_total_completed_batches.count | - | Count |
Streaming Retained Completed Batches | databricks.spark.streaming.statistics.num_retained_completed_batches.count | - | Unspecified |
Streaming Active Batches | databricks.spark.streaming.statistics.num_active_batches | - | Count |
Streaming Processed Records | databricks.spark.streaming.statistics.num_processed_records.count | - | Count |
Streaming Received Records | databricks.spark.streaming.statistics.num_received_records.count | - | Count |
Streaming Avg Input Rate | databricks.spark.streaming.statistics.avg_input_rate | - | Byte |
Streaming Avg Scheduling Delay | databricks.spark.streaming.statistics.avg_scheduling_delay | - | MilliSecond |
Streaming Avg Processing Time | databricks.spark.streaming.statistics.avg_processing_time | - | MilliSecond |
Streaming Avg Total Delay | databricks.spark.streaming.statistics.avg_total_delay | - | MilliSecond |
Metric name | Metric key | Description | Unit |
---|---|---|---|
RDD Count | databricks.spark.rdd_count.gauge | - | Count |
RDD Partitions | databricks.spark.rdd.num_partitions | - | Count |
RDD Cached Partitions | databricks.spark.rdd.num_cached_partitions | - | Count |
RDD Memory Used | databricks.spark.rdd.memory_used | - | Byte |
RDD Disk Used | databricks.spark.rdd.disk_used | - | Byte |
New Feature Set - Hardware Metrics
DXS-1597
Aggregate Dimensions for Spark API Metrics
Updates to how Spark API is called
UA Screen updates
DXS-1920
Adds ability to ingest Spark Jobs as traces
Adds ability to ingest Spark Config as Log Messages
##v1.02