Databricks logs assortment with Azure Monitor at a Workspace Scale

Databricks is an optimized knowledge analytics platform primarily based on Apache Spark. Monitoring Databricks plateform is essential to make sure knowledge high quality, job efficiency, and safety points by limiting entry to manufacturing workspaces.
Spark utility metrics, logs, and occasions produced by a Databricks workspace will be personalized, despatched, and centralized to varied monitoring platforms together with Azure Monitor Logs. This instrument, previously referred to as Log Analytics by Microsoft, is an Azure cloud service built-in into Azure Monitor that collects and shops logs from cloud and on-premises environments. It present a imply for querying logs from knowledge collected utilizing a read-only question language named “Kusto”, for constructing “Workbooks” dashboards and establishing alerts on recognized patterns.
This text concentrate on automating the export of Databricks logs to a Log Analytics workspace through the use of the Spark-monitoring library at a workspace scale.
Overview of Databricks log sending
This part is an outline of the structure. Extra detailed info and the related supply code are offered additional down within the article.
Spark-monitoring is a Microsoft Open Supply mission to export Databricks logs at a cluster stage. As soon as downloaded, the library is domestically constructed with Docker or Maven based on the Databricks Runtime model of the cluster to configure (Spark and Scala variations). The construct of the library generates two jar information:
spark-listeners_$spark-version_$scala_version-$model
: collects knowledge from a operating cluster;spark-listeners-loganalytics_$spark-version_$scala_version-$model
: extendsspark-listeners
by amassing knowledge, connecting to a Log Analytics workspace, parsing and sending logs through Information Collector API
Within the documentation, as soon as the jars are constructed, they’re placed on the DBFS. An init script spark-monitoring.sh
is edited domestically with workspace and cluster’s configurations and manually added by means of Databricks interface at cluster stage.
At cluster’s launch, logs are despatched in streaming in JSON format to Log Analytics Information Collector API and saved in 3 totally different tables, one for every kind of log despatched:
- SparkMetric_CL: Execution metrics of Spark purposes (reminiscence utilization, variety of jobs, phases duties submitted/accomplished/operating);
- SparkListenerEvent_CL: All occasions intercepted by SparkListener in the course of the execution of the Spark utility (jobs, phases and duties begin/finish);
- SparkLoggingEvent_CL: Logs from log4j appender.
Some configurations permit to automate the arrange of log sending on the workspace stage by configuring all clusters in a given workspace. It includes downloading the mission, constructing it with Docker or Maven, enhancing spark-monitoring.sh
script and cluster’s surroundings variables. As soon as all of the configurations have been made, operating the PowerShell script configures the Databricks workspace. It’s primarily based on 3 bash scripts:
spark-monitoring-vars.sh
: outline workspace’s environments variables;spark-monitoring.sh
: sends logs in streaming to Log Analytics;spark-monitoring-global-init.sh
: this script at workspace scale runsspark-monitoring-vars.sh
thenspark-monitoring.sh
.
The PowerShell script dbx-monitoring-deploy.ps1
runs domestically and it deploys configurations at a workspace stage. It fills spark-monitoring-vars.sh
with workspace variables, copies scripts and jars to DBFS and posts international init script to Databricks.
Configuration of a workspace
1. Constructing the jar information
Clone the repository Spark-monitoring and construct domestically the jar information with Docker or Maven in Databricks runtime variations of all of the clusters that have to be configured within the workspace based on the documentation.
With Docker:
Within the root of the spark-monitoring
folder, run the construct command within the desired Spark and Scala variations. On this instance, the library is constructed for Scala 2.12 and Spark 3.0.1.
docker run -it --rm -v pwd:/spark-monitoring -v "$HOME/.m2":/root/.m2 -w /spark-monitoring/src maven:3.6.3-jdk-8 mvn set up -P "scala-2.12_spark-3.0.1"
Jars are constructed within the spark-monitoring/src/goal
folder. The spark-monitoring.sh
is positioned contained in the spark-monitoring/src/spark-listeners/scripts
folder.
All these steps are defined within the chapter Construct the Azure Databricks monitoring library from the Microsoft patterns & practices GitHub repository.
2. Setting Log Analytics surroundings variables
The Log Analytics workspace Id and Key are saved in Azure Key Vault’s secrets and techniques and referenced within the surroundings variables of all clusters configured. Azure Databricks accesses the Key Vault by means of Databricks workspace Secret Scope.
After creating the secrets and techniques of Log Analytics workspace Id and Key, configure each cluster manually referencing the secrets and techniques following the directions on learn how to arrange the Azure Key Vault-Backed Secret Scope.
LOG_ANALYTICS_WORKSPACE_KEY=secrets and techniques/secret-scope-name/pw-log-analytics
LOG_ANALYTICS_WORKSPACE_ID=secrets and techniques/secret-scope-name/id-log-analytics
3. Including spark-monitoring-global-init.sh and spark-monitoring-vars.sh scripts
Create a jars
folder, add all jars and configurations information respecting the next file tree:
-
spark-monitoring-global-init.sh
: This script is began on the launch of each cluster within the workspace.#!/bin/bash STAGE_DIR=/dbfs/databricks/spark-monitoring VARS_SCRIPT=$STAGE_DIR/spark-monitoring-vars.sh MONITORING_SCRIPT=$STAGE_DIR/spark-monitoring.sh if [ -d "$STAGE_DIR" -a -f "$VARS_SCRIPT" -a -f "$MONITORING_SCRIPT" ]; then /bin/bash $VARS_SCRIPT; /bin/bash $MONITORING_SCRIPT; else echo "Listing $STAGE_DIR doesn't exist or one of many scripts wanted is lacking" fi
-
spark-monitoring-vars.sh
: This script is a template of all surroundings variables wanted on the cluster and workspace stage.#!/bin/bash DB_HOME=/databricks SPARK_HOME=$DB_HOME/spark SPARK_CONF_DIR=$SPARK_HOME/conf tee -a "$SPARK_CONF_DIR/spark-env.sh" << EOF # Id of Azure subscription export AZ_SUBSCRIPTION_ID="$AZ_SUBSCRIPTION_ID" # Useful resource group identify of workspace export AZ_RSRC_GRP_NAME="$AZ_RSRC_GRP_NAME" export AZ_RSRC_PROV_NAMESPACE=Microsoft.Databricks export AZ_RSRC_TYPE=workspaces # Identify of Databricks workspace export AZ_RSRC_NAME="$AZ_RSRC_NAME" EOF
4. Enhancing and including spark-monitoring.sh
Copy spark-monitoring.sh
from the cloned mission, add it to the file tree and edit surroundings variables like the next:
DB_HOME=/databricks
SPARK_HOME=$DB_HOME/spark
SPARK_CONF_DIR=$SPARK_HOME/conf
tee -a "$SPARK_CONF_DIR/spark-env.sh" << EOF
# Export cluster id and identify from surroundings variables
export DB_CLUSTER_ID=$DB_CLUSTER_ID
export DB_CLUSTER_NAME=$DB_CLUSTER_NAME
EOF
Given the massive storage prices related to a Log Analytics workspace, within the context of Spark metrics, apply filters primarily based on REGEX expressions to solely protect essentially the most related logs info. This occasion filtering documentation offers you the totally different variables to set.
5. Enhancing, including and launching the PowerShell script
The script dbx-monitoring-deploy.ps1
is used to configure the export of cluster logs from a Databricks workspace to Log Analytics.
It performs the next actions:
- Fills
spark-monitoring-vars.sh
with appropriate values for workspace. - Uploads
spark-monitoring-vars.sh
,spark-monitoring.sh
and all jar information on DBFS’ workspace. - Posts by means of Databricks API content material of the worldwide init script.
It assumes there are 3 totally different Azure subscriptions (DEV/ PREPROD/ PROD) to separate growth, check and manufacturing phases of a steady integration. A preproduction subscription is used for integration assessments and enterprise acceptance testing earlier than going into manufacturing.
Edit this part based on your subscriptions.
param(
[Parameter(Mandatory=$true,ValueFromPipeline=$true)]$p,
[Parameter(Mandatory=$true,ValueFromPipeline=$true)]$e,
[Parameter(Mandatory=$true,ValueFromPipeline=$true)]$n,
[Parameter(Mandatory=$true,ValueFromPipeline=$true)]$rg,
[Parameter(Mandatory=$true,ValueFromPipeline=$true)]$w
)
$armFolder = $p.TrimEnd("/","")
$deploymentName = $n.ToLower()
$varsTemplatePath = "$armFolder/spark-monitoring-vars.sh"
if ($e -like "dev")
$AZ_SUBSCRIPTION_ID = ""
elseif ($e -like 'prod')
$AZ_SUBSCRIPTION_ID = ""
elseif ($e -like 'preprod')
$AZ_SUBSCRIPTION_ID = ""
else
Write-Output "no surroundings offered - exiting"
Exit-PSSession
$AZ_RSRC_GRP_NAME = $rg
$AZ_RSRC_NAME = $w
$surroundings = $e.ToLower()
$parametersPath = "$armFolder/$surroundings/$deploymentName/spark-monitoring-vars-$surroundings-$deploymentName.sh"
$template = Get-Content material "$varsTemplatePath" -Uncooked
$filledTemplate = Invoke-Expression "@`"`r`n$template`r`n`"@"
mkdir -p $armFolder/$surroundings/$deploymentName
Out-File -FilePath $parametersPath -InputObject $filledTemplate
attempt
$context = get-azContext
if(!$context)
Write-Output "No context, please join !"
$Credential = Get-Credential
Join-AzAccount -Credential $Credential -ErrorAction Cease
if ($surroundings -like "dev")
set-azcontext "AD-DEV01" -ErrorAction Cease
elseif ($surroundings -like 'prod')
set-azcontext "AD-PROD01" -ErrorAction Cease
elseif ($surroundings -like 'preprod')
set-azcontext "AD-PREPROD01" -ErrorAction Cease
else
Write-Output "no context discovered for offered environment- exiting"
Exit
catch
Write-Output "error setting context - exiting"
Exit
$mydbx=Get-AzDatabricksWorkspace -ResourceGroupName $AZ_RSRC_GRP_NAME
$hostVar = "https://" + $mydbx.Url
$myToken = Get-AzAccessToken -Useful resource "2ff814a6-3304-4ab8-85cb-cd0e6f879c1d"
$env:DATABRICKS_AAD_TOKEN=$myToken.Token
databricks configure --aad-token --host $hostVar
databricks fs mkdirs dbfs:/databricks/spark-monitoring
databricks fs cp --overwrite $armFolder/spark-monitoring.sh dbfs:/databricks/spark-monitoring
databricks fs cp --overwrite $armFolder/$surroundings/$deploymentName/spark-monitoring-vars-$surroundings-$deploymentName.sh dbfs:/databricks/spark-monitoring/spark-monitoring-vars.sh
databricks fs cp --recursive --overwrite $armFolder/jars dbfs:/databricks/spark-monitoring
$inputfile = "$armFolder/spark-monitoring-global-init.sh"
$fc = get-content $inputfile -Encoding UTF8 -Uncooked
$By = [System.Text.Encoding]::UTF8.GetBytes($fc)
$etext = [System.Convert]::ToBase64String($By, 'InsertLineBreaks')
$Physique = @
identify = "monitoring"
script = "$etext"
place = 1
enabled = "true"
$JsonBody = $Physique | ConvertTo-Json
$Uri = "https://" + $mydbx.Url + "/api/2.0/global-init-scripts"
$Header = @Authorization = "Bearer $env:DATABRICKS_AAD_TOKEN"
Invoke-RestMethod -Methodology Publish -Uri $Uri -Headers $Header -Physique $JsonBody
Enrich and launch the script with these parameters:
Parameter | Description |
---|---|
p | Path to script |
e | Setting (DEV, PREPROD, PROD) |
n | Deployment identify |
rg | Workspace useful resource group |
w | Workspace identify |
Name the script like the next:
pwsh dbx-monitoring-deploy.ps1 -p /house/Paperwork/pwsh-spark-monitoring/pwsh-deploy-dbx-spark-monitoring -e DEV -n deploy_log_analytics_wksp_sales -rg rg-dev-datalake -w dbx-dev-datalake-sales
Due to this script, you possibly can simply deploy the Spark-monitoring library on all of your Databricks workspaces.
The logs natively despatched permit to watch cluster well being, job execution and report errors from notebooks. One other solution to monitor each day knowledge processing is to carry out customized logging utilizing log4j appender. This fashion, you possibly can add steps to implement knowledge high quality validation over ingested and cleaned knowledge and customized assessments with a predefined checklist of expectations to validate the info in opposition to.
We are able to think about utilizing customized logs to log unhealthy data, apply checks and constraints on knowledge after which ship high quality metrics to Log Analytics for reporting and alerting. To take action, you possibly can construct your personal knowledge high quality library or use present instruments like Apache Griffin or Amazon Deeque.