Databricks is an optimized knowledge analytics platform primarily based on Apache Spark. Monitoring Databricks plateform is essential to make sure knowledge high quality, job efficiency, and safety points by limiting entry to manufacturing workspaces.
Spark utility metrics, logs, and occasions produced by a Databricks workspace will be personalized, despatched, and centralized to varied monitoring platforms together with Azure Monitor Logs. This instrument, previously referred to as Log Analytics by Microsoft, is an Azure cloud service built-in into Azure Monitor that collects and shops logs from cloud and on-premises environments. It present a imply for querying logs from knowledge collected utilizing a read-only question language named “Kusto”, for constructing “Workbooks” dashboards and establishing alerts on recognized patterns.
This text concentrate on automating the export of Databricks logs to a Log Analytics workspace through the use of the Spark-monitoring library at a workspace scale.
Overview of Databricks log sending
This part is an outline of the structure. Extra detailed info and the related supply code are offered additional down within the article.
Spark-monitoring is a Microsoft Open Supply mission to export Databricks logs at a cluster stage. As soon as downloaded, the library is domestically constructed with Docker or Maven based on the Databricks Runtime model of the cluster to configure (Spark and Scala variations). The construct of the library generates two jar information:
spark-listeners_$spark-version_$scala_version-$model: collects knowledge from a operating cluster;
spark-listenersby amassing knowledge, connecting to a Log Analytics workspace, parsing and sending logs through Information Collector API
Within the documentation, as soon as the jars are constructed, they’re placed on the DBFS. An init script
spark-monitoring.sh is edited domestically with workspace and cluster’s configurations and manually added by means of Databricks interface at cluster stage.
At cluster’s launch, logs are despatched in streaming in JSON format to Log Analytics Information Collector API and saved in 3 totally different tables, one for every kind of log despatched:
- SparkMetric_CL: Execution metrics of Spark purposes (reminiscence utilization, variety of jobs, phases duties submitted/accomplished/operating);
- SparkListenerEvent_CL: All occasions intercepted by SparkListener in the course of the execution of the Spark utility (jobs, phases and duties begin/finish);
- SparkLoggingEvent_CL: Logs from log4j appender.
Some configurations permit to automate the arrange of log sending on the workspace stage by configuring all clusters in a given workspace. It includes downloading the mission, constructing it with Docker or Maven, enhancing
spark-monitoring.sh script and cluster’s surroundings variables. As soon as all of the configurations have been made, operating the PowerShell script configures the Databricks workspace. It’s primarily based on 3 bash scripts:
spark-monitoring-vars.sh: outline workspace’s environments variables;
spark-monitoring.sh: sends logs in streaming to Log Analytics;
spark-monitoring-global-init.sh: this script at workspace scale runs
The PowerShell script
dbx-monitoring-deploy.ps1 runs domestically and it deploys configurations at a workspace stage. It fills
spark-monitoring-vars.sh with workspace variables, copies scripts and jars to DBFS and posts international init script to Databricks.
Configuration of a workspace
1. Constructing the jar information
Clone the repository Spark-monitoring and construct domestically the jar information with Docker or Maven in Databricks runtime variations of all of the clusters that have to be configured within the workspace based on the documentation.
Within the root of the
spark-monitoring folder, run the construct command within the desired Spark and Scala variations. On this instance, the library is constructed for Scala 2.12 and Spark 3.0.1.
docker run -it --rm -v pwd:/spark-monitoring -v "$HOME/.m2":/root/.m2 -w /spark-monitoring/src maven:3.6.3-jdk-8 mvn set up -P "scala-2.12_spark-3.0.1"
Jars are constructed within the
spark-monitoring/src/goal folder. The
spark-monitoring.sh is positioned contained in the
All these steps are defined within the chapter Construct the Azure Databricks monitoring library from the Microsoft patterns & practices GitHub repository.
2. Setting Log Analytics surroundings variables
The Log Analytics workspace Id and Key are saved in Azure Key Vault’s secrets and techniques and referenced within the surroundings variables of all clusters configured. Azure Databricks accesses the Key Vault by means of Databricks workspace Secret Scope.
After creating the secrets and techniques of Log Analytics workspace Id and Key, configure each cluster manually referencing the secrets and techniques following the directions on learn how to arrange the Azure Key Vault-Backed Secret Scope.
LOG_ANALYTICS_WORKSPACE_KEY=secrets and techniques/secret-scope-name/pw-log-analytics LOG_ANALYTICS_WORKSPACE_ID=secrets and techniques/secret-scope-name/id-log-analytics
3. Including spark-monitoring-global-init.sh and spark-monitoring-vars.sh scripts
jars folder, add all jars and configurations information respecting the next file tree:
spark-monitoring-global-init.sh: This script is began on the launch of each cluster within the workspace.
#!/bin/bash STAGE_DIR=/dbfs/databricks/spark-monitoring VARS_SCRIPT=$STAGE_DIR/spark-monitoring-vars.sh MONITORING_SCRIPT=$STAGE_DIR/spark-monitoring.sh if [ -d "$STAGE_DIR" -a -f "$VARS_SCRIPT" -a -f "$MONITORING_SCRIPT" ]; then /bin/bash $VARS_SCRIPT; /bin/bash $MONITORING_SCRIPT; else echo "Listing $STAGE_DIR doesn't exist or one of many scripts wanted is lacking" fi
spark-monitoring-vars.sh: This script is a template of all surroundings variables wanted on the cluster and workspace stage.
#!/bin/bash DB_HOME=/databricks SPARK_HOME=$DB_HOME/spark SPARK_CONF_DIR=$SPARK_HOME/conf tee -a "$SPARK_CONF_DIR/spark-env.sh" << EOF # Id of Azure subscription export AZ_SUBSCRIPTION_ID="$AZ_SUBSCRIPTION_ID" # Useful resource group identify of workspace export AZ_RSRC_GRP_NAME="$AZ_RSRC_GRP_NAME" export AZ_RSRC_PROV_NAMESPACE=Microsoft.Databricks export AZ_RSRC_TYPE=workspaces # Identify of Databricks workspace export AZ_RSRC_NAME="$AZ_RSRC_NAME" EOF
4. Enhancing and including spark-monitoring.sh
spark-monitoring.sh from the cloned mission, add it to the file tree and edit surroundings variables like the next:
DB_HOME=/databricks SPARK_HOME=$DB_HOME/spark SPARK_CONF_DIR=$SPARK_HOME/conf tee -a "$SPARK_CONF_DIR/spark-env.sh" << EOF # Export cluster id and identify from surroundings variables export DB_CLUSTER_ID=$DB_CLUSTER_ID export DB_CLUSTER_NAME=$DB_CLUSTER_NAME EOF
Given the massive storage prices related to a Log Analytics workspace, within the context of Spark metrics, apply filters primarily based on REGEX expressions to solely protect essentially the most related logs info. This occasion filtering documentation offers you the totally different variables to set.
5. Enhancing, including and launching the PowerShell script
dbx-monitoring-deploy.ps1 is used to configure the export of cluster logs from a Databricks workspace to Log Analytics.
It performs the next actions:
spark-monitoring-vars.shwith appropriate values for workspace.
spark-monitoring.shand all jar information on DBFS’ workspace.
- Posts by means of Databricks API content material of the worldwide init script.
It assumes there are 3 totally different Azure subscriptions (DEV/ PREPROD/ PROD) to separate growth, check and manufacturing phases of a steady integration. A preproduction subscription is used for integration assessments and enterprise acceptance testing earlier than going into manufacturing.
Edit this part based on your subscriptions.
param( [Parameter(Mandatory=$true,ValueFromPipeline=$true)]$p, [Parameter(Mandatory=$true,ValueFromPipeline=$true)]$e, [Parameter(Mandatory=$true,ValueFromPipeline=$true)]$n, [Parameter(Mandatory=$true,ValueFromPipeline=$true)]$rg, [Parameter(Mandatory=$true,ValueFromPipeline=$true)]$w ) $armFolder = $p.TrimEnd("/","") $deploymentName = $n.ToLower() $varsTemplatePath = "$armFolder/spark-monitoring-vars.sh" if ($e -like "dev") $AZ_SUBSCRIPTION_ID = "" elseif ($e -like 'prod') $AZ_SUBSCRIPTION_ID = "" elseif ($e -like 'preprod') $AZ_SUBSCRIPTION_ID = "" else Write-Output "no surroundings offered - exiting" Exit-PSSession $AZ_RSRC_GRP_NAME = $rg $AZ_RSRC_NAME = $w $surroundings = $e.ToLower() $parametersPath = "$armFolder/$surroundings/$deploymentName/spark-monitoring-vars-$surroundings-$deploymentName.sh" $template = Get-Content material "$varsTemplatePath" -Uncooked $filledTemplate = Invoke-Expression "@`"`r`n$template`r`n`"@" mkdir -p $armFolder/$surroundings/$deploymentName Out-File -FilePath $parametersPath -InputObject $filledTemplate attempt $context = get-azContext if(!$context) Write-Output "No context, please join !" $Credential = Get-Credential Join-AzAccount -Credential $Credential -ErrorAction Cease if ($surroundings -like "dev") set-azcontext "AD-DEV01" -ErrorAction Cease elseif ($surroundings -like 'prod') set-azcontext "AD-PROD01" -ErrorAction Cease elseif ($surroundings -like 'preprod') set-azcontext "AD-PREPROD01" -ErrorAction Cease else Write-Output "no context discovered for offered environment- exiting" Exit catch Write-Output "error setting context - exiting" Exit $mydbx=Get-AzDatabricksWorkspace -ResourceGroupName $AZ_RSRC_GRP_NAME $hostVar = "https://" + $mydbx.Url $myToken = Get-AzAccessToken -Useful resource "2ff814a6-3304-4ab8-85cb-cd0e6f879c1d" $env:DATABRICKS_AAD_TOKEN=$myToken.Token databricks configure --aad-token --host $hostVar databricks fs mkdirs dbfs:/databricks/spark-monitoring databricks fs cp --overwrite $armFolder/spark-monitoring.sh dbfs:/databricks/spark-monitoring databricks fs cp --overwrite $armFolder/$surroundings/$deploymentName/spark-monitoring-vars-$surroundings-$deploymentName.sh dbfs:/databricks/spark-monitoring/spark-monitoring-vars.sh databricks fs cp --recursive --overwrite $armFolder/jars dbfs:/databricks/spark-monitoring $inputfile = "$armFolder/spark-monitoring-global-init.sh" $fc = get-content $inputfile -Encoding UTF8 -Uncooked $By = [System.Text.Encoding]::UTF8.GetBytes($fc) $etext = [System.Convert]::ToBase64String($By, 'InsertLineBreaks') $Physique = @ identify = "monitoring" script = "$etext" place = 1 enabled = "true" $JsonBody = $Physique | ConvertTo-Json $Uri = "https://" + $mydbx.Url + "/api/2.0/global-init-scripts" $Header = @Authorization = "Bearer $env:DATABRICKS_AAD_TOKEN" Invoke-RestMethod -Methodology Publish -Uri $Uri -Headers $Header -Physique $JsonBody
Enrich and launch the script with these parameters:
|p||Path to script|
|e||Setting (DEV, PREPROD, PROD)|
|rg||Workspace useful resource group|
Name the script like the next:
pwsh dbx-monitoring-deploy.ps1 -p /house/Paperwork/pwsh-spark-monitoring/pwsh-deploy-dbx-spark-monitoring -e DEV -n deploy_log_analytics_wksp_sales -rg rg-dev-datalake -w dbx-dev-datalake-sales
Due to this script, you possibly can simply deploy the Spark-monitoring library on all of your Databricks workspaces.
The logs natively despatched permit to watch cluster well being, job execution and report errors from notebooks. One other solution to monitor each day knowledge processing is to carry out customized logging utilizing log4j appender. This fashion, you possibly can add steps to implement knowledge high quality validation over ingested and cleaned knowledge and customized assessments with a predefined checklist of expectations to validate the info in opposition to.
We are able to think about utilizing customized logs to log unhealthy data, apply checks and constraints on knowledge after which ship high quality metrics to Log Analytics for reporting and alerting. To take action, you possibly can construct your personal knowledge high quality library or use present instruments like Apache Griffin or Amazon Deeque.