Databricks logs assortment with Azure Monitor at a Workspace Scale

Databricks is an optimized knowledge analytics platform primarily based on Apache Spark. Monitoring Databricks plateform is essential to make sure knowledge high quality, job efficiency, and safety points by limiting entry to manufacturing workspaces.

Spark utility metrics, logs, and occasions produced by a Databricks workspace will be personalized, despatched, and centralized to varied monitoring platforms together with Azure Monitor Logs. This instrument, previously referred to as Log Analytics by Microsoft, is an Azure cloud service built-in into Azure Monitor that collects and shops logs from cloud and on-premises environments. It present a imply for querying logs from knowledge collected utilizing a read-only question language named “Kusto”, for constructing “Workbooks” dashboards and establishing alerts on recognized patterns.

This text concentrate on automating the export of Databricks logs to a Log Analytics workspace through the use of the Spark-monitoring library at a workspace scale.

Overview of Databricks log sending


Overview of Spark-monitoring library

This part is an outline of the structure. Extra detailed info and the related supply code are offered additional down within the article.

Spark-monitoring is a Microsoft Open Supply mission to export Databricks logs at a cluster stage. As soon as downloaded, the library is domestically constructed with Docker or Maven based on the Databricks Runtime model of the cluster to configure (Spark and Scala variations). The construct of the library generates two jar information:

  • spark-listeners_$spark-version_$scala_version-$model: collects knowledge from a operating cluster;
  • spark-listeners-loganalytics_$spark-version_$scala_version-$model: extends spark-listeners by amassing knowledge, connecting to a Log Analytics workspace, parsing and sending logs through Information Collector API

Within the documentation, as soon as the jars are constructed, they’re placed on the DBFS. An init script spark-monitoring.sh is edited domestically with workspace and cluster’s configurations and manually added by means of Databricks interface at cluster stage.

At cluster’s launch, logs are despatched in streaming in JSON format to Log Analytics Information Collector API and saved in 3 totally different tables, one for every kind of log despatched:

  • SparkMetric_CL: Execution metrics of Spark purposes (reminiscence utilization, variety of jobs, phases duties submitted/accomplished/operating);
  • SparkListenerEvent_CL: All occasions intercepted by SparkListener in the course of the execution of the Spark utility (jobs, phases and duties begin/finish);
  • SparkLoggingEvent_CL: Logs from log4j appender.

Some configurations permit to automate the arrange of log sending on the workspace stage by configuring all clusters in a given workspace. It includes downloading the mission, constructing it with Docker or Maven, enhancing spark-monitoring.sh script and cluster’s surroundings variables. As soon as all of the configurations have been made, operating the PowerShell script configures the Databricks workspace. It’s primarily based on 3 bash scripts:

  • spark-monitoring-vars.sh: outline workspace’s environments variables;
  • spark-monitoring.sh: sends logs in streaming to Log Analytics;
  • spark-monitoring-global-init.sh: this script at workspace scale runs spark-monitoring-vars.sh then spark-monitoring.sh.

The PowerShell script dbx-monitoring-deploy.ps1 runs domestically and it deploys configurations at a workspace stage. It fills spark-monitoring-vars.sh with workspace variables, copies scripts and jars to DBFS and posts international init script to Databricks.

Configuration of a workspace

1. Constructing the jar information

Clone the repository Spark-monitoring and construct domestically the jar information with Docker or Maven in Databricks runtime variations of all of the clusters that have to be configured within the workspace based on the documentation.

With Docker:

Within the root of the spark-monitoring folder, run the construct command within the desired Spark and Scala variations. On this instance, the library is constructed for Scala 2.12 and Spark 3.0.1.

 docker run -it --rm -v pwd:/spark-monitoring -v "$HOME/.m2":/root/.m2 -w /spark-monitoring/src maven:3.6.3-jdk-8 mvn set up -P "scala-2.12_spark-3.0.1"

Jars are constructed within the spark-monitoring/src/goal folder. The spark-monitoring.sh is positioned contained in the spark-monitoring/src/spark-listeners/scripts folder.

All these steps are defined within the chapter Construct the Azure Databricks monitoring library from the Microsoft patterns & practices GitHub repository.

2. Setting Log Analytics surroundings variables

The Log Analytics workspace Id and Key are saved in Azure Key Vault’s secrets and techniques and referenced within the surroundings variables of all clusters configured. Azure Databricks accesses the Key Vault by means of Databricks workspace Secret Scope.

After creating the secrets and techniques of Log Analytics workspace Id and Key, configure each cluster manually referencing the secrets and techniques following the directions on learn how to arrange the Azure Key Vault-Backed Secret Scope.

LOG_ANALYTICS_WORKSPACE_KEY=secrets and techniques/secret-scope-name/pw-log-analytics
LOG_ANALYTICS_WORKSPACE_ID=secrets and techniques/secret-scope-name/id-log-analytics

3. Including spark-monitoring-global-init.sh and spark-monitoring-vars.sh scripts

Create a jars folder, add all jars and configurations information respecting the next file tree:


Project file tree

  • spark-monitoring-global-init.sh: This script is began on the launch of each cluster within the workspace.

    #!/bin/bash
    
    
    STAGE_DIR=/dbfs/databricks/spark-monitoring
    
    
    VARS_SCRIPT=$STAGE_DIR/spark-monitoring-vars.sh
    
    MONITORING_SCRIPT=$STAGE_DIR/spark-monitoring.sh
    
    
    if [ -d "$STAGE_DIR" -a -f "$VARS_SCRIPT" -a -f "$MONITORING_SCRIPT" ]; then
    
    /bin/bash $VARS_SCRIPT;
    /bin/bash $MONITORING_SCRIPT;
    
    else
    echo "Listing $STAGE_DIR doesn't exist or one of many scripts wanted is lacking"
    fi
  • spark-monitoring-vars.sh: This script is a template of all surroundings variables wanted on the cluster and workspace stage.

    #!/bin/bash
    
    
    
    
    
    
    
    DB_HOME=/databricks
    SPARK_HOME=$DB_HOME/spark
    SPARK_CONF_DIR=$SPARK_HOME/conf
    
    tee -a "$SPARK_CONF_DIR/spark-env.sh" << EOF
    # Id of Azure subscription
    export AZ_SUBSCRIPTION_ID="$AZ_SUBSCRIPTION_ID"
    # Useful resource group identify of workspace
    export AZ_RSRC_GRP_NAME="$AZ_RSRC_GRP_NAME"
    export AZ_RSRC_PROV_NAMESPACE=Microsoft.Databricks
    export AZ_RSRC_TYPE=workspaces
    # Identify of Databricks workspace
    export AZ_RSRC_NAME="$AZ_RSRC_NAME"
    EOF

4. Enhancing and including spark-monitoring.sh

Copy spark-monitoring.sh from the cloned mission, add it to the file tree and edit surroundings variables like the next:

DB_HOME=/databricks
SPARK_HOME=$DB_HOME/spark
SPARK_CONF_DIR=$SPARK_HOME/conf

tee -a "$SPARK_CONF_DIR/spark-env.sh" << EOF
# Export cluster id and identify from surroundings variables
export DB_CLUSTER_ID=$DB_CLUSTER_ID
export DB_CLUSTER_NAME=$DB_CLUSTER_NAME
EOF

Given the massive storage prices related to a Log Analytics workspace, within the context of Spark metrics, apply filters primarily based on REGEX expressions to solely protect essentially the most related logs info. This occasion filtering documentation offers you the totally different variables to set.

5. Enhancing, including and launching the PowerShell script

The script dbx-monitoring-deploy.ps1 is used to configure the export of cluster logs from a Databricks workspace to Log Analytics.

It performs the next actions:

  1. Fills spark-monitoring-vars.sh with appropriate values for workspace.
  2. Uploads spark-monitoring-vars.sh, spark-monitoring.sh and all jar information on DBFS’ workspace.
  3. Posts by means of Databricks API content material of the worldwide init script.

It assumes there are 3 totally different Azure subscriptions (DEV/ PREPROD/ PROD) to separate growth, check and manufacturing phases of a steady integration. A preproduction subscription is used for integration assessments and enterprise acceptance testing earlier than going into manufacturing.

Edit this part based on your subscriptions.


param(
    [Parameter(Mandatory=$true,ValueFromPipeline=$true)]$p,
    [Parameter(Mandatory=$true,ValueFromPipeline=$true)]$e,
    [Parameter(Mandatory=$true,ValueFromPipeline=$true)]$n,
    [Parameter(Mandatory=$true,ValueFromPipeline=$true)]$rg,
    [Parameter(Mandatory=$true,ValueFromPipeline=$true)]$w
    )

$armFolder = $p.TrimEnd("/","")

$deploymentName = $n.ToLower()
$varsTemplatePath = "$armFolder/spark-monitoring-vars.sh"

if ($e -like "dev")

    $AZ_SUBSCRIPTION_ID = ""

elseif ($e -like 'prod') 
    $AZ_SUBSCRIPTION_ID = ""

elseif ($e -like 'preprod') 
    $AZ_SUBSCRIPTION_ID = ""

else
    Write-Output "no surroundings offered - exiting"
    Exit-PSSession




$AZ_RSRC_GRP_NAME = $rg
$AZ_RSRC_NAME = $w
$surroundings = $e.ToLower()

$parametersPath = "$armFolder/$surroundings/$deploymentName/spark-monitoring-vars-$surroundings-$deploymentName.sh"

$template = Get-Content material "$varsTemplatePath" -Uncooked
$filledTemplate = Invoke-Expression "@`"`r`n$template`r`n`"@"

mkdir -p $armFolder/$surroundings/$deploymentName
Out-File -FilePath $parametersPath -InputObject $filledTemplate



attempt 
    $context = get-azContext
    if(!$context)
    
        Write-Output "No context, please join !"
        $Credential = Get-Credential
        Join-AzAccount -Credential $Credential -ErrorAction Cease
    
    if ($surroundings -like "dev")
    
        set-azcontext "AD-DEV01" -ErrorAction Cease
    
    elseif ($surroundings -like 'prod') 
        set-azcontext "AD-PROD01" -ErrorAction Cease
    
    elseif ($surroundings -like 'preprod') 
        set-azcontext "AD-PREPROD01" -ErrorAction Cease
    
    else
        Write-Output "no context discovered for offered environment- exiting"
        Exit
    

catch
    Write-Output "error setting context - exiting"
    Exit




$mydbx=Get-AzDatabricksWorkspace  -ResourceGroupName $AZ_RSRC_GRP_NAME
$hostVar = "https://" + $mydbx.Url

$myToken = Get-AzAccessToken -Useful resource "2ff814a6-3304-4ab8-85cb-cd0e6f879c1d"

$env:DATABRICKS_AAD_TOKEN=$myToken.Token

databricks configure --aad-token --host $hostVar

databricks fs mkdirs dbfs:/databricks/spark-monitoring

databricks fs cp --overwrite $armFolder/spark-monitoring.sh dbfs:/databricks/spark-monitoring
databricks fs cp --overwrite $armFolder/$surroundings/$deploymentName/spark-monitoring-vars-$surroundings-$deploymentName.sh dbfs:/databricks/spark-monitoring/spark-monitoring-vars.sh
databricks fs cp --recursive --overwrite $armFolder/jars dbfs:/databricks/spark-monitoring



$inputfile = "$armFolder/spark-monitoring-global-init.sh"
$fc = get-content $inputfile -Encoding UTF8 -Uncooked
$By = [System.Text.Encoding]::UTF8.GetBytes($fc)
$etext = [System.Convert]::ToBase64String($By, 'InsertLineBreaks')

$Physique = @
    identify = "monitoring"
    script = "$etext"
    place = 1
    enabled = "true"


$JsonBody = $Physique | ConvertTo-Json

$Uri = "https://" + $mydbx.Url + "/api/2.0/global-init-scripts"

$Header = @Authorization = "Bearer $env:DATABRICKS_AAD_TOKEN"

Invoke-RestMethod -Methodology Publish -Uri $Uri -Headers $Header -Physique $JsonBody

Enrich and launch the script with these parameters:

Parameter Description
p Path to script
e Setting (DEV, PREPROD, PROD)
n Deployment identify
rg Workspace useful resource group
w Workspace identify

Name the script like the next:

pwsh dbx-monitoring-deploy.ps1 -p /house/Paperwork/pwsh-spark-monitoring/pwsh-deploy-dbx-spark-monitoring -e DEV -n deploy_log_analytics_wksp_sales -rg rg-dev-datalake -w dbx-dev-datalake-sales

Due to this script, you possibly can simply deploy the Spark-monitoring library on all of your Databricks workspaces.

The logs natively despatched permit to watch cluster well being, job execution and report errors from notebooks. One other solution to monitor each day knowledge processing is to carry out customized logging utilizing log4j appender. This fashion, you possibly can add steps to implement knowledge high quality validation over ingested and cleaned knowledge and customized assessments with a predefined checklist of expectations to validate the info in opposition to.

We are able to think about utilizing customized logs to log unhealthy data, apply checks and constraints on knowledge after which ship high quality metrics to Log Analytics for reporting and alerting. To take action, you possibly can construct your personal knowledge high quality library or use present instruments like Apache Griffin or Amazon Deeque.