Thursday, November 30, 2023
HomeBig DataAutomate notifications on Slack for Amazon Redshift question monitoring rule violations

Automate notifications on Slack for Amazon Redshift question monitoring rule violations


On this publish, we stroll you thru find out how to arrange automated notifications of question monitoring rule (QMR) violations in Amazon Redshift to a Slack channel, in order that Amazon Redshift customers can take well timed motion.

Amazon Redshift is a completely managed, petabyte-scale information warehouse service within the cloud. With Amazon Redshift, you’ll be able to analyze your information to derive holistic insights about your online business and your prospects. One of many challenges is to guard the information warehouse workload from poorly written queries that may devour important assets. Amazon Redshift question monitoring guidelines are a characteristic of workload administration (WLM) that permit automated dealing with of poorly written queries. Guidelines which might be utilized to a WLM queue permit queries to be logged, canceled, hopped (solely out there with handbook WLM), or to alter precedence (solely out there with automated WLM). The explanation to make use of QMRs is to guard towards wasteful use of the cluster. You can even use these guidelines to log resource-intensive queries, which gives the chance to determine governance for advert hoc workloads.

The Amazon Redshift cluster mechanically collects question monitoring guidelines metrics. This handy mechanism helps you to view attributes like the next:

  • Question runtime, in seconds
  • Question return row rely
  • The CPU time for a SQL assertion

It additionally makes Amazon Redshift Spectrum metrics out there, such because the variety of Redshift Spectrum rows and MBs scanned by a question.

When a question violates a QMR, Amazon Redshift logs the violation into the STL_WLM_RULE_ACTION system view. If the motion is aborted for the queries that violate a QMR, end-users see an error that signifies question failure as a result of violation of QMRs. We advocate that administrative staff members periodically look at violations listed within the STL_WLM_RULE_ACTION desk and coach the concerned end-users on find out how to keep away from future rule violations.

Alternately, a centralized staff, utilizing a Slack channel for collaboration and monitoring, can configure Amazon Redshift occasions and alarms to be despatched to their channel, in order that they’ll take well timed motion. Within the following sections, we stroll you thru find out how to arrange automated notifications of QMR violations to a Slack channel by way of the usage of Slack occasions and alarms. This permits Amazon Redshift customers to be notified and take well timed actions with out the necessity to question the system view.

Answer overview

To show how one can obtain automated notification to a Slack channel for QMR violation, we now have designed the next structure. As proven within the following diagram, we now have combined workload extract, rework, and cargo (ETL), enterprise intelligence (BI) dashboards, and analytics purposes which might be powered by an Amazon Redshift cluster. The answer depends on AWS Lambda and Amazon Easy Notification Service (Amazon SNS) to ship notifications of Amazon Redshift QMR violations to Slack.

To implement this resolution, you create an Amazon Redshift cluster and fasten a customized outlined parameter group.

Amazon Redshift gives one default parameter group for every parameter group household. The default parameter group has preset values for every of its parameters, and it might probably’t be modified. If you wish to use completely different parameter values than the default parameter group, it’s essential to create a customized parameter group after which affiliate your cluster with it.

Within the parameter group, you should use automated WLM and outline just a few workload queues, similar to a queue for processing ETL workloads and a reporting queue for consumer queries. You may identify the default queue adhoc. With automated WLM, Amazon Redshift determines the optimum concurrency and reminiscence allocation for every question that’s working in every queue.

For every workload queue, you’ll be able to outline a number of QMRs. For instance, you’ll be able to create a rule to abort a consumer question if it runs for greater than 300 seconds or returns greater than 1 billion rows. Equally, you’ll be able to create a rule to log a Redshift Spectrum question that scans greater than 100 MB.

The Amazon Redshift WLM evaluates metrics each 10 seconds. It data particulars about actions that end result from QMR violation that’s related to user-defined queues within the STL_WLM_RULE_ACTION system desk. On this resolution, a Lambda operate is scheduled to watch the STL_WLM_RULE_ACTION system desk each couple of minutes. When the operate is invoked, if it finds a brand new entry, it publishes an in depth message to an SNS subject. A second Lambda operate, created because the goal subscriber to the SNS subject, is invoked every time any message is printed to the SNS subject. This second operate invokes a pre-created Slack webhook, which sends the message that was obtained by way of the SNS subject to the Slack channel of your alternative. (For extra info on publishing messages through the use of Slack webhooks, see Sending messages utilizing incoming webhooks.)

To summarize, the answer includes the next steps:

  1. Create an Amazon Redshift customized parameter group and add workload queues.
  2. Configure question monitoring guidelines.
  3. Connect the customized parameter group to the cluster.
  4. Create a SNS subject.
  5. Create a Lambda operate and schedule it to run each 5 minutes through the use of an Amazon EventBridge rule.
  6. Create the Slack assets.
  7. Add an incoming webhook and authorize the Slack app to publish messages to a Slack channel.
  8. Create the second Lambda operate and subscribe to the SNS subject.
  9. Check the answer.

Create an Amazon Redshift customized parameter group and add workload queues

On this step, you create an Amazon Redshift customized parameter group with automated WLM enabled. You additionally create the next queues to separate the workloads within the parameter group:

  • reporting – The reporting queue runs BI reporting queries which might be carried out by any consumer who belongs to the Amazon Redshift database group named reporting_group
  • adhoc – The default queue, renamed adhoc, performs any question that’s not despatched to some other queue

Full the next steps to create your parameter group and add workload queues:

  1. Create a parameter group, named csblog, with automated WLM enabled.
  2. On the Amazon Redshift console, choose the customized parameter group you created.
  3. Select Edit workload queues.
  4. On the Modify workload queues web page, select Add queue.
  5. Fill within the Concurrency scaling mode and Question precedence fields as wanted to create the reporting queue.
  6. Repeat these steps so as to add the adhoc queue.

For extra details about WLM queues, consult with Configuring workload administration.

Configure question monitoring guidelines

On this step, you add QMRs to every workload queue. For directions, consult with Creating or modifying a question monitoring rule utilizing the console.

For the reporting queue, add the next QMRs:

  • nested_loop – Logs any question concerned in a nested loop be part of that leads to a row rely greater than 10,000,000 rows.
  • long_running – Stops queries that run for greater than 300 seconds (5 minutes).

For the adhoc queue, add the next QMRs:

  • returned_rows – Stops any question that returns greater than 1,000,000 rows again to the calling shopper utility (this isn’t sensible and may degrade the end-to-end efficiency of the applying).
  • spectrum_scan – Stops any question that scans greater than 1000 MB of information from an Amazon Easy Storage Service (Amazon S3) information lake through the use of Redshift Spectrum.

Connect the customized parameter group to the cluster

To connect the customized parameter group to your provisioned Redshift cluster, observe the directions in Associating a parameter group with a cluster. For those who don’t have already got a provisioned Redshift cluster, consult with Create a cluster.

For this publish, we hooked up our customized parameter group csblog to an already created provisioned Amazon Redshift cluster.

Create an SNS subject

On this step, you create an SNS subject that receives an in depth message of QMR violation from the Lambda operate that checks the Amazon Redshift system desk for QMR violation entries. For directions, consult with Creating an Amazon SNS subject.

For this publish, we created an SNS subject named redshiftqmrrulenotification.

Create a Lambda operate to watch the system desk

On this step, you create a Lambda operate that displays the STL_WLM_RULE_ACTION system desk. At any time when any report is discovered within the desk for the reason that final time the operate ran, the operate publishes an in depth message to the SNS subject that you just created earlier. You additionally create an EventBridge rule to invoke the operate each 5 minutes.

For this publish, we create a Lambda operate named redshiftqmrrule that’s scheduled to run each 5 minutes by way of an EventBridge rule named Redshift-qmr-rule-Lambda-schedule. For directions, consult with Constructing Lambda features with Python.

The next screenshot reveals the operate that checks the pg_catalog.stl_wlm_rule_action desk.

To create an EventBridge rule and affiliate it with the Lambda operate, consult with Create a Rule.

The next screenshot reveals the EventBridge rule Redshift-qmr-rule-Lambda-schedule, which calls the operate each 5 minutes.

We use the next Python 3.9 code for this Lambda operate. The operate makes use of an Amazon Redshift Information API name that makes use of GetClusterCredentials for momentary credentials.

import json
import time
import unicodedata
import traceback
import sys
from pip._internal import most important
import urllib3
import os
import boto3
from datetime import datetime

# provoke redshift-data shopper in boto3
shopper = boto3.shopper("redshift-data")

question = "choose userid,question,service_class,trim(rule) as rule,trim(motion) as motion,recordtime from stl_wlm_rule_action WHERE userid > 1 AND recordtime >= current_timestamp AT TIME ZONE 'UTC' - INTERVAL '5 minute' order by recordtime desc;"
sns = boto3.useful resource('sns')
sns_arn = os.environ['sns_arn']
platform_endpoint = sns.PlatformEndpoint('{sns_arn}'.format(sns_arn = sns_arn))

def status_check(shopper, query_id):
    desc = shopper.describe_statement(Id=query_id)
    standing = desc["Status"]
    if standing == "FAILED":
        increase Exception('SQL question failed:' + query_id + ": " + desc["Error"])
    return standing.strip('"')

def execute_sql(sql_text, redshift_database, redshift_user, redshift_cluster_id):
    print("Executing: {}".format(sql_text))
    res = shopper.execute_statement(Database=redshift_database, DbUser=redshift_user, Sql=sql_text,
                                   ClusterIdentifier=redshift_cluster_id)
    
    query_id = res["Id"]
    print("question id")
    print(query_id)
    achieved = False
    whereas not achieved:
        time.sleep(1)
        standing = status_check(shopper, query_id)
        if standing in ("FAILED", "FINISHED"):
            print("standing is: {}".format(standing))
            break
    return query_id

def publish_to_sns(message):
    attempt:
        # Publish a message.
        response = platform_endpoint.publish(
                  Topic="Redshift Question Monitoring Rule Notifications",
                  Message=message,
                  MessageStructure="string"

            )
        return  response

    besides:
        print(' Didn't publish messages to SNS subject: exception %s' % sys.exc_info()[1])
        return 'Failed'

def lambda_handler(occasion, context):
    
    rsdb = os.environ['rsdb']
    rsuser = os.environ['rsuser']
    rscluster = os.environ['rscluster']
    #print(question)
    res = execute_sql(question, rsdb, rsuser, rscluster)
    print("res")
    print(res)
    response = shopper.get_statement_result(
        Id = res
    )
    # datetime object containing present date and time
    now = datetime.now()
    dt_string = now.strftime("%d-%b-%Y %H:%M:%S")
    print(response) 
    if response['TotalNumRows'] > 0:
        messageText="################## Reporting Start" + ' [' + str(dt_string) + ' UTC] ##################nn'
        messageText = messageText + 'Whole variety of queries affected by QMR Rule violation for Redshift cluster "' + rscluster + '" is ' + str(len(response['Records'])) + '.' + 'n' + 'n'
        for i in vary(len(response['Records'])):
            messageText = messageText + 'It was reported at ' + str(response['Records'][i][5]['stringValue'])[11:19] + ' UTC on ' + str(response['Records'][i][5]['stringValue'])[0:10] + ' {that a} question with Question ID - ' + str(response['Records'][i][1]['longValue']) + ' needed to ' +  str(response['Records'][i][4]['stringValue']) + ' as a result of violation of QMR Rule "' + str(response['Records'][i][3]['stringValue']) + '".n'
        messageText = messageText + 'n########################### Reporting Finish ############################nn'
        query_result_json = messageText
        response = publish_to_sns(query_result_json)
    else:
        print('No rows to publish to SNS')

We use 4 setting variables for this Lambda operate:

  • rscluster – The Amazon Redshift provisioned cluster identifier
  • rsdb – The Amazon Redshift database the place you’re working these exams
  • rsuser – The Amazon Redshift consumer who has the privilege to run queries on pg_catalog.stl_wlm_rule_action
  • sns_arn – The Amazon Useful resource Identify (ARN) of the SNS subject that we created earlier

Create Slack assets

On this step, you create a brand new Slack workspace (should you don’t have one already), a brand new personal Slack channel (provided that you don’t have one or don’t wish to use an current one), and a brand new Slack app within the Slack workspace. For directions, consult with Create a Slack workspace, Create a channel, and Creating an app.

For this publish, we created the next assets within the Slack web site and Slack desktop app:

  • A Slack workspace named RedshiftQMR*****
  • A non-public channel, named redshift-qmr-notification-*****-*******, within the newly created Slack workspace
  • A brand new Slack app within the Slack workspace, named RedshiftQMRRuleNotification (utilizing the From Scratch possibility)

Add an incoming webhook and authorize Slack app

On this step, you allow and add an incoming webhook to the Slack workspace that you just created. For full directions, consult with Allow Incoming Webhooks and Create an Incoming Webhook. You additionally authorize your Slack app in order that it might probably publish messages to the personal Slack channel.

  1. Within the Slack app, beneath Settings within the navigation pane, select Fundamental Data.
  2. Select Incoming Webhooks.
  3. Activate Activate Incoming Webhooks.
  4. Select Add New Webhook to Workspace.
  5. Authorize the Slack app RedshiftQMRRuleNotification in order that it might probably publish messages to the personal Slack channel redshift-qmr-notification-*****-*******.

The next screenshot reveals the main points of the newly added incoming webhook.

Create a second Lambda operate and subscribe to the SNS subject

On this step, you create a second Lambda operate that’s subscribed to the SNS subject that you just created earlier. For full directions, consult with Constructing Lambda features with Python and Subscribing a operate to a subject.

For this publish, we create a second operate named redshiftqmrrulesnsinvoke, which is subscribed to the SNS subject redshiftqmrrulenotification. The second operate sends an in depth QMR violation message (obtained from the SNS subject) to the designated Slack channel named redshift-qmr-notification-*. This operate makes use of the incoming Slack webhook that we created earlier.

We additionally create an SNS subscription of the second Lambda operate to the SNS subject that we created beforehand.

The next is the Python 3.9 code used for the second Lambda operate:

import urllib3
import json
import os

http = urllib3.PoolManager()
def lambda_handler(occasion, context):
    
    url = os.environ['webhook']
    channel = os.environ['channel']
    msg = {
        "channel": channel,
        "username": "WEBHOOK_USERNAME",
        "textual content": occasion['Records'][0]['Sns']['Message'],
        "icon_emoji": ""
    }
    
    encoded_msg = json.dumps(msg).encode('utf-8')
    resp = http.request('POST',url, physique=encoded_msg)
    print({
        "message": occasion['Records'][0]['Sns']['Message'], 
        "status_code": resp.standing, 
        "response": resp.information
    })

We use two setting variables for the second Lambda operate:

  • channel – The Slack channel that we created
  • webhook – The Slack webhook that we created

Check the answer

To point out the impact of the QMRs, we ran queries that violate the QMRs we arrange.

Check 1: Returned rows

Check 1 appears to be like for violations of the returned_rows QMR, through which the return row rely is over 1,000,000 for a question that ran within the adhoc queue.

We created and loaded a desk named lineitem in a schema named aquademo, which has greater than 18 billion data. You may consult with the GitHub repo to create and cargo the desk.

We ran the next question, which violated the returned_rows QMR, and the question was stopped as specified within the motion set within the QMR.

choose * from aquademo.lineitem restrict 1000001;

The next screenshot reveals the view from the Amazon Redshift shopper after working the question.

The next screenshot reveals the view on the Amazon Redshift console.

The next screenshot reveals the notification we obtained in our Slack channel.

Check 2: Lengthy-running queries

Check 2 appears to be like for violations of the long_running QMR, through which question runtime is over 300 seconds for a consumer who belongs to reporting_group.

Within the following code, we created a brand new Amazon Redshift group named reporting_group and added a brand new consumer, named reporting_user, to the group. reporting_group is assigned USAGE and SELECT privileges on all tables within the retail and aquademo schemas.

create group reporting_group;
create consumer reporting_user in group reporting_group password 'Test12345';
grant utilization on schema retail,aquademo to group reporting_group;
grant choose on all tables in schema retail,aquademo to group reporting_group;

We set the session authorization to reporting_user so the question runs within the reporting queue. We ran the next question, which violated the long_running QMR, and the question was stopped as specified within the motion set within the QMR:

set session authorization reporting_user;
set enable_result_cache_for_session  to off;
choose * from aquademo.lineitem;

The next screenshot reveals the view from the Amazon Redshift shopper.

The next screenshot reveals the view on the Amazon Redshift console.

The next screenshot reveals the notification we obtained in our Slack channel.

Check 3: Nested loops

Check 3 appears to be like for violations of the nested_loop QMR, through which the nested loop be part of row rely is over 10,000,000 for a consumer who belongs to reporting_group.

We set the session authorization to reporting_user so the question runs within the reporting queue. We ran the next question, which violated the nested_loop QMR, and the question logged the violation as specified within the motion set within the QMR:

set session authorization reporting_user;
set enable_result_cache_for_session  to off;
choose ss.*,cd.* 
from retail.store_sales ss
, retail.customer_demographics cd;

Earlier than we ran the unique question, we additionally checked the clarify plan and famous that this nested loop will return greater than 10,000,000 rows. The next screenshot reveals the question clarify plan.

The next screenshot reveals the notification we obtained in our Slack channel.

Check 4: Redshift Spectrum scans

Check 4 appears to be like for violations of the spectrum_scan QMR, through which Redshift Spectrum scans exceed 1000 MB for a question that ran within the adhoc queue.

For this instance, we used store_sales information (unloaded from an Amazon Redshift desk that was created through the use of the TPC-DS benchmark information) loaded in an Amazon S3 location. Information in Amazon S3 is non-partitioned beneath one prefix and has a quantity round 3.9 GB. We created an exterior schema (qmr_spectrum_rule_test) and exterior desk (qmr_rule_store_sales) in Redshift Spectrum.

We used the next steps to run this take a look at with the pattern information:

  1. Run an unload SQL command:
    unload ('choose * from store_sales')
    to 's3://<<Your Amazon S3 Location>>/store_sales/' 
    iam_role default;

  2. Create an exterior schema from Redshift Spectrum:
    CREATE EXTERNAL SCHEMA if not exists qmr_spectrum_rule_test
    FROM DATA CATALOG DATABASE 'qmr_spectrum_rule_test' area 'us-east-1' 
    IAM_ROLE default
    CREATE EXTERNAL DATABASE IF NOT exists;

  3. Create an exterior desk in Redshift Spectrum:
    create exterior desk qmr_spectrum_rule_test.qmr_rule_store_sales
    (
    ss_sold_date_sk int4 ,            
      ss_sold_time_sk int4 ,     
      ss_item_sk int4  ,      
      ss_customer_sk int4 ,           
      ss_cdemo_sk int4 ,              
      ss_hdemo_sk int4 ,         
      ss_addr_sk int4 ,               
      ss_store_sk int4 ,           
      ss_promo_sk int4 ,           
      ss_ticket_number int8 ,        
      ss_quantity int4 ,           
      ss_wholesale_cost numeric(7,2) ,          
      ss_list_price numeric(7,2) ,              
      ss_sales_price numeric(7,2) ,
      ss_ext_discount_amt numeric(7,2) ,             
      ss_ext_sales_price numeric(7,2) ,              
      ss_ext_wholesale_cost numeric(7,2) ,           
      ss_ext_list_price numeric(7,2) ,               
      ss_ext_tax numeric(7,2) ,                 
      ss_coupon_amt numeric(7,2) , 
      ss_net_paid numeric(7,2) ,   
      ss_net_paid_inc_tax numeric(7,2) ,             
      ss_net_profit numeric(7,2)                     
    ) ROW FORMAT DELIMITED 
      FIELDS TERMINATED BY '|' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.mapred.TextInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION
      's3://<<Your Amazon S3 Location>>/store_sales/'
    TABLE PROPERTIES (
      'averageRecordSize'='130', 
      'classification'='csv', 
      'columnsOrdered'='true', 
      'compressionType'='none', 
      'delimiter'='|', 
      'recordCount'='11083990573', 
      'sizeKey'='1650877678933', 
      'typeOfData'='file');

  4. Run the next question:
    choose * 
    FROM qmr_spectrum_rule_test.qmr_rule_store_sales 
    the place ss_sold_date_sk = 2451074;

The question violated the spectrum_scan QMR, and the question was stopped as specified within the motion set within the QMR.

The next screenshot reveals the view from the Amazon Redshift shopper.

The next screenshot reveals the view on the Amazon Redshift console.

The next screenshot reveals the notification we obtained in our Slack channel.

Clear up

Once you’re completed with this resolution, we advocate deleting the assets you created to keep away from incurring any additional prices.

Conclusion

Amazon Redshift is a strong, absolutely managed information warehouse that may provide considerably elevated efficiency and decrease price within the cloud. On this publish, we mentioned how one can automate notification of misbehaving queries on Slack through the use of question monitoring guidelines. QMRs can assist you maximize cluster efficiency and throughput when supporting combined workloads. Use these directions to arrange your Slack channel to obtain automated notifications out of your Amazon Redshift cluster for any violation of QMRs.


Concerning the Authors

Dipankar Kushari is a Senior Specialist Options Architect within the Analytics staff at AWS.

Harshida Patel is a Specialist Senior Options Architect within the Analytics staff at AWS.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments