It is no secret that getting your head around the capabilities or syntax of Esper EPL can be difficult and that finding examples which fit your needs can be just as hard. Even then, you may still be left with more questions than answers. In this blog, we take a look at a rule from an educational perspective. With the information provided, you should be able to massage the rule into a variety of use cases or apply the concepts to brand new requirements.
A quick note about documentation: Google queries have often led me to old versions of the Esper documentation so double check the URL when you're following links. Here's a link to the latest documentation and I'd particularly suggest reading section 2. https://esper.espertech.com/release-8.9.0/reference-esper/html_single/
Alerts based on the rate of traffic matching an arbitrary query can be useful for SOC monitoring suspicious activity or engineering teams ensuring proper data flow. This rule is meant to act similar to Health & Wellness alarms to avoid repetitive alerting:
A threshold condition (total event count) is met within a time period
The condition is sustained for a number of time periods
An alert fires and will not fire again until the condition resolves (count falls below the threshold)
Follow-up actions from this alert should include a manual query to better assess the situation causing the rule to fire. Alternatively, integration with SOAR could trigger queries or reports to obtain the relevant information.
This rule was specifically designed without the use of sliding or batch time windows due to their high memory overhead which traditionally require time windows to be short and without a large number of events. Given that only a count is being kept here, it should be suitable for high rates and long periods with little performance impact. However, as always, the rule should be deployed as a trial rule and monitored closely after the initial deployment.
At the time of this writing, I've observed that there are only specific situations where the Java engine is able to collect statistics from the rule. This might make more sense as you read further but imagine the following events in the window, where S is stat collection start, E is end, and F T T T is the search pattern.
* * * * * * * *
F F F T F T F T T T T T F T T F T T F T T T T T T F F
S E S E
I've talked to the engineering team about this and it appears to be a limitation of objects exposed from the Esper engine to the NetWitness integration while the search pattern is incomplete. Regardless, during the periods where statistics are available, I've found this rule is only using 25-100KB of memory and very little CPU, even when matching over 1,000,000 events per hour.
Future enhancements could include modifying the rule to alert when a rate is sustained below a threshold (easy), dynamically calculating the threshold based on standard deviation (complicated), or considering the pass/fail criteria based on followed-by or otherwise complex conditions. The rule could also be simplified by alerting on a total count within a time period or set of periods and then using @RSAAlert
to suppress alerts for an acceptable period of time. Otherwise, for the purpose of alerting on a sustained rate above a threshold, the sample rule provided here is easily customizable to fit other use cases by modifying the event filter criteria.
Consider an example where the rule is looking for time periods containing more than 700 events matching the selection criteria and sustained for two periods. Also requiring the two periods to be preceded by a period that failed to meet the threshold prevents the alert from firing again until the situation causing the elevated rate has ended.
The below example is looking for three 60-minute periods containing at least 500,000 events. The event criteria is inbound DNS with the error = "no name"
meta. The error
meta key is a multivalued meta key and, as such, will need to be defined in the correlation service. Don’t forget to sync the keys when this value is updated in the service config. https://community.netwitness.com/t5/netwitness-platform-online/update-your-esa-rules-for-the-required-multi-value-and-single/ta-p/669426
Including direction IS NOT NULL
may or may not be required. In the past, it was advised to ensure that at least one condition in a statement would match to avoid EPL errors. I will update the blog if I get clarification on this point.
WARNING: If you are deploying multiple versions of this rule, the named window, context, variable, and schema names must be unique because these are global structures shared by the ESA deployment. In a text editor, you can find/replace "Rule#" to avoid issues.
The comments in the rule below explain each component used, or should at least give you enough information to find the relevant sections in the Esper EPL documentation.
//If you are deploying multiple versions of this rule, the following must be unique because these are globally shared by the ESA deployment. Update all occurrences in the rule by replacing "rule#" with a unique rule number:
// window names
// context names
// variable names
// schema names
//Modify the variables, the FROM clause in the statement with the event filter criteria, and the pattern match occurrences (numbers in {#} next to E1/E2).
//This alert will NOT populate the Respond alert with actual session data. It will simply notify you that your criteria has been met which should then be followed up with a manual query.
//If the rule is deployed when the matching traffic rate is already above the threshold, the rule will not begin tracking until the rate drops below the threshold again.
//The time period must contain at least this many sessions in order to insert a "True" value into the history window.
CREATE VARIABLE integer var_session_threshold_rule1 = 500000;
//The length of time that must pass before the final comparison of the total number of sessions in the window against the threshold.
CREATE VARIABLE integer var_window_minutes_rule1 = 60;
//100 should be large enough to handle most use cases. This window's main purpose is to hold True/False values for the match_recognize statement.
//It is intentionally not part of the context because we need to track results spanning multiple context instances.
//@RSAPersist will keep the contents of the window stored on the ESA appliance at /var/netwitness/correlation-server/esa-windows/. When the service restarts, the window contents will be restored from the file.
//When the stored events are written back into the window (oldest to newest), this will trigger the SELECT statement and may fire alerts. The @RSAAlert suppression would likely mean only one alert would fire.
//It is not necessary to use @RSAPersist but it can be helpful for debugging through service restarts.
@RSAPersist
CREATE WINDOW thresholdHistoryWindow_rule1.win:length(100) (bool_exceeded bool, window_ending string, matched_count long);
//Contexts are a way to segment event streams. In this case, we start a context as soon as the rule deploys, terminate it after the defined number of minutes, and start a new one after the defined number of minutes.
CREATE CONTEXT contextTimePeriod_rule1 INITIATED @now AND PATTERN [every timer:at(*/var_window_minutes_rule1, *, *, *, *)] TERMINATED after var_window_minutes_rule1 minutes;
//By including an operator (>=) in the SELECT clause, the result of the statement becomes a boolean value which is then inserted into the history window.
//"Event(...)" means that only sessions matching the criteria in parentheses will be counted.
//Esper EPL maintains a count of matching events via count(). It will not retain the events (thus reducing memory requirements) unless a batch or sliding window are created.
//OUTPUT LAST WHEN TERMINATED prevents the SELECT clause result from being inserted until the context terminates.
//Newlines/indentation are not necessary but help make the rule easier to read. Note that the statement begins at CONTEXT and ends with the semicolon after TERMINATED.
CONTEXT contextTimePeriod_rule1
INSERT INTO thresholdHistoryWindow_rule1
SELECT
count(*) >= var_session_threshold_rule1 as bool_exceeded,
current_timestamp().toDate().toString() as window_ending,
count(*) as matched_count
FROM Event(direction IS NOT NULL
AND
isOneOfIgnoreCase(error,{ 'no name' })
AND
direction.toLowerCase() IN ( 'inbound' )
AND
service IN ( 53 ))
OUTPUT LAST WHEN TERMINATED;
//oneInSeconds=600 prevents alert spam in case the rule behaves unintentionally.
//match_recognize is used to ensure that the pattern is matched explicitly in the order described without any intermediate events being considered for the match.
//Modify the numbers in curly braces {#} in the pattern to set how many occurrences of each event are needed for the alert to fire.
@RSAAlert(oneInSeconds=600)
SELECT *
FROM thresholdHistoryWindow_rule1
MATCH_RECOGNIZE(
MEASURES E1 as e1_data,
E2 as e2_data
PATTERN (
E1{1} E2{3})
DEFINE E1 as (E1.bool_exceeded = false),
E2 as (E2.bool_exceeded = true)
);
Note: If you plan to use this rule in the online EPL tryout tool, EsperTech Esper EPL Online, you will need to use direction IN ( 'inbound' )
because the toLowerCase() function is custom (but built into the ESA deployment). The @RSAAlert annotation should be removed as well. Sample input for the tryout tool is attached but is not maintained to exactly match the rule above so you may need to adjust it.
Also, you should understand how the @RSAPersist annotation behaves, especially if you use it in other ESA rules. ESA Annotations.
After deploying rules with named windows, you can use the Named Windows tool under the ESA Rules tab of the UI to view the window contents. In this screenshot, I have a field called "windowEnding" but later changed the script above to "window_ending" for consistency.
Hopefully you learned something from this dissection of an ESA rule and can see how its concepts could be applied to other use cases. I thought it was important to share not only what works, but why certain methods were chosen and which methods didn't work. The use of contexts here instead of sliding or batch windows, as well as the match_recognize syntax, should provide a good deal of flexibility. If you implement a version of this rule, please share your use case and experience!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.