Health checks serve as the first line of defense for ensuring data ingestion into Workbench. In most cases, you are responsible for resolving a failed health check that results in an unhealthy device, but there are a few exceptions.
Quick Links
About Health Checks
Types of Health Checks
There are three types of health checks:
-
API health checks apply to devices with “direct” integrations to Workbench via an API connection.
- These checks focus on ensuring there is a viable connection between your integration(s) and Workbench so that events can be ingested.
-
Via SIEM health checks apply to devices that are connected to Workbench via a SIEM (because a direct API connection is not available), and there are two checks available.
- The configuration check looks for the correct credentials and query substitution fields to verify the query syntax.
- The parent SIEM check looks at the health of the parent SIEM device and updates the status of any child devices accordingly.
-
Device Health Monitoring (DHM) health checks are done by Workbench to identify possible ingestion anomalies.
- DHM leverages statistical forecasting.
- See Device Health Monitoring (DHM) Philosophy to learn more about how this works.
Note
Health checks are not available for collector devices, and this will be reflected in the status shown for those devices.
Device Health Monitoring (DHM) Philosophy
When we are monitoring the health of your device, we are not just looking for gaps in connectivity (these are your API and via SIEM health checks) but also evaluating the overall pattern of data ingestion behavior. Our DHM health checks apply time-series forecasting models (Prophet and ARIMA) to determine what “healthy” or “normal” looks like for each device.
When a possible ingestion anomaly is identified, we triage the anomaly and work to determine if the model produced a false positive or a true positive. We will reach out if we identify a true positive.
Health Check Failures
If any health check fails, the status of the device (and any via SIEM child devices) immediately changes to unhealthy. You will also automatically be alerted to a change in device status via your default Workbench notifications (to a channel like Slack, Teams, Webhooks, etc.) if you have set up the integration for your organization. Email notifications can be enabled as well.
Because device health is part of your default Workbench notifications, it is not necessary to manually check Workbench for a device health status change after you have enabled organization notifications. You will receive an alert from Ruxie that looks like this:
Note
For via SIEM devices, the child device will become unhealthy if the parent device becomes unhealthy (and it will return to a healthy status only when the parent device is healthy). For example, if a Zscaler (via SIEM) device is connected to Workbench through a Splunk device, and the Splunk device becomes unhealthy, then the Zscaler (via SIEM) device will also become unhealthy.
Recommended Preventive Device Health Tasks
While Workbench can detect when data ingestion stops, it cannot determine the root cause. To prevent unnecessary interruptions to data ingestion and the resulting unhealthy device status, we recommend you:
- Monitor for planned maintenance or configuration changes that might affect data ingestion
- Inform Expel of any scheduled downtime
View Device Health
There are multiple ways to view the health of your device. The Security Devices page (Organization Settings > Security Devices) is one of them, but you can also use the Alert Analysis Dashboard.
For a high-level snapshot of a device's connectivity status, use the Status column to check on the connection. If a problem exists, you will see a message here along with a timestamp for when the device became unhealthy.
- Most devices should indicate a Healthy API connection soon after the new device configuration has been saved. If this does not happen for a new device, your first step should be to check your device configuration for errors.
- See the Reference section for a full list of unhealthy statuses.
Note
You may occasionally see a status of "No alerts mode" which indicates that polling has been intentionally disabled or is not supported, and we are not ingesting data at this time. Contact support if you need help with this status or if it appears unexpectedly on a device.
Fix an Unhealthy Device
If you see an unhealthy message in the status column or receive a notification about an unhealthy device from Ruxie, you will usually need to resolve the issue yourself (there are a few exceptions). For via SIEM devices, you may need to fix the parent device in order to restore the child device to a healthy status.
To fix a newly added device that shows as unhealthy after onboarding:
- Check your device configuration for errors.
- If you are unable to find the issue within your configuration, or if your device becomes healthy but does not begin polling within 15 minutes and receiving data within 30 minutes, contact support for help.
To fix an existing device that was previously healthy, do one of the following:
- Follow the instructions given to you by Ruxie, or
- Go into the device details (see View Security Device Details for more information about this screen and how to access it) and look for a matching "unhealthy" banner, then select the small arrow to display the steps to fix. For via SIEM devices, you may need to also check the parent device.
Note
See the Reference for a full list of health statuses, error messages, and the general actions required to resolve each one.
Reference
Health Statuses You Must Resolve
The following chart shows some specific definitions for unhealthy statuses that must be resolved by you, and the required action(s) to take.
| Connection Type | Status Message | Meaning | Your Required Action(s) |
| API | Connection refused: Check that device is running and can be reached. | The security device refused our connection attempt. |
View the security device details and validate the configuration.
Verify that you can log into the device.
Verify that your firewalls allow traffic to the device. |
| API | Failed to connect: Check device hostname is correct and Assembler has correct DNS server. | Workbench cannot connect due to a DNS lookup failure. |
View the security device details and verify the device hostname is correct.
Check the Assembler and verify it has the correct DNS server. |
| API | Device error: <error message>. If necessary, contact Expel for further assistance | The security device is reporting an internal error. |
View the security device details and verify that it is polling and receiving data.
Follow any troubleshooting procedures from the device vendor. |
| API | Failed to connect to device API: Check that server URL is correct. If necessary, contact Expel for further assistance. | We received an unexpected response when we tried to access the security device, and cannot connect. |
View the security device details and verify that the server URL is correct.
Verify that the server URL is using the correct port. |
| API | Device has invalid credentials. | We are unable to access the integration with the credentials supplied. | Edit the security device and update the credentials. |
| API | License expired. | The security device is reporting an expired vendor license and is preventing us from using it. | Renew the license with the security product vendor, or delete the device from Workbench if you are no longer using it. |
| API | Failed to connect: Check device address is correct. | Workbench cannot connect to the integration from the assembler due to a network routing problem. | View the security device details and validate the configuration. |
| API | Device credentials don't have permission to connect. | The supplied credentials are valid, but Workbench does not have permission to perform the necessary actions. | Review the onboarding guide to ensure the correct permissions are assigned to the Expel user account. |
| API | Timeout problem: If necessary, contact Expel for further assistance. | Our attempt to retrieve data from the security device timed out. |
View the security device details and validate the configuration.
Verify that your firewalls allow traffic to the device. |
| API | Unsupported version:<error message>. | We detected that this security device is running a version we do not support. | Modify the version or service tier for the device. |
| via SIEM | <unique error message from device> | The SIEM’s query syntax is improperly defined due incorrect credentials or query substitution fields. | Validate the correct credentials and query substitution fields are used. |
| via SIEM | Parent SIEM is not active. | The SIEM through which this child (via SIEM) security device is connected to Workbench is unhealthy. | Go to the parent SIEM and resolve the unhealthy status message. This action will restore the child device to a healthy status. |
Health Statuses Expel Must Resolve
The following statuses must be resolved by Expel.
| Connection Type | Status Message | Meaning | Expel's Required Action(s) |
| API | Device is rate limited: Please increase query rate limit for the device. | You have hit your rate limit. This may require further investigation. | Expel must investigate the cause of the rate limit, and work to resolve any issues or contact you directly to discuss options for increase. |
| API | Unknown error: Expel will investigate and contact you if action is needed. | An unknown error has occurred. | Expel must investigate the root cause and work to resolve the error. |
| via SIEM | Unknown error: Expel will investigate and contact you if action is needed. | One or more of the detection queries for this (via SIEM) device is not working. | Expel must investigate and resolve the issue. |
| N/A | No alerts mode | Polling has been intentionally disabled or is not supported, and we are not ingesting data at this time. | This is an intentional status set by Expel. Contact support if you need help with this status or if it appears unexpectedly on a device. |
AWS Errors
AWS users may also see the following errors as a result of health checks, and these errors must be resolved by you. The error itself will show in the Device Details within "Steps to Fix."
|
Error (located in Device Details) |
Status Message | Meaning | Your Required Action(s) |
|
STS:AssumeRole ClientError: An error occurred (AccessDenied) when calling the AssumeRole operation. |
Device has invalid credentials. | Expel does not have the permission to use AssumeRole. | Verify that the correct organization GUID is applied to the ExternalId in the IAM policy. Refer to the onboarding guide for assistance. |
| AWS_FILE_DECRYPT_FORBIDDEN | Device has invalid credentials. | The Expel role within your AWS environment does not have permission to decrypt the CloudTrail log file. |
Verify the KMS:Decrypt permission is set to “Allow” in the KMS Key ARN for your policy.
Verify the KMS:Decrypt permission is applied to the correct KMS Key ARN resource. |
| AWS_FILE_READ_FORBIDDEN | Device has invalid credentials. | The Expel role within your AWS environment does not have permission to read the CloudTrail log file. |
Verify the get object permission is set to “Allow” in the IAM policy.
Verify the object permission has the correct S3 bucket resource ARN applied. |
| AWS_INVALID_JSON | Device has invalid credentials. | The CloudTrail log files are not in JSON format. | Find the log files in your S3 bucket and make sure they are in JSON format. |
| AWS_LOG_FILE_NOT_FOUND | Device has invalid credentials. | No log file was found in the S3 bucket. |
Make sure the notification service is connected to the correct S3 bucket.
Confirm the log file was not deleted. |
| AWS_NON_TAR_FILE | Device has invalid credentials. | The CloudTrail log file is in the incorrect format. | Confirm the S3 bucket contains compressed JSON files ending with the .gz file extension. |
| AWS_NO_RECORDS_IN_S3_FILE | Device has invalid credentials. | No records were found in the S3 log file. |
Check your configuration to be sure CloudTrail is writing logs to the correct S3 bucket.
Confirm the log file was not deleted. |
| INVALID_SQS_MESSAGE | Device has invalid credentials. | The format of the SQS message is incorrect | Check your SQS dashboard and verify that it has properly formatted messages. |