SBC Routing & Load Balancing Techniques

Quick Summary:

Session Border Controllers (SBCs) are vital to securely managing voice and video communications as well as ensure good quality of service. SBCs manage the routing of voice/video calls based on the rules of cost, priority, and compliance; simultaneously ensuring the health of all links are checked constantly to provide the best call performance possible. SBCs provide load balancing at either the signaling or media level and provide Quality of Service tagging and parameters that prioritize the real-time traffic (voice/video) as much as possible. In overload situations, SBCs provide admission control that allows current calls to complete, but does limited new calls; the new calls are either rejected or based on priority that the user puts into the SBC. An SBC can also comply with law enforcement directives or calls to comply with Do Not Disturb recognitions. SBCs also provide near real-time monitored session traffic data with SLA-based automated failovers to provide reliable, compliance, and cost-effective communications experiences.

Introduction

Session Border Controllers (“SBCs”) are important in voice and video networks because they reside at the edge of the network and contribute to security, data, media, and control information. In real-time video and voice sessions, there are two main functions of routing, numbering, and load balancing. Routing and load balancing ensures the session uses the optimal path, that costs are properly managed, and high availability is achieved for many users. As environments become increasingly complex, additive to IT and communication resources on a single SBC, are distributions via cloud, blended into hybrid, single or multiple carriers, and mixing voice and video relying on signalling protocols ( SIP, RTP, RCTP, etc.) even new users can consume a whole host of configurable features; SBCs aggregate and provides control and and resilience. This article will highlight some important features of SBC routing and load balancing and some best practices for robust and efficient communications.

Routing Basics

Routing Criteria

When the SBC receives an INVITE, it will establish an outbound link based on destination prefix, caller identity, and policies. Work within the dial plan can map destination prefixes to either route groups with carriers or peer SBCs having assigned cost values with priority values per call based on the destination prefix. The SBC must consider values such as cost, priority, and the capability of the codec for both the caller and callee to select the outbound routing path.

For example, if the call destination is country code plus forty four, the SBC may select the route to Carrier A because they have the option for lower rates.

As mentioned, the SBC does take these routing values into consideration, however there are some more specific high priority calls where cost has no bearing in the selection of path as the SBC for such calls would prefer to establish the calls with the least latency option. High priority calls can include emergency pathways or VIP options.

Health Checks

SBCs will execute health checks and may conduct periodic health checks either through SIP OPTIONS or test calls of links to assess and provide current link metrics for operations such as reliability, delay, packet loss, or jitter. Should the link fail or have repetitive degradation, the SBC will make the link unavailable and adjust traffic out to avoid calls using the poor quality link.

The traffic SIP paths for health checks should be throttle tuned because the SBC makes assessments on how to advance detection and processing overhead usage on an inverse balance.

Geography and compliance based routing

As often occurs, routing also includes geography or compliance rule capabilities in the code. It could be that calls flowing to certain areas are limited to specific licensed carriers. For regulatory intention, the media in the call needs to stay in an organized region to meet the data sovereignty regulation and requirements. It is at this stage new sessions are tagged for the compliance expectations and can affect routing decisions.

Least Cost and Priority Routing

Least Cost Logic

Least cost routing manages expenses by assigning a cost value to each link. The SBC ranks eligible carriers in order of cost and routes sessions to the cheapest carrier that passes any health checks. If that carrier fails health checks, the SBC routes to the next cheapest carrier. An administrator can respond to carrier rate changes by manipulating costs so that routing reflects accurate information.

Priority Overrides

Priority overrides allow specific traffic, like emergency calls, or high priority customers, to bypass cost. The SBC will route to links that satisfy the best quality metrics for such traffic, like latency or packet loss. Appropriate policy and traffic tagging will facilitate ensuring that only eligible traffic is prioritized.

Dynamic Path Selection and Failover

Real Time Path Selection

Dynamic path selection for high availability is enabled by using real time selection. When a health check determines a degradation, the SBC will not route the session to any links utilizing the affected link. Degraded links can still be allowing low priority and routing to unaffected links for critical or premium calls. When the link recovers from the degradation, it is automatically. This mitigates any disruption for call continuity.

Single Node Failover

In a single SBC environment, a standby or hot standby would take over if the primary failed. Both SBCs replicate state information on active sessions, which includes state and media information. Preventing drops is possible if state information is reliable and accurately replicated. Administrators should regularly test the failover and check their state replication.

Clustered Environments

Within a clustered environment, there are a number of SBC nodes that replicate session state, as well as some resource metrics. By having a load balancer to direct new sessions to the least-loaded node, if one of those nodes fails, the peers will pick up the active sessions without disruption to the organization or user. This relies on state replication and having a plan in place to minimize potential lost data from state replication declines. Protocols where administrators’ schedules at test failover and their cluster regular testing to maintain cluster resiliency.

Load Balancing Type

Signaling Level Balancing

Signaling load balancing distributes SIP control traffic across SBC nodes. A DNS round robin can be implemented, or on a larger scale, using a front end load balancer. If the administrator decided to use a front end load balancer for SIP traffic, the front end would inspect the SIP INVITE and send it to the respective SBC node, based on the amount of active session or CPU load being used. This level of load balancing is anticipated to balance call control SIP traffic on each node, leaving the media considerations. If the administrator exceeded their grade and had a high number of active sessions and had a node drop, this process would not consider media load which could create a media bottleneck.

Media Level Balancing

In media level balancing, the SBC has both signaling and media embedded on the same node. The SBC is considered a media relay between devices and avoids Network Address Translation or is mandating it. Load balancing would need to be considered based on if the SBC was transcoding, how many media channels are being used, and how much bandwidth would be routed. Therefore, all new calls would consider the node has resources available (potentially vomit nodes per released device), while overloaded nodes would be bypassed. This method of routing would produce consistent media handling during the lifetime of the session, of course there needs to be robust state synchronization to prevent orphaned media streams.

Choosing method

Signaling level is easier than media level, and if an administrator is implementing SIP Server with separate media gateways, signaling level would suffice. Media load balancing would work better for a single integrated SBC deployment when voice media quality is important. In either case, administrators should have a good understanding of expected call volume, network topology, and Quality of Service (QoS) requirements to choose the appropriate level of assistance.

Quality of Service Integration

QoS Tagging

SBCs tag SIP and RTP packets with DSCP values and/or VLAN IDs to allow for prioritization of traffic. Some common JDSCP values are EF for Voice packets and AF for video packets. Downstream devices will read the tags and act accordingly with a close mimicking to prioritize those real time packets, thus allowing additional time to recover from latency and packet loss during congestion.

QoS Monitoring

SBCs monitor packet loss, jitter and delay continuously. If one metric violates a threshold value, the SBC will mark paths as degraded and not allow new session requests to use that path. Administrators would define respective thresholds potentially based on service level agreements, or user expectations. You can think of QoS monitoring as an integration of multi-faceted data metrics to help make dynamic routing decisions that avoid packet paths of poor quality at the onset of a call.

Session Admission Control

Admission Policies

Session Admission Control, SAC, prevents users from overloading the SBC when surge events occur. Administrators define the usage limits for concurrent sessions, CPU, memory, and media channels. When the limits are reached, the SBC will reject new calls with a SIP 503 reject, and protect ongoing concurrences. SAC can have policies that allow for a certain class of service to pass on limits as outlined by business rules as well as load shedding policies, which expect to throttle lower priority traffic during extreme surge periods.

Coordinated SAC in Clusters

All nodes in a cluster share resource metrics either using a central resource store or a messaging bus. This coordination validates that all nodes are using the same admission control, therefore calls are passed through, but redirected to the newly available authorized node when resources are available, preventing inconsistent load distribution. Administrators must also keep an eye on the synchronization latency as it can directly affect the effectiveness of SAC decisions for the cluster and lead to underutilization as well as overload on a node.

Monitoring and Reporting

Real Time Dashboards

SBC dashboards can report frequently on how many concurrent sessions, what percentage of media is being used, CPU and memory usage, call setup times, and overall Quality of Service metrics. Historical trending means will show frequent reporting occurrences, and are helpful for administrators to identify trends on hotspots and make policy adjustments proactively. Alerting can inform each designated team, when thresholds were breached, and an operator or technician can employ and remediate identified threshold before greatly impacting end-users or customers.

Historical Analysis

The SBC records, logs record and performance data that can provide today’s operator lots of data about usage. Over time, metrics will reveal usage patterns, peak period load, peaks within carrier network, and keep the SBC operators and administrator informed about usage patterns that are useful data markers going into capacity planning or can help demonstrate the need to add additional SBC nodes or SIP trunks. Historical data patterns can also be cumbersome for troubleshooting to correlate past events to user’s performance issues.

Integration with Management Tools

Integrating comprehensive SBC metrics with local network management systems via Simple Network Management Protocol (SNMP), syslog or Application Programming Interface (APIs) could provide an overall health assessment of your infrastructure systems when combined together. Those out there may wish to integrate it with Nagios, Zabbix, or Splunk which can consume these metrics. Alerts, with triggering scripts or programs can take action and add distributed SBC instances, or change the routing of traffic to the right health system based on conditions. Given their current capacities and network component measurements, it may make the ability of the combined systems contribute to the SLA expected performance.

Common Challenges and Traps

Misconfigured Routing Rules

Misconfigured routing rules can result in call blackholing, security vulnerabilities, or unexpected costs. Administrators should periodically Audit their dial plans and document, so if it is necessary to check their routing across prefixes, and across carriers to take action to fix issues with dial plans before clients do.

State synchronization failures

State synchronization failures in clusters can lead to orphaned sessions and failed failovers. Administrators need to be aware of when they test their failover scenarios, have an awareness of replication latency, and make sure that their state transfer functions. Failures in synchronization need immediate investigation to understand the implications for the services.

Media Path Congestion

Balancing at the media level is going to require media path’s being aligned with the topology of the network. The issue in routing when only considering the signaling load will force the media to use links, where the links are potentially congested. Congested media links create packet loss, and quality. Network teams are encouraged to map media flows, validate available bandwidth, and service availability, as well as try and test periodic performance tests, so that they can mitigate media congestion.

Best Practices

Keep Clear Dial Plans

Ensure dial plans are version controlled, right to the latest change for carriers, regulatory requirements, or changing networks. Making administrative policies clear will help avoid future confusion in what normal operations are. Document cost values, priority rules, geographic rules, codec preferences, etc.

Implement Comprehensive Health Checks

Use synthetic test calls that can negotiate codecs, and simulate real media flows. A ping test is hardly a sufficient serviceability test. A full suite of granular health check calls provide the most confidence to support routing when issues arise, and provide the greatest ability to keep calls off of degraded links.

Set Conservative Admission Thresholds

Avoid oversubscription by maintaining conservatively engineered thresholds with room to grow. Utilization thresholds can provide useful data for future maintenance cycles – especially adjusting when traffic starts to grow. Implement auto scaling techniques that provision additional SBC instances when going to max thresholds. In SAC policies, prioritize the most critical traffic first so that essential calls are the first to flow.

Automate Monitoring & Alerts

XML or API SBC metrics should always be fed into a centralized monitoring system. Alerts should be configured based on these metrics with respect to CPU, session count, packet loss, jitter, and synchronization status. Automated alerts will help personnel respond quickly as things go wrong, while periodically reviewing metrics and thresholds will help both your teams and the SBC equipment, and avoid alarm fatigue.

Conduct Regular Failover Drills

SBC’s have the ability to handle simulated carrier outages, or simulated down nodes, or simulated surges in traffic and to validate active routing, load balancing, or failover mechanisms. Administrators should document the results of all failover or workaround drills, and understand the outcomes of the drills based on the configuration at the time. There is great value in taking the time for a red hat review, and better still if it is included in the configuration documentation.

Align QoS Policies End to End

Ensure that any DSCP or VLAN tags from your SBC’s are universally applied and honored by every component in the network. Work with the networks team to both test your QoS labeling, and then your end to end processes to ensure the QoS valued Real time transport is being protected first when traffic volumes are not constrained.

Conclusion

Routing and load balancing using SBC’s for VoIP and SIP protocols provide fundamental foundations for delivering secure, reliable, and scalable communication services. Enabling value add mechanisms like Dynamic routing, or continuous health monitoring, along with safety nets like an admission controller and QoS enabled network, address both resource optimization and service quality. Adopting best practices like defined clear and documented dial plans, applying comprehensive health checks, and coordinated monitoring are the key to resilient SBC deployments that can meet evolving network needs and provide a consistent end user experience.

Are You Prepared to Improve Your VoIP system?

At Sheerbit, we understand advanced VoIP development, custom SBC projects, and intelligent load balancing schemes providing a secure solution for high real-time availability or high-performance communication systems. If you are managing high call volumes, need multiple carrier interconnects, real-time failover, or QoS compliance, our expert SBC engineers can develop, deploy, and support a custom or tailored SBC solution for your organization.

Work with an SBC development company you can trust and deploy dynamic SBC routing, real-time monitoring, and session control systems that can be scaled securely and reliably.

Contact Sheerbit today if you are interested in hiring SBC specialists – and learn how our custom VoIP development services can help your enterprise with saving costs, improving uptime and remaining compliant.