Communications Blog • 11 MIN READ

Troubleshooting Skype for Business: Advanced

Dave Bottomley

Written by Dave Bottomley

Implementing Skype for Business into your organization is one project; maintaining and ensuring high performance of Skype for Business is an entirely different ask. Troubleshooting Skype for Business sits at the center of maintaining high performance. Recently we described where the main Skype for Business challenges lie and we also helped identify where you should start troubleshooting Skype for Business. Now let's look at more advanced Skype for Business troubleshooting using Prognosis.

Troubleshooting Skype for Business with Prognosis 

The Prognosis offering  has three overlapping components: Testing, Path Insight, and Prognosis UC. To maximize your productivity while troubleshooting Skype for Business, let's navigate through all three of these components and demonstrate how you can leverage them for complete visibility in your environment.

1. Manage Endpoint Call Quality with Soft Phone Metrics

Soft Phone Host Metrics allows you to get a sense of the health of endpoints that are actively participating in UC activities like calls, conferences or screen sharing. Prognosis obtains these metrics using WMI, which can also be used to give visibility of the top running processes, tell whether IOPS are beyond acceptable ranges, show if the CPU is experiencing high utilization, or determine if the memory is saturated. We use this same access method to identify resource constraints on the endpoints (as long as they can be queried with WMI).

Above is an example screenshot showing a remote worker's metrics using Soft Phone Host Metrics in Prognosis. Here we have the ability to dig into the servers playing various roles in our Skype interface, giving insight to the health of each endpoint. On the right is the average CPU load, memory utilization, and network traffic. On the left is a list of the 20 busiest processes. We can obtain this visibility as long as we have access and a clear path to the network.


In the above screenshot, we can see that the endpoint was using a MacBook Air, 2011 Skype client, and the computer's built in microphone and speakers. Aside from being located off-campus, the user has a number of components that could possibly be changed to improve the experience. This is an example of the insight we can get on endpoints, which rounds out the core Prognosis visibility as illustrated in the blue circle on the Venn diagram at the beginning of this post (in this case allowing us to see the health of the servers and infrastructure along with the health of the endpoint itself). The kinds of devices that are plugged into the endpoint shouldn't be overlooked either, because they can be optimally or sub optimally influencing the experience.

2. Extend Visibility with Path Insight

The Path Insight module extends the visibility from the application layer down into the network layer, including OSI layers 1 through 4. This gives you some forensic evidence necessary to initiate a dialog with the network team if you're not in a position to manage the network yourself. By obtaining this visibility, you will have the ability to demonstrate that your sever infrastructure appears solid and your end users aren't doing anything that would be cause for alarm. The smoking gun can be isolated to the performance on the network.

Prognosis culls data from Active Directory user objects and WMI for performance data on endpoints and server components. It also pulls CDR QoE metric data out of the monitoring server and stitches that together with the SDN API interface. Additionally, the module Path Insight uses SNMP v1, v2 or v3 for network components (the vast majority of organizations still use SNMP v2). We source all our atomic data using these methods and we currently do not reuse existing caches or data. While it is possible to pull data from additional sources, this would be done through an extended solution.

In this screenshot, you can see the internal network hops for a call, where one hop took 235 milliseconds. One of the recent additions to the hop diagrams now shows at-a-glance how much loss a call has experienced. By hovering the mouse cursor over a monitored router, a tooltip appears displaying packet errors and packet loss. Within the Path Insight module, you can dig into each device (in this case, a router in Sydney) to see if there are any potential issues that need to be addressed. This is the kind of useful forensic evidence that UC administrators should show to people on the network team.

Discover Why the Problem Occurred, and How to Fix It

Aside from the graphical depiction, the network prescription shows the possible reasons why you are encountering the error rate. Digging into the advanced stats, you will see which reading is causing the flag to occur along with what you should do to fix the problem. In this case, a dramatic increase in single collision frames might be caused by an overloaded network segment. In terms of server health, network health, and endpoints, this represents the network visibility component (green circle) of the Venn diagram, allowing you to see what occurs on the specific ports supporting UC.

The Issues tab is a recent addition. Items with a ‘c' contain a hyperlink that explains why there may be a misconfiguration with that particular device. (The more devices we have in our inventory, the quicker we can stitch together, hop by hop, an overview of the performance.) The Gremlins tab allows you to take a more filtered view of issues, for instance, isolating results to the last 30 minutes for all interfaces on all devices. This is a very useful way to see what is going on in the network. Call Simulator is a tool that comes with Path Insight, which allows you to run an end-to-end test, Link troubleshooting test, RTP transmitter/receiver test, UDP firewall test, or a test to see if DSCP is being stripped.

3. Generate Real Traffic with Active Testing

Testing from the outside-in shows what happens when people try to use your system the way it's intended to be used. Active testing creates real traffic using the public telephone network as the access method. Let's cover the two types of test activity, StressTest load/performance testing and HeartBeat availability/experience testing, in detail. StressTest is typically used for new implementations or after something has changed within an environment, such as a patch update or version upgrade. Stress testing generates large loads into the system in a controlled fashion so you can confirm that the system is 100% connected to the telephone network and all SIP pipes are available.

Verify that all Channels are Working

If your system should be able to receive 1,000 concurrent calls, stress testing makes sure that every single one of those lines are connected and accessible without having to find 1,000 people to make calls at the same time. Stress testing performed in a controlled fashion will demonstrate whether or not the system will be capable of handling production loads. In a Skype for Business environment with potentially tens of thousands of numbers, a dial plan ringout can be an effective and efficient way of determining that the numbers actually are properly defined. You can make a telephone call (whether it's coming in as a DID or running off an internal environment) with auto attendant functionality to replace an army of people that would otherwise have had to call every one of those numbers manually.

Ensure Channels Continue to Function Reliably

Once all the channels on your system are up and running, you need a way to make sure everything continues to function as intended. HeartBeat availability and experience testing can check at periodic intervals (e.g., every 20 minutes) depending on the urgency of the situation. An automated process makes a real telephone call from the outside-in. After, for example, the call attempts to log into a conference bridge and create an open bridge for participants to join, HeartBeat will periodically interact with the system as intended and trigger alerts if necessary. It is a completely automatic way to make sure your systems continue to be accessible. If HeartBeat detects an issue, it will generate an alarm that brings you directly into the website.

In a typical environment, there might be several running HeartBeat interactions, each accessing different elements of the environment. One might be making sure voicemail is running while another is making sure the remote branch connection is still working. Each of those periodic interactions is called a drill, which is represented by the horizontal bars in the screenshot above. The screenshot shows five different drills, with the bottom drill made up of a completely red bar, indicating that this conference bridge needs to be investigated. The donut at the top is the summary status of everything happening under this particular login. The donut and bars demonstrate the integration of Prognosis and show how the testing products intermix with Path Insight to give you a complete picture of what is happening in your environment.

Clicking on any bar will show the details of that particular drill (pictured above). The spreadsheet at the bottom shows a list of every call that has been made over a given time period, giving you a view of which calls were successful (red indicates an error). Clicking on a spreadsheet item shows the details of what happened in that particular call:

This was supposed to be a multistep call, but it got dropped after a short period of time. To see how often this is happening across the board, we click on the green rectangle labeled Step1.

The scatter diagram above shows the history over the past week, including the types of errors. This historical data allows us to see if this is a common problem, which is useful while performing troubleshooting. For systems that are instrumented with Prognosis, every telephone in the environment is tagged, which allows Prognosis trained engineers to dig in precisely to get information about what was happening on that particular phone call.

The screenshot above shows the Call Details in which HeartBeat is interacting with the system and actively looking for issues. Instead of taking a day or more for somebody to bring these problems to your attention, an automated process like HeartBeat will let you know as soon as they occur. The issue is tagged and takes you right into Prognosis so that you can get data about the problematic telephone call. There are multiple ways of tracking issues down, including using filters to look through your recent call history. If you know the SIP URI, you can simply filter by that number to see how the recent calls have fared. (This is where you can use the hop information to map the call and determine where any issues may lie.) The testing component refers back to the red circle in the Venn diagram and completes the illustration showing the overlap between testing, Prognosis UC, and Path Insight, giving you the ability to navigate in any of these directions using a single tool.

Unified Communications is more than Just Skype for Business

In the Microsoft world, everyone uses Skype out-of-the-box, which gives you the ability to monitor everything in the environment using Skype for Business internally. Unfortunately, the world is more complex than that—a unified communications environment is built from many other pieces that allow us to communicate, including Cisco components, Avaya components, routers, and other monitoring software. Although Skype does provide useful out-of-the-box information, we provide monitoring of the whole world of unified communications. When you combine Skype's out-of-the-box capabilities with Prognosis' UC capabilities, you have much more visibility.

Visibility is also more than just having access to data—it's about being able to tell the story of why you're having a problem. A dictionary contains every single word you could ever find in a novel, but it doesn't tell a story. A novel, on the other hand, tells a story from end to end. (You've probably never seen someone sitting at the fireplace, curled up with a good dictionary.) Data isn't useful unless it's correlated piece by piece to tell the story of your environment. People want to sit down and read an engrossing novel because it tells a story, and that's exactly what Prognosis does.


Topics: Communications

Subscribe to our blog

Stay up to date with the latest
Communications, Payments and HP Nonstop
industry news and expert insights from IR.

We're committed to your privacy. IR uses the information you provide to us to contact you about our relevant content, products, and services. You may unsubscribe from these communications at any time. For more information, check out our privacy policy.