This is a game changer, and this will be a long blog post. Cisco is flipping the script on QoS. Quality of Service – will now become Quality of Experience. This isn’t a marketing term either. Come along for a ride as I explain.
First some references, the amazing team at Tech Field Day – www.techfieldday.com and the Cisco Team who presented at Tech Field Day Xtra at Cisco Live this year provided so much insight. As I talk about this, I will provide some links to videos, or specific parts in that presentation. Some of my graphics have been pulled from that content. Tim Szigeti is an amazing knowledgeable professional a true leader in the field, and Ramit Kanda provides an amazing demo on this great new technology.
A history lesson…
QoS… Since the day I took the Cisco CVOICE course, I was learning about protocols and methods of qualities of service. The construct is simple – we need important stuff to be first. Quickly this became a topic even the top network professionals – CCIE’s couldn’t handle.
Cisco Enterprise has a Vision.. “Transform our customers’ businesses through powerful yet simple networks” — powerful.. yes.. simple.. no so far…
As networks became constricted in bandwidth (mostly in the WAN) we needed a way to constrain less important traffic. The start of QoS was in the VoIP world – as people like me (hard core telephony guys from the TDM days) started to work on VoIP, we wanted circuit switched performance over packet switched networks. Zero packet drops, little jitter and delay.
We started with ToS (Type of Service) – a small field in the IPV4 header that gave us some bits we could set. 3 bits should be enough for anyone — yeah right, just like “640KB should be enough for anyone”. For most enterprises 8 classes is enough – but for service providers, not so much.
Then there was vendors who treated TOS and DSCP bits differently, or put them into different queues and treated them differently
QoS is second only to routing in the network when it comes to adoption – but how many customers are deploying it properly. Stay with me – we have new tools for you.
“It takes [us] 4 months and $1M to push a QoS Change… ” says a Wall Street Financial company.
“It took us 3 months to deploy a 2 line ACL change across 10K devices, which slowed down onboarding of our Jabber application” – says a Cisco Network Architect
QoS is Too Hard
“With QOS – the #1 TAC case report – is missing or incorrect classification and marking” – says Tim Szigeti – Cisco Systems
In a recent group of CCIE’s, and some others who I also respect greatly for their knowledge they all agreed “QoS is too difficult” – just get more bandwidth. Let me provide some illustration. This is the way a 2P6Q3T router would classify these categories into queues.
As I go across my network – each device I have has a different QoS architecture
Let me save you – don’t bother reading the below graphic – you get the point. Can you, as a professional, trap and trace a packet as it flows across the network to ensure it is getting the treatment you want? Can you design how to deploy a new application into this many different queuing mechanisms? Do you even want to?
What if I wanted to provide QoS for all 1400 applications that a network device supported?
Here is a hint you don’t want to do that.
“We have done more to advance QoS technology in the last year, than in the last 10” says Tim Szigeti from Cisco Systems.
So Cisco made it better, — but this is still too much
Cisco Validated of Design – Classification, Marking, and best practices – 2 lines of code. This is a huge day for QoS design. This will be consistent across ROUTERS AND SWITCHES – all products, all lines. So even if you are doing this in the CLI this is good news. Cisco is moving to a single design in hardware as well in the future. 5 Queuing structures will be the future – but still only a single reference design. Why can they not create a single structure? Cost. However now it has a reference design.
More Bandwidth Does Not Solve It!
HOLD THAT THOUGHT – No, more bandwidth does not solve QoS problems. It might sound like it does on the surface – lets dig down a bit
“Bandwidth and Utilization is not an accurate way of assessing if there is a QoS Problem” – says Tim Szigeti of Cisco Systems
- Security – As a construct, QoS has a place, we can limit risky traffic, questionable traffic or scavenger traffic so that it cannot overwhelm our network and shut us down, and stop the speed of attacks
- Cost – You cannot simply add bandwidth forever – your costs would simply continue to go up and up. On that note, until now, it has been cheaper to deploy more bandwidth than configure QoS – in some situations, but that does not address the security concern or….
- Buffers – That’s right, buffers. Micro bursts – even with the highest performance switching ASIC – at 1% port utilization, with a micro burst we could see traffic being dropped.
Cisco DNA – Automation
If you recall in my recent article we talked about automation being at the heart of DNA. If we want to make things simpler, automation is the only answer.
Wait a second – isn’t this SDN? No this is automation! Most SDN solutions – including Cisco’s own ACI – include forklift.
Cisco APIC-EM for QoS works with existing networks (brownfield!) – You can even abort the installation APIC-EM EasyQOS at anytime. So if you deploy EasyQOS as I am about to show you – but decide after you do not like it – you can remove it – even if you made other network changes later, it tracks every single change and will set back exactly what it changed to QOS and QOS only.
“People that are really serious about software should build their own hardware” (Alan Kay – 1982) that is why Cisco developed the UADP (Unified Access Data Plane – Code Name Doppler) and the QFP (QuantumFlow Processor – Code Name Yoda)
This is all about controlling and automating that high performance hardware and pushing that configuration in a consistent way down to the network
Wait a second ago did you not say many of the queue architectures are different? How do you address that?
EasyQOS – The APIC-EM Secret Weapon for Quality of Experience
Why is this important – the idea is simply this. EasyQoS will allow you to program BUSINESS INTENT in your network. You tell the EasyQoS application in APIC-EM how you want traffic to be treated, classified and prioritized. The APIC will figure out how to apply that business intent – against all of the various QoS architectures in the routing and switching platforms that you have.
QoE via EasyQOS – How It Works
It goes without saying – this is an APIC-EM app. So – go and get APIC-EM installed, and then come back.
The key architectural thing you need to understand is – 3 policy constructs are used here, to abstract 12 classes. You will see that in a minute.
Step 1: Create a scope
For your devices, create a scope in the APIC-EM for your devices, and then add the appropriate devices to the scope.
Step 2: Define Applications
Within EasyQOS there is 1300+ applications that are pre-defined, plus you can define your own applications based on a variety of factors.
Each application there is a traffic class.
You really want to create “Favourites” here, within the interface you can “star” and mark your applications as favourites, this is a good way to track which apps you are actually creating policies for.
Step 3: Define Policy
We need to apply these applications to a policy, within the policy we have classes of traffic – but think of this as business intent – not QoS.
There are three basic classes. You simply drag and drop each application into each policy.
- Business Relevant – This has 10 classes within it based on the application, but do not worry, the APIC will automatically define the business relevant apps to an appropriate class. This is all under the covers
- Default – Traffic you don’t really care about, this is your Best Effort class
- Business Irrelevant – This is your scavenger class
Step 3: Apply Policy
The policy uses various types of connections, today it uses SSH – and YES you can validate the commands before they are sent.
Any interface changes are detected by SNMP, or through polling every 30 minutes in case you change things by hand. The changes are sent out immediately.
If during the provisioning you realise something is wrong, or something fails – the APIC tracks every transaction on every device. You can abort a provisioning half way through – and it will back out each individual change.
Now we have this running. We have some other cool tools that make our life easier.
The first is a history engine, any changes will be tracked so you can see the changes in the policy over time – so if you make changes, then realise you had an adverse affect, a simple fix is to hit “Rollback” — keep in mind, this could be 500 devices on the network. The old way you spend a month making QOS changes – only to realise those changes are detrimental – you spend a month removing them. In APIC you can make, and rollback these types of changes in literally minutes. Huge cost and time savings here.
This one is pretty crazy sounding, but for VoIP and Video, we cannot always track these by application, they are encrypted or dynamic.
So the way this works is – Jabber or Lync sends a call setup – the APIC is informed of this call, and the APIC sends a NEW QoS policy — for just that call — to all the network devices in the path.
If you are reading this and thinking “So you are telling me my QoS Config is going to be modified every time someone makes a call” — Yes that is exactly what I am saying. I am not sure I am on board with this idea – that is a lot of dynamic network changes. Cisco says “it works!”
Show Me The Money – Path Flow Analysis
This is the most compelling part of APIC-EM EasyQOS. Bar None – Hands Down – Mic Drop.
You can perform Path Flow Analysis, on every device – instantly.
- Including interface stats
- QOS Stats
- ACL Rules blocking traffic
- Interface Stats
Step 1: Input the path trace data
Step 2: Flow Visibility
Prepare to be blown away. Here is the application flow. It even looks inside CAPWAP tunnels. If you had to do this by hand you have to do this per flow, in every single device. To set this up alone would take you hours, then analyze the data, then remove that config.
The APIC-EM does all of this for you – in seconds.
Device Health, performance stats, packet loss, DSCP values, Jitter, even routing protocol information. Router CPU level, Memory use. If you are troubleshooting a network – this is literally gold. “All hail the packet – for it runs on the network” did Denise Fishburne herself call someone up and help them build this? They should call this the APIC-EM Network Detective!
Here is a great example of an ACL block – imagine if you had 200-300 ACL’s on this device, finding the one that is causing problems would take you forever.
Even Asymmetric Flows. Every device, every hop. Even if you didn’t use EasyQOS this is worth the time to deploy APIC-EM.
Watch the last few minutes of our video from Tech Field Day and be BLOWN AWAY. A room of CCIE’s clapping tells you how amazing this is.
Prove it – with Validation of Experience VoE
The functional architecture of the validation of experience is an analytics engine. I would like to put a caveat on this discussion – this is still a bit of a proof of concept discussion. There is limited actual capability that you can deploy at this moment – but this is the functional way this will work.
Functional Layer 1 – Instrumentation
Collect all the right things, no silent drops in hardware – collect all the relevant metrics. Right down to the application layer if we can, as an example – Jabber. This means not just network information, but application level metrics like video or audio frame drops. If we want to monitor experience – we need to go all the way to layer 7
Functional Layer 2 – On-Device Analytics
We may not need to collect and return everything, but some of these are critical. So we need to analyse them on the device, decide what is critical and then return that.
Functional Layer 3 – Telemetry
Get the critcal information off the device – we don’t want that data sitting there, we need to collect it to the analytics platform. (Cisco is still working on the analytics platform). SNMP/MIB is simply not enough.
Functional Layer 4 – Real-Time Monitoring
We need to get alerts. Real-Time, not in an hour. If we make a change, and we cause a negative affect to the network, we need to know now. Real-time monitoring of application experience and performance.
Functional Layer 5 – Scalable Storage and Efficient Retrieval
Store these analytics somewhere, with an interface to access this data. Scaleable storage – even in the cloud. All the information from all of the devices in the same location. This is key, without a complete picture, from all devices and applications in the network – we cannot validate or analyze the true experience of the user.
Functional Layer 6 – Analytics
Correlation of data now results in information about network quality. We can identify where problems are in the network or applications.
Functional Layer 7 – Troubleshooting
Now can identify the root cause of problems with the network. Remember the quote from earlier – the #1 QOS TAC ticket is incorrect classifcation and marking.
Functional Layer 8 – Self Remediation\Troubleshooting
The holy grail – find the root cause – and fix it.
Summary – Justin’s Opinion
So, after all of that – what do I think about this. Game changer. The troubleshooting tools save hours and hours of time, one of my colleagues mentioned “Mean Time to Innocence” MTTI – how long it takes to prove, it wasn’t the network at fault. With path flow analysis like this, we can prove the network out in seconds.
The ability for us to take BUSINESS INTENT and map it to technology in an intelligent way that is automated is how this will program the network to “Intrinsically know what the business needs, and then just does it” — that is delivering on the promise of the marchitecture.
QoS has been way too difficult for way too long, we NEED this type of tool, the cool part is that REST-API’s are all published, so other vendors are already starting to take advantage of EasyQOS in their own applications. I cannot wait to see what comes out of Cisco DevNET. Just imagine the packet analysis and tracing tools that could use the troubleshooting engine in interesting ways.
We are not fully there, or fully baked yet. VoE is still a bit conceptual. What is the holy grail for me would be the following
- Program Business Intent via EasyQOS – Quality of Experience
- Monitor my network for experience, provide validation of experience alerts.
- When problems occur either automatically fix them – or recommend changes.
We are not far from this – the team at Cisco says “it’s in the pipeline”
My recommendation – if you are not up to speed with APIC-EM – you better start, because networks have finally burst the bounds of our brains when it comes to understanding everything that is going on – so you need this automation in order to tackle these complex network and application needs.
Tech Field Day Extra – 2016 – Cisco APIC-EM Controller Discussion
Tech Field Day Extra – 2016 – Cisco Validation of Experience with Tim Szigeti
Tech Field Day Extra – 2016 – APIC-EM EasyQoS Demo