SlideShare a Scribd company logo
Monitoring
Considerations
Monitorama, 2013
John Allspaw
SVP, Technical Operations
Sunday, August 4, 13
I want to warn you that I will lift references from various sources this morning, and I’ll make
sure to point to those further readings I’ll touch on when I post slides.
You can feel free to view those readings as HOMEWORK. Unsurprisingly to anyone who knows
me, a large amount of them will be in the field of Human Factors and Safety.
WHO HERE HAS EVER WRITTEN MONITORING SOFTWARE? (alerts, dashboards, graphs, metrics
collection, analysis, display, etc.)
“In the long term, Operations as a science
needs to be elevated.”
Chris Brown
Velocity London, 2012
Sunday, August 4, 13
We are at an interesting time in our field.
We are still naive.
We express indignation in terse remarks about our challenges.
We also believe that certainty is something we can attain through the use of technology alone.
This makes the field of web engineering as a whole ADORABLE.
Dr. Richard Cook, Velocity US 2012
http://www.youtube.com/watch?v=R_PDc0HFdP0
Sunday, August 4, 13
Dr. Cook explains how the research done in Human Factors and Systems Safety has a good relevance to the
operation of web infrastructures.
“Anytime you find a world in which you have high consequences, high-tempo operations, time pressure, and
lots of complexity...and people are called upon to manage that, you’re going to have these kinds of issues
arise.”
Aviation, patient safety, military, power generation and distribution, space travel, etc.....they
are attractive because we see something in them that is familiar.
While we have an opportunity to take ADVANTAGE of LESSONS LEARNED in other fields of
high-tempo/complexity/consequences, it behooves us to think on how we are DIFFERENT
from the other fields.
We also have an opportunity to SIDESTEP some of the quagmires those fields have found
themselves in.
This talk is a tiny effort towards this direction.
LANGUAGE
Sunday, August 4, 13
In order to support this, I will argue that we need to start paying attention to our language.
1. OTHER DOMAINS ALREADY HAVE A LEXICON, WE CAN BORROW SOME TERMS FROM THEM
2. How we discuss our challenges can play a very large role in how we surmount them.
There are a number of concepts, words, and ideas that need to enter our lexicon, especially
when it comes to monitoring and the challenges that come with making sense of where, what,
how, and why complex systems behave.
BETTER
QUESTIONS
Sunday, August 4, 13
One of the OTHER things that has become clear to me is that as a field, we need to ask
BETTER QUESTIONS instead of quickly jumping to CORRECT ANSWERS or SOLUTIONS.
ASKING TERRIBLE QUESTIONS WILL GUARANTEE TERRIBLE SOLUTIONS.
I’m increasingly convinced that the road to progress on such a broad and complicated topic
as monitoring is paved with BETTER QUESTIONS, not NEWER TOOLS.
So you may hear me asking some questions today.
They may or may not be good questions, but I’ll take a stab at it anyway.
DOWN and IN
Sunday, August 4, 13
“Down and In”
As the years go by and we see the continued decline of storage prices, the explosion of
accessible processing power, we have an ever-expanding ability to zoom in deeply to the
ways servers and services talk to each other and process information.
WE CAN ZOOM IN ON THE RELATIONSHIPS and BEHAVIORS of SEEMINGLY DISPARATE PIECES
OF DATA...
... AND WE CAN DISCOVER AND DETECT DISRUPTIONS IN SOMETIMES SURPRISING PLACES.
THIS IS INTERESTING.
BUT IT IS ALSO WOEFULLY INCOMPLETE IF WE ARE TO MAKE ANY PROGRESS IN OPERATIONS.
UP and OUT
Sunday, August 4, 13
...it is INCOMPLETE because as we ZOOM OUT, what we find is a much-ignored environment
which includes one of the most powerful CONTEXT-SENSITIVE and INCREDIBLY ADAPTIVE
anomaly detection and response agent in the world:
HUMANS
Sunday, August 4, 13
Do we have ANOMALY DETECTION problems? Certainly. One can argue (I will, if you’d like,
later at the bar) that we will ALWAYS have them.
BUT: What I’m interested in is NOT how software can be used to detect anomalies
automatically.
(well, I’m interested, but I don’t doubt that you all will continue to get better at it)
Sunday, August 4, 13
... It is how people navigate this boundary between themselves and the machines they work
with.
The BOUNDARY between humans and machines, as we observe our use of tools, is a focus IN
and OF ITSELF.
If we have any hope of making progress in monitoring complex systems, we must take this
boundary into account.
Sunday, August 4, 13
BUT ABOUT HUMANS: A couple of observations with respect to tools and monitoring
in general.
1. We don’t use a single tool to gain insight into the architectures we build. And we
will not.
2. Teams of people are the NORM, which means communication and coordination
become as important (if not more important) than surfacing anomalies themselves.
3. We bring our BIASES, EXPECTATIONS, TRUST, and PERCEPTIONS to the table. No
tool or piece of automation or tooling will change that.
4. Understanding the breakdowns at these boundaries between people and machines
should be a part of how we approach design of tools and organizational behaviors.
LESS CODE
MORE PSYCHOLOGY
Sunday, August 4, 13
SPECIFICALLY:
ALGORITHMS ALONE WILL NOT DELIVER US TO A BETTER AND SAFER PLACE.
OODALoop
Observe Orient Decide Act
credit:http://blog.b3k.us/ooda.html
Sunday, August 4, 13
WHO IS FAMILIAR WITH Lt. Boyd’s OODA Loop?
Observation and orientation is a place where we can look for making progress.
When we get alerted, look at dashboards, graphs and logs, we’re looking to make sense of
the past and project into the future.
NOTE: Observe and Orient are not Unix commands, they are HUMAN ACTIVITIES.
We need to understand how people make sense of
what is going on
Sunday, August 4, 13
SO: Writing code to TELL COMPUTERS WHAT TO LOOK AT is quite different than making sure
that the code’s human supervisors are equipped or aided in what to look at when an alert
goes off.
How people make sense of what is going on (in diagnosis? In planning? In response? In
control?) is just plain HARD.
We need to understand how normal
work is getting done by normal people
in normal situations.
Sunday, August 4, 13
If we don’t understand how people consume, adapt to, work around, and make use of tools
under “normal” operating conditions, how can we have confidence that our designs will
perform under uncertain or escalating scenarios?
Work As Imagined
Work As Done
Sunday, August 4, 13
Our clues on how we THINK we work guides our design decisions.
But there is a gap between how we think we work, and how we actually work.
How large is this gap? How will we know when it’s too large?
Where is design?
“The system should therefore be designed so
that human adaptation is ENHANCED.”
Erik Hollnagel
Expertise and Technology: Cognition & Human-Computer
Cooperation, 1995
Sunday, August 4, 13
Design thought should be in tools, displays, controls, and processes.
What do we have to work with, though?
“It is the expertise of the human operator
that makes it possible to adapt the
performance of the joint system, in real
time, to unexpected events and
disturbances. Every working day, across the
whole spectrum of human enterprise, a large
number of near-misses are prevented from
turning into accidents only because human
operators intervene...
Sunday, August 4, 13
Whether we know it or not, we are ALL designers now, if we build tools intended to aid
monitoring.
I’m not just talking about UI and garden-variety HCI work, but those topics should be
considered table stakes.
Where is design?
http://www.perceptualedge.com/articles/visual_business_intelligence/
time_on_the_horizon.pdf
Sunday, August 4, 13
VISUAL PERCEPTIONS and UI approaches are integral to our field, so we should try to
understand them as deeply as we can.
Armed with the knowledge that every element of design can (and will) be mis-used (like these
Horizon Graphs), we are left with a dilemma:
How can we understand what can augment human capabilities without getting in the way,
and without having to first re-start our career as an Human Factors expert?
WE FAKE IT UNTIL WE MAKE IT
http://www.perceptualedge.com/articles/Whitepapers/Dashboard_Design.pdf
Salience
Sunday, August 4, 13
For example, this illustration of the concept of SALIENCE, or
“quality of an item that stands out relative to neighboring items”
Comes from a great whitepaper called
“Dashboard Design for Real-Time Situation Awareness” by Stephen
Few
http://www.perceptualedge.com/articles/Whitepapers/Dashboard_Design.pdf
Salience
Sunday, August 4, 13
So SALIENCE is an important quality.
Principles of Display Design
• Principle of information need
• Principle of legibility
• Principle of display integration/proximity
• Principle of pictorial realism
• Principle of the moving part
• Principle of predictive aiding
• Principle of discriminability: status versus command
Wickens, Lee, Liu, Becker
An Introduction to Human Factors Engineering
Sunday, August 4, 13
Here is another great pointer on display design, from “AN INTRODUCTION TO HUMAN
FACTORS ENGINEERING”.
Cognition In The Wild
“It is notoriously difficult to generalize
laboratory findings to real-world
situations.”
Sunday, August 4, 13
So let’s leave design for a moment and talk about how we can VALIDATE our design choices.
We CANNOT hope to understand how people behave in real-world scenarios BY USING OUR
IMAGINATION alone.
How many of you work at a company where funnel or clickstream analysis is being done?
How many of you have done clickstream or funnel analysis on your monitoring dashboards, graphs,
and displays?
What sort of information might we find when we gather data on how people navigate metric data
during varying scenarios?
ALERT
DESIGN
Sunday, August 4, 13
- Who has ever gotten a page and ignored it?
Endsley: At a safety expert conference, in a 300-person hall, only 3 people got up for a fire alarm.
- How many alerts were received in the past week that were not actionable? (no human action was
required?)
- How many alerts were received in the past week as a result of known work being done, but alerts
were not silenced during that period?
- How many alerts were received as a result of a previously silenced alert (because work was being
done) that was mistakenly un-silenced?
Jack Garman
Flight controller
NASA Mission Control
Apollo Program (Murray and Cox 1990)
Sunday, August 4, 13
“A program alarm could be triggered by trivial problems that
could be ignored altogether.
Or it could be triggered by problems that called for an immediate
abort.
How to decide which was which?
"We wrote ourselves little rules like
'If this alarm happens and it only happens once, don't worry
about it. If it happens repeatedly, but other indicators are okay,
don't worry about it.'"
Operator, interviewed.
The Three Mile Island
nuclear power plant, following the
accident. (Kemeny 1979)
Sunday, August 4, 13
“I would have liked to have thrown away the alarm panel. It
wasn't giving us any useful information."
Comment by one operator at the Three Mile Island nuclear
power plant
to the official inquiry following the TMI accident (Kemeny 1979).
Physician, explaining how they
respond to a nuisance alarm on a
device in the operating room.
(Cook, Potter,Woods and McDonald 1991)
Sunday, August 4, 13
"When the alarm kept going off then we kept shutting it [the
device] off [and on] and when the alarm would go off [again],
we’d shut it off.”
“... so I just reset it [a device control] to a higher temperature. So
I kinda fooled it [the alarm]...”
SIGNAL
DETECTION
THEORY
Sunday, August 4, 13
Signal Detection Theory
- Too sensitive, and you’ll get false alarms
- Not sensitive enough, and you’ll get missed alarms
ALERT DESIGN
Mica Endsley
Designing for Situation Awareness
Sunday, August 4, 13
What about the context people are in when they
experience a FALSE ALERT?
Or a MISSED ALERT?
Interpretation
Integration
Interpretation
Other Situational
Information
Expectancies Past History Mental Model
Alarm Signal
Response
Decision
Designing for Situational Awareness, Mica Endsley
Sunday, August 4, 13
The cognitive processing of an alarm signal.
When we DESIGN ALERTS, we HAVE to think about the
various ways that the ALERT could be interpreted or
acted on. Often times, we will PUNT on aiding the
operator with CONTEXT.
Critical Care & Anesthesiology
• Monitors & alarms designed to “never miss”
• 566 deaths reported related to alarms
(2005-2008)
• Most associate with the silencing function
• ECRI’s #1 health technology hazard, 2012 & 2013
And you have complaints about Nagios’ “set downtime” feature?
Sunday, August 4, 13
Emergency Care Research Institute (ECRI), which recently
identified alarms as the “number one health technology hazard”
for 2012.9
And you have complaints about Nagios’ “set downtime” feature?
ALERT DESIGN
Confirmation
Sunday, August 4, 13
- Because false alarms are a problem, people will spend time not
reacting to an alert, but confirming that the alert is legit.
- Pilots delay responding to GPWS (Ground Proximity Warning
System) 73% of the time, because they’re looking out the window
to confirm it’s true, and how true it is.
What are ways we can SUPPORT CONFIRMATION or
VALIDATION in our alert design?
ALERT DESIGN
Expectancy
Sunday, August 4, 13
- People’s expectancies can also affect their interpretation of alerts.
- In many cases, people EXPECT the alert to go off, as the result of their own actions.
- In a study in 2001, 6% of operating room alarms were found to be expected or anticipated.
- This can become a nuisance, and further degrade the trust in the alerts.
- Example: disk space alerts that happen during a backup, and then recover.
- Example: someone on the team doing work, and not silencing the alerts temporarily.
BONUS: when the time period for an alert is silenced passes, and the condition isn’t acceptable yet.
(downtime expiring)
What are ways that we could SUPPORT EXPECTANCY in our alert design?
ALERT DESIGN
• Signal:Noise can be difficult
• Easy to err on more false alarms
• Decay in trust
• Origins: Undetectable conditions
Sunday, August 4, 13
- Signal:Noise can be difficult to get right
- General view: err on the side of too many false alarms. This ignores the detrimental effect
of them on humans.
- Study in 1998 said: New ATC systems, missed alerts at 0.2%, false alarm rates at 65%.
- Underlying false alerts: not the functioning of algorithms themselves, but the CONDITIONS
AND FACTORS THAT THE ALARM SYSTEMS CANNOT DETECT OR INTERPRET
Ex: Cincinnati Airport - riverbank leading up to a runway increases in terrain causes an alarm
because the system can’t detect that it’s going to plateau at the runway. Pilots familiar with
the airport ignore the alarms.
Information is not a
scarce resource.
Attention is.
Herb Simon, 1991
Sunday, August 4, 13
http://csel.eng.ohio-state.edu/productions/woodscta/media/diagnosis.pdf
Directed Attention
• Attention focusing
• Attention switching
• Dynamic Prioritization
Sunday, August 4, 13
We work in a COGNITIVELY NOISY WORLD, even when there is NOT an outage going on.
Alerts are ESSENTIALLY ATTENTION DIRECTORS.
The main challenge for DYNAMIC FAULT MANAGEMENT (HF term) in design is to support:
- ATTENTION FOCUSING
- ATTENTION SWITCHING
- DYNAMIC PRIORITIZATION
By getting to know how human attention works (and its relationship to context, perception,
etc.), we can hope to design better alerts.
Interrupts AND
Underspecification
1. “Here is the data I want you to see”
2. “Here is why I think you would find it interesting”
Sunday, August 4, 13
An alert is essentially an INTERRUPT.
TWO STATES:
1 - HERE IS THE DATA I WANTYOU TO SEE
2 - HERE IS WHY I THINKYOU WOULD FIND IT INTERESTING
What can we do to support #2?
Paradox
Of
Directed Attention
Sunday, August 4, 13
An alert is essentially an interruption to everyday work, and there is a paradox at the heart of
DIRECTED ATTENTION.
1. We are always busy!
2. Shifting attention has a very real cost!
2. Not all signals are worth paying attention to; context-sensitivity will always vary
3. So how can you SKILLFULLY IGNORE a SIGNAL that should NOT SHIFT UR ATTENTION
WITHOUT first processing it....IN WHICH CASE IT HASN’T BEEN IGNORED.
“Given that the supervisory agent is loaded by various other task related demands, how does one
interpret information about the potential need to switch attentional focus without interrupting or
interfering with the tasks or lines of reasoning already under attentional control. We can state this
paradox in another way: how can one skillfully ignore a signal that should not shift attention within
the current context, without first processing it -- in which case it hasn't been ignored.” - David
Woods
David Woods has suggested some ways to break this paradox, he calls it PREATTENTIVE
REFERENCE.
I’ll let you discover his suggestions on your own.
Directed Attention
Sorting through
an avalanche of
data
Picking up on
subtle early
indications of a
fault
Sunday, August 4, 13
This idea of an alert DIRECTING OUR ATTENTION can exist in two views:
SORTING THROUGH AN AVALANCHE or PICKING UP SUBTLE/EARLY INDICATIONS....
So....which is it?
IT CAN BE BOTH!
“The critical point is that the challenge of fault management lies in sorting through an avalanche of raw data -- a data overload problem. This
is in contrast to the view that the performance bottleneck is the difficulty of picking up subtle early indications of a fault against the
background of a quiescent monitored process.”
Context Sensitivity
Sunday, August 4, 13
The background and context in which a SIGNAL arrives can play a huge role in how they can
HELP or HINDER us.
If the background is one of QUIET, contrast is HIGH. <- this is what most designers plan for
If the background is ONGOING DIAGNOSIS, then SIGNAL can SUPPORT/CONTRADICT existing
hypothesis
If the background is EXECUTING A RESPONSE, then SIGNAL can cue the RESPONSE is WRONG
or INCOMPLETE.
In any case, the ALERT’s MEANING will change as CONTEXT and BACKGROUND changes.
Data Overload
Sunday, August 4, 13
This is simply a tough problem.
There are approaches to solve it, but none of them to date are effective given the rate at
which new pieces of data are being collected and stored.
There is a significant agreement among those who study data overload phenomena that the
critical piece to understand is of CONTEXT SENSITIVITY.
Some HF researchers have pointed at something that may help reduce the effects of DO:
Depicting RELATIONSHIPS between data in a known FRAME of REFERENCE, as opposed to the
raw data.
What can we do as designers to aid surfacing those relationships?
How have I taken the
OPERATOR into account?
Sunday, August 4, 13
PEOPLE use monitoring tools.
Arguably, MACHINES use monitoring tools we build, as well.
But only PEOPLE can adapt and improvise with a given tool outside of the original intentions
of its designer.
Am I hurting or helping:
•Data overload or underload?
•Salience?
•Directed attention?
•Interruptibility?
Sunday, August 4, 13
When we design alerts and monitoring tools, we should be asking these questions.
In addition: HOW WILL WE KNOW WHEN THIS DESIGN WOULD HURT those things?
Joint Cognitive Systems
Sunday, August 4, 13
One final thought: what if, instead of the view that the BOUNDARY is a large barrier to be
hurdled only by our writing increasingly complex code...we view that boundary as a place for
an actual cooperative RELATIONSHIP?
Joint Cognitive Systems
What if we viewed an alerting system
as a PARTNER, instead of a subordinate?
Sunday, August 4, 13
What is we viewed alerting systems as a PARTNER, instead of a subordinate or otherwise
dumb messenger delivering news to us?
What does the world look like if we designed alerts to COOPERATE with us?
If TRUST in alerting systems is such a big deal....
WHAT can we learn from how HUMANS learn to trust each other, and let that influence our
design decisions?
In other words: how can we design alerts that SUPPORT our confirming their legitimacy, or
our expectations when an alert will fire? Is context-sensitivity part of this?
We see some blunt versions of these notions:
1 - Time periods for alerts, so that people aren’t woken up for things that can wait until
morning (the machine has been given some context about our availability to pay attention to
an alert)
2 - Rough dependency relationships, so we don’t send a bazillion alerts when a known SPOF
dies
What other examples can we think of, where the COMPUTERS can attempt to understand,
predict, or observe US, as we work?
The End
Sunday, August 4, 13
My hope is that I’ve been able to ask BETTER QUESTIONS, and I can kick off this conference
with food for thought.
You can tell me how that food tastes at the bar later.
Can We Ever Escape From Data
Overload?
A Cognitive Systems Diagnosis
Woods, Patterson, Roth 1999
http://csel.eng.ohio-state.edu/productions/woodscta/media/diagnosis.pdf
Sunday, August 4, 13
http://csel.eng.ohio-state.edu/productions/woodscta/media/diagnosis.pdf
The Alarm Problem and
Directed Attention in
Dynamic Fault Management
Woods 1995
http://csel.eng.ohio-state.edu/woods/foundations/directed%20att.pdf
Sunday, August 4, 13
http://csel.eng.ohio-state.edu/productions/woodscta/media/diagnosis.pdf
Sunday, August 4, 13

More Related Content

PDF
게임제작개론 : #9 라이브 서비스
PDF
Understaing Android EGL
PDF
Embedded Virtualization for Mobile Devices
PDF
PPTX
Device Tree Overlay implementation on AOSP 9.0
PDF
Static Partitioning with Xen, LinuxRT, and Zephyr: A Concrete End-to-end Exam...
PDF
[MGDC] 리눅스 게임 서버 성능 분석하기 - 아이펀팩토리 김진욱 CTO
PDF
Data Day Texas 2017: Scaling Data Science at Stitch Fix
게임제작개론 : #9 라이브 서비스
Understaing Android EGL
Embedded Virtualization for Mobile Devices
Device Tree Overlay implementation on AOSP 9.0
Static Partitioning with Xen, LinuxRT, and Zephyr: A Concrete End-to-end Exam...
[MGDC] 리눅스 게임 서버 성능 분석하기 - 아이펀팩토리 김진욱 CTO
Data Day Texas 2017: Scaling Data Science at Stitch Fix

Viewers also liked (20)

PDF
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
PDF
Dev and Ops Collaboration and Awareness at Etsy and Flickr
ZIP
Ops Meta-Metrics: The Currency You Pay For Change
PDF
Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls
ZIP
Ops Meta-Metrics: The Currency You Pay For Change
PPTX
CitoEngine : Alert management and automation tool.
PDF
Contiuously Deploying Culture 2.0 - Agile Ísland
PPTX
Spur Infrastructure Performance With Proactive IT Monitoring
PDF
Puppet: What _not_ to do
PDF
Responding to Outages Maturely
PPTX
Automated Puppet Testing - PuppetCamp Chicago '12 - Scott Nottingham
PDF
Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at Li...
PDF
Continuous Deployment at Etsy — TimesOpen NYC
PPTX
Demystifying DevOps for Ops - Including Findings from the 2015 State of DevOp...
PDF
Path to continuous delivery
PDF
906702 Enhancing Business Processes Using Enterprise Information Systems
PDF
How To Make Dev Ops Work @ Netlight Edge X Berlin
PDF
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
DOC
11. Huccet I Imaniye
PDF
IT_FOR_BUSINESS_30NOV15
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
Dev and Ops Collaboration and Awareness at Etsy and Flickr
Ops Meta-Metrics: The Currency You Pay For Change
Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls
Ops Meta-Metrics: The Currency You Pay For Change
CitoEngine : Alert management and automation tool.
Contiuously Deploying Culture 2.0 - Agile Ísland
Spur Infrastructure Performance With Proactive IT Monitoring
Puppet: What _not_ to do
Responding to Outages Maturely
Automated Puppet Testing - PuppetCamp Chicago '12 - Scott Nottingham
Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at Li...
Continuous Deployment at Etsy — TimesOpen NYC
Demystifying DevOps for Ops - Including Findings from the 2015 State of DevOp...
Path to continuous delivery
906702 Enhancing Business Processes Using Enterprise Information Systems
How To Make Dev Ops Work @ Netlight Edge X Berlin
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
11. Huccet I Imaniye
IT_FOR_BUSINESS_30NOV15
Ad

Similar to Considerations for Alert Design (20)

PPTX
Human Factors at the Grid Edge
PDF
Our Jobs are Changing. Can We Keep Up?
PDF
2018 10-man machinecollaboration-withvideolinks
PDF
Instrumentation as a Living Documentation: Teaching Humans About Complex Systems
PDF
The Future of Crowd Work
PDF
That soft, messy people factor in technology projects
PPTX
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
PPTX
PyCon UK 2014 Keynote
PDF
AgileLunch Meetup - Listen to your Board
PDF
Resilience Engineering & Human Error... in IT
PDF
Pathways to Technology Transfer and Adoption: Achievements and Challenges
PPTX
Designing for People
PPTX
Dec2018 istanbul-2
PPTX
Human centered design and Social media
PPTX
Keynote at the CIBC XX Design Challenge
PPTX
Accessible User Experience: In Strategy, In Practice...In Thinking!
PDF
Inclusive Toolkit Activities By Microsoft
PPTX
AgileNCR 2019 _ The Soft Side of Software Development.pptx
PDF
The Lies Hacker News Tell Us - Brian Murphy
PDF
ICISS Newsletter Sept 14
Human Factors at the Grid Edge
Our Jobs are Changing. Can We Keep Up?
2018 10-man machinecollaboration-withvideolinks
Instrumentation as a Living Documentation: Teaching Humans About Complex Systems
The Future of Crowd Work
That soft, messy people factor in technology projects
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
PyCon UK 2014 Keynote
AgileLunch Meetup - Listen to your Board
Resilience Engineering & Human Error... in IT
Pathways to Technology Transfer and Adoption: Achievements and Challenges
Designing for People
Dec2018 istanbul-2
Human centered design and Social media
Keynote at the CIBC XX Design Challenge
Accessible User Experience: In Strategy, In Practice...In Thinking!
Inclusive Toolkit Activities By Microsoft
AgileNCR 2019 _ The Soft Side of Software Development.pptx
The Lies Hacker News Tell Us - Brian Murphy
ICISS Newsletter Sept 14
Ad

More from John Allspaw (10)

PDF
Resilience Engineering: A field of study, a community, and some perspective s...
PDF
Resilient Response In Complex Systems
PDF
Outages, PostMortems, and Human Error
PDF
Anticipation: What Could Possibly Go Wrong?
PDF
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
PDF
Go or No-Go: Operability and Contingency Planning at Etsy.com
PDF
Capacity Planning For LAMP
ZIP
Operational Efficiency Hacks Web20 Expo2009
PDF
Capacity Management for Web Operations
PDF
Capacity Planning for Web Operations - Web20 Expo 2008
Resilience Engineering: A field of study, a community, and some perspective s...
Resilient Response In Complex Systems
Outages, PostMortems, and Human Error
Anticipation: What Could Possibly Go Wrong?
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Go or No-Go: Operability and Contingency Planning at Etsy.com
Capacity Planning For LAMP
Operational Efficiency Hacks Web20 Expo2009
Capacity Management for Web Operations
Capacity Planning for Web Operations - Web20 Expo 2008

Recently uploaded (20)

PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Machine learning based COVID-19 study performance prediction
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Machine Learning_overview_presentation.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
A comparative study of natural language inference in Swahili using monolingua...
Heart disease approach using modified random forest and particle swarm optimi...
Network Security Unit 5.pdf for BCA BBA.
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Per capita expenditure prediction using model stacking based on satellite ima...
Programs and apps: productivity, graphics, security and other tools
Building Integrated photovoltaic BIPV_UPV.pdf
Group 1 Presentation -Planning and Decision Making .pptx
Univ-Connecticut-ChatGPT-Presentaion.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Machine learning based COVID-19 study performance prediction
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Machine Learning_overview_presentation.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Accuracy of neural networks in brain wave diagnosis of schizophrenia
OMC Textile Division Presentation 2021.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
A comparative analysis of optical character recognition models for extracting...

Considerations for Alert Design

  • 1. Monitoring Considerations Monitorama, 2013 John Allspaw SVP, Technical Operations Sunday, August 4, 13 I want to warn you that I will lift references from various sources this morning, and I’ll make sure to point to those further readings I’ll touch on when I post slides. You can feel free to view those readings as HOMEWORK. Unsurprisingly to anyone who knows me, a large amount of them will be in the field of Human Factors and Safety. WHO HERE HAS EVER WRITTEN MONITORING SOFTWARE? (alerts, dashboards, graphs, metrics collection, analysis, display, etc.)
  • 2. “In the long term, Operations as a science needs to be elevated.” Chris Brown Velocity London, 2012 Sunday, August 4, 13 We are at an interesting time in our field. We are still naive. We express indignation in terse remarks about our challenges. We also believe that certainty is something we can attain through the use of technology alone. This makes the field of web engineering as a whole ADORABLE.
  • 3. Dr. Richard Cook, Velocity US 2012 http://www.youtube.com/watch?v=R_PDc0HFdP0 Sunday, August 4, 13 Dr. Cook explains how the research done in Human Factors and Systems Safety has a good relevance to the operation of web infrastructures. “Anytime you find a world in which you have high consequences, high-tempo operations, time pressure, and lots of complexity...and people are called upon to manage that, you’re going to have these kinds of issues arise.” Aviation, patient safety, military, power generation and distribution, space travel, etc.....they are attractive because we see something in them that is familiar. While we have an opportunity to take ADVANTAGE of LESSONS LEARNED in other fields of high-tempo/complexity/consequences, it behooves us to think on how we are DIFFERENT from the other fields. We also have an opportunity to SIDESTEP some of the quagmires those fields have found themselves in. This talk is a tiny effort towards this direction.
  • 4. LANGUAGE Sunday, August 4, 13 In order to support this, I will argue that we need to start paying attention to our language. 1. OTHER DOMAINS ALREADY HAVE A LEXICON, WE CAN BORROW SOME TERMS FROM THEM 2. How we discuss our challenges can play a very large role in how we surmount them. There are a number of concepts, words, and ideas that need to enter our lexicon, especially when it comes to monitoring and the challenges that come with making sense of where, what, how, and why complex systems behave.
  • 5. BETTER QUESTIONS Sunday, August 4, 13 One of the OTHER things that has become clear to me is that as a field, we need to ask BETTER QUESTIONS instead of quickly jumping to CORRECT ANSWERS or SOLUTIONS. ASKING TERRIBLE QUESTIONS WILL GUARANTEE TERRIBLE SOLUTIONS. I’m increasingly convinced that the road to progress on such a broad and complicated topic as monitoring is paved with BETTER QUESTIONS, not NEWER TOOLS. So you may hear me asking some questions today. They may or may not be good questions, but I’ll take a stab at it anyway.
  • 6. DOWN and IN Sunday, August 4, 13 “Down and In” As the years go by and we see the continued decline of storage prices, the explosion of accessible processing power, we have an ever-expanding ability to zoom in deeply to the ways servers and services talk to each other and process information. WE CAN ZOOM IN ON THE RELATIONSHIPS and BEHAVIORS of SEEMINGLY DISPARATE PIECES OF DATA... ... AND WE CAN DISCOVER AND DETECT DISRUPTIONS IN SOMETIMES SURPRISING PLACES. THIS IS INTERESTING. BUT IT IS ALSO WOEFULLY INCOMPLETE IF WE ARE TO MAKE ANY PROGRESS IN OPERATIONS.
  • 7. UP and OUT Sunday, August 4, 13 ...it is INCOMPLETE because as we ZOOM OUT, what we find is a much-ignored environment which includes one of the most powerful CONTEXT-SENSITIVE and INCREDIBLY ADAPTIVE anomaly detection and response agent in the world: HUMANS
  • 8. Sunday, August 4, 13 Do we have ANOMALY DETECTION problems? Certainly. One can argue (I will, if you’d like, later at the bar) that we will ALWAYS have them. BUT: What I’m interested in is NOT how software can be used to detect anomalies automatically. (well, I’m interested, but I don’t doubt that you all will continue to get better at it)
  • 9. Sunday, August 4, 13 ... It is how people navigate this boundary between themselves and the machines they work with. The BOUNDARY between humans and machines, as we observe our use of tools, is a focus IN and OF ITSELF. If we have any hope of making progress in monitoring complex systems, we must take this boundary into account.
  • 10. Sunday, August 4, 13 BUT ABOUT HUMANS: A couple of observations with respect to tools and monitoring in general. 1. We don’t use a single tool to gain insight into the architectures we build. And we will not. 2. Teams of people are the NORM, which means communication and coordination become as important (if not more important) than surfacing anomalies themselves. 3. We bring our BIASES, EXPECTATIONS, TRUST, and PERCEPTIONS to the table. No tool or piece of automation or tooling will change that. 4. Understanding the breakdowns at these boundaries between people and machines should be a part of how we approach design of tools and organizational behaviors.
  • 11. LESS CODE MORE PSYCHOLOGY Sunday, August 4, 13 SPECIFICALLY: ALGORITHMS ALONE WILL NOT DELIVER US TO A BETTER AND SAFER PLACE.
  • 12. OODALoop Observe Orient Decide Act credit:http://blog.b3k.us/ooda.html Sunday, August 4, 13 WHO IS FAMILIAR WITH Lt. Boyd’s OODA Loop? Observation and orientation is a place where we can look for making progress. When we get alerted, look at dashboards, graphs and logs, we’re looking to make sense of the past and project into the future. NOTE: Observe and Orient are not Unix commands, they are HUMAN ACTIVITIES.
  • 13. We need to understand how people make sense of what is going on Sunday, August 4, 13 SO: Writing code to TELL COMPUTERS WHAT TO LOOK AT is quite different than making sure that the code’s human supervisors are equipped or aided in what to look at when an alert goes off. How people make sense of what is going on (in diagnosis? In planning? In response? In control?) is just plain HARD.
  • 14. We need to understand how normal work is getting done by normal people in normal situations. Sunday, August 4, 13 If we don’t understand how people consume, adapt to, work around, and make use of tools under “normal” operating conditions, how can we have confidence that our designs will perform under uncertain or escalating scenarios?
  • 15. Work As Imagined Work As Done Sunday, August 4, 13 Our clues on how we THINK we work guides our design decisions. But there is a gap between how we think we work, and how we actually work. How large is this gap? How will we know when it’s too large?
  • 16. Where is design? “The system should therefore be designed so that human adaptation is ENHANCED.” Erik Hollnagel Expertise and Technology: Cognition & Human-Computer Cooperation, 1995 Sunday, August 4, 13 Design thought should be in tools, displays, controls, and processes. What do we have to work with, though? “It is the expertise of the human operator that makes it possible to adapt the performance of the joint system, in real time, to unexpected events and disturbances. Every working day, across the whole spectrum of human enterprise, a large number of near-misses are prevented from turning into accidents only because human operators intervene...
  • 17. Sunday, August 4, 13 Whether we know it or not, we are ALL designers now, if we build tools intended to aid monitoring. I’m not just talking about UI and garden-variety HCI work, but those topics should be considered table stakes.
  • 18. Where is design? http://www.perceptualedge.com/articles/visual_business_intelligence/ time_on_the_horizon.pdf Sunday, August 4, 13 VISUAL PERCEPTIONS and UI approaches are integral to our field, so we should try to understand them as deeply as we can. Armed with the knowledge that every element of design can (and will) be mis-used (like these Horizon Graphs), we are left with a dilemma: How can we understand what can augment human capabilities without getting in the way, and without having to first re-start our career as an Human Factors expert? WE FAKE IT UNTIL WE MAKE IT
  • 19. http://www.perceptualedge.com/articles/Whitepapers/Dashboard_Design.pdf Salience Sunday, August 4, 13 For example, this illustration of the concept of SALIENCE, or “quality of an item that stands out relative to neighboring items” Comes from a great whitepaper called “Dashboard Design for Real-Time Situation Awareness” by Stephen Few
  • 21. Principles of Display Design • Principle of information need • Principle of legibility • Principle of display integration/proximity • Principle of pictorial realism • Principle of the moving part • Principle of predictive aiding • Principle of discriminability: status versus command Wickens, Lee, Liu, Becker An Introduction to Human Factors Engineering Sunday, August 4, 13 Here is another great pointer on display design, from “AN INTRODUCTION TO HUMAN FACTORS ENGINEERING”.
  • 22. Cognition In The Wild “It is notoriously difficult to generalize laboratory findings to real-world situations.” Sunday, August 4, 13 So let’s leave design for a moment and talk about how we can VALIDATE our design choices. We CANNOT hope to understand how people behave in real-world scenarios BY USING OUR IMAGINATION alone. How many of you work at a company where funnel or clickstream analysis is being done? How many of you have done clickstream or funnel analysis on your monitoring dashboards, graphs, and displays? What sort of information might we find when we gather data on how people navigate metric data during varying scenarios?
  • 23. ALERT DESIGN Sunday, August 4, 13 - Who has ever gotten a page and ignored it? Endsley: At a safety expert conference, in a 300-person hall, only 3 people got up for a fire alarm. - How many alerts were received in the past week that were not actionable? (no human action was required?) - How many alerts were received in the past week as a result of known work being done, but alerts were not silenced during that period? - How many alerts were received as a result of a previously silenced alert (because work was being done) that was mistakenly un-silenced?
  • 24. Jack Garman Flight controller NASA Mission Control Apollo Program (Murray and Cox 1990) Sunday, August 4, 13 “A program alarm could be triggered by trivial problems that could be ignored altogether. Or it could be triggered by problems that called for an immediate abort. How to decide which was which? "We wrote ourselves little rules like 'If this alarm happens and it only happens once, don't worry about it. If it happens repeatedly, but other indicators are okay, don't worry about it.'"
  • 25. Operator, interviewed. The Three Mile Island nuclear power plant, following the accident. (Kemeny 1979) Sunday, August 4, 13 “I would have liked to have thrown away the alarm panel. It wasn't giving us any useful information." Comment by one operator at the Three Mile Island nuclear power plant to the official inquiry following the TMI accident (Kemeny 1979).
  • 26. Physician, explaining how they respond to a nuisance alarm on a device in the operating room. (Cook, Potter,Woods and McDonald 1991) Sunday, August 4, 13 "When the alarm kept going off then we kept shutting it [the device] off [and on] and when the alarm would go off [again], we’d shut it off.” “... so I just reset it [a device control] to a higher temperature. So I kinda fooled it [the alarm]...”
  • 27. SIGNAL DETECTION THEORY Sunday, August 4, 13 Signal Detection Theory - Too sensitive, and you’ll get false alarms - Not sensitive enough, and you’ll get missed alarms
  • 28. ALERT DESIGN Mica Endsley Designing for Situation Awareness Sunday, August 4, 13 What about the context people are in when they experience a FALSE ALERT? Or a MISSED ALERT?
  • 29. Interpretation Integration Interpretation Other Situational Information Expectancies Past History Mental Model Alarm Signal Response Decision Designing for Situational Awareness, Mica Endsley Sunday, August 4, 13 The cognitive processing of an alarm signal. When we DESIGN ALERTS, we HAVE to think about the various ways that the ALERT could be interpreted or acted on. Often times, we will PUNT on aiding the operator with CONTEXT.
  • 30. Critical Care & Anesthesiology • Monitors & alarms designed to “never miss” • 566 deaths reported related to alarms (2005-2008) • Most associate with the silencing function • ECRI’s #1 health technology hazard, 2012 & 2013 And you have complaints about Nagios’ “set downtime” feature? Sunday, August 4, 13 Emergency Care Research Institute (ECRI), which recently identified alarms as the “number one health technology hazard” for 2012.9 And you have complaints about Nagios’ “set downtime” feature?
  • 31. ALERT DESIGN Confirmation Sunday, August 4, 13 - Because false alarms are a problem, people will spend time not reacting to an alert, but confirming that the alert is legit. - Pilots delay responding to GPWS (Ground Proximity Warning System) 73% of the time, because they’re looking out the window to confirm it’s true, and how true it is. What are ways we can SUPPORT CONFIRMATION or VALIDATION in our alert design?
  • 32. ALERT DESIGN Expectancy Sunday, August 4, 13 - People’s expectancies can also affect their interpretation of alerts. - In many cases, people EXPECT the alert to go off, as the result of their own actions. - In a study in 2001, 6% of operating room alarms were found to be expected or anticipated. - This can become a nuisance, and further degrade the trust in the alerts. - Example: disk space alerts that happen during a backup, and then recover. - Example: someone on the team doing work, and not silencing the alerts temporarily. BONUS: when the time period for an alert is silenced passes, and the condition isn’t acceptable yet. (downtime expiring) What are ways that we could SUPPORT EXPECTANCY in our alert design?
  • 33. ALERT DESIGN • Signal:Noise can be difficult • Easy to err on more false alarms • Decay in trust • Origins: Undetectable conditions Sunday, August 4, 13 - Signal:Noise can be difficult to get right - General view: err on the side of too many false alarms. This ignores the detrimental effect of them on humans. - Study in 1998 said: New ATC systems, missed alerts at 0.2%, false alarm rates at 65%. - Underlying false alerts: not the functioning of algorithms themselves, but the CONDITIONS AND FACTORS THAT THE ALARM SYSTEMS CANNOT DETECT OR INTERPRET Ex: Cincinnati Airport - riverbank leading up to a runway increases in terrain causes an alarm because the system can’t detect that it’s going to plateau at the runway. Pilots familiar with the airport ignore the alarms.
  • 34. Information is not a scarce resource. Attention is. Herb Simon, 1991 Sunday, August 4, 13 http://csel.eng.ohio-state.edu/productions/woodscta/media/diagnosis.pdf
  • 35. Directed Attention • Attention focusing • Attention switching • Dynamic Prioritization Sunday, August 4, 13 We work in a COGNITIVELY NOISY WORLD, even when there is NOT an outage going on. Alerts are ESSENTIALLY ATTENTION DIRECTORS. The main challenge for DYNAMIC FAULT MANAGEMENT (HF term) in design is to support: - ATTENTION FOCUSING - ATTENTION SWITCHING - DYNAMIC PRIORITIZATION By getting to know how human attention works (and its relationship to context, perception, etc.), we can hope to design better alerts.
  • 36. Interrupts AND Underspecification 1. “Here is the data I want you to see” 2. “Here is why I think you would find it interesting” Sunday, August 4, 13 An alert is essentially an INTERRUPT. TWO STATES: 1 - HERE IS THE DATA I WANTYOU TO SEE 2 - HERE IS WHY I THINKYOU WOULD FIND IT INTERESTING What can we do to support #2?
  • 37. Paradox Of Directed Attention Sunday, August 4, 13 An alert is essentially an interruption to everyday work, and there is a paradox at the heart of DIRECTED ATTENTION. 1. We are always busy! 2. Shifting attention has a very real cost! 2. Not all signals are worth paying attention to; context-sensitivity will always vary 3. So how can you SKILLFULLY IGNORE a SIGNAL that should NOT SHIFT UR ATTENTION WITHOUT first processing it....IN WHICH CASE IT HASN’T BEEN IGNORED. “Given that the supervisory agent is loaded by various other task related demands, how does one interpret information about the potential need to switch attentional focus without interrupting or interfering with the tasks or lines of reasoning already under attentional control. We can state this paradox in another way: how can one skillfully ignore a signal that should not shift attention within the current context, without first processing it -- in which case it hasn't been ignored.” - David Woods David Woods has suggested some ways to break this paradox, he calls it PREATTENTIVE REFERENCE. I’ll let you discover his suggestions on your own.
  • 38. Directed Attention Sorting through an avalanche of data Picking up on subtle early indications of a fault Sunday, August 4, 13 This idea of an alert DIRECTING OUR ATTENTION can exist in two views: SORTING THROUGH AN AVALANCHE or PICKING UP SUBTLE/EARLY INDICATIONS.... So....which is it? IT CAN BE BOTH! “The critical point is that the challenge of fault management lies in sorting through an avalanche of raw data -- a data overload problem. This is in contrast to the view that the performance bottleneck is the difficulty of picking up subtle early indications of a fault against the background of a quiescent monitored process.”
  • 39. Context Sensitivity Sunday, August 4, 13 The background and context in which a SIGNAL arrives can play a huge role in how they can HELP or HINDER us. If the background is one of QUIET, contrast is HIGH. <- this is what most designers plan for If the background is ONGOING DIAGNOSIS, then SIGNAL can SUPPORT/CONTRADICT existing hypothesis If the background is EXECUTING A RESPONSE, then SIGNAL can cue the RESPONSE is WRONG or INCOMPLETE. In any case, the ALERT’s MEANING will change as CONTEXT and BACKGROUND changes.
  • 40. Data Overload Sunday, August 4, 13 This is simply a tough problem. There are approaches to solve it, but none of them to date are effective given the rate at which new pieces of data are being collected and stored. There is a significant agreement among those who study data overload phenomena that the critical piece to understand is of CONTEXT SENSITIVITY. Some HF researchers have pointed at something that may help reduce the effects of DO: Depicting RELATIONSHIPS between data in a known FRAME of REFERENCE, as opposed to the raw data. What can we do as designers to aid surfacing those relationships?
  • 41. How have I taken the OPERATOR into account? Sunday, August 4, 13 PEOPLE use monitoring tools. Arguably, MACHINES use monitoring tools we build, as well. But only PEOPLE can adapt and improvise with a given tool outside of the original intentions of its designer.
  • 42. Am I hurting or helping: •Data overload or underload? •Salience? •Directed attention? •Interruptibility? Sunday, August 4, 13 When we design alerts and monitoring tools, we should be asking these questions. In addition: HOW WILL WE KNOW WHEN THIS DESIGN WOULD HURT those things?
  • 43. Joint Cognitive Systems Sunday, August 4, 13 One final thought: what if, instead of the view that the BOUNDARY is a large barrier to be hurdled only by our writing increasingly complex code...we view that boundary as a place for an actual cooperative RELATIONSHIP?
  • 44. Joint Cognitive Systems What if we viewed an alerting system as a PARTNER, instead of a subordinate? Sunday, August 4, 13 What is we viewed alerting systems as a PARTNER, instead of a subordinate or otherwise dumb messenger delivering news to us? What does the world look like if we designed alerts to COOPERATE with us? If TRUST in alerting systems is such a big deal.... WHAT can we learn from how HUMANS learn to trust each other, and let that influence our design decisions? In other words: how can we design alerts that SUPPORT our confirming their legitimacy, or our expectations when an alert will fire? Is context-sensitivity part of this? We see some blunt versions of these notions: 1 - Time periods for alerts, so that people aren’t woken up for things that can wait until morning (the machine has been given some context about our availability to pay attention to an alert) 2 - Rough dependency relationships, so we don’t send a bazillion alerts when a known SPOF dies What other examples can we think of, where the COMPUTERS can attempt to understand, predict, or observe US, as we work?
  • 45. The End Sunday, August 4, 13 My hope is that I’ve been able to ask BETTER QUESTIONS, and I can kick off this conference with food for thought. You can tell me how that food tastes at the bar later.
  • 46. Can We Ever Escape From Data Overload? A Cognitive Systems Diagnosis Woods, Patterson, Roth 1999 http://csel.eng.ohio-state.edu/productions/woodscta/media/diagnosis.pdf Sunday, August 4, 13 http://csel.eng.ohio-state.edu/productions/woodscta/media/diagnosis.pdf
  • 47. The Alarm Problem and Directed Attention in Dynamic Fault Management Woods 1995 http://csel.eng.ohio-state.edu/woods/foundations/directed%20att.pdf Sunday, August 4, 13 http://csel.eng.ohio-state.edu/productions/woodscta/media/diagnosis.pdf