This can mean that the bottleneck is not the Node process at all, but rather an I/O operation
Diagnose: Use clinic bubbleprof to explore asynchronous delays – run clinic bubbleprof -h to get started.
Read more
Understanding the analysis
Node.js provides a platform for non-blocking I/O.
Unlike languages that typically block for IO (e.g. Java, PHP), Node.js passes I/O operations
to an accompanying C++ library (libuv) which delegates these operations to the Operating System.
Once an operation is complete, the notification bubbles up from the OS, through libuv, which can then
trigger any registered JavaScript functions (callbacks) for that operation. This is the typical
flow for any asynchronous I/O (where as Sync API's will block, but should never be used in a
server/service request handling context).
The profiled process has been observed is unusually idle under load, typically this means
it's waiting for external I/O because there's nothing else to do until the I/O completes.
To solve I/O issues we have to track down the asynchronous call(s) which are taking an
abnormally long time to complete.
I/O root cause analysis is mostly a reasoning exercise. Clinic.js Bubbleprof is a tool developed specifically to inform and ease this kind of reasoning.
Next Steps
Use clinic bubbleprof to create a diagram of the application's asynchronous flow.
See clinic bubbleprof --help for how to generate the profile
Explore the Bubbleprof diagram. Look for long lines and large circles representing persistent delays, then drill down to reveal the lines of code responsible
Pay particular attention to "userland" delays, originating from code in the profiled application itself.
Identify possible optimization targets using knowledge of the application's I/O touch points (the I/O to and from the Node.js process, such as databases, network requests, and filesystem access). For example:
Look for operations in series which could be executed in parallel
Look for slow operations that can be optimised externally (for example with caching or indexing)
Consider if a large processes has good reasons for being almost constantly in the queue (for example, some server handlers)
This can mean that your application might be spending more time than expected in memory allocations
Diagnose: Use clinic heapprofiler to discover the functions which allocates more memory at heap – run clinic heapprofiler -h
Understanding the analysis
JavaScript is Garbage Collected language. Rather than manually freeing objects, they
are simply "cleaned away" at some point after all references to an object have been removed.
At a basic level, the Garbage Collector traverses the JavaScript objects at various intervals to find any
"orphaned" objects (objects which no longer have any references). If there are too many
objects, and/or too many orphaned objects this can cause performance issues – because the Garbage
Collector uses the same thread as the JavaScript event loop. In other words, JavaScript execution
pauses while the Garbage Collector clears away de-referenced objects.
At a more detailed level, GC collection is triggered by memory activity, rather than time and
objects are classified by the GC into young and old. "Young" objects are
traversed (scavenged) more frequently, while "old" objects will stay in memory for longer. So there
are actually two GC types, a frequent scavenge of new space (short lived objects) and a less regular traversal of
old space (objects that survived enough new space scavenges).
Several heuristics may trigger detection of a GC issue, but they all center around high
memory usage.
One possible cause of a detected GC issue is a memory leak, where objects are being accidentally
allocated. However there are other (more common) cases where the is no leak but the memory strategy
needs to be adapted.
One such common case is when large objects (such as may be generated for big JSON payloads), are
created during periods of high activity (e.g. under request load). This can cause the objects
to be moved into old space – if they survive two (by default) GC scavenges – where they will live
for longer due to the less frequent scavenges. Objects can then build up in "old space" and
cause intermittent process stalling during Garbage Collection.
Depending on the use case this may be solved in different ways. For instance if the goal is to write
out serialized objects, then the output could be written to the response as strings (or buffers) directly
instead of creating the intermediate objects (or a combined strategy where part of the object is written out
from available state). It may just be a case that a functional approach (which is usually recommended) is
leading to the repeated creation of very similar objects, in which case the logical flow between functions
in a hot path could be adapted to reuse objects instead of create new objects.
Another possibility is that a very high amount of short lived objects are created, filling up the
"young" space and triggering frequent GC sweeps – if this case isn't an unintended memory leak,
then then an object pooling strategy may be necessary.
To solve Garbage Collection issues we have to analyse the state of our process in order to track down the
root cause behind the high memory consumption.
Next Steps
If the system is already deployed, mitigate the issue immediately by implementing
HTTP 503 Service Unavailable functionality (see Load Shedding in Reference)
Use clinic heapprofiler to create a flamegraph of memory allocations in the application.
See clinic heapprofiler --help for how to generate the memory profile.
Look for "hot" blocks, those are functions that are observed to be at the top the stack per memory allocation sample – in other words, such functions are allocation more memory in the heap
(In the case of a distributed bottleneck, start by looking for lots of wide tips at the top of the Flamegraph)
Diagnose: Use clinic flame to discover CPU intensive function calls – run clinic flame -h
Understanding the analysis
JavaScript is a single-threaded event-driven non-blocking language.
In Node.js I/O tasks are delegated to the Operating System, JavaScript functions (callbacks)
are invoked once a related I/O operation is complete. At a rudimentary level, the process of
queueing events and later handling results in-thread is conceptually achieved with the
"Event Loop" abstraction.
At a (very) basic level the following pseudo-code demonstrates the Event Loop:
while (event) handle(event)
The Event Loop paradigm leads to an ergonomic development experience for high concurrency programming
(relative to the multi-threaded paradigm).
However, since the Event Loop operates on a single thread this is essentially a shared
execution environment for every potentially concurrent action. This means that if the
execution time of any line of code exceeds an acceptable threshold it interferes with
processing of future events (for instance, an incoming HTTP request); new events cannot
be processed because the same thread that would be processing the event is currently
blocked by a long-running synchronous operation.
Asynchronous operations are those which queue an event for later handling, they tend to be
identified by an API that requires a callback, or uses promises (or async/await).
Whereas synchronous operations simply return a value. Long running synchronous operations are either
functions that perform blocking I/O (such as fs.readFileSync) or potentially resource intensive
algorithms (such as JSON.stringify or react.renderToString).
To solve the Event Loop issue, we need to find out where the synchronous bottleneck is.
This may (commonly) be identified as a single long-running synchronous function, or
the bottleneck may be distributed which would take rather more detective work.
Next Steps
If the system is already deployed, mitigate the issue immediately by implementing
HTTP 503 Service Unavailable functionality (see Load Shedding in Reference)
This should allow the deployments Load Balance to route traffic to a different service instance
In the worse case the user receives the 503 in which case they must retry (this is still preferable to waiting for a timeout)
Use clinic flame to generate a flamegraph
Run clinic flame --help to get started
see "Understanding Flamegraphs and how to use 0x" article in the Reference section for more information
Look for "hot" blocks, these are functions that are observed (at a higher relative frequency) to be at the top the stack per CPU sample – in other words, such functions are blocking the event loop
(In the case of a distributed bottleneck, start by looking for lots of wide tips at the top of the Flamegraph)
This can mean that the bottleneck is not the Node process at all, but rather an I/O operation
Diagnose: Use clinic bubbleprof to explore asynchronous delays – run clinic bubbleprof -h to get started.
Understanding the analysis
Node.js provides a platform for non-blocking I/O.
Unlike languages that typically block for IO (e.g. Java, PHP), Node.js passes I/O operations
to an accompanying C++ library (libuv) which delegates these operations to the Operating System.
Once an operation is complete, the notification bubbles up from the OS, through libuv, which can then
trigger any registered JavaScript functions (callbacks) for that operation. This is the typical
flow for any asynchronous I/O (where as Sync API's will block, but should never be used in a
server/service request handling context).
The profiled process has been observed is unusually idle under load, typically this means
it's waiting for external I/O because there's nothing else to do until the I/O completes.
To solve I/O issues we have to track down the asynchronous call(s) which are taking an
abnormally long time to complete.
I/O root cause analysis is mostly a reasoning exercise. Clinic.js Bubbleprof is a tool developed specifically to inform and ease this kind of reasoning.
Next Steps
Use clinic bubbleprof to create a diagram of the application's asynchronous flow.
See clinic bubbleprof --help for how to generate the profile
Explore the Bubbleprof diagram. Look for long lines and large circles representing persistent delays, then drill down to reveal the lines of code responsible
Pay particular attention to "userland" delays, originating from code in the profiled application itself.
Identify possible optimization targets using knowledge of the application's I/O touch points (the I/O to and from the Node.js process, such as databases, network requests, and filesystem access). For example:
Look for operations in series which could be executed in parallel
Look for slow operations that can be optimised externally (for example with caching or indexing)
Consider if a large processes has good reasons for being almost constantly in the queue (for example, some server handlers)
There is a performance issue but Clinic.js could not accurately categorize it.
Clinic.js may inadvertently be picking up system noise.
Disable any CPU- or Memory-intensive applications.
Use the --on-port flag to immediately begin load testing, this is known to reduce the chances of interference.
If an unknown result is consistently produced it is still very likely that there is a performance issue.
Undertake additional diagnosis with the clinic bubbleprof and/or clinic flame commands.
For memory analysis use the --inspect flag with the Chrome Devtools Memory tab.
Understanding the analysis
An unknown issue occurs when Clinic.js' analysis algorithms are unable to categorize the sampling results but nevertheless an issue of some kind has been detected.
This outcome can be attributed to one of two scenarios:
Ambient noise – for instance, other applications using the CPU or memory – during the sampling period has polluted the results.
There is a genuine performance issue but clinic doctor doesn't recognize it.
In the case of ambient noise, there may still be a specific, categorizable performance issue.
We can make eliminate the possibility of ambient noise and make it easier for Clinic.js to definitively recognize the issue by:
Closing down as many applications as possible, especially applications that are CPU- or Memory- intensive.
Using the --on-port flag. This can reduce the chances of unknown issues because there is no time gap nor additional system activity between the server starting and the load test beginning.
By way of example, instead of running clinic -- node app.js in one terminal and autocannon localhost:3000 in another, it is preferable and recommended to trigger both in one command using the following command:
clinic doctor --on-port="autocannon localhost:3000" -- node app.js
An even simpler form of this is to use the --autocannon flag,
clinic doctor --autocannon / -- node app.js
If after taking these steps an unknown categorization continues to occur then we can instead attempt to infer the nature of the performance issue using specialist diagnostic tooling, such
as clinic flame, clinic bubble or Node Inspector.
Next Steps
First eliminate the possibility of ambient noise
Reduce noise by closing down as many other applications running on the system as possible - especially CPU or Memory intensive applications
Ensure that the --on-port flag is being used to trigger load testing instead of initiating load testing independently
Use clinic bubbleprof to create a diagram of the application's asynchronous flow (see clinic bubbleprof --help)
Explore the Bubbleprof diagram. Look for long lines and large circles representing persistent delays, then drill down to reveal the lines of code responsible
A common problem is the overuse or misuse of promises. clinic bubbleprof will visualize promise activity, make a point of looking out for it in the diagram.
Use clinic flame to generate a flamegraph
Run clinic flame --help to get started
Look for "hot" blocks, these are functions that are observed (at a higher relative frequency) to be at the top the stack per CPU sample – in other words, such functions are blocking the event loop
For memory analysis use the --inspect flag with the Chrome Devtools Memory tab.