In this blog post
Creating trace diffs: OverviewCreating trace diffs: OverviewPrerequisitesPrerequisites1) Setup and run MongoDB1) Setup and run MongoDB2) Initialize a NodeJS server2) Initialize a NodeJS server3) Build a Snapshot directory3) Build a Snapshot directory4. Diff two Snapshots4. Diff two Snapshots5. Create a scheduler for snapshot diffs5. Create a scheduler for snapshot diffsSee the trace differ in actionSee the trace differ in actionExperimental UIExperimental UINext StepsNext StepsLightstep recently added several new APIsnew APIs to help developers access the high value data being sent to Lightstep from their systems and integrate the rich analysis built on that data into their existing workflows.
Lightstep customers send trillions of spans from their applications to Lightstep MicrosatellitesLightstep Microsatellites hosted and managed in their own environments. Data going to the Microsatellites is not sampled at the clients. This ensures that when a user is querying for this data in Lightstep, 100% of the un-sampled data is available for real-time analysis. Lightstep Microsatellites then use intelligent, dynamic sampling to ensure behavior outliers and anomalies are always well represented, as well as the RED metrics and histograms are created with the full representative data set. The outcome of this query in ExplorerExplorer is saved in a "Snapshot""Snapshot" in our SaaS.
A Snapshot contains rich analysis data that is very useful for investigating issues, so why not supercharge it with building more custom functionality on top of it? We can use the new Snapshot APIsSnapshot APIs to programmatically diff two snapshots in time. This is the Trace Diffing use case, where we can define critical request paths (traces) in our system and automatically track when deviations or anomalies occur.
Behrooz Badi, Senior Engineering Manager at Lyft, had this to say about our new diff tooling, "With Lightstep's Trace Differ tool now available, we are able to root cause structural discrepancies in our critical traffic patterns across our architecture. During a deployment or, worse, an outage, we are able to check the difference between traces before and after the event to aid in root cause analysis. This means that during critical business-facing outages, we can significantly reduce the scope of investigation and reduce mean-time-to-resolution (MTTR)."
Creating trace diffs: Overview
In this tutorial, we will setup a the following simple workflow:
1. Define important queries that we want to track - The Explorer in Lightstep lets you choose any arbitrary set of service
, operation
, and attributes
to create a targeted and dynamic query. This helps to narrow down your analysis to a specific set of data during your investigation, reducing MTTR. For demo purposes, we will choose a simple query, service IN ("android")
from our SandboxSandbox demo data.
2. Set up automatic snapshot creation - Once the query is defined, we will create snapshots of the data every 10 minutes (configurable), this will ensure that we are pulling the relevant analysis and underlying data back to the SaaS periodically and we can run further analyses on it.
3. Find the differences between two snapshots - For two given snapshots, and a set of keys to group by (think attributes like customer
, region
, http.status_code
, etc.), we will take the entire underlying trace datasets from the snapshots and find whether any values are missing, if there are any new values, or for ones that exist in both snapshots, if there are significant changes in RED metrics.
Prerequisites
You will need a Lightstep account (they’re freethey’re free) and a Lightstep API key. Get it hereGet it here.
Docker and
docker-compose
installed on your machinenode
andnpm
installed on your machine
NOTE: The entire code for this application is available on GitHubis available on GitHub if you'd like to follow along.
We will setup a simple MEVN stack for this application:
MongoDB - To store the queries, snapshot data, and diffs
ExpressJS - REST API to interact with the application
VueJS - An experimental client UI to visualize results
$ mkdir trace-differ
1) Setup and run MongoDB
Create a docker-compose.yml
for MongoDB
$ cd trace-differ && touch docker-compose.yml
# docker-compose.yml
version: "3"
services:
database:
image: "mongo"
container_name: "mongo-db-container"
environment:
- MONGO_INITDB_DATABASE=lightstep
- MONGO_INITDB_ROOT_USERNAME=lightstep
- MONGO_INITDB_ROOT_PASSWORD=lightstep
volumes:
- ./db/init-mongo.js:/docker-entrypoint-initdb.d/init-mongo.js:ro
- ./db/mongo-volume:/data/db
ports:
- "27017-27019:27017-27019"
Create a db
directory to hold the mongo volume data and the initialization script.
trace-differ$ mkdir db && touch db/init-mongo.js
/* init-mongo.js */
db.createUser({
user: "lightstep",
pwd: "lightstep",
roles: [
{
role: "readWrite",
db: "lightstep",
},
],
});
Now, let's start MongoDB
trace-differ$ docker-compose up -d
2) Initialize a NodeJS server
Let's initalize a NodeJS application with a REST server with Express, and a data model with Mongoose to access MongoDB.
trace-differ$ mkdir server && cd server
server$ npm init # Go through the standards directions of npm
Install the required packages
server$ npm install --save async axios body-parser cors express moment mongoose node-schedule
server$ npm install --save-dev winston prettier nodemon eslint eslint-plugin-prettier babel-eslint
Explanation of packages:
async
- asynchronous operations we will be running including accessing Lightstep APIsaxios
- client for making HTTP callsbody-parser
,cors
,express
- for setting up our servermongoose
- MongoDB data modelingmoment
- time related operationsnode-schedule
- chron based scheduler
The others are utilities I use in local development and are optional.
Set up a constants file to store values like DB connection string, API keys, etc.
server$ touch constants.js
/* constants.js */
// MongoDB
const DATABASE = "mongodb://lightstep:lightstep@127.0.0.1:27017/lightstep";
// Lightstep API Values
// Set these here or as environment variables
const HOST = "https://api.lightstep.com";
const ORG = process.env.LIGHTSTEP_ORG || "";
const PROJECT = process.env.LIGHTSTEP_PROJECT || "";
const API_KEY = process.env.LIGHTSTEP_API_KEY || "";
module.exports = {
DATABASE: DATABASE,
HOST: HOST,
ORG: ORG,
PROJECT: PROJECT,
API_KEY: API_KEY,
};
Define our data models with Mongoose.
server$ mkdir models && cd models
models$ touch diff.js query.js snapshot.js
Here is the example for query.js
, the rest are available in the code repocode repo.
/* query.js */
const mongoose = require("mongoose");
const Schema = mongoose.Schema;
let querySchema = new Schema(
{
query: {
type: String,
},
name: {
type: String,
},
createdAt: {
type: Number,
},
groupByKeys: {
type: Array,
},
},
{
collection: "queries",
}
);
module.exports = mongoose.model("Query", querySchema);
Create the Express routes to handle each entity in our data model.
server$ mkdir routes && cd routes
routes$ touch diff.route.js snapshot.route.js query.route.js
Here is query.route.js
for example. A simple CRUD scaffold using Mongoose. The rest are available in the code repocode repo.
/* query.route.js */
const express = require("express");
const QueryRoute = express.Router();
let QueryModel = require("../models/query");
QueryRoute.route("/queries").get((req, res) => {
QueryModel.find((error, data) => {
if (error) {
return next(error);
} else {
res.json(data);
}
});
});
QueryRoute.route("/queries").post((req, res, next) => {
QueryModel.create(req.body, (error, data) => {
if (error) {
return next(error);
} else {
res.json(data);
}
});
});
QueryRoute.route("/queries/:id").get((req, res) => {
QueryModel.findById(req.params.id, (error, data) => {
if (error) {
return next(error);
} else {
res.json(data);
}
});
});
QueryRoute.route("/queries/:id").delete((req, res, next) => {
QueryModel.findByIdAndRemove(req.params.id, (error, data) => {
if (error) {
return next(error);
} else {
res.status(200).json({
msg: data,
});
}
});
});
module.exports = QueryRoute;
Put it all together in a server.js
. This file will start a connection to our MongoDB instance, import our server/routes
and initialize an ExpressJS server.
I like using
winston
for logging, so I usually write thelogger.js
in my applications and use that instead of the regular console.
server$ touch server.js
/* server.js */
let express = require("express"),
cors = require("cors"),
mongoose = require("mongoose"),
bodyParser = require("body-parser"),
constants = require("./constants"),
scheduler = require("./scheduler"),
logger = require("./logger");
const snapshotAPI = require("./routes/snapshot.route");
const diffAPI = require("./routes/diff.route");
const queryAPI = require("./routes/query.route");
mongoose.Promise = global.Promise;
mongoose
.connect(constants.DATABASE, {
useNewUrlParser: true,
useUnifiedTopology: true,
useFindAndModify: false,
})
.then(
() => {
logger.info("MongoDB Connected");
runApp();
},
(error) => {
logger.error("MongoDB could not be connected due to: " + error);
}
);
// We wait for DB connection to initialize
function runApp() {
const app = express();
app.use(bodyParser.json());
app.use(
bodyParser.urlencoded({
extended: false,
})
);
app.use(cors());
app.use("/api", snapshotAPI);
app.use("/api", diffAPI);
app.use("/api", queryAPI);
const port = process.env.APP_PORT || 4000;
app.listen(port, () => {
logger.info("Trace Differ running on port: " + port);
});
app.use((req, res, next) => {
return res.status(404).send({ message: "Route" + req.url + " Not found." });
});
app.use((err, req, res, next) => {
logger.error(err.message);
if (!err.statusCode) err.statusCode = 500;
res.status(err.statusCode).send(err.message);
});
scheduler.startScheduler();
}
You will notice that the last line of the code starts our scheduler, which we will get to later in this tutorial.
3) Build a Snapshot directory
By default, the GET /snapshot/:idGET /snapshot/:id API returns a list of exemplar spans, however, there is a lot more data lurking underneath the surface. When a Snapshot is created, Lightstep first tracks all the spans that match the query explicitly, and in the background starts trace assembly for all the traces that first set of spans were a part of. This data can be fetched separately using the GET /stored-traces
GET /stored-traces
API.
To this end, let's build a snapshot directory that will be responsible for creating snapshots, and fetching the related underlying data and storing it in MongoDB.
First, we will create a Lightstep API client using axios
. We will need to call a few APIs: POST /snapshots
, GET /snapshot/:id
, and GET /stored-traces
Note: This is optional, I like to keep it separate in case we want to swap this out for a different API in the future. This is also why we keep our data models separate.
server$ touch api.js
/* api.js */
let axios = require("axios");
let constants = require("./constants");
if (constants.ORG == "" || constants.PROJECT == "" || constants.API_KEY == "") {
console.error(
"Please set the environment variables: LIGHTSTEP_ORG, LIGHTSTEP_PROJECT, LIGHTSTEP_API_KEY"
);
process.exit(1); // exit if the variables have not been set
}
// Create the main client
const api = axios.create({
baseURL: `${constants.HOST}/public/v0.2/${constants.ORG}/projects/${constants.PROJECT}`,
headers: {
Accept: "application/json",
"Content-Type": "application/json",
Authorization: `Bearer ${constants.API_KEY}`,
},
});
// API Methods (https://api-docs.lightstep.com)
function createSnapshot(query) {
var body = JSON.stringify({
data: {
attributes: {
query: query,
},
},
});
return new Promise((resolve, reject) => {
api
.post("/snapshots", body)
.then(function (response) {
resolve(response.data);
})
.catch(function (error) {
reject(error);
});
});
}
function getSnapshot(id, params) {
let url = `/snapshots/${id}`;
if (Object.keys(params).length > 0) {
url = `${url}?`;
Object.keys(params).forEach((p) => {
url = `${url}${p}=${params[p]}`;
});
}
return new Promise((resolve, reject) => {
api
.get(url)
.then((response) => {
resolve(response.data);
})
.catch(function (error) {
reject(error);
});
});
}
function getStoredTrace(spanId) {
let url = `/stored-traces?span-id=${spanId}`;
return new Promise((resolve, reject) => {
api
.get(url)
.then((response) => {
resolve(response.data);
})
.catch(function (error) {
reject(error);
});
});
}
module.exports = {
api,
createSnapshot,
getSnapshot,
getStoredTrace,
};
Next, create a directory.js
file that includes these functions:
createSnapshotForQuery
- Given a query object, create a Snapshot in LightstepfetchSnapshotData
- Once the Snapshot is ready, fetch all the underlying datagetExemplarsForSnapshot
andgetTracesForExemplars
- Helper functions to further call the APIs needed to fetch all traces comprising of a Snapshot
server$ touch directory.js
/* directory.js */
let moment = require("moment");
let async = require("async");
let api = require("./api");
let logger = require("./logger");
let SnapshotModel = require("./models/snapshot");
function createSnapshotForQuery(q) {
return new Promise((resolve, reject) => {
api
.createSnapshot(q.query)
.then((res) => {
snapshot = {
snapshotId: res.data.id,
query: q.query,
createdAt: moment().unix(),
completeTime: moment(res.data.attributes["complete-time"]).unix(),
link: "", // TODO: create link to UI
};
// Save the Snapshot in our DB
SnapshotModel.create(snapshot, (error, data) => {
if (error) {
reject(error);
} else {
logger.info(
`Created snapshot ${res.data.id} for query: ${q.query} `
);
resolve(snapshot);
}
});
})
.catch((error) => {
reject(error);
});
});
}
function fetchSnapshotData(snapshotId) {
getExemplarsForSnapshot(snapshotId)
.then((exemplars) => {
getTracesForExemplars(exemplars)
.then((spans) => {
// Save snapshot data to DB
SnapshotModel.findOneAndUpdate(
{ snapshotId: snapshotId },
{ $set: { spans: spans } },
(error, data) => {
if (error) {
logger.error(error);
} else {
logger.info(
`Added ${spans.length} spans to snapshot ${snapshotId}`
);
}
}
);
})
.catch((err) => logger.error(err));
})
.catch((err) => logger.error(err));
}
function getExemplarsForSnapshot(snapshotId) {
logger.info(`Getting snapshot ${snapshotId} exemplars`);
return new Promise((resolve, reject) => {
api
.getSnapshot(snapshotId, { "include-exemplars": 1 })
.then((res) => {
if (res.data.attributes.exemplars) {
resolve(res.data.attributes.exemplars);
} else {
reject(new Error("No snapshot data returned"));
}
})
.catch((err) => {
reject(err);
});
});
}
function getTracesForExemplars(exemplars) {
exemplars = exemplars.map((s) => {
return s["span-guid"];
});
return new Promise((resolve, reject) => {
// TODO: Implement Retry to catch the Rate Limiting
// We get the traces related to each exemplar span and
// concatenate into one array
async.reduce(
exemplars,
[],
(memo, item, cb) => {
api
.getStoredTrace(item)
.then((res) => {
let spans = res.data[0].attributes.spans;
// Service name is stored separately in the response
let reporters = res.data[0].relationships.reporters;
spans.forEach((s) => {
let r = reporters.find(
(obj) => obj["reporter-id"] == s["reporter-id"]
);
// Add service metadata to the span itself
s.reporter = {
id: r["reporter-id"],
name: r.attributes["lightstep.component_name"],
hostname: r.attributes["lightstep.hostname"],
};
delete s["reporter-id"];
});
cb(null, memo.concat(spans));
})
.catch((err) => {
logger.warn(err);
cb(null, memo.concat([]));
});
},
(err, result) => {
if (err) {
reject(err);
} else {
// TODO: Remove duplicate entries if they exist
resolve(result);
}
}
);
});
}
module.exports = {
createSnapshotForQuery,
fetchSnapshotData,
};
4. Diff two Snapshots
Now that we have a scaffold ready for creating snapshots and storing snapshot data, let's implement the trace diffing logic. Given a key (or multiple) to group by, for example http.status_code
, we want to know how many different values of that attribute exist in each snapshot, if any are missing, if any are new, and for values that exist in both, what are the differences in their RED metrics.
We will walk through this file step by step.
server$ touch differ.js
/* Import the required modules */
let moment = require("moment");
let async = require("async");
let logger = require("./logger");
let DiffModel = require("./models/diff");
let SnapshotModel = require("./models/snapshot");
/* A function to diff two snapshots based on the above logic */
function diffSnapshots(query, aId, bId, groupByKeys) {
// Get the span data for both snapshots from local DB
async.mapValues(
{ a: aId, b: bId },
function (id, key, cb) {
SnapshotModel.find({ snapshotId: id }, (err, data) => {
if (err) {
cb(err, null);
} else {
cb(null, data);
}
});
},
(err, snapshots) => {
if (err) {
logger.error(err);
} else {
// Diff Snapshots by each group by key
let groups = [];
let aResults, bResults, diff;
groupByKeys.forEach((k) => {
// Calculate the group by analysis for both snapshots
aResults = calculateGroupByAnalysis(snapshots.a[0].spans, k);
bResults = calculateGroupByAnalysis(snapshots.b[0].spans, k);
// Compare the group bys and create a diff object per key
diff = compareAnalysis(aResults.groups, bResults.groups);
diff.key = k;
groups.push(diff);
});
// TODO: Set custom thresholds
// Create and save a diff
if (groups.length > 0) {
let diff = {
query: query,
linkA: snapshots.a.link,
linkB: snapshots.b.link,
calculatedAt: moment().unix(),
diffs: groups,
};
logger.info(
`diff calculated for query {${query}} between snapshot ${aId} and ${bId}`
);
DiffModel.create(diff, (error, data) => {
if (error) {
logger.error(error);
}
});
}
}
}
);
}
/* Do a group by key on the spans of a snapshot */
function calculateGroupByAnalysis(spans, groupByKey) {
let groups = {};
let groupLabels = new Set();
spans.forEach((s) => {
let label = "";
if (groupByKey == "service") {
label = s["reporter"]["name"];
} else if (groupByKey == "operation") {
label = s["span-name"];
} else {
label = s["tags"][groupByKey];
}
if (!groupLabels.has(label)) {
// initialize the group
groups[label] = {
count: 0,
latency_sum: 0,
error_count: 0,
};
groupLabels.add(label);
}
// Update group statistics
groups[label].count += 1;
groups[label].error_count += s["is-error"] ? 1 : 0;
groups[label].latency_sum += s["end-time-micros"] - s["start-time-micros"];
});
// Aggregate statistics
let agg = [];
Object.keys(groups).forEach((k) => {
agg.push({
value: k,
occurrence: groups[k].count,
avg_latency: Math.floor(groups[k].latency_sum / groups[k].count),
error_ratio: groups[k].error_count / groups[k].count,
});
});
let groupByResults = {
"group-by-key": groupByKey,
groups: agg,
};
return groupByResults;
}
/* Compare two group by analyses and find the differences */
function compareAnalysis(a, b) {
let aKeys = a.map((o) => {
return o.value;
});
let bKeys = b.map((o) => {
return o.value;
});
// Check for existence of keys in the groups
let inBnotA = bKeys.filter((k) => {
return !aKeys.includes(k);
});
let inAnotB = aKeys.filter((k) => {
return !bKeys.includes(k);
});
let inBoth = bKeys.filter((k) => {
return aKeys.includes(k);
});
// for values in both snapshots, calculate differences
let diffs = inBoth.map((k) => {
let x = a.find((o) => o.value == k);
let y = b.find((o) => o.value == k);
return {
value: k,
occurrence: y.occurrence - x.occurrence,
avg_latency: y.avg_latency - x.avg_latency,
error_ratio: Number(
(
(y.error_ratio ? y.error_ratio : 0) -
(x.error_ratio ? x.error_ratio : 0)
).toFixed(2)
),
};
});
return {
new: inBnotA, // A is the older snapshot
missing: inAnotB, // B is the newer snapshot
diffs: diffs,
};
}
5. Create a scheduler for snapshot diffs
Now, we can schedule creation and diffing of snapshots for a given query. When a snapshot is created, it takes some time to assemble the underlying traces and make them available for consumption. During this time, the pipe remains open to receive more data matching that quey. The API response from a Snapshot creation returns when the Snapshot will be completed. We will use that along with node-schedule
to schedule fetching snapshot data when it is done.
The pseudocode for our scheduler is as follows:
Every 10 minutes, for all queries in DB
Create a new Snapshot
Use the response from the API call to schedule fetching data
Compare the two latest snapshots for a query that have data and create a diff report
server$ touch scheduler.js
/* scheduler.js */
let schedule = require("node-schedule");
let moment = require("moment");
let differ = require("./differ");
let directory = require("./directory");
let logger = require("./logger");
let QueryModel = require("./models/query");
const SNAPSHOT_DIFF_INTERVAL_MINUTES = 10;
const SNAPSHOT_CREATE_TIMEOUT_SECONDS = 10;
const SNAPSHOT_FETCH_TIMEOUT_SECONDS = 5;
function startScheduler() {
let rule = new schedule.RecurrenceRule();
rule.minute = new schedule.Range(0, 59, SNAPSHOT_DIFF_INTERVAL_MINUTES);
QueryModel.find((err, queries) => {
if (err) {
logger.error(err);
} else {
queries.forEach((q) => {
// Schedule checking of snapshots every 10 minutes and then create a new snapshot
schedule.scheduleJob(rule, () => {
differ.diffLatestSnapshotsForQuery(q);
setTimeout(() => {
directory
.createSnapshotForQuery(q)
.then((snapshot) => {
let s = snapshot.completeTime + SNAPSHOT_FETCH_TIMEOUT_SECONDS;
let t = moment.unix(s).toDate();
schedule.scheduleJob(t, () => {
directory.fetchSnapshotData(snapshot.snapshotId);
});
logger.info(
`Scheduled fetching of Snapshot ${snapshot.snapshotId} at ${t}`
);
})
.catch((err) => {
logger.error(err);
});
}, SNAPSHOT_CREATE_TIMEOUT_SECONDS);
// Timeout to create the new snapshot after diffing for two
// latests snapshots has already started so that we don't
// get stuck in a loop of only checking stale data
});
});
}
});
}
module.exports = {
startScheduler,
};
See the trace differ in action
After setting the relevant Lightstep specific variables in server/constants.js
, run the server:
server$ node server.js
By default, no queries exist, so nothing will happen.
Since we set up a REST API for, we can use Postman or curl to create a query matching the Lightstep Query SyntaxLightstep Query Syntax.
$ curl --location --request POST 'localhost:4000/api/queries' \
--header 'Content-Type: application/json' \
--data-raw '{"query":"service IN (\"android\")","name":"Hello World","createdAt":1605061500,"groupByKeys":["operation","http.status_code"]}'
The query should now be visible at localhost:4000/api/queries
.
Once snapshots and diffs start getting created, they will also be visible at localhost:4000/api/snapshots
and localhost:4000/api/diffs
respectively.
Experimental UI
In order to visualize the diffs, I created an experimental UI in Vue using a template provided by Creative TimCreative Tim. The code for this UI is available hereThe code for this UI is available here and can be run with the following:
trace-differ$ cd client && npm run serve
The above photo is an example diff report on the “operation” attribute. This shows that for all traces that the “android” service was involved in, the operations in those traces were reported in both snapshots with minor differences.
The above is an example diff on the “http.status_code” attribute. This shows that for all traces that the “android” service was involved in, status codes 503 and 429 were not included in the newer snapshot, while 403 was included in the newer snapshot.
Next Steps
In this tutorial, we built a simple snapshot directory and implemented scheduled trace diffing on important queries. This base can be extended to many different use cases, including but not limited to:
Setting custom thresholds for diffing
Implement alerting to get paged when a significant diff is encountered
Create your own system diagram between services, operations, frequency of trace paths, etc.
Create an "evolution of system architecture" visualization for all your services
Interested in joining our team? See our open positions herehere.
In this blog post
Creating trace diffs: OverviewCreating trace diffs: OverviewPrerequisitesPrerequisites1) Setup and run MongoDB1) Setup and run MongoDB2) Initialize a NodeJS server2) Initialize a NodeJS server3) Build a Snapshot directory3) Build a Snapshot directory4. Diff two Snapshots4. Diff two Snapshots5. Create a scheduler for snapshot diffs5. Create a scheduler for snapshot diffsSee the trace differ in actionSee the trace differ in actionExperimental UIExperimental UINext StepsNext StepsExplore more articles
How Queries at Lightstep Work
Brian Lamb | Oct 24, 2022In this post, we’ll explore how the different stages of a query interact with each other to take your raw data from our data storage layer and aggregate it into useful visualizations.
Learn moreLearn more
Let's Talk About Psychological Safety
Adriana Villela | Jul 12, 2022System outages are rough for all involved, INCLUDING those who are scrambling to get things up and running again as quickly as possible. Psychological safety is crucial, ensuring that employees are at their best & don't burn out. Read on for more on this.
Learn moreLearn more
OpenTelemetry Python Metrics approaching General Availability
Alex Boten | Apr 1, 2022OpenTelemetry Metrics is moving towards general availability across the project, including in Python. Learn about how to get started with it today!
Learn moreLearn moreLightstep sounds like a lovely idea
Monitoring and observability for the world’s most reliable systems