Before I became a Security Engineer at JupiterOne, I was the sole security automation and cloud compliance engineer at LifeOmic. We built the JupiterOne platform to support the needs we had at LifeOmic eventually spinning off as our own company. Here's some of the backstory.
LifeOmic is a decidedly "cloud-first" startup company. We went all-in on AWS services, delivering most of our applications on serverless technology stacks. We leveraged SaaS services whenever we can and federated identity and access to these services via Okta. We had almost zero on-premises technology footprint. This freed us to focus our security time and effort where it matters most. In our case it was the data – in the cloud – that was truly important.
LifeOmic's evolving Zero-Trust approach to securing cloud data was informed by our Top Ten security principles. In particular, we chose to air-gap our production environments: our engineers could not directly access production data under normal operating conditions. To do so required a heavyweight and highly-audited "break glass" emergency procedure to be followed. Also, we believed that security that isn't usable is worthless. We worked hard to make security invisible, automatic, and where possible, pleasant. This meant aligning natural incentives with our security policies, so that people *wanted* to do the secure thing.
As an emerging leader in the precision health space, LifeOmic has specific health care regulations that we had to follow. One of the mandatory regulatory controls was the ability to provide evidence that our endpoint device configurations comply with certain screensaver, firewall, disk encryption, and security patch settings.
LifeOmic is a developer-focused company in many ways, and the developers value the freedom to use the hardware, operating system and tools of their choice to get their work done. Those choices have meant almost 100% laptops. Mostly Macs, with a growing number of Linux devices, and very few Windows devices. LifeOmic is decidedly "Bring Your Own Device" (BYOD) when it comes to IT. Aside from a few one-time settings changes and a few applications we ask employees to install at onboarding, we're hands-off when it comes to our employee's devices. Since we've put so many other controls in place around securing access to our data, granting our employees this much freedom isn't such a crazy choice. In fact, it was liberating! It does, however, demand a lightweight strategy for endpoint compliance, which could easily be in a natural state of tension with our decentralized provisioning and minimal management approach.
This might be summarized as "How do we flexibly demonstrate compliance for user-controlled devices in a pleasant and usable way?"
Iteration One: Chef InSpec
Our first iteration of this strategy in late 2017 sought to leverage Chef InSpec, which provides a concise language for describing security and compliance rules, and a mechanism for checking to ensure they are followed. We really liked the concise and expressive DSL it provided, and so began to look for ways to execute InSpec profiles on our endpoints. A quick proof-of-concept seemed to indicate that we could indeed leverage the inspec CLI tool to generate evidence. Where, then to store the output? The natural choice at the time seemed to be a Chef Compliance Server which is now deprecated in favor of Chef Automate. It provided an API for storing and retrieving InSpec profiles, as well as storing and retrieving results from individual InSpec audit runs, and some nice eye-candy in the form of compliance dashboards and graphs. At the time, there was no hosted SaaS offering for this service.
Standing up Chef Compliance required a dedicated EC2 instance that was not auto-scalable. This introduced a single point of failure (SPOF), and cut against the grain of LifeOmic's otherwise nearly instance-free, serverless architecture. Once deployed, however, we discovered that certain technical details at the time prevented inspec from reporting directly to Compliance Server. They could be made to talk to each other, but it was necessary to add chef-client as a dependency in order to run the Audit cookbook, which itself required further dependencies in the form of Ruby gems.
Since this was the company's only use of Chef, there was no need or desire to use Chef Server. We still needed a lightweight update mechanism for our agents, however, so we created a "cookbook-updating-cookbook", and inserted that into the chef-client runlist alongside the Audit cookbook. It would ensure the latest cookbooks were retrieved from S3 prior to running Audit.
With these mechanisms in place, we set out to implement the compliance checks needed for our regulatory assessment in early 2018. We quickly found that OS resources available via the InSpec DSL did not have exactly what we needed. This forced us to frequently "shell out" to system binaries with InSpec's command syntax, do string comparisons on the command output, and include many conditional OS platform and version checks. This meant that we lost most of the benefits of the beautiful DSL that InSpec offered.
Once the rules were developed, Chef Server was configured properly, and once we installed chef-client, inspec and all the audit cookbook dependencies on all our laptops, we were able to get compliance data flowing and generate some pretty charts
This mechanism worked –indeed we passed our Q1 2018 assessment with flying colors –but we knew immediately that we wanted to overhaul our approach. Like fixing the proverbial "hole in the bucket" every step along the journey of this first iteration seemed to add additional complexity and overhead, and the sum was not worth its weight in parts. This is in no way a knock against Chef or any of their products. The "Chef Way" just wasn't a good fit for our environment (BYOD), and appetite for operational overhead (averse). In particular, we weren't happy with the SPOF introduced by Compliance Server, and didn't want or need full administrative control over the configuration of our endpoint devices via chef-client.
Iteration Two: OSQuery + SumoLogic
Our next idea was to leverage Osquery, which allows you to query your host device like a SQL database. We liked the cross-platform nature of the tool, and while using SQL to query for Operating System details might seem a bit arbitrary, it is easy to reason about and is a tested and sensible choice for a "lingua franca".
Osquery supported most of our evidence requirements out-of-the-box, and it was pretty easy to construct query packages for the tool. While we were considering where the query results would be stored and analyzed, we evaluated two Osquery-related SaaS offerings: Kolide Fleet and Uptycs. These were both powerful security analytics tools, but since we were hard at work building out our own JupiterOne security platform, we weren't really in the market for another security analysis tool. We decided to take a lightweight approach and simply have osqueryd periodically execute our query packages and log the results to disk. We would then forward these logs to SumoLogic, a SaaS log-monitoring and SIEM tool we were already using.
The technical implementation was straightforward, and only required the installation and configuration of Osquery and one other dependency, Fluentd for log forwarding.
Once we had log data flowing to SumoLogic, it was pretty simple to create dashboards for realtime monitoring of the log results.
This worked fairly well as a lightweight approach to endpoint compliance monitoring, but there were some sharp edges we weren't satisfied with:
– Screensaver settings seemed broken for some versions of MacOS.
– osqueryd required root/Administrator access to run as a system service, and to access almost anything of interest on Linux.
– Query performance was hard to troubleshoot, and extending osquery was non-trivial.
– Steps needed to remediate individual endpoints were opaque to the end-user.
– We wanted this compliance data to flow through our JupiterOne platform.
What we really wanted was a lightweight, user-friendly way to notify the user when their machine was out-of-compliance, empowering and informing them with specific instructions for remediating just those bits that need attention. This tool shouldn't need excessive privileges in order to do that, and it should report its findings to our own JupiterOne platform for asset inventory, compliance reporting and security analysis use.
Iteration Three: Stethoscope + JupiterOne
Enter Netflix's Stethoscope-app. This project's goals and features aligned *perfectly* with our own needs.
Stethoscope empowers the user to maintain their own compliance readiness in a transparent and non-invasive way. It does not need administrative privileges to run, and won't accidentally launch CPU-consuming queries. In addition to the clean UI, it provides a GraphQL API running on localhost to interface with.
Stethoscope targets just the security compliance checks needed for common frameworks, and accepts a simple policy configuration as YAML or JSON that specifies what configuration values, application and OS versions, etc, should pass compliance.
When we started using Stethoscope in earnest, Linux was not supported. It was very easy to contribute to this open-source project and add the support we needed for our use. Big shout-out to Rob McVey at Netflix for being so responsive and providing such great feedback and help!
Rather than fork Stethoscope to add custom integration logic for JupiterOne we chose to ship a sidecar agent, written in Golang for easy cross-platform distribution. This agent is responsible for:
– performing initial one-time activation/registration with JupiterOne
– retrieving the Stethoscope policy configuration associated with that endpoint's JupiterOne account
– daemonizing as a background system service
– periodically hitting the localhost GraphQL endpoint exposed by Stethoscope to scan the device
– reporting those GraphQL results back to JupiterOne
On the administrative side, JupiterOne provides a simple configuration pane for specifying the policy each device should be bound to, how often it should execute, and which email addresses you'd like to send invite/activation emails to:
Stethoscope (via Electron) provides an easy auto-upgrade mechanism, and since the minimum required stethoscopeVersion may be configured via policy, users will be notified via Stethoscope's native OS notification mechanism when their agent requires update. This minimizes hassle should a security update become necessary in Stethoscope itself.
Endpoint Security Analysis with JupiterOne
As a SecOps engineer, it is easy to ask JupiterOne for our current compliance status (names changed to protect the €˜innocent').
Find Person that OWNS Device that MONITORS HostAgent with compliant=false return tree
Looks like we have four non-compliant users. Drilling down into any of those graph nodes gives more detail, showing individual compliance check results. Let's pick host Stormbreaker:
Inspecting these results, we find the host above seems to allow remote login, which violates our policy. That might be interesting, but let's ask JupiterOne for all non-compliant users who also have access to AWS, which is definitely worth digging into:
Find HostAgent with compliant=false that MONITORS Device that OWNS Person that IS User that ASSIGNED Application with shortName = 'aws' return tree
Only one result now, and it's the same Stormbreaker host. Time to have a quick chat with this user, and perhaps set up an alert based on this query.
Of course, JupiterOne's powerful and expressive query language enables us to ask lots of other useful questions related to real-time endpoint compliance, and see those results in graph, table, or JSON format.
Conclusion
As a cloud-first, developer-heavy startup, finding the right tradeoffs for endpoint compliance has been an iterative process. We're quite pleased with the results so far, and excited about using and contributing back to the active Stethoscope open-source project going forward.
If you'd like to maximize your security efforts and reduce security operations complexity, start using JupiterOne. And if you'd like work on security that matters, solving for real-world security, compliance, and operations issues –we're hiring!