If the best way to learn something is to teach it, then I’ve learned a surprising amount by guiding customers through on-premises deployments of our SaaS application.
Dealing with flaky DB migrations on Windows machines is a far cry from operating our fine-tuned AWS VPC. I’m glad we bit the bullet and deployed on-prem early on—it’s shaped our application and system design decisions in ways I couldn’t have imagined.
We launched our integrations platform, Kloudless, in 2014 as a multi-tenant web app on AWS. As we began to work with our first enterprise customers, we quickly realized that our SaaS application didn’t come close to meeting the various expected regulatory compliance and audit requirements. Several organizations simply couldn’t use Kloudless until we certified compliance with PCI and HIPAA, or underwent a SOC 2 audit.
As CTO, I challenged my team to find a solution. We soon introduced Kloudless Enterprise, the on-premises version of our cloud solution, to allow our customers to run Kloudless as part of their on-prem stack. Today, Kloudless Enterprise Docker containers, AMIs, and OVAs power applications used by Fortune 50 companies and some of the nation’s largest banks. This year, we crossed 500M upstream API requests a month.
Here are some of the major lessons I’ve learned along the way.
Kloudless Enterprise (<a href="http://www.chris.com/ascii/index.php?art=television/star%20trek" target="_blank" rel="noopener">source</a>)
\_ \ \----._________.----/
Kloudless Enterprise is installed and updated often, so our customers require this process to be completed quickly. Limiting the number of exposed moving parts exposed enables us to provide updates that are as simple as deploying new Docker containers. Developers can start out using local PostgreSQL and Redis services and configure external ones later for scalable production deployments. This structure is similar to GitLab’s Omnibus packages, which are a great example.
We tied licenses to individual release channels for images and containers since a patch release is sometimes required for specific customers or a specific version. Secret: We also provide Kloudless operators the ability to update individual application code packages to the next patch release within a container, immutability be damned. While this is an anti-pattern and is definitely not the recommended option, it’s occasionally proven to be the path of least resistance in dealing with necessary change management review processes.
Preserving all application state outside of the container, like licenses, encryption keys, and meta-configuration, allows updates to be seamless. Using external storage, such as Docker volumes and network-attached storage like EBS, preserves configuration during updates.
SLAs for Enterprise and on-prem deployments often require the ability to quickly diagnose issues, ideally with self-serve tools like command-line utilities and monitoring integrations. Even so, our engineering team has had to dive into remotely deployed Kloudless development clusters to walk operators through configuration and diagnose hard-to-reproduce issues. Making this as easy as troubleshooting any cloud environment is a major benefit. When things don’t go as planned, support staff need tools available to assist customers with.
We built command line utilities to do everything from reading and sending log data, to managing users, maintenance windows, and configuration updates. Our most frequently used tool authenticates an outbound connection from Kloudless Enterprise instances to a secure remote bastion in our network for engineers to assist via a reverse SSH tunnel. Definitely use autossh for this kind of stuff. This utility lets us access our customers’ instances behind firewalls at their request while also maintaining their compliance requirements.
Private Kloudless deployments reach upwards of several machines running 100+ cores combined. You can prevent unnecessary stress down the line by providing canonical scaling guides and items to consider.
Here are some issues we’ve seen while deploying privately hosted instances:
- Postgres running out of disk due to insufficient capacity planning for activity stream data.
- 10+ small AWS instances instead of 5 larger ones. Fun fact: AWS network throughput is based on instance vCPU, so adding more instances isn’t the answer to network bottlenecks!
- Unhandled network partition issues. We quickly switched from a stateful, primary/secondary clustering format using etcd to a masterless model. This was not only simpler for our customers’ Ops teams, but turned out to be more robust and better able to handle large deployments.
- Diverse deployment environments and tooling.
Our initial version had an interactive setup process that required less preparation, but turned out to be a stumbling block for large production environments. Non-interactive deployment improves automation as well as integration with existing tools like AWS CloudFormation and Docker Compose.
While on the topic of writing, I’ve noticed an entirely distinct set of documentation we’ve had to provide our on-prem customers. We expected to contribute API docs and information around setup, but we’ve also had to create far more documentation, including:
- SOPs and security policy docs to provide security and legal teams
- HA, scalability, and monitoring guides for operations
- Solution walkthroughs and examples of configuration combinations for engineering
- Detailed release notes and update paths
- System troubleshooting guides
It’s important to be upfront about all data transmissions and to ensure tools initiating outbound connections are opt-in by default to not raise the ire of a customer’s security team.
The worst deployment strategy would be to fork the cloud application for on-prem deployments. In addition to increased maintenance costs, this results in bugs that dogfooding would have otherwise caught. Introducing a feature flag that indicates whether the application is run on-prem helps to easily toggle functionality like license restrictions and alternate web application UI.
Cloud to On-prem is an upgrade away
But wait, there’s more: using the same application and data stack makes it easy to transition cloud customers to private deployments or on-prem solutions without a complex data migration. While environment-specific items like encryption keys necessitate a documented upgrade procedure, it is possible to simplify the transition by limiting the deviation in application behavior. Plus, sales teams love the frictionless path to increasing revenue.
Ask the front-end design team to sit this one out. A friend of mine at GitHub contributes to GitHub Enterprise, which inspired a lot of our design choices when building out our own solution. We heeded one valuable bit of advice to reduce the overhead of configuration. We scrapped fancy settings dashboard web UIs and just provided a YAML file to edit. This saved us an inordinate amount of time shipping features. We’ve also:
- had fewer questions than expected since the point-and-click folks stay away
- accessed the full suite of YAML data structures to represent state
- inherited configuration from org-wide, license-specific, and deployment-specific settings into each machine’s configuration
- been able to access and manage configuration easily via utilities like Salt
If you’d like a fun conversation with engineering and procurement teams at enterprises, then tell them they need to purchase even more on-prem software to actually view analytics, logs, monitoring, and more. Using third-party SaaS vendors works great for a cloud product, but on-prem software needs to come with batteries included for core functionality.
Fortunately, allowing a variety of open-source tools to be configured helps package functionality and prevent these concerns. For instance, Kloudless provides options to send system and application metrics to StatsD and InfluxDB as well as ship log data externally via standard protocols. This allows Kloudless to integrate with existing monitoring infrastructure with minimal effort.
Open source utilities also help users easily configure application security. For example, take advantage of Let’s Encrypt to provision SSL/TLS certificates for custom domain names quickly.
Security is like a sine wave that peaks throughout this blog post at periodic intervals.
Application security goes beyond technical solutions that satisfy engineering. For example, summaries of compliance with certifications and items like pen test results are equally valuable. Security posture is relevant at every stage since it is one of the primary reasons for deploying on-prem. Providing a well-written, thorough information security policy covering everything from SDLC processes like code reviews, to information on internal controls, helps satisfy customers’ voracious need for security-related collateral.
We’ve had multiple requests for items like SSO, access controls for team members, activity logs, IP whitelists, and more. Making these easily configurable allows for frictionless deployments. However, we’ve noticed that these haven’t been a deal-breaker in most cases—we released Kloudless Enterprise without several features popular with larger customers. This could be specific to our type of application, but I’m glad we chose to ship our product ASAP instead of building features we assumed would be required. Check out enterpriseready.io for more details on features frequently included in software for enterprise customers.
Side-note: We also adjusted our build pipeline to compile and obfuscate application code from the very beginning to protect a different kind of IP, but in hindsight, it is debatable as to whether this was even necessary.
At the end of the day, a customer is likely purchasing some concept of a license “key” to run the application privately. We’ve found that the minimum effort required here is to build in validation checks and detail what data they transmit.
Avoid technical solutions for payments early on
Let’s say you create a self-hosted version of your product and you have a great billing system running on your cloud version—even using the latest Stripe Billing API—with a clear licensing model. I’ve found it’s still best to invoice customers manually until you’ve identified a repeatable process. We avoided prematurely optimizing payments with click-through solutions that would get in the way of customized pricing schemes, especially since the volume of enterprise deals is relatively low early on. Unlike in SaaS apps, it usually isn’t an option to disable accounts due to a Net 30 invoice delayed a few days. License terms also vary from quarterly and annual subscriptions to multi-year contracts, so I recommend setting aside the monthly cron job or equivalent solution.
We built an internal dashboard to update licenses entirely independently of payment, and we avoided directing customers purchasing on-prem solutions to the billing options provided to cloud customers. It’s hard enough to build in feature flags to deal with varied license requirements; requiring a single payment flow will simply slow things down.
Kloudless integrates APIs, but also provides a Meta API to manage the API integration platform. My goal this year is to convince our engineering team to integrate Kloudless with itself. In all seriousness, programmatic manipulation of the application is a massive benefit for engineering teams deploying the solution on-prem. It enables workflows that the customer may not have even bothered bringing up earlier, like requiring key rotation every 90 days.
I highly recommend an API that manages all data in the application, especially provisioning and deprovisioning workflows.
Would we do it again?
Yes, but I’d probably charge more right off the bat. 🙂
Ultimately, our GTM strategy incorporated deploying our application where the customer felt comfortable using it; either our multi-tenant cloud, a private deployment managed by us, or entirely self-hosted VMs and containers. A key decision to make for vendors considering on-prem deployments is simply whether the overhead described above is worth the benefits it brings.