In the last few years, the engineering group at New Relic has grown quickly as we develop technologies to better help you monitor your applications and infrastructure. As we grow and add more infrastructure of our own, we face many of the same challenges and hurdles as you do. Some are more challenging than others (e.g., containerizing all the things); and some are, seemingly, as simple as migrating network time protocol (NTP) servers. Here’s the story of how the New Relic Cloud and Core Services team hunted down 300 rogue devices during an NTP migration earlier this year.
A brief history
For quite a while, New Relic’s NTP infrastructure, which maintains clock synchronization across the servers, load balancers, routers, and switches in our network, was run out of one datacenter on one virtual machine. But over time, as we added more production datacenters, it no longer made sense to point every device at that same VM. Additionally, VMs are notoriously inaccurate timekeepers, so we built three new NTP servers from bare metal and created a distributed NTP infrastructure.
When it was time to migrate, we decided each engineering team would be responsible for migrating their own devices. That way, each team could directly monitor their own migration and more easily respond to any issues that might arise. We gave everyone a set of instructions, but when the deadline hit, we still had as many as 300 devices still syncing their time from the old NTP server.
All these devices had to be found and pointed in the right direction.
Rejecting some options
So, we had to find 300 devices as quickly and efficiently as possible. We use New Relic Infrastructure to monitor our infrastructure, gathering, for example, server statistics like CPU, disk, and network usage, and create alerts if they extend beyond their expected thresholds. For this migration, we wanted to take advantage of Infrastructure’s conditional reporting, but first we needed to figure out how to gather the devices that hadn’t been migrated. We considered and rejected several options before settling on our solution.
Here’s a quick rundown of the options we rejected:
- Rejected option 1: Run an Ansible task—Our first thought was to run an Ansible task across our infrastructure to read the contents of
/etc/ntp.confon each system. Unfortunately, to run the task, we’d have to log into each individual system and that would take a considerable amount of time and manual work.
- Rejected option 2: Query sFlow data—During a previous project to change the DNS resolvers our infrastructure used, we wrote a custom integration for New Relic Insights to query sFlow data. We thought we could use this to see if any devices in our infrastructure were sending traffic to the old NTP server. However, NTP traffic happens so infrequently and is so small that it’s not sampled by sFlow. Even if we set the query to a point six months before the migration, we’d still received inaccurate traffic reports.
- Rejected option 3: Run Tcpdump—We could run Tcpdump on the old NTP server to see which devices were still syncing from it. This would give us absolute certainty as to which devices had not been migrated, but then we’d have a long project of analyzing the output. To get the information we needed, we’d have to use Wireshark to analyze the PCAP file (an API for capturing network traffic) from the Tcpdump and then compare an exported list of IPs against a list of hostnames. Even then we’d still have to manually confirm a device’s configuration.
Our solution: Use Infrastructure’s Puppet integration to seek out rogue devices
Since we used Puppet to configure this part of our infrastructure, we suspected we’d be able to use this integration to our advantage. New Relic Infrastructure can gather data from Facter, which gathers various details (or “facts”) about any device that has a Puppet agent on it. When Infrastructure runs Facter, these facts are collected by Infrastructure and stored as inventory. So to extract a list of rogue devices all we had to do was write the following custom fact:
Facter.add('ntp_servers') d setcode do output = Facter::Util::Resolution.exec("grep -e '^server' /etc/ntp.conf") servers =  output.each_line do |line| servers << line.split(' ') end servers.join(',') end end
The fact read each device’s
/etc/ntp.conf file and returned which NTP server that system was pointed at. As soon as we deployed the fact to our Puppet infrastructure, devices started collecting the information we wanted. Immediately then, New Relic Infrastructure gathered that data, and we knew which device was synced with which NTP server.
Finishing the migration
Once we had the final list of devices that had not been migrated, and knew which teams they belonged to, we wrote some simple logic into our Puppet code to allow each team to “opt-in” to putting a fact on their rogue device. This fact triggered Puppet to change the device’s NTP configuration to point at the new NTP server. And, of course, we gave the teams another deadline.
When this deadline passed, we checked New Relic Infrastructure to see which devices had still not been migrated—there were a few. We then used Puppet to force the new NTP configuration to any of these rogue devices.
In the grand scheme of things, performing an NTP migration isn’t the most difficult task a site reliability engineer (SRE) will ever face. But the difficult jobs we face are also made easier with New Relic Infrastructure. For instance, it’s been especially valuable for inspecting packages installed on our systems when security vulnerabilities are announced. In such cases, we’re able to quickly query our infrastructure to discover any affected systems or devices. We can move straight into patching affected systems because we’re not spending time determining where we’re exposed.
Check out New Relic Infrastructure: it has plenty of out-of-the-box integrations and a full SDK to get you started monitoring your infrastructure faster and deploying with total confidence.