|LWN.net needs you!|
Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing
"User tracking" is generally contentious in free-software communities—even if the "tracking" is not really intended to do so. It is often distributions that have the most interest in counting their users, but Linux users tend to be more privacy conscious than users of more mainstream desktop operating systems. The Fedora project recently discussed how to count its users and ways to preserve their privacy while doing so.
Ben Cotton brought up the topic in the context of a proposal for Fedora 30. Instead of the current method of counting unique IP addresses that request updates from the DNF mirrors, which is an unreliable estimator of Fedora usage, the proposal would create a unique user ID (UUID) for each installed system that would be sent with DNF mirror-list requests. It explicitly calls out privacy concerns: "We don't want to track; just count."
The proposal outlines the kind of information that the project would like to count, including the version of Fedora, the Fedora variant (or spin), and the architecture of the machine. It would also be useful to have some way to distinguish long-lived installations from one-off test systems in virtual machines. Currently, variants cannot be distinguished and the unique IP counting method both undercounts systems behind network address translation (NAT) and overcounts systems that change IP addresses frequently. The UUID is similar to what openSUSE uses, so "this is ground already traveled".
Using the machine ID (stored in /etc/machine-id) as the UUID is not part of the plan, since it may be used in other ways that would facilitate tracking. So some kind of random UUID would be generated for this purpose. But, as Lennart Poettering pointed out, sending a UUID makes tracking possible even if the project doesn't want to do that tracking. Essentially, users would need to trust that the project isn't doing the tracking because it says it isn't. While he was skeptical that Fedora really wanted to use a UUID that way, he did suggest using an application-specific machine ID, like those calculated by sd_id128_get_machine_app_specific(). That way, Fedora would be using an existing mechanism that generates a UUID using the machine ID and an ID specific to the counting application.
Poettering also mentioned that Ubuntu counts installations via NTP, which might be an option if Fedora wanted to run its own NTP servers. Both Ubuntu and Fedora configure their systems to regularly ping the NTP servers. Another possibility would be to send a "countme" flag once a day as part of the captive-portal and connectivity detection that is already installed with Fedora, but that did not sit well with Kevin Kofler. He called the existing NetworkManager-config-connectivity-fedora package "spyware" and does not install it on his systems. Fedora project leader Matthew Miller (who is also the owner of the feature proposal) said that the connectivity check could be used but it would only count a subset of desktops and not other types of installations, such as server, cloud, or container. In addition, setting up NTP servers would be much more work than hosting a UUID-counting service, he said.
Miller said that the intention is to rotate the logs "fairly frequently", but that is not really visible to users so there is still a trust factor present. But Tom Gundersen suggested another approach:
Then you make sure that all UUIDs submitted by a given machine during a given time window are the same, but UUIDs submitted in different windows are not related, and you don't have to trust the server to respect your privacy.
That approach would "make sense" Poettering said, though he still advocated using NTP or the "HTTP ping" that is done as part of the captive-portal detection. Others, such as Bruno Wolff III, are worried that even if the UUIDs are changed frequently, users still have to trust Fedora (or someone who gained access to the logs) not to correlate UUIDs, IP addresses, and other information to track users that way. Beyond that, Nicolas Mailhot is concerned about interaction with the EU General Data Protection Regulation (GDPR); that requires a shift in thinking about how data can be misused:
That's what the GDPR is about. It's *your* responsibility as data collector to think about how data could be used, it's *your* problem to protect it, it's *your* problem if it's misused, you can not make it available on a platter for others to do evil things with and claim it's those people's problem.
Wolff also pointed out that attackers may try to send UUIDs that are unexpected. Those could be generated to try to attack the system in some way or they could simply be strings containing profanity or other "not safe for work" (NSFW) content. He wants to ensure that the actual UUID strings don't end up in reports or require review by humans. Even ensuring that the strings are valid hexadecimal doesn't preclude inventive usage that could embarrass the project or offend people. Beyond that, UUIDs could be changed more frequently to try to inflate the statistics.
As these privacy and other problems with the UUID scheme were being discussed, Poettering came up with a scheme that alleviated most of the problems that were identified. He proposed that a "countme" flag simply be added to a single mirror-list query each week. The sum of all such queries over a week's time should provide an accurate estimate of the number of Fedora systems. That way, UUIDs need not be stored, which removes much of the concern—data that is not stored cannot be misused.
Poettering followed up by noting that avoiding even the appearance of tracking will likely result in fewer users disabling the counting mechanism. Miller was enthusiastic about the idea; he suggested that since there would be no UUID associated with the information, the "countme" flag could increment once per week, which would give some additional information about the longevity of systems—without providing much information that could be used for tracking.
It would not even necessarily require that every machine reported, Roberto Ragusa suggested. Machines could decide whether to report based on some property of their machine ID (e.g. divides evenly by 1000) or by combining machine ID and the date so that the counted systems would change over time. Then the counts could simply be multiplied by whatever is used as a modulus to provide the actual estimate.
Overall, there were few complaints about the simpler counting mechanism. Miller has updated the proposal using Poettering's method; it should be posted to the mailing list soon, once he receives some feedback from the DNF developers. It seems likely that Fedora 30 will have the feature when it is released, which is currently scheduled for the end of April.
We have looked at other user-counting initiatives and proposals along the way. In 2010, there was a proposal to add UUID tracking to Yum, but Fedora has been trying to figure how to unobtrusively count users for longer than that. A 2006 scheme involving a tracking image was proposed for Fedora Core 7. More recently, the Django web-framework project discussed adding analytics that would report to Google servers, which was not popular with Debian (at least).
There is a certain amount of tension between the needs of a distribution or software project and the needs of users—especially when it comes to privacy issues. Being able to show the existence of more project users will generally lead to a higher profile and potentially more funding for development and other activities. Counting variants can also help projects make better decisions about where to allocate their scarce resources. But many users do not want to be tracked, though they may be willing to be counted. This Fedora proposal seems like it finds a reasonable balance by reusing an existing mechanism without adding something that could be tracked. It will be interesting to see what Fedora finds once it rolls out this counting feature to users.
(Log in to post comments)