Guys, we're here to talk about Octobox today, but let's talk about Libraries.io and where you guys have been over the last couple of years, just to catch everybody up. Andrew, we have had you on the show a couple of times, I think The Changelog once, talking about 24 Pull Requests and Libraries.io… Long time ago, I think episode #188. We'll link that up. Also, you were on Request for Commits back in episode #3, all about measuring success, with Arfon Smith, and of course, Libraries was/is all about measuring things… So we've had you on – Ben, we haven't had you on the show before, but happy to have you… Tell us, before we get into Octobox, what's up with you guys, what's up with Libraries… Give us just the recent history of what you all have been up to.
Oh yeah, I totally forgot about that.
Forgot about Libraries.io, forgot about being on the show before…? What.
No, sorry; I totally remember being on the show… It's great to be back. It's been a bit of a crazy year in terms of Libraries. I think the last time I spoke to you I had been working on Libraries on my own, kind of in my spare time, building it up from scratch. And I met Ben, actually I think during a 24 Pull Requests event… Was it 24 – or, definitely related to 24 Pull Requests.
Yeah, it was at a local Ignite night, a lightning talk session… I think you did a talk about 24 Pull Requests, and at the time I was working on another project that kind of became the Core Infrastructure Initiative, and it was like "Ah, that seems perfect!" and then we kind of started talking at that point.
Tell everyone real quick what 24 Pull Requests is… Since it is the season right now, go ahead and just give that.
It is the season. 24 Pull Requests is in its seventh year now. 24 Pull Requests is basically trying to encourage other developers and open source users to contribute back and give little gifts to the maintainers of those projects that they've been benefitting from all year round, and kind of trying to get that swarm of people working together to bring in the holiday spirit with software.
This year we've actually kind of changed [unintelligible 00:03:45.04] a little bit. In previous years it was literally "Try and send 24 pull requests during the 24 days in December, on the run-up to Christmas." This year we've opened it up to all kinds of contributions. It's still called 24 Pull Requests, but it's really 24 contributions to open source software in any way that you consider to be a contribution; that might be writing a blog post, or answering Stack Overflow questions, or running an event, or speaking at a conference, or even doing a podcast episode on a particular bit of open source would be considered a contribution. So you can record those alongside your pull request, as different ways of showing how you've contributed back to those open source projects that you have benefitted from all year.
[00:04:32.16] That's awesome. I'm happy to hear that you've made it more inclusive. Something we talk about often and stress is the importance of non-code contributions to the open source community; they're paramount, and they're so valuable, so that's pretty cool to see it moving beyond pull requests to more things. Very cool. 24pullrequests.com, check that out. You have a couple more weeks to get involved before the month is out, so check that out.
Ben, you were saying that you all met at one of these 24 Pull Requests events back in the day. Andrew, you were working on Libraries by yourself… Take us from there, guys. What happened next?
So at the time I was working at a civic tech company called My Society, and I'd got pulled by Ben Laurie into a group of workshops that became the Core Internet Infrastructure kind of workshop, which was built out of the kind of fall-out of the Heartbleed vulnerability. I was comparing, effectively, an event, and 24 Pull Requests was featured, and Andrew started talking about Libraries and started talking about how he was mapping the relationships between projects and their dependency tree, and at the same time I was trying to work out how to highlight projects that were like OpenSSL, that were part of this digital infrastructure concept that hadn't really had much thought contributed to it, so there's wasn't really a standard for that… And thought "Well, actually, that's pretty much the same thing." Andrew is coming at it from the exploring open source for projects to use and contribute to, and I was looking at it from the perspective of "What's the next thing that could potentially blow up and cause a massive problem for users of open source software, i.e. everyone?" And yeah, we just got talking after that, and it kind of built from there. Two aspects of the same technology, brought together.
I recall you guys got a grant or some sort of funding to work on libraries for maybe a year or 18 months, and you guys worked on it together… We met at the Sustain event; Sustain 1… Sadly, I didn't make it out to Sustain 2 this fall… But we met in San Francisco at GitHub headquarters, and at that time the grant was just about running out, or the time period for that was just about running out. You had been doing Libraries.io (both of you) for a while, and you were looking at what was next and it seemed like what happened next was this move into Tidelift, and working with them… And now you're working at Octobox. Give us just the 30-second version of that history, and… I just wanna highlight your guys' path together and through these different projects, landing us on what we're talking about today, which is Octobox, which seems like it's a different thing altogether than what you've been working on for the last couple years… Which is interesting to see how you got here.
[00:07:39.04] The long story short is that we were excited by Tidelift's high-level mission, but when it came down to actually trying to get work done and move towards achieving some of those goals, we kind of clashed with the founders quite heavily. There was a lot of frustration, and we ended up being pushed out of the company… Which has left us in a situation where we can't work on Libraries for, at this point, another six months. Libraries still remains open source. It's AGPL licensed for exactly that kind of reason, as a protection to ensure that it always remains open… And the Open Data release that was released just before we left, in – I can't remember the exact date, but we can always get that added to the show notes… It was also a Creative Commons license. So we kind of have that sat, waiting for the ability to pick that up and work with it again, possibly fork it off, so that we can continue to do that… But we kind of had our hands tied a little bit, of what can we do, to still help the developer community, and try and look at sustainability from a slightly different angle, rather than directly the financial sustainability, but also kind of thinking about developer burnout as an important part of that sustainability.
Octobox had been around for a year and a half, nicely ticking away, and it felt like a good place to jump to, where we can try out some of our ideas and get back to actually shipping software, and then try and also making Octobox itself a sustainable open source project.
So Octobox (Octobox.io) is a project "Untangle your GitHub notifications." So this is a separate client/tool/resource in order to do a better job of dealing specifically with the overwhelming amount of notifications that many maintainers get via this open source project that, Andrew, you began, like you said, maybe a year or two back. Was this just something that you were scratching your own itch, and doing on the side? Because libraries was your main thing, but Octobox - you put it out there and it seems like it's really been attractive to people.
Everything is always connected. Octobox was inspired by the need to do something to keep on top of 24 Pull Requests, which every December it comes around and it kind of punches me in the face with the amount of people that jump on it… Right now I think it's 22,000 developers active on the 24 Pull Requests website this year. It's always a flood of issues and pull requests to the project itself, as well as pull requests to my other open source projects… And I was just left feeling like I had no way of knowing what I supposed to be doing.
Also, the way that GitHub notifications has worked for the past few years is once you look at a notification, it's gone, and you can't get it back, unless you drive it entirely from e-mail, which really doesn't work for me as someone who is a terrible e-mail manager. I kind of wanted to separate those to pieces, but… You kind of get the fear then. If you know that when you look at an issue you might not be able to find or remember that you didn't solve that right away, then you kind of don't wanna look at it and you leave those things there for like "Oh, eventually I'll get to this important issue, but if I look at it, I'll forget that it's something I need to do."
[00:11:43.22] So Octobox started off as a simple idea to be like "Let's have a kind of archived state for notifications." It pulls in your notifications over the API, and then it basically says "Everything is unarchived" or is in an inbox, and then you choose when you are done with those issues, or pull requests, or release notifications, or all of the kind of things that you could get notified on GitHub… Then that instantly gives you back a level of control, of going like "Okay, now I know that none of my notifications are ever gonna disappear. Even after I've read them, I've still got a full list", which then we started to layer on different ways of slicing and dicing those notifications. Because now you're looking at a list that never goes away; you're like "Okay, I've got 1,000 different things here that are all incoming towards me. Let me filter out by pull request, let me then filter that by pull requests that have already been emerged, or have been closed, and I can throw those away; I'm fairly confident they're not needing any further action from me, especially if I haven't been mentioned on them since."
We get some information from the GitHub API that says the last reason that you got notified, which could be you were subscribed to this repo, or you got mentioned, or you were assigned to solve that particular issue… And so you basically end up with every possible different way that you can filter down those notifications, to really triage and drive through the list in an effective way, and actually leave you with the things that you haven't done yet, but still need to do… Almost like a to-do list, which then you can work from; new things come in at the top, or an existing issue that you've marked as Done, archived, will actually pop back up, in the same way that an e-mail would in your e-mail client. And the whole thing is heavily inspired by Gmail's interface.
Yeah, it looks very much like Gmail. It's funny that it happened because of 24 Pull Requests, because I just recently had such a scenario during Hacktoberfest… You know, I've maintained small open source libraries for years and I've always had kind of a trickle of things, or other people's issues that I'm involved in, or pull requests of my own on other people's projects, but it's never been an overwhelming thing with GitHub's notifications for me, and I am an Inbox (0) kind of person, so I've just always managed it via just another thing in my inbox. However, I've made the "mistake" (air quotes around mistake, because it was awesome), I accidentally promoted our transcripts repo as a really easy way of getting those Hacktoberfest pull requests opened up. I even said we had the fastest Merge button in the West… Just joking around, but setting myself up to do a lot of work in October.
So during the month of October - and just thank you to our audience, everybody who got involved, because I only kid, it was awesome how many people came out and helped us make our transcripts more awesome. I think I merged over 300 pull requests in that month alone, and my inbox was just completely overrun by notification e-mails.
So I can definitely commiserate with you with regard to 24 Pull Requests. Now, I'm not quite as industrious as you are; I didn't say "Okay, I'm gonna solve this problem." Of course, I did know Octobox was a thing… But I knew also that October only had 31 days, and after all those Hacktoberfest T-shirts got sent out, everybody would stop contributing to our transcripts repo… To the point where it's back to a trickle, which is kind of how I like it.
But it's hilarious - you have this problem; for you it's gonna happen once a year, and it inspires you to create a tool that now is helping out lots and lots of people, so that's pretty cool.
[00:15:47.08] Yeah, I mean, I've kind of put up with it for 2-3 years, and then as Libraries started to take off, Libraries spread across (I wanna say) something like 20 different repositories as well… So with a certain amount of automated actions to tell me that there were updated versions of dependencies as well. For lots of people who have Dependabot or similar services plugged in to their depositories, they're gonna get regular amounts of updates - either pull requests or issues - telling them that there's something to update…
So the number of projects you have kind of multiplies the amount of notifications you get, and it can quickly kind of – the thing I didn't want to do was to stop working on those projects. That would be the other way to solve my notification problems - it'd be like "I can't do this much work." But if felt like a solvable problem to actually enable me to effectively handle more things, rather than it being like "I'm overwhelmed by the amount of humans incoming." Actually, the tooling was failing me, so I kind of just tackled it in similar way that I tackled 24 Pull Requests or Libraries, which is "Let me see if I can spin up a basic Rails app that does just enough to get by, open-source it", and encourage other people, if they have the same problem, to dive in and to add their take to it, or to drive it in a way that they feel would improve it.
We've had something like 80 different people contribute some very significant features and design work to Octobox over the past couple of years, that have made it into this kind of really nice, well-rounded, solid tool, that people actually really depend on to get work done now.
Were you surprised by how many people shared in this problem space in terms of needing Octobox the way that you did, considering how prolific you are with the open source work, and then 24 Pull Requests just getting so much attention, as well as Libraries being spread [unintelligible 00:18:01.01] but that's another reason why it hasn't been too much of an issue for me - we have very few repos in terms of number count… But when you have a single project that has maybe 20 repos, it's very hard to track everything.
One thing that surprised me - maybe it doesn't surprise me, but… Well, it does, but it shouldn't. I'm live-thinking this through… It's just the amount of people that have been like "This is THE thing that I've been waiting for. This solves a huge problem for me", which goes to show how many maintainers out there are really drowning in our inboxes, or drowning in our notifications.
The thing that surprised me the most is that the amount of notifications I get is tiny compared to people who are maintaining projects as big as VS Code, or Electron. They get absolutely drowned in notifications, and have already had a number of systems in place. Mike McQuaid, who is the maintainer of Homebrew - Homebrew is one of the most active repositories on GitHub, or at least the Formula repository, with updates coming in for new versions of a different Formula multiple times a day… And then actually having the codebase move forward and have to keep up with all the macOS updates…
He's had to build up sets of clever Gmail filters to be able just to compartmentalize all of those different things, so that it doesn't just overwhelm him. That felt like such a – it's a clever hack, but it's such a hack to have to use Gmail to augment the features in GitHub; especially as a kind of allergic to e-mail, that was not particularly useful for me, and it felt like having something that was specific to the developer problem was kind of what (I guess) kept my interest in the project past that simple "Oh, I've got a basic thing working here."
[00:20:13.20] 4.74 million notifications managed, and counting. That's quite a few. Ben, you were gonna say something? Go ahead…
I was gonna say as well - in my opinion, Mike has also got a lot of procedure and process around how he deals with people that sometimes he gets some flack for, but it also kind of works as the maintainer of one of the most active projects on GitHub. So it's not just the tooling, but there's also an attitude and a way of thinking about a project that means that he can get effectively quite a lot done.
I think, to be fair, he gets a lot of flack for that, but it's part and parcel of the problem, and this is one of the extra things that people don't necessarily think about… It's not just the tools - the tools are there to support - but also it's how you deal with the project and how yo deal with the people in that project, as well.
I thought I'd say that just because I know Mike gets a lot of flack for that sometimes, but it's difficult for him, and you have to–
Yeah, exactly. You just have to deal with it.
He's got a number of ways of almost shielding himself from the onslaught.
Yeah. I mean, the thing is, if you wanna take the conversation back just a minute - you started off at the top by saying that compared to Libraries, Octobox seems like a project that's quite separate in terms of what it's trying to achieve, but the threat [unintelligible 00:21:41.27] is for a month your life was full of pull requests with Hacktoberfest, right? And imagine if that was your every day. The thread that pulls through this whole thing is we wanna try and help solve the problem that exists today for maintainers of popular open source packages… And Octobox is part of that. Octobox is one of the tools that helps solve people's problems today… So I would say there's like a pretty strong thread that links from libraries through Tidelift into Octobox, in that respect.
Yeah, definitely in the spirit of what y'all are up to, for sure. I agree with that. I think I was referring to it in terms of functionality it just seems like a different thing altogether, but in terms of what you and Andrew are doing with your life's work these days, I absolutely see those ties, for sure.
Speaking of people who do this every day, as opposed to just during October, we went to Microsoft Build last spring, and we were speaking with the VS Code team. Andrew, you mentioned how they have to deal with a lot of issues… And I'm not sure if this conversation made the final show or not, or if we just had it after we finished recording with them, but they were sharing with us some of the lengths that they go through just to triage their issues… Not even to deal with them necessarily, but to be the incoming person who labels, and assigns for code review, or closes things that are off-topic and whatnot… They have a full-time employee there. They transfer ownership of this triage position on the VS Code issues, similar to how you'd be on like PagerDuty. This is a full-time thing that they're adding to the other work that they're doing, just to maintain the status quo. That doesn't mean to get down to zero issues, it means just not to let it explode into thousands and thousands, so… Yeah, there's lots of people with these problems.
[00:23:41.17] Yeah, the Microsoft team - especially the Microsoft Open team - are one of the biggest users of the hosted version of Octobox.io. The amount of stuff they get coming in is just overwhelming just to look at it, and I'm not even involved.
Talking about that triage and the way that they manage that - we were actually talking about potentially a feature to enable that within Octobox, to kind of have the ability for a team who have a lot of incoming support requests, or activity from external people, rather than, say, an internal team that are mostly doing progress and their issues are managed within, say, more like a project management tool. Actually, you get kind of an interesting shared inbox, despite Octobox looking like it's literally just a tool for the individual developer and it's entirely based on their context and their view of all of the work that they're involved in. You could actually get to the point where a team could almost triage and work through a certain amount of the other team members' inboxes for them.
You can imagine, like, "Oh, I've seen these five new issues come in, so I label them up, I close the ones that didn't make any sense, and I commented to ask for more details on these ones." Actually then, for the other team members, they could filter their inbox in Octobox to go "Well, lower the priority of everything that has had someone else on the team go through and touch these things", thereby leaving me with things that haven't been touched yet, or things that I'm involved in the conversation. Effectively then, the team has the ability to share the work and pre-filter for each other, so especially if you're distributed across timezones, you're able to essentially treat it almost like a help desk, without needing to literally use a help desk piece of software, because everyone still reports their issues via GitHub issues.
So let's talk about how Octobox works, and that will lead us into what y'all are doing with the Octobox app, and the GitHub marketplace, and trying to make a real go at this. Andrew, you said you started up a Rails application… I'm on Octobox.io, I can sign in right here; I assume you can also run this on your own server, maybe you've got a "Deploy to Heroku" button… Tell us about the way it works and the way people use it, and then we'll go into where it's going from there.
Yeah, so… Interesting how you said that it's slightly different from the other projects that I've worked on in the past, like Libraries.io and 24 Pull Requests… Octobox has that challenge of where Libraries.io and 24 Pull Requests are kind of the one instance that's running online. You can run it yourself, but it's not designed – it works best when there's multiple people all using the same thing, the network effect, and all of the data is in that one place.
[00:28:06.22] With Octobox, it was kind of designed from the start for anyone to be able to spin up their own version, and that's mostly from a privacy point of view, that you just might not want to give me access to your notifications, or to have them stored on Heroku - you might just wanna keep those things to yourself - as well as enabling people to use it for their GitHub Enterprise installation. So you can actually point your version of Octobox at your own company's internal GitHub enterprise, or GitHub.com, and then suck down all of your notifications from there.
Shopify are big users of their own hosted instance. I think they had something like seven million notifications across their internal team on their GitHub Enterprise installation, which was like "Whoa…"
That's actually bigger than Octobox.io's installation.
Yeah. That's a big company.
And another big user is GitHub itself. They run their own internal Octobox instance, which gets used a lot… Which is really surprising, because it's kind of an admittance that notifications isn't as good as it could be.
Yeah… We'll definitely get into that when we get to the business side, because I've got questions there… But yeah, continue with this instance thing that you were telling us about.
So the main way that most people would deploy Octobox is using Docker. It has a docker-compose file that will basically group everything up - Postgres, Redis, the Rails app, and stand it up in basically one command, in a similar way to deploying to Heroku, and that's gonna be configured as a GitHub app, OR… So this is where it gets – it's kind of perhaps why no one else has built this before, because the GitHub permissions gets really, really weird around notifications. A lot of the GitHub permissions APIs are designed around individual repositories; the GitHub app, kind of the new – it's not the GraphQL in particular, but the new GitHub app style setup specifically for the marketplace is designed entirely around "You install this integration into a single app or multiple apps within an org."
The notifications API is different to that, because it's based entirely on the user that enabled it… So it spans across every repository that that user has access to. So to be able to download a user's notifications and then also be able to pull in extra information, so the status of an issue, or the labels on the pull request, or if your CI is passing or failing on the pull requests - we actually then need to go and hit the individual endpoints for each of those bits of data. The notifications API is not available via the GraphQL API, so we can't do a nice n+1 set of queries in one go.
And you also then kind of have to work out, like, "For each of these notifications, do I have the ability to pull in the extra data for each one of these subject types?" So it gets slightly complicated in the different ways that you can configure it. And actually, if you run it yourself, the simple way is to plug in your own personal access token, and that enables everything, because the personal access–
That's always the easiest way, isn't it?
[00:31:54.04] Yeah, it gives you full permissions and doesn't require permission from the owners of the organizations that would be the gatekeepers to install in the GitHub app. But the GitHub app does come with the nice benefit of the webhooks, so we actually can then listen for any changes to issues or pull requests and instantly react by syncing that data in and updating your notifications, as well as using that as a hint to update other people's notifications that may have heard about that before… So it kind of speculatively syncs people's notifications as it hears from one of the webhook events, and that has made everything seem like it all just happens magically.
I was gonna say, in lieu of that webhooks, are you pulling on an interval then, if you don't have the webhook capabilities?
We basically have a sync button, which is very much like Gmail or Apple Mail's kind of check for e-mail.
You don't wanna exacerbate the problem with people who feel like they're overwhelmed by notifications already, right? You don't want that…
Just keep piling them in…
…new notifications popping in at the top when you're in the middle of triaging them, so…
I see, so this is by design so that you – you have to go check your notifications by your own agency, and you have to say "Okay, sync my notifications, because I'm ready for the flood", as opposed to them just popping in, "You have a new e-mail, you have a new e-mail…" all day long.
Yeah. And that was pushed, again, by 24 Pull Requests, because I was refreshing the page and seeing more things come in; I was like "I can't keep up with this. I need to chunk this up into at least something that can fit in my brain until I can [unintelligible 00:33:40.29] done with all these, and now I can check to see if there's some more.
So I'm just sitting here, staring at the UI as we talk, and I think – and I think this about Gmail all the time, too… I'm like, "This seems like it's better as a desktop app", not as a hosted thing with either your own instance, or the shared instance… Was that anything you ever considered, or you just reached for your must trusty tool, which is Rails, and built a web app right away?
That was definitely – kind of the start was "Oh, I can solve this with Rails. It's using a lot of the same libraries that Libraries.io and 24 Pull Requests are using to interact with the GitHub API", but it also works to building a web app that is mobile-friendly means keeping data in sync across multiple devices. Octobox works really nicely on your phone, so that you can actually kind of – I can look at a notification, and then when I get back to my laptop, it's still there, and I can continue to work on it… Which then means you don't need to build different clients. But you can wrap those up so there's a nice selection of tools to put a website inside of an Electron desktop wrapper. There's one that – I forget what it's called; I'm sure we can find one and stick one in the show notes, but you command line with a URL to a website, and it will just hook up everything, generate a Mac app or a Windows app, and then it'll pick up the nice icon from – I guess it's the Apple Touch icon… So everything kind of happens automatically, and then you have it sat when you're ready.
But personally, I'm kind of against the always-on, constant stream of notifications. I like to actually choose and be thoughtful about when I'm gonna check my e-mail, or when I'm gonna get notified about new pieces of work, because it can just be distracting if you're in the middle of some fairly complicated bit of code… You don't wanna have someone reporting a bug on one of your other projects pop up and kind of distract you. So it wasn't kind of a "I'm imagining this as being something that would always be on, and always be able to tell you when there's new things."
[00:36:12.10] Well, I can get that sense from the Sync button, and I also am somewhat intentional with the way I use specific communications. I make rules for myself… For instance, on my phone I will not set up my mail client to pull in new e-mails and show me the Unread count, because I'm a completionist, and I can't deal with an Unread count; I have to go read it. So I always check my mail. My old saying was "Don't let your e-mail check you. You check your e-mail." And then I can only check Twitter on my phone. I actually break this all the time, but I had this rule; it was hard set for a while - I would not check Twitter from my laptop, it had to be phone only… Just to kind of silo things and feel like I'm in control and the devices aren't in control. Ben, do you have any sort of things like that? You seem like you're keen on that Sync button being intentional… Is this something that you work through?
I don't mind having an inbox… I use the Apple Mail client, and my process has always been for the last decade like I'll read something and flag it if I need to do it later, and I might panic if I have too many flagged e-mails. So I would in Gmail use the equivalent of a star, in Octobox we have stars as well, and we go by that.
I'm not the Inbox (0) kind of person, but I completely use the same batching process, and I think maybe the three of us are the same type of person, and it turns out there are a fair few other people like us as well out there, who prefer to batch those kinds of tasks up. That's what Octobox is - Octobox is a shift in paradigm from what is effectively like an activity feed style notifications experience to something that's more inbox.
You also don't wanna carry around that kind of mental baggage of knowing that you've got things… I don't know if you have this - if you copy something and you're going to paste it, can you kind of feel that it's on your hand? You've pressed Cmd+C and you can almost feel it until you put it down with Cmd+V. It's there, like, mentally… [laughter]
You know it's still there, right?
…and I have that with some important messages, or issues, or notifications, that I'm like "I can't put this down until I know it's somewhere that I'll be able to make sure it's there…" And yeah, a clipboard manager changed my life. Once I – I was like, "Oh, I don't have to worry about this thing going away or not", and then I became a lot less scared of having something I've copied that I haven't yet pasted…
Do you have a good clipboard manager you'd recommend for me? Because I've always looked at them and I've never found one that I actually liked.
I use Alfred, which is a combined search, and it has a clipboard manager built-in, with kind of a nice ability to search through those clipboard things, and I use that hundreds of times a day… It's wonderful. As well as just being a much better Spotlight replacement.
Well, lots of Alfred fans out there… I'm [unintelligible 00:39:09.16] with my software. If the operating system provides it, I'll tend to use that, so I just use Spotlight, and it's just like good enough, because I don't wanna have to install yet another thing and manage it… But that's because I'm particular in my ways, as we all find out we are.
That's funny that you think that when you have the copy, but not the paste, you can almost sense it on your finger… It's like part of you until you can put it down. I definitely have that in my head, but I don't know about how strong the sensation is as it is for you, Andrew.
I think that's possibly the greatest compliment you could give to the UI developers of desktop software in general, is that you actually feel that it's in your hand and have to drop it. [laughter]
Literally walking around with…
[00:39:52.03] It's similar to like if I've not put my kids somewhere down, that I know I've got this – "I need to go put my keys on the hook, because otherwise I'll never find them again."
Well, that's my problem - I can't find my keys, but they're my hand the entire time, and I'm walking around the house looking for them, holding them in my hand, like a fool…
I think you've basically got PTSD for skeuomorphism, right? That's basically what it is.
[laughs] Well, I like that. That needs to be like a Tumblr, or something… Is Tumblr still a website…? Anyways. Alright, back to Octobox - so that's a little bit about how it works. I guess this Rails-by-default in your brain, Andrew, worked out pretty well, because it allowed you to create a centralized service for people who don't wanna manage their own instances and to do all the heavy lifting for them, so hence the GitHub app.
Let's talk about the move from a side project really to something that you guys are trying to give a go as a sustainable open source business kind of thing, with the GitHub Marketplace. What are your plans with Octobox in terms of generating revenue?
Yeah, sure. Octobox is two things for us - it's a tool for ourselves and for maintainers to solve one of the problems that we feel maintainers have, which is they have lots and lots of notifications, whether it's from many repos, or a single monolithic repo with lots and lots of activity… But for us it's also a trial by fire experience of putting ourselves into this situation that a lot of open source maintainers are in, where they have a project which is popular, which is used, and they would like to work on it more, and they need to find solutions to be able to enable them to do that, whether it's gonna be sponsorship, whether it's gonna be donations, whether it's gonna be paid support…
These are some of the models that have started gaining a little bit of head room in terms of popularism, and have had kind of dribs and drabs of experience from people who have tried [unintelligible 00:41:50.22] One of the things that we wanted to do is just expose ourselves to that as a microcosm of the types of people that we're trying to help; we want to throw ourselves in it.
The main goal for us is to make Octobox sustainable for ourselves and our community, which we're kind of at a bit of advantage - we can play both sides; we happen to be the maintainers of Octobox.io, and there are certain things that we can do with Octobox.io, and we are also members of the community. We're members of the 80+ developers who have contributed to Octobox as a service, that people are running themselves, that people are running in Docker containers, and so on… And we want also to experiment with some of the questions that we've been unable to answer in the past around sustainability of open source software.
Questions like "Do people care more about supporting a commercial entity that provides for a project, or do they care more about supporting the community directly, and being able to ask and answer those questions with data?" We want to just prove it to ourselves and to other people who are in a similar situation to us that it is possible, how to do it, and that it is repeatable. And that extends for the community, as well.
We as the maintainers of Octobox and the operators of Octobox.io are in a certain position when it comes to having that shared instance that people go to, and having a point of focus for users, that some open source maintainers do not have… The guys who work on the key core components that you just include as a library; the people who write API interfaces for things like Redis, and so on. It's very difficult to do some of the things that we can do with Octobox to make money, effectively. So what we wanna do is also demonstrate some of the things that we can do as a collective community to support one another, to support those maintainers that don't have the opportunities to expose the service effectively and make money from that kind of service, so that we can support one another. At this point it's a big experiment in terms of what a new kind of small, but through a certain measure successful open source project can do.
[00:44:33.05] We're also trying not to be too restrictive. There's been a spat of open source projects recently that have kind of swung very far the other way in the ways that they try to monetize their projects by some literally forgetting where their roots are in the open source movement, in the free software movement, and kind of rescinding on those basic freedoms of free software.
It feels like there are ways to get around that without needing to literally put up big roadblocks, and that kind of works out really nicely with the way that Octobox works, because because it's open source and because it's AGPL licensed, we don't ask for the copyright of people who are contributing, so we can't change the license without literally reaching out to everyone, or without ripping out all of their contributions, which we're definitely not going to do.
So actually, if people don't like the service that we offer via Octobox.io, they can always go and run their own version. That keeps us honest to a certain degree, and kind of avoids us poisoning our own well, or kind of trying to take everything for ourselves. It forces us to act as good actors within the community, whilst providing a service which people see as valuable enough to pay for.
What's pretty cool here - you talk about the experiment that you are doing, Ben, and you guys have Octobox free for open source projects, with basic notifications for private projects. Of course, it's also open source, you run your own instance, so if you wanna do it that way, that's all good. We're talking about Octobox.io. And then it says "To add enhanced notifications for private repositories to your organization's account, there's two ways to pay", and I think this speaks to what you were talking about, Ben, where they'd rather provide funding to a community versus to a commercial enterprise that's really supporting that community, but is somehow distinct from it in terms of limited participation… So you have an Open Collective donation, and then you also have the GitHub Marketplace option.
Yeah. And then we have another page underneath our Pricing page that explains "Wait… What?", which is kind of like… [laughter] It's kind of like, "You probably need a little bit of a back-story in order to understand why we're telling you that you can pay for this exact same thing two different ways." It just explains what we're trying to do, and it's just proving to ourselves - if other people are gonna do something like Octobox, should they create a company, should they have that company set up as a shell that basically provides for themselves and their community, or should they just go whole hog for the community, and which one is more successful?
I mean, sample size of one, but this isn't the only experiment that we're gonna do… But I do think it needs some explanation. And I'd also shout out for the whole "What you're paying for enhanced notifications" - a lot of that is to do with the dance around the notifications API versus the way in which GitHub apps work as well, so…
Yeah… I think we're gonna try and simplify that a little bit in the new year…
[00:47:52.09] It often confuses people, and they think either "Oh, is this gonna be like having to pay $100/user/month?" Or sometimes they think "Oh, I don't need to pay for anything", because everything continues to work with the basic notifications… So I think we've exposed the levels of permissions a little bit too much, with kind of my developer hat on, compared to – someone who comes to it without understanding the complications involved often won't appreciate that, and we should do a better job of explaining it, or hiding the complexity.
Initially it was actually just a lie. It's easier to explain to people that the standard [unintelligible 00:48:44.09] "Free for open source, paid for private projects", but then with the notifications API you get some small amount of private repo notification information as well, and people are like "Hang on, there's something wrong… I've got some private data here." Like, "No, no… That's the way it works. We were lying to you effectively, because we were trying to help you understand it easier." So it's an experiment, and it's still very early on.
I think – when did we go live on the GitHub Marketplace, Andrew? It was probably about four, five, six weeks ago maybe?
Six weeks, yeah… So we're still kind of finding our feet on how to talk about the service and what you're paying for, but we'll keep experimenting with that. Going back to the business model though - one of the other things that we're trying to find our feet on and experiment with, and kind of demonstrate to ourselves and others that this might be a responsible way of running an open source company is doing things like saying "We're gonna share 15% of our revenue as a commercial company with the community as well", because we don't want – if the experiment about "Hey, do you wanna pay like a commercial entity, or do you wanna pay the community?" to end up with "People wanna pay the commercial company", which means Ben and Andrew are gonna win. We want to make sure that the company is tied to the community in such a way that it has to provide for its community, as well.
So there are various things that we're doing as an experiment that aren't just overtly user-facing in terms of pricing, but they're also kind of behind the scenes in terms of how we set the company up, how we commit the company to the community… And in the future, the goal is if the commercial company holds more revenue than people have donated to the community, to pull people from the community into the commercial company as contractors, employees and so on, and work out how that relationship would work most effectively, so that people have the protection of a commercial company that owns Octobox.io in this particular instance, and they also have the freedom to come in and do paid pieces of work on not only Octobox as an open source project, which has benefits for the whole community, but maybe even specifically for Octobox.io if they see something that needs to be done on that particular instance.
So it's all an experiment, it's all gonna be as well documented and publicized as we possibly can when we have that evidence. Andrew and I - we both live and die by the evidence of the data that we collect, that comes back from working with things like libraries, where we have this vast wealth of data that we can pull from.
Talking of data - we've literally turned off Google Analytics on Octobox.io last week… To give ourselves just a little bit more challenge.
I was gonna say, because you don't want data…? I thought you guys were evidence-based.
Yeah, but also, my background is in computer security and I'm a massive privacy wonk. [laughs]
[00:51:42.01] It turns out that quite a lot of developers actually block Google Analytics, as well… Comparing the data that we were seeing from usage on Octobox.io from Google Analytics, compared to coming through Cloudflare and compared to our server logs - it was wildly different, so we could really rely on it to make many good decisions anyway… So we figured we'll just cut Google out of it, make the page more secure, make it faster, and stop exposing people's data out there, while still being able to have a good indication of how people are using it via Cloudflare and our own server logs.
The other area we don't have much visibility on is how much it gets used on individual instances, so… It's been downloaded almost 600,000 times from Docker Hub. We have little visibility, but I get the feeling that there is a lot more going on with Octobox outside of our control. People want Octobox to continue to work and be developed, as well as kind of integrating with what Microsoft are pushing forward with the changes they've recently came in and started to encourage… So you just see the bookmarks have shown up in notifications, so you can bookmark an individual notification - we'd like to be able to feed that back and kind of sync it up with the stars in Octobox, so that you can have your data the same in both places.
So it's not like Octobox is gonna be a done project any time soon and we won't need ongoing maintenance. There's a lot of work to keep it moving forward, with this moving target that is the GitHub universe.
Andrew, one thing that you mentioned is how GitHub has recently put some efforts into notifications with the bookmark feature. We know GitHub under new management, new CEO, Nat Friedman - he seems to be very focused on small polish and improvements to areas that maybe have been neglected over the last few years, that power users such as maintainers care about… And of course, Octobox is all about the power user.
Any concerns with GitHub basically building what you guys have built internally, as a first-party thing? Of course, anytime you build on a platform you don't wanna get sherlocked, as the Mac community well knows… Apple is known to sherlock their platform vendors. And here you are, you're on the GitHub Marketplace… What are your concerns about GitHub and replicating some of these features and making Octobox not quite so intriguing to folks?
I kind of swing backwards and forwards on this, but I have spoken to a few people internally at GitHub since the acquisition, and I'm fairly confident that Octobox isn't in their line of sight right now. Octobox really works well for a particular kind of user - the users who are not wanting to use e-mail, but are getting a load of notifications and are driving across a number of different repositories, which is actually quite a particular set of users… And trying to solve those problems while still enabling all of the other kinds of users that are on GitHub actually becomes incredibly difficult.
We can kind of lean on the fact that we're only solving problems for a particular kind of power user, that we can be like "Well, actually, Octobox might not be for you if you're okay using e-mail." Or "If you only get a few notifications or you're not working with many other people on a private repository, that's okay. Octobox isn't for you. But you can use it if you like." But we're really gonna focus on making those people who are doing huge amounts of work on managing a lot of communication to really make those people seem like they've got super-powers, because they've kind of been struggling for a long time… And a lot of the testimonials we get are people kind of suddenly feeling like they're back in control or they're able to actually then start to take on more things. You're kind of like, "Oh, I could never possibly watch this repo, because I already get too many notifications." And then suddenly, you're like, "Actually, maybe I can."
I personally started watching the Ruby on Rails repository again, after I guess five years of not doing it.
I used to do it back when I was a junior, learning Ruby on Rails and kind of wanting to see what the masters were doing. I would watch for what was happening on the pull requests, and of course, then as work happens, you're kind of like "Oh, this is too much…" But Octobox has actually let me compartmentalize that enough to be like "Oh, let's see what's going on over here" and then put it away again, so you can context-switch out to just the Octobox stuff, but when December comes around, I just wanna look at the 24 Pull Requests stuff. Then once I've gone through that, then let's go see what else is still there, if I have the bandwidth for it. Otherwise, it will be there tomorrow.
[01:00:05.11] That's pretty cool. I used to follow that repository as well, and I think I lasted maybe two or three weeks, and I had to just – I just couldn't. Even just like "Meh, I'm not that interested." Do you ever do that? You star a repository or you subscribe, and then a few e-mails come in and you're like "I'm just not gonna do this."
Oh, yeah… You wouldn't believe the amount of – I must have starred over five thousand repositories or more, and my actual activity feed on GitHub is the most useless thing. The homepage tells me nothing…
Oh, it's never been useful, yeah…
…recommendations are like "Oh, here's EVERYTHING."
So I can't get much out of that page, because I just broke it from starring too many things.
I mean, the question is what does a star mean, right? It's something different to everyone. We could definitely [unintelligible 01:00:50.06] on that, so… Maybe not.
I think we literally covered that in the last time I was on Request for Commits.
That's right… That's right. And the answer is a star doesn't really mean very much at all. Because it means something different to so many different people, it makes it very difficult to mean anything at all that's useful.
But it's an interesting point that you say about Nat and GitHub… Notifications was one of the three things that he said in his opening gambit as the new CEO that he wanted to focus on… And we've seen a lot of those improvements, but as Andrew says, I think personally that there are so many more casual users of GitHub than there are the power users that we're catering for. It would be difficult, or at least it wouldn't be a problem that I personally would wanna solve, to bridge between the two in one interface.
Having something on the marketplace, even from GitHub's point of view - I think could be very positive for them, because it can allow them to refocus their efforts on what might be their core group of users, which are the maybe more casual, medium-level, while they still have something like Octobox to offer people.
Yeah. And as the platform, that's what you want - you wanna provide the 80% solution, and then you want that marketplace, which you're still getting cuts out of; it has this cottage industry around your platform, filling in all those gaps that you don't wanna fill in yourself, or aren't worth it for you, but are worth it for somebody who's smaller than you. So hopefully that symbiotic relationship will just continue on forward.
So you're focused on the power users… Octobox is in a place now where it has a nice core set of functionality; it seems like you've got a good 1.0, or I don't know if you consider this 2.0 or what version it's at… But it's there, it's available, it does what it's supposed to do, but for power users, we always want more, better, faster, deeper, more power, so what are you guys thinking about Octobox moving forward? Some things that maybe current users can look down the road and maybe hop in and help out with, or even give a thumbs to "Yes, I want this feature. Where are you gonna take Octobox in the next 6-12 months for the power users?
I love this. There's so many different ways. As you say, we're in a nice place where we can kind of take stock, listen to users, feel their pain… Because we solved a lot of our own pains that now it's like "Okay, now let's go out into the world a little bit more, interview some people and see how they use it."
I regularly watch Suz Hinton when she's streaming on Twitch, and she starts every Twitch coding session by looking at her Octobox and going "Okay, what do I need to work on today?" That's a great way to see – it's interesting; she's using it like this – it looks like maybe she could do something to help her shun a few things away that like "Oh, this isn't ready to work on until next week", so perhaps features that are maybe a little bit more like a to-do list. I can imagine having the ability to snooze notifications, to say "Put this away until next week, because I need to go check on it again" or "I'm totally not ready to deal with that right now."
[01:04:06.29] Or you're waiting on, say, an API change somewhere else, and similarly, maybe having a due date or some other way of highlighting the importance of particular kinds of notifications… That would allow you to really focus down on the things you need to do today.
The other really interesting area is trying to get into some more automation, or – I don't wanna say intelligence, because I really don't wanna add an unknowing machine learning, with no real clear boundary to why it did something, or having to train it up, because what we don't wanna do is have to share behaviors across different users. It's very much like "This is your data that's only used for you." So I can imagine allowing users to say "Well, if a project comes in and it's been labeled as, say, a bug, then can we automatically assign that to this person, if it's on this repo?" Or "If it is a notification from a bot user, then I just wanna automatically archive that… Potentially even for people who have usernames like me."
I'm "andrew" on GitHub, and I get a lot of actual spam through my GitHub notifications where people have mentioned "and Andrew", and it comes up in my Octobox, in my GitHub notifications, and it's completely irrelevant to me. So being able to look at it and go "Well, you've never interacted with this repository before. You got mentioned, but there's no indication that you would ever want to do that." Maybe that is actual spam, and that could just be automatically moved away.
But rather than try and do it in a one-size-fits-all, I feel like doing Gmail-style filters and automated actions would probably be the best way to allow developers to build the "if this, then that" that they need, rather than try and make a set of simple actions that would be very easy to do, because the power users are gonna be like, "Well, I've got these very specific sets of things I wanna do when these things happen", and Octobox already has a really powerful search that can let you filter down by every different kind of state to get exactly what you need; every time that there's a state change, fire off, see if there's any search results that match that, and run that particular action on them.
I think there's also some of the stuff that we might see coming in the more medium-term, as well. At the moment it's just in beta, but we have the thread view, which we'll show you the content from the thread of that notification within Octobox, rolling that out to users and putting more time into that, so that people don't have to jump out of Octobox as readily as they may already be… And then that can kind of build into some of the potential team discussion stuff in the future, as well.
I've been using that feature quite a lot. You can enable it from the settings, if you're in Octobox.io, and basically it will give you a three-pane view on a regular laptop size screen… So you can jump through all the different conversations and catch up with them without needing to open many different tabs to GitHub. Then potentially that also opens the door to being able to comment directly from Octobox, or even label and close issues without needing to jump backwards and forwards between the different tabs, that is the current behavior if you're gonna have a lot of things to work through.
[01:08:06.13] There's a balance for us there, between – you know, we talked about GitHub sherlocking Octobox, but also, we don't wanna rebuild GitHub, so it's finding that fine line between where do people want Octobox to be a part of their workflow, and where do they want GitHub or their other existing tools; talking about contribution to open source in the greater, more whole sense… We need to find that line, and one of the things, Andrew, I think you did recently was just kind of reach out on Twitter and say "Hey, we're really interested in talking to people about their current workflow in their tools, whether they use Octobox or not", because we're getting to that point now where, you're right, we do have a reasonable 1.0, and now it's kind of finding your way as a built product, to add and take away things that are gonna make the users that we're building for as productive as possible.
That's the key, really - we wanna solve for people like us, who have the same problems as us… We wanna kind of do that, between Octobox and the other tools that people use.
Yeah. For example, the one thing that Ben and I actually do quite a lot is we have a backchannel for Octobox. We don't set it up as a separate, private repo, but instead we're – depending on where we're chatting at the time, it might be text message, it might be in Slack DM, but often it's about Octobox. Potentially, with this thread view, we actually have the ability to then allow Octobox users to message each other directly. And I don't know if you were on GitHub back nine years ago, but actually GitHub used to have the ability to send messages to other users, and they yanked it, and it has never come back since…
Right. I can remember that.
But the potential to have that kind of data, that is not just a mirror of GitHub data, but other data in Octobox, or potentially even having an API that allows users to push data from other platforms in… So you can imagine, take your notifications from Stack Overflow and feed them into Octobox, so that you can actually drive multiple different kinds of developer-focused events that maybe act as to-do's, or things that I will need to check on and confirm that I have done something with them - in one nice, unified UI, that doesn't just fall down to the lowest common denominator of e-mail.
Very cool. I like that idea quite a bit. Actually, a lot of these are good ideas, so you guys have a lot of work ahead of you. So Octobox.io is, of course, the website… How do people get involved from a community perspective? Maybe they love that messages idea; they wanna let you know that you should build that, and they will come… Or they wanna get involved and sling some code - what are the waypoints for people to get into the Octobox community and maybe become a user, but also, hopefully a contributor?
We drive most of the development from an issue tracker, encouraging people to propose new features or report bugs. We also have a roadmap document within the repository, so actually proposals to add things to that roadmap are very cool.
The other thing that we use is Gitter, which is similar to Slack, I guess, but focused on GitHub repositories, or GitLab as well, now that they – I think they got purchased last year… So that's kind of the more real-time chat area, and it's completely open. You can just drop in and someone will probably be hanging out there.
[01:11:58.27] And then we also try to make the project really friendly for new people to get in, so one of the design or architecture decisions was to try and stick with Rails conventions as much as possible. So if you have any experience with Ruby on Rails, this will feel right at home. You'll be able to find exactly where you would expect the logic to be, because we don't try and do custom bits of code that stick outside of it; it's literally like models, controllers, views, Turbolinks, and Postgres with Sidekiq as the queue… So it's very easy to get involved.
We've had people build whole features, because it was so easy for them to dip in and go like "Oh, I recognize this… This is a Unix system." [laughter]
Jurassic Park reference? Nice…
…and literally be able to fix bugs. I'm probably pretty bad because I'm checking Octobox so often that I will often merge a pull request and roll it out within minutes of seeing them sometimes. Then other times they'll get merged, but maybe not rolled out straight away. We don't have continuous deployment mostly because we are the only people who are responsible, so kind of 24-hour ops makes me slightly more cautious… But absolutely open to different ways of contributing, as well as potentially even as people can spin up their own forks. We have a lot of people that fork the project, making changes, seeing how they work for their own instance, and then seeing if they can suggest them as features after they've kind of kicked the tires on them a little bit.
Very cool. Well, guys, thanks so much for coming on the show. Thanks for all the work you've done on Libraries, 24 Pull Requests, which was super-cool… Octobox - I hope you guys have great success with this. Hey, come back after the experiment has some facts that we can get some evidence back, find out what people are up to with regards to do they wanna support a community, do they wanna support a commercial enterprise? Do they wanna do both, or maybe do they wanna do neither? Maybe that's what we'll find out… Or hopefully we won't find that one out.
And then also, one thing we didn't talk about and maybe we'll just tease it now and have you back for another show later - all about how you also have this desire to divvy out revenue to your downstream (or is it upstream? I don't know), to the stream dependencies that help Octobox be what it is, because that is something that I definitely wanna talk about. But we're out of time for now, so we'll have you back on later on to talk about that and get a follow-up to find out how you all are doing.
For now, that's our show. Thanks so much for joining us. Ben, Andrew, it's been a joy.
On today’s show, Mikeal and I talked with Andrew Nesbitt, creator of Libraries.io, and Arfon Smith, who heads up open source data at GitHub. Andrew’s project, Libraries.io helps people discover and track open source libraries, which was informed by his work on GitHub Explore. Arfon works to make GitHub data more accessible to the public. Previously, he worked on science initiatives at GitHub and elsewhere, including a popular citizen science platform called Zooniverse.
Our focus on today’s episode with Andrew and Arfon was around open source metrics and how to interpret data around dependencies and usage. We talked about what we currently can and cannot measure in today’s open source ecosystem.
We also got into individual project metrics. We talked with Andrew and Arfon about how we can measure success, what maintainers should be paying attention to and whether stars really matter.
Andrew, I’ll start with you. What made you wanna build Libraries.io? How was that informed by your GitHub Explore experiences, if at all?
I got a little bit frustrated working at GitHub on the Explore stuff. It was me kind of deprioritized whilst I was there, and my approach of libraries, rather than just build the same thing again outside of GitHub, was to use a different data source, which started at the package management level, and it turns out that’s actually a really good source of metric data, especially when you start looking into dependencies. If I had taken the approach of, “Let me look at GitHub repositories”, I would have gone down a very different path, I think.
Right. So tell me a little bit about that. So you pull out the whole dependency graph data - do you go into the kind of deep dependencies, or do you sort of stay at more of a top layer of first-order dependency data?
So for each project, it only pulls out the direct dependency. But as it picks up every project, because every time it finds anything that depends on anything else, it will go investigate that as well. It ends up having the full dependency tree, but right now I don’t have it stored in a way that makes it very easy to query in a transitive way, if that makes sense. I’ve been looking into putting the whole dataset into Neo4j - a graph database - to be able to do that easy transitive query, and to be able to give you the whole picture of any one library’s dependencies and their transitive dependencies, but it’s not quite at that point. But I do have all the data to be able to do it.
Interesting. Okay. So you said that this is a much more interesting way to go about this in the GitHub data. What’s something that you found when you started working with the dependency data that you never had in GitHub Explore, or just to the GitHub data?
GitHub stars don’t really give you a good indication of actual usage, and GitHub download data is only really accessible if you are a maintainer of a project, rather than just someone who’s looking at the project from a regular browser’s perspective. If you actually look at the dependency data and not just other libraries that depend on that particular library, but if you look at the whole ecosystem and how many, say, GitHub projects depend upon this particular package, it gives you a fairly good idea of how many people are still using that, still need that thing to be around so that code continues to work. And if there was a security vulnerability, you can see exactly how many projects may be affected. So actually end up connecting the dots between… And I’ve only looked at GitHub data so far; I haven’t got around to doing Bitbucket or arbitrary Git repositories.
[00:04:21.05] But you can actually use package management data to connect the dots between GitHub repositories as well. You can say, “Oh, well given this GitHub repository, how many other GitHub repositories depend on it through NPM or through RubyGems.
It’s good to hear that stars are useless, because I’ve also thought that. [laughs] That’s my assessment, as well.
Yeah, I’ve [unintelligible 00:04:46.03] over how you shouldn’t judge a project by its GitHub stars. There’s one particular project that’s a great example of that, it’s called Volkswagen. It is essentially a monkey patch for your CI to make sure it always passes. I think it’s got something like 5,000 GitHub stars, and it’s maybe downloaded 50 times on NPM; it has zero usage.
Yeah, that’s by Thomas Watson. It was a joke when VW had that scandal where they were just passing all their tests, so he wrote a module called Volkswagen that just made all your tests pass, no matter what. [laughs] It’s brilliant… But yeah, utterly useless in terms of actual usage.
Yeah, and if you actually look at the stars… Of course, people have contributed to it, but even looking at contributed data doesn’t give you a good indication of actually is this a useful thing, a real thing, and should I care about it? I always look at GitHub stars as a way of… It’s kind of like a Hacker News upvote or a Reddit upvote, or a Facebook like. It just means like, “Oh, that’s neat!”, rather than “I’m actually using this” or “I used to use this five years ago.” No one ever un-stars anything either, whereas if people stop using a dependency, you actually see the amount of people that depend on a thing go down.
I think stars are an indication of attention at some point in time, and that is all we can say about them. So if you look at stars versus pageviews on a given repo, they correlate very well. So in defense of stars, we shouldn’t use them as “This is what people are using”, but they’re a good measure of some popularity, some metric. And I think that’s exactly what you just said, Andrew. Consider it like a Facebook like, or something like that. It’s got very little to do with how many people are actually using something at any point in time.
Yeah. I saw someone actually build a package manager; I think it was only a prototype, but I really hope it never actually became a thing, where it would pick the right GitHub repository if you just gave it the name rather than the owner and the name, by the thing that had the most stars, which sounded like a terrible idea at the time and completely gameable.
Yeah, that doesn’t sound like a good idea. You mentioned something interesting, which was you can understand how people use it in terms of just it being depended on. Recently GitHub did this new BigQuery thing, and one of the results is that you can do - RegEx has done the actual file content of a lot of this stuff, so you can start to look at which methods of a module people might use or how they might use it. Could you get into that a little bit?
Yeah, so just to refresh the data that we put into BigQuery, it’s basically not only the event data that comes out of the GitHub API, which is just “Something happened on this public repo” - and that’s what the GitHub archive has been collecting for a long time - this is actually in addition to that, the contents of the files and all the paths of the files for about 2.8 million repos, so anything with an open source license on GitHub basically that’s in a public repo.
[00:08:15.07] So that allows you to do things like if there’s a particularly - maybe a method call in your public API that you wanna try and measure the use of, then you can now actually go and look for people using that explicitly. So currently really complex kind of RegEx stuff on GitHub searches is pretty hard; in fact, I’m not sure you can do a RegEx query on GitHub search, so that’s one of the strengths of BigQuery, that you can actually construct these really complex, expensive queries, but then of course that gets distributed across the BigQuery framework, so it comes back in a reasonable amount of time.
Yeah, for languages like C, that’s pretty much the only way to do it. There’s just no convention there, other than the language itself. And then for some other package managers, you actually have to execute a file to be able to work out the things that it depends upon, which I avoid doing because I don’t really wanna run other people’s code just arbitrarily.
Well, in the NodeJS project we’ve been trying forever to really figure out how are people using some of these methods, because if we wanna, say, deprecate something, we’d really like to know how many people are using that in the wild and to which level is it depended on. But we’ve had several projects where we tried to pull all of the actual sources out of NPM and create some kind of parse graph and then figure out how that gets used… It’s just such a big undertaking that it hasn’t really happened. When this BigQuery stuff got released we were like, “Oh my god, how far can we get with the RegEx to figure out some of the stuff that’s used?” because that’d be really useful.
Yeah, it kind of makes me sad that we’ve made everyone write crazy RegExes, but sorry about that. Hopefully, that will be useful. [laughs] Hopefully a bunch of good stuff can be done; people are gonna have to level up their RegEx skills, I think.
Just for people who are newer to metrics world, why should they care to be blunt about this dataset being open and being on BigQuery? What are some things that you expect the world to be able to do with this data? Even outside of people like Mikeal with Node, but policy makers or researchers or anyone else.
One of the things I think is incredibly difficult right now for some people is to measure how much people are using their stuff. For a maintainer of an open source project maybe that’s not a huge problem, because you can go and look at things like libraries and see how many people are including your library as a dependency, or maybe you can just see how many forks and stars you’ve got of your project on GitHub, but I think there are some producers of software where actually reporting these numbers is incredibly important, and Nadia, you mentioned researchers. If I get money as an academic researcher from a federal agency like the National Science Foundation or the National Institute of Health, one of the really important things about getting money from these funders is you need to be able to report the impact of your work.
[00:12:26.16] It’s currently kind of hard to do that if you have your software only on GitHub and you don’t have any other way of measuring when people use the library. You don’t have any direct ways of doing that, other than just looking at the graphs that you have as the owner of the software on GitHub. So I’m excited about the possibility of people being able to just construct queries to go and look… Of course, only open source, public stuff is in this BigQuery dataset, but I think it offers at least a place where people can go and try and get some further insight into usage.
I think it’s actually a hard problem to solve, but I know there are some environments - I’m trying to think of some large institutional compute facilities, big HPC centers… People have done some work, doing some reporting on when something’s being installed or run, and actually Homebrew I think have started doing that recently as well, starting to capture these metrics. Because it’s really tough to know; not everything that people produce is open source, so it’s not even clear that everything’s out there and measurable and available. It’s really tough if you need good numbers to actually say, “Who’s using my stuff? Where are they?”, and there’s lots of very legitimate privacy concerns for collecting all of that data. So yeah, it’s a hard problem.
So for you coming from the academia world, have you gotten requests from people from the scientific community around using this type of data? Did those experiences help inform the genesis of this project at all?
Yeah, a little bit. Very early on when I joined GitHub I got some enquiries from people saying, “We’d love to get really, really rich metrics on how much stuff is being downloaded, where people are downloading from…” - all this stuff that you needed if you had to report and you wanted really rich metrics. Some of those data we just can’t serve in a responsible fashion. There’s no way we can tell you the username of every GitHub user of your software, that would be a gross violation of users’ privacy on our part. So there are things that we just can’t do.
The other things is - and I think this is a kind of a pretty sane standpoint for us to take - we take very seriously user support, so if somebody comes to me with a data request, it may be ethically possible for me to service that, and it might be technically possible for me to service that. But if it takes two weeks of my time to pull that data, then we’re not gonna help them with that problem, and that’s because we kind of believe that everybody… We should be able to service a thousand requests that are coming like that; we should be able to give uniformly the same level of quality support service to people, so we generally try and avoid doing special favors, if that makes sense, in terms of pulling data. So this is why making it a self-service thing, getting more data out in the community, making it possible for people to answer their own questions is a much more scalable approach to this problem.
[00:15:58.08] I think the next step for me personally with this data being published is to start to kind of show some examples of how it can be used to answer common support questions that we see. I think that’s kind of the obvious next step from my standpoint.
And Andrew, you’re in a position where you’re actually taking a bunch of public data that’s out there in all these different public ecosystems and then kind of mashing it together, so you’re like your own customer for this data. What are some of the interesting things that you’ve been looking at? What are some of the most interesting questions that you’ve been able to answer?
Unfortunately I didn’t have access to the BigQuery earlier, so I’ve been collecting it manually via the GitHub API for the past year and a bit, which takes a lot longer, but it also picks up all of the repositories that don’t have a license, which I guess often it’s probably best not to pull people’s code out if they have not given permission to do that.
Some of the things that I’ve been able to pull out and have been quite interesting is looking at not only the usage of package managers across different repositories, but the amount of repositories that use more than one package manager, or that use Bower and NPM, or RubyGems and NPM, and then looking at the total counts of those usages, as well as the number of lockfiles, which I found really interesting.
Coming from a time working with Rails before Bundler, it was incredibly painful sharing projects or coming back to projects and trying to reinstall the set of dependencies that all worked, given the transitive dependencies that move around all the time with new versions. And it looks like the Ruby community is pretty much… For every gemfile there was a gemfile.lock, whereas for the Node community, there’s maybe kind of five, ten thousand shrinkwrap files that I’ve found on GitHub on public projects, compared to the nine hundred thousand package.jsons, which in the short term won’t be a problem, but could potentially cause Node projects to be very hard to bring back to life if they’ve not been used in over a year. Because trying to rebuild that transitive dependency graph may be impossible - or it may be really easy, it’s hard to know. But it’s quite interesting to look at how different communities take how their “How reproducible can I make my software?”
I think we’re heading into the break right now… When we come back we’ll talk about the open source ecosystem.
[00:19:38.23] We’re back with Andrew from Libraries.io and Arfon from GitHub. In this segment I wanna talk about the broader open source ecosystem and the types of metrics that are and aren’t available to people, because I’ve heard a lot of confusion about “Well, why can’t we measure what is being measured right now?” and I think both of you together probably have a good handle on that. I want to start with talking about GitHub data, since that was mentioned earlier, around download data and stars and things like that. Are there any sort of myths that you wanna address around the types of things that GitHub actually does measure or doesn’t measure?
I don’t think so. I mean, I don’t know what myths there might be. I would love to hear things that you’ve heard that you would love to know if they’re true. I don’t know of any kind of whisperings of what GitHub might be doing, so I’m happy to respond to questions.
I hear a lot around just download data, and whether GitHub actually has the data and isn’t sharing enough of it, why not use download data in addition to stars as something that people can see…
Sure… Yeah, okay. So there is a difference between what you as a project owner can see about a GitHub project and you as a potential user of that software. So there are graphs with things like number of clones of the software, which is I think a good metric, there are graphs for showing how many pageviews your project got actually, like a mini Google Analytics. So anybody who owns a GitHub repository can see those graphs. They’re not exposed to the general public, and I would like them to be; I think they’re useful. I think we were kind of cautious initially when rolling those out, thinking that was the kind of information that is something maybe that’s only relevant or appropriate for the repository owner to see… I don’t know, I think that data is generally useful for people to be able to see if… Andrew, you’ve mentioned before just the idea there’s a package manager that tries to suggest the correct GitHub repository based on just a name, and it does that based on stars - that’s not great, but at the same time when you are looking for a piece of software to use, if it has a bunch of forks and a bunch of stars and a bunch of contributors, then that helps you inform your decision about what to use, even if you haven’t even looked at the code yet, right? Personally, I use that information to help inform my decision.
I seem to remember the metrics weren’t exposed because of some of the referrer data potentially leaking people’s internal CI systems.
Yeah, that might be possible. I’m not hugely familiar with exactly why the data isn’t exposed right now. I think it’s important to remember that we take user privacy very seriously, so the thing here is you wanna be on the right side of people’s expectations of privacy. There are things that GitHub could do that would surprise people - and not in a good way - and we don’t want that to happen. So you’re always gonna see us on the side of reducing the scope of who could see a particular thing. That said, I think consumption metrics, fork events - we used to expose downloads. I think one reason we don’t expose downloads anymore is we actually just changed the way that we capture that metric, and it’s not captured in a way that is designed to be served through like a production service. It’s in our analytics pipeline, but it’s not in a place where we could build an API around it, it’s just not performant enough to build those kind of endpoints.
[00:23:47.15] So yeah, we capture more information than we expose, but that’s just a routine part of running a web application and having a good engineering culture around measuring lots of things. The decision about what to further expose to the broad open source community or the public at large is largely one based on making sure that we’re in line with people’s expectations of privacy, but also just based on user feedback. So if the stuff that you would like to see presented more clearly, you should definitely get in touch with us about that, because we are responsive to things that come up as common feature requests. That’s a good way of giving us feedback.
I think also any metric has to be qualified, right? A lot of this talk about stars is that stars is not an indication of quality, it’s an indication of popularity at a point in time, like you said, but people take it as that because it’s the only data that they have.
An example is in NodeJS we have metrics for which operating system people are using, so we always put out two data points. One is the operating systems that have pulled downloads of Node, either the tarballs or the installers of some kind, and then we also have the actual market share for the NodeJS website, visitors to the website. And those are two ends of a very large spectrum in terms of machines that are running Node and people that are using Node.
One metric that is huge on the people end is Windows, and incredibly small on the actual computer end is Windows. But we do a lot to qualify those before we put them out, to set people’s expectations about them.
Yeah, and there’s another thing… I think the Python package index has a similar - like a badge you can put on your profile. And you see this, people will put it, the number of downloads last month from the Python package index, and it’s exactly the same problem. For a fast-moving project where they’re doing lots of CI builds it might be 50,000 downloads last month, or something, and you’re like, “Whoa, that’s crazy!” and then actually there’s not that many users, it’s actually the CI tools that are responsible for most of those.
Yeah, the problem with download metrics on packages too is that you also get into the dependency graph stuff, right? Downloads are really good at looking at the difference in popularity between something like Lodash and Request. They’re both very popular, but the difference in downloads gives you some kind of indication of the difference. But there’s also a dependency of requests that’s only depended on by three other packages, that has amazing download numbers because it’s depended on by Request, right?
Yeah, I have one of those, base62. I don’t think there are many projects that use it, but it gets like one and a half million downloads a month because React transitively depends upon it, so it’s downloaded by everyone all the time. But it never changes, it’s never really used. Lots of people reimplement it themselves.
That’s funny. There’s a lot of packages like that. The whole Leftpad debacle was people did not know that this was used by a thing that used a thing that used a thing. It wasn’t that popular of like a first-order dependency, it just happened to be in the graph of a couple really popular things.
That’s one reason why I haven’t started pull download stats for libraries, because you can’t compare across different package managers either, because the client may cache really aggressively. RubyGems really aggressively caches every package, whereas if people are kind of blasting away their Node modules folder whenever they want to reinstall things, then the numbers - you can’t even try to compare them across different package managers. If you’re looking for “I wanna find the best library to work with Redis, then download counts just muddy the waters, really.
[00:28:01.09] I think a lot of the metrics fall into that, though. When you start looking at them across ecosystems, they really don’t match up. The one that I think of comparing a lot is Go and NPM. GoDoc is actually like a documentation resource, it’s not really a package manager, but people essentially use the index of it as an indication of the count of total packages. But that’s really like about four times what the actual unique packages are, which is an interesting way to go, and it’s one things that just doesn’t map up with the way that NPM or PIP do it. Not that it’s invalid, it’s just measuring something different.
Yeah, the Go package manager is slightly strange because it’s so distributed. It’s just, give it a URL and that is the package that it will install, so basically every nested filed inside that package could be considered to be a separate thing, because it’s just a URL that points to a file of the internet, as opposed to something that has been explicitly published as a package manager to a repository somewhere.
I’d like to get into the human side of this, too. You’ve mentioned this a little bit earlier when you were talking about the difference between NPM and Ruby in terms of locking down your dependencies. That’s not enforced by the package manager, it’s just now a cultural norm to use Bundler and not NPM. Are there some other people differences that you see between Go and NPM because of those huge differences? Or any other packet manager, for that matter.
I’ve tried not to look too much into the people yet, partly because I didn’t wanna end up pulling a lot of data that could be used by recruiters, and make libraries a source of kind of horrible data that would abuse people’s privacy.
I didn’t mean like individuals, I meant like culturally. I didn’t mean like, “Be creepy.” [laughs]
[inaudible 00:29:55.07] all kinds of horrible things. Nothing springs to mind… I guess you can look at the average number of packages that a developer in a particular language or package manager would potentially publish more, or the size of different packages. Node obviously tends towards smaller things, or a lot more smaller things. There are still some big projects as well, but it’s a bit more spread around, whereas something like Java tends to have really large packages that would do a lot of things.
I haven’t done too much in comparing the different package managers from that perspective, because it felt like… As you said, you don’t get much mileage from going like “What this thing compared to this thing?” It’s much better to look at what packages can we highlight as interesting or important within a particular package manager and see if we can do something to support those and the people behind them; so looking at who are the key people inside the community, and then “Are they well supported? What can we do to encourage or to help them out more?” as opposed to trying to compare people across different languages.
You definitely see a certain amount of people who live in more than one language as well. It’s not often that there’s people that are just only doing one particular language.
I’m curious whether there’s - I don’t know a whole lot about this, but if there’s any way to standardize how package managers work across languages, or just standardize behavior somehow. Because I just sort of think for people that are coming for this from outside of open source, but are really curious of, for example, what are the most depended on libraries that we should be looking at and trying to support those people. It seems like it’s just really hard to count… Every language is different, every package manager is different.
[00:32:14.20] Yeah. I’ve standardized as much as possible with Libraries. The only way I could possibly collect so many things is to kind of go, “Let’s treat every package manager as basically the same, and if they don’t have a particular feature then that’s just ‘no’ for that particular package manager.” If you ignore the clients and the way the clients install things and just look at the central repositories that are storing essentially names of tarballs and versions, then it’s fairly easy to compare across them as when there is a central repository. Things like Bower and Go are a little bit more tricky because they don’t have that… You end up going like “Well, we’ll assume the GitHub repo is the central repository for this package manager”, which for Bower it is, but for Go it’s kind of spread all over the internet; it’s mostly GitHub, but there is things all over the place.
But you can then kind of go, “Okay, within a given package manager, show me the things that are highly depended on but only have one contributor, or have no license”, which is easy to pull out in Go, but then “Order by the number of people that depend on it or the number of releases that it’s had” to try and find the potential problems or the superstars inside of that particular community.
Right. I can see you kind of standardizing the data and some of the people work, but the actual technology - or even the encapsulation - you eventually hit the barrier of the actual module system itself, right? One of the reasons why Node is really good at this is because NPM was built and the Node module system was essentially rewritten in order to work better for NPM and better for packaging. So a lot of the enablement of the small modules is that two modules can depend on two conflicting versions of the same module, which you can’t do if you have a global namespace around the module system, which is the problem in Python, for instance.
So there’s a general trend I think towards everything getting smaller and packages are getting smaller, but some module systems actually don’t support that very well, and you’re hitting kind of a bottleneck there.
Yeah, I don’t think there are many other package managers other than NPM that allow you to run multiple versions of a package at the same time, and partly because of the danger of doing that, that you introduce potentially really subtle bugs in the process. But most of the package managers in the languages that at least I have an experience with will load the thing into a global namespace, or the resolver will make sure that it either resolves correctly to only have one particular version of a thing, or it will just throw its hands up and go “I can’t resolve this dependency tree.”
Yeah, it’s important to note that’s not part of NPM, it’s part of Node. Node’s resolution semantics enable you to do that; it’s not actually in NPM. NPM is just the vehicle by which these things get published and put together.
I think there’s been valiant efforts to make an installer and an NPM-like thing in Python, and they eventually hit this problem where you actually need to change the module system a bit.
Yeah, I made a shim for RubyGems once that essentially did that and it made a module of the name and the version, and then kind of hijacked the require in Ruby. It was a fun little experiment, but ends up being… You’re just fighting against everything else that already exists in the community. So you kind of wanna get in early before the community really gets going and starts building things, because once all that code is there it’s really hard to change.
[00:36:00.08] In that vein, have you seen any changes across these module systems as they’ve gone along? Have any really spiked in popularity or fallen? Are there changes that actually happen in these ecosystems once they get established?
Not so much. Elixir is making a few small changes, but it’s more around how they lock down their dependencies. Usually once there’s a few hundred packages - and often it’s because I guess there’s just not many maintainers that are actually working directly on the package managers; often they’re completely overwhelmed anyway to be able to keep up and be forward-thinking with a lot of this stuff. And I get the feeling that a lot of people are building their package manager for the first time and kind of don’t really learn the lessons of previous package managers. CPAN and Perl solved almost every problem a long time ago…
…and these package managers go round and eventually run into the same problems and solve the same things over again.
Related to that - I’m curious for both Andrew and Arfon - when we talked about looking at stars versus looking at downloads, and looking at projects that are trending or popular versus ones that are actually being used, for someone who’s trying to look through available projects and fair out which ones they should be using, how should they balance those two ideas? Because it sounds like once an ecosystem gets established then nothing really changes a whole lot, so you could make the argument that just because a lot of people are using a certain project doesn’t mean that you should also be using it. It could also encourage a different kind of behavior, whereas if you’re telling people only to look at the popular ones, then that encourages a behavior of doing, “I don’t know, maybe it’s not the best project.” So how do you balance - should we be looking at which one is trending or new or flashy, versus something that is older but everybody is using?
Yes, tricky one. I’ve been kind of intentionally avoiding highlighting the new, shiny things in package managers for the moment, and kind of not doing any newsletters of “Here are the latest and greatest things that have been published.” I think this mirrors my approach to software at the moment, which is to focus on actually shipping useful things to solve a problem, as opposed to following whatever the latest technology is.
But that’s just my point of view. There are lots of people who are looking for employment and want to be able to keep on top of whatever is currently the most likely to get them a job, which is a very different view of “What should I look at? What should I use?”
Something I really struggle with software in general, you often hear people saying, “Oh, this project should just die, because it’s not following modern development practices, or it’s just kind of hopeless and we should just focus on whatever is new.” I think it’s because it’s comparatively easier to do that with software infrastructure than it is with physical infrastructure; they can kind of just throw something away. But there’s a part of me that’s also like, “Well, maybe we should reinvest in things that are older but that everybody is still using.”
Yeah, and sometimes it’s a case of people very loudly saying, “I’m not gonna use this anymore”, whereas there are a number of people that are just using it and not telling anyone, just getting on with what they’re doing. They still require that stuff. Often you see companies will have their own private fork, or they’ll just keep their internal changes and improvements and never contribute them back, because they’re just solving their own particular problem.
I relatively recently started doing some Node stuff and I wanted to find a testing framework; I just wanted to write some tests, and I ended up going through about six in about five hours and it seemed by my assessment of what’s going on, the community was moving so quickly - three of the frameworks are all written by the same person. They clearly changed their opinion and had a preference about the way that they were going to now work, but I literally couldn’t get… It wasn’t a very satisfactory experience because things were moving so fast.
I consider myself reasonably technical and pretty good at using GitHub hopefully, and I found it hard to find a good set of defaults. I don’t know, I think finding the right thing, it’s…
It’s very similar in the browser at the moment. It’s hard to know - is this library the right thing anymore? I find myself going to, and I use DotCom to work out, like “Is this mirroring and API that now is a standard, or has it moved on?” because the browser has been evergreen mix, everything really hard to… And you can’t freeze anything in time anymore with anything that’s delivered to a browser, because Chrome is updating every day almost.
Yeah, I don’t know… The other thing is if you actually went out, stick your neck out and say “You should use these things” then somebody’s obviously gonna shout at you on the internet and say “You’re an idiot. You should use this thing.” I think it’s hard for the individual to have a strong preference and be public about that. It’s an unsolved problem, I think.
The scary thing to me is that there is no correlation that I can find between the health of a project and a popularity of a project.
It’s totally fine if it’s not the coolest thing, but people are still working on it and it’s still maintained. But things actually die off and the maintainer leaves and it’s still popular and still out there, and still being heavily used because it’s that thing that people find. But as you said, that maintainer already moved on to a new project, didn’t hand it over to anybody, has a new testing framework that they’re working and doesn’t really care about this thing. So we don’t have a great way to surface that data or to just embed into the culture, like when you’re looking for something, look for health, and what does health mean to a project?
And making that argument to someone that… They might not care about the health, because they’re like, “Well, it’s popular and everyone’s using it.” I struggle with sort of like what is a good argument for saying “You should care about this” to a user.
Yeah, it’s a very long-term thing as well, because if you get an instant result and you can ship it and be done, you’re like “Oh, that’s fine, I don’t need to come back and look at this again”, whereas in six months, a year’s time you might come back to it and be like “Oh, I wish I didn’t do this.” But you have to be quite forward-thinking; especially as a beginner, that can be something that you just don’t consider, the long-term implications of bit-rot on your software.
Yeah, I feel like there was a thing relatively recently on Hacker News, like “Commiserations, you’ve now got a popular open source project”, or something like that. It was this really well-articulated overview of, so you publish something on GitHub; now a bunch of people are using it, and now you’ve got the overhead of maintaining it for all of these people that maybe you don’t really wanna help.
[00:44:06.17] For me that’s just a good demonstration of, you know, lots of people publish open source code, and they’re doing that because that’s just normal, or maybe they’re doing that because that’s the free side of GitHub, or whatever the reason is they’re doing that; or they’re solving probably their own problems - they were working on something because they were trying to solve a problem for themselves. If that then happens, to become incredibly popular, because that’s a useful thing and lots of people wanna use it, there’s no contract of “It’s my job now to help you.” There’s just conventions and social norms around what it looks like to be a good maintainer, but there’s no…
I think a lot of people who publish something that then becomes popular maybe don’t want to maintain it, or maybe don’t have the time to maintain it. Money helps, I think, but I think funding open source is hard; for lots of people it isn’t their day job to work on these things, and I think there’s not a good way yet - apart from the very large open source projects - of handing something off to a different bunch of people. I think that’s actually not very well solved for. You see Twitter do it with some of their large open source projects, they put them in the Apache Software Foundation, but that’s a whole different kind of scale of what it looks like to look after an open source project.
Nadia, you’ve written a bunch about this, I’m sure you’ve got a bunch of opinions on this as well.
I think that you’ve really highlighted the basis for the shift in open source, which is that we’ve gone to a more traditional peer production model. If you read anything from Clay Shirky about peer production, it’s like you publish first and then you filter, and the culture around how you filter and how you figure that out is actually the culture that defines what that peer production system looks like.
And in older open source, in order to get involved at all it was so hard, that you basically internalized all of that culture and then basically became a maintainer waiting in the wings, and that’s just not the world anymore.
People publish that have no interest in maintaining things at all, because everybody just publishes, that’s the culture now. I think we’re actually gonna come into a break now, but when we get back we’re gonna dive into what are those metrics of success, what are those metrics of health and how can we better define this.
[00:48:49.29] And we’re back. Alright, so let’s dive right into this. What are the metrics that we can use for success? How can we use this data to show what the health of an open source project might be and expose that to people? Let’s start with Arfon, since we have so many new metrics coming out of this new GitHub data.
Yeah, so I’ll start by not answering your question directly, if you don’t mind. One thing I would love to see is… There are things that I can do, and anybody who’s looked at an enough open source software… If you give somebody ten minutes, “Tell me if this project is doing well”, you can answer that question as a human, right? You can go and look at the repo, maybe you find out they have a Slack channel or discussion board, you go and see how active that is, you maybe go and look at how many releases there were, how many open issues there are, how many pull requests end up being responded to in the last three or four months… You can kind of take a look at a project and get a reasonable feeling for whether it’s doing well or not, and that I think is the project’s health. I think that’s what we can do as an experienced eye.
What that actually means in terms of heuristics, the ways in which we could codify that in terms of hard metrics, I think that’s a reasonably tough problem. I don’t think it’s impossible by any stretch, but it’s things like - we could make some up right now. Like, are there commits coming and landing in master? Are pull requests being merged? Are issues being responded to and closed? Another one I’m particularly interested in because I think this is pretty important for the story we tell ourselves about open source, the kind that anyone can contribute, “Are all the contributions coming from the core team, or are they coming from the outside of the core team?”
There’s one quote that calls this the ‘democracy of the project’. Is it actually - ‘meritocracy’ is a dirty word these days, but is it the community that’s contributing to this thing, or is it just three people who are actually rejecting the community’s contributions and are just working on their own stuff?
Is it participatory, right? Can people participate? That’s the question.
Yeah. How open is this collaboration, is the way I like to think of it. Because I think that’s the thing we tell ourselves, and that’s one of the reasons that I think open source is both a collaboration model and a set of licenses and ways to think about IP. For me, the most exciting thing about open source - and actually about GitHub - is that I think the way in which collaboration can happen is very exciting. You have permission to take a copy, do some work and propose a change, and then have that conversation happen in the open.
A lot of people do that, but they’re actually working in a very small team, or working together. Actually, a while ago I tried to measure some of this stuff on a few projects that I use, and you can see quite clearly that some projects are terrible at merging community contributions. They’re absolutely appalling at it. I can’t name names; some of them are incredibly popular languages.
I totally won’t, I’ll absolutely not. Some of them are very poor. But then actually, just to counter that, okay, so what does it mean if you are very bad at merging contributions? Maybe that means your API is really robust and your software is really stable, right? It’s not clear that being very conservative about merging pull requests is wrong, but it does mean that the community feels different. It does mean that the collaboration experience is [unintelligible 00:52:44.16]
That’s exactly what I wanted to tease apart a little bit. I just had a talk recently where I was looking at Rust versus Clojure and how both of those communities function, and they’re really different. Rust is super participatory and Clojure is more BDFL, but one can make the argument that both are still working, and Clojure really prioritizes stability over anything else, so that’s why they’re really careful about what they actually as contributions.
[00:53:10.28] So we talked about popularity of projects and then we’re talking now about health of projects, and it feels like two parts of it. One is around “Is this project active? Is it being actively worked on and being kept up to date?”, and you can look at contribution activity there. The other part is “Is it participatory or is it collaborative? Does the community itself look positive, healthy, welcoming?” But those are two pretty separate areas in my opinion.
yum, which is an even smaller number of people for the project that could actually publish whatever changes were merged in, unless everyone is literally pulling from GitHub directly, which I don’t think most published software happens that way yet.
My prediction here is that the people and the organizations that are gonna solve this are gonna be the ones that are paying most attention to business users of open source. Because if you are a CIO and you’re thinking about starting to use open source more extensively in your organization, then assessing the risk of that in terms of maintenance and service agreements and understanding of whether a project is - if it does have a security vulnerability that’s likely to be patched… It’s useful to know in open source generally. “Should I use this library because it’s likely to see updates when Rails 5 is released?” or “When something happens, can I use my favorite framework with this, or my favorite tool? Is that likely to happen?” That’s useful to know, but it’s not business-critical. I think the people who really want a hard answer to this are more likely to be business consumers. That’s my prediction. I think there’s actually a lot of opportunity to do good stuff in this space.
[00:57:12.00] The Linux Foundation are a little bit around that with the Core Infrastructure Initiative, where they’re trying to see, “Has this project had a security review? When was the last time it was checked for the people that are behind the project?”, which I think is a harder thing to do automatically. You end up having to have a set of humans that go and contact other humans, which if those people are anything like me on e-mail, it may take ages to get a response.
There’s a fair number of metrics that we can pull in automatically to give you a light indication of if the project is healthy. I guess you have to split it in half again and go like, “Well, what do I care about the project? Is this thing that I’m doing a throw-away fun experiment or a learning exercise, or is it something I’m gonna be putting into production?” Then you have to look at things with two very different sets of metrics.
I think the methodology that they used is somewhat applicable here though. I know a lot about the CII thing because I’m at the Linux Foundation. The NodeJS project was one of the first to get a security badge. Essentially what they did was they came up with “How do we do a really good survey on projects that are problematic? Do they have a security problem?” They asked some of the similar questions that we did, like “What makes a project healthy? How do we define that?” Then they went out and did this huge survey to identify all the projects that are having a problem. Later what they did was they turned all of those things into basically a badging program. There was a set of recommendations that you can do, and if you do all of these things, then you get the security badge.
The Node project was one of the launch partners of this. It’s really simple stuff, like have a private security list, have a documented disclosure policy, have that on a website somewhere. It sounds really basic, but the number of projects that are heavily depended on that don’t do that is surprisingly big. And just having a really basic set of things that people can go do that make people feel better about their software and are actually good for the health of the projects is like a really good set of recommendations that we can come up with, that would actually be based on metrics and some really good methodology.
I’m curious to kind of move this a little bit to thinking about analytics from a maintainer’s point of view. So if you’re a maintainer and you have a project, the project gets popular, what should they be measuring for their projects? What do you think they should be paying attention to at a high level?
Someone asked me a question the other day on Twitter… They were wondering for a given library that they were maintaining what were the versions of that library that people depended on. They wanted to see for the 500 other projects that depended on it what versions were they using, because they wanted to get an idea of which things could they deprecate. As Mikeal said earlier, we wanna know the actual pain points here and if people are stuck on an old version, and how can we move them forward, so that we can drop some old code or we can kind of clean up something that we don’t like anymore. That data is very easy to get, although trying to lump that in together with SemVer ranges ends up going like, “Oh, they depend on something around this version”, as opposed to something very specific.
[01:00:59.03] But having that actual usage data around the versions, which some package managers really give you the data of a particular download for a version as well, so you can see, “Oh, this thing looks completely dead. No one has downloaded this anymore”, as opposed to the last two releases that are really heavily downloaded. And you can get that data from RubyGems. I don’t think NPM has download data on a per-version basis, as least publicly available. For other smaller package managers it’s kind of all over the place, whereas at least on GitHub you can assume everyone is looking at the default branch.
Then also looking into the forks is something that maintainers might wanna do to be able to kind of go, “Oh, people are forking this off and changing things manually. They haven’t wanted to contribute back? Why didn’t they contribute back?” It definitely seems to me to come down to very human questions, as opposed to kind of like “What versions of Node are people running when they’re using my library?” It’s more kind of like, “How can I help these people either move forward onto a newer version, or what are the exceptions that they’re having that I never see?”
I was talking to the guy at Bugsnag, who do exception tracking, and they collect a lot of exception data that actually is thrown up by an open source library and they see it in the stack trace, like “Oh, this error has come from Rack”, for example, and they were investigating if they could use or at least ask for permission for users to report that error, exception tracking data, like “This line of your source code is causing lots of people lots of exceptions, for whatever reason”, which I thought was quite interesting. I don’t think they’ve actually got around to doing that yet, though.
Yeah, I’m also interested in the types of roles of people on your project, as well. One of the projects I maintain for GitHub is called Linguist, which is actually one of our more popular open source projects, and it does the language detection on GitHub; it’s kind of a somewhat self-serviced project, like if a new language springs up in the community and you want to GitHub to recognize it and maybe syntax-highlight it, then you need to come along and add that to Linguist. The longest time it’s been myself and one of the GitHubber merging pull requests, and we just realized that the rate at which the project was able to move from being responsive was actually really severely limited by our attention. So I went and looked at who made the best pull request and being most responsive on the project in the past 6-12 months and I actually just gave a couple of those people commit rights to master.
We’ve got a little bit of policy around who gets to do releases still, just because it’s kind of coupled to our production environment, but doing that has just breathed new life into the project, and I think one of the things that was not straightforward, but you can get it from the pulls page, to see who’s got the most commits to master in the last year or two… Paying attention to who’s active on your project and then thinking about their role - it’s not the kind of hard metric, but thinking about who’s around and who actually really understands and cares about the project, has been contributing… I don’t know, I’m just reflecting on that; it’s only a few weeks that we’ve been doing it, but it’s been really successful so far, and has really put a shot in the arm in terms of energy of the project.
[01:05:02.19] My approach with open source projects I maintain like that is based off a Felix Geisendörfer’s blog post, which was I guess a couple years ago. He basically just goes, “If someone sends me a progress, I’m just gonna add him as a contributor. Because what’s the worst that could happen? If they merge something I don’t like, then I can just back it out.” And later on maybe give them release rights when they’ve kind of proved themselves a little bit that they’re not gonna go crazy… Which seems to work really well, so you get a lot more initial contributions, and those people might not stay around very long, but you see a spike in activity.
And that really developed in the Node community, too. Eventually, that turned into open-open source and more liberal contribution agreements. It’s really the basis now for Node’s core policies as well. There’s been a lot of iteration there on how you liberalize access and commit rights and stuff like that.
It’s been quite interesting to have GitHub actually go like, “Oh, this is the third pull request you’ve received from this person. You should consider adding them as a collaborator so they can do this themselves.”
In the Node project we do a roll-up every month just to show, “Okay, these are the people that merge a lot of stuff”, and then there’s a note next to them if they’re a committer or not, so that they can get onboarded if they’re not. That’s how we base the nominations.
If that was automatically integrated in GitHub it would save me so much time… To run those scripts and [unintelligible 01:06:33.08] those issues, it would be fantastic.
I think Ruby on Rails runs a leader board as well of the total number of commits into any of the rails projects, and you can kind of see a little star next to the ones who are currently Rails core. It kind of gamifies it a little bit, which I don’t know if that’s a good thing or not. I guess as long as it’s people actually doing stuff for the contributions rather than just to get up the leader board…
I think it’d be cool to see that for other types of contributions too, like people that are really active in issues or people that are doing a lot of triaging work, or whatever. I hear that from people, of “Well, I also wanna recognize all these other people that are falling through the cracks or that we don’t always see.”
Right, yeah. We did this blog post recently called “The Shape Of Open Source” that kind of just shows really clearly the difference between the types of activities around a project as the contributor pool grows. You can see that the lion’s share of the activity goes from commits if it’s just a solo project to actually comments on code, and pull requests to actual code review, but then just comments on pull requests and issues, and replies to those issues. It just demonstrates the project’s kind of transitioned to… A lot of it becomes user support, and that’s a ton of work and it’s something that I think what that contributor role is. There’s been some nice thinking going around that, but I don’t think it’s yet kind of baked itself into changes in the way products like GitHub actually work.
Well, to wind this down a little bit and look more towards the future - are there any trends like that that you see actually growing over time? I’ll ask this to both of you… We’ve talked a lot about what the data looks like right now. If you look at the data now, compared to last year or compared to the year before, what are the biggest growth areas in terms of what this data looks like?
[01:08:47.15] Well, for me there’s an accelerating number of packages everywhere, across every package manager that is in a language that is still very active. Perl is slowed down a little bit, but most package managers seem to continue to gain more and more code. There’s just more choice and more software to keep track of and to choose which things you should use. There’s never just like, “Oh, there’s the one obvious choice for this thing.” It feels like it’s reaching a point where… The internet happened 10-15 years ago, where the Yahoo! curated homepage was no longer useful because they couldn’t keep up with the amount of things that they were putting in. We have the equivalent in awesome lists where people are manually adding stuff. It’s kind of like the Yahoo! Directory of the internet, whereas you need something like Google to come along and go, “Actually, here’s the things that are gonna solve your problems.”
The dependency graph does give you something like a page rank to be able to go, “If we used a combination of links to that…”, either the GitHub page or the NPM page, and dependencies from actual software projects, you would then have a good picture of the things that are the most considered to be useful. Which is something that I’ve tried to put in, but there’s a huge amount of work to keep on top of and to build at essentially Google again, but for software.
Right. Clay Shirky has been mentioned once already on this today, but let’s mention him again - he’s like, “The problem is filter failure, not information overload.” I think currently a lot of what we’ve talked about today, it’s like it’s hard to find the right thing, because the volume of open source software is growing exponentially.
I think it’s almost becoming standard to hear some of these conversations happen. Now people are like, “Yeah, but how can we measure health? How can we know whether a project is doing well?” How is the data changing? I don’t know that the data is changing necessarily that much; I think Homebrew’s adding those metrics to capture usage, I think that’s a really good step in the right direction.
Some of this is there’s data missing that we don’t necessarily have, and it will be better to have more explicit measure of consumption in the use of open source.
I think the other part of it, the biggest change that I’m seeing is that the conversation is moving pretty fast, and that to me speaks of a demand and a better understanding of the problem generally in the community, and I think that means that we’re likely to see product changes and improvements that help solve some of the really common issues for people.
That’s great, I’m excited!
There’s a lot of people [unintelligible 01:12:06.20] that area as well. Did you see the Software Heritage project that was released yesterday?
So far they’re just collecting stuff, but building those kinds of tools on top of all of that, like the internet archive of software, could be a really powerful way for collecting those metrics and making them distributed out and allowing people to do interesting things on top of them
I think we’ll leave it there. Thank you all for coming on, this was amazing.
Thanks for the conversation.