by Patrick DeVivo
Kubernetes is a big project. Not only because it’s a big deal, but also in terms of its source code. At the time of writing, there are 86k+ commits, 2k+ contributors, 2k+ open issues, 1k+ open PRs, and 61k+ stars. This is accessible from the project’s Github page.
scc
counts 4.3M+ lines of go source code (5.2M+ total lines), 3M+ lines of “actual” vs. 700k+ lines of comments. 16k+ files in total. This includes the vendor/
directory.
We’ve been working on a project that surfaces TODO comments in a codebase to help developers do basic project management workflows within that codebase.
We decided to point our little TODO finder at the Kubernetes source code to see what would turn up. Here are some of the results.
We ran tickgit
against source code from commit 9bf52c2. The CSV output was then imported into SQLite to run queries against. Note that the tool only finds TODOs in the tree of the checked-out commit; it will not account for TODOs that were added and subsequently removed. Therefore, the numbers reflect only the TODOs still “live” in the code, at the commit.
Totals (for 9bf52c2)
- 2,380 TODOs across 1,230 files from 363 distinct authors
- 460 TODOs with an assignee e.g.
// TODO (patrickdevivo) Fix the ...
- 489 TODOs were added in 2019 so far
- 860 days (or 2.3 years) is the average age of a TODO
- The oldest TODO is from Jun 6, 2014 (from “First commit”)
- The most recent TODO is from Dec 9, 2019
- This file has the most TODOs at 33
- deads2k has added the most (current) TODOs (git blame) at 147
- This commit added the most TODOs (that are still in the source) at 64
Summaries
count,file_path
33,cluster/gce/util.sh
25,pkg/apis/core/types.go
23,staging/src/k8s.io/api/core/v1/types.go
21,staging/src/k8s.io/legacy-cloud-providers/aws/aws.go
20,staging/src/k8s.io/code-generator/cmd/conversion-gen/generators/conversion.go
20,pkg/apis/core/validation/validation.go
16,test/e2e/network/service.go
16,pkg/kubelet/kubelet.go
14,test/e2e/framework/util.go
14,pkg/kubelet/kubelet_pods.go
author,count
deads2k,147
Clayton Coleman,105
Chao Xu,99
Dr. Stefan Schimanski,93
Jordan Liggitt,81
David Eads,60
Random-Liu,54
Wojciech Tyczynski,50
Yu-Ju Hong,43
Prashanth Balasubramanian,38
count,sha
64,6a4d5cd7cc58e28c20ca133dab7b0e9e56192fe3
19,e01ff1641c7321ac81fe5775f6ccb21aa6775c04
19,4fb28dafad121e163fa86dc90067ce3d14415811
18,adb75e1fd17b11e6a0256a4984ef9b18957d94ce
14,963c85e1c807efcdbb82dd44439dc3c55f6a0bfd
14,8b17db7e0c4431cd5fd9a5d9a3ab11b04e2f0a7e
13,f0f78299348afcf770d4e8d89dcea82f80811b28
11,d0b94538b9744d0c06df6ddec2604be168568f9d
10,f1248b9c829e225138ab6d6234221c63092f7592
10,cd663d7ad00937cffa8a09e4761acb95d34c89a3
count,year
34,2014
249,2015
523,2016
650,2017
435,2018
489,2019
To produce similar results, try tickgit todos --csv-output
to get raw TODO data. We used SQLite to query for the above summaries.
Conclusions and Questions
These results are from a fairly off-the-cuff look at what TODO comments in the Kubernetes source code look like. We get a sense of the top TODO creators, which tracks more or less with the top contributors to the project.
We also see that for “large” source code, developer behavior around TODO comments doesn’t seem to be out of the norm, there’s just more of it.
An important observation is that there are more TODO comments than there are Github issues. This is interesting, in that it indicates a significant amount of latent “work”…or to-do items, which are not easily accessible unless you spend time in the source code itself.
Core contributors likely have a good idea of their area of the codebase and strong intuitions about their own TODOs and “latent work.” This is fairly opaque to outside observers, though. Github issues (or other public ticket trackers) are more easily accessible to those not “in the weeds” of the project.
As most developers understand, software projects “live and breathe.” There’s frequent change, continuous improvement, constant imperfection and lots of discussions. Workflow and process are very important because good code requires continual reflection. We see a part of this in action through the use of TODO comments in the Kubernetes source. Without a benchmark, though, an average TODO age of 2.3 years does seem quite high. Those closer to the code will be much better able to pass judgment; perhaps it would be interesting to see how this source code compares to that of other big open source projects.
A more in-depth analysis of a codebase’s TODOs might involve a look at all of the TODOs in the history, not just the ones currently in the source code.
- What’s the rate at which TODOs are closed over time?
- What’s the average lifetime of a TODO comment?
- How do popular codebases compare to one another?
Does it Matter?
TODO comments typically cover the type of work that might be too small for a ticket, but important enough to note and describe in a code comment (though plenty of TODOs will reference issues/tickets). Since they are part of the code, they are often “closer” to the work that needs to get done. They are easy to add, but, it seems, just as easy to lose (there are 1.8k+ TODOs added prior to 2019 still in the Kubernetes’ source).
We hope that by creating a tool that surfaces metadata about code, we can make it easier for software developers to get work done, in projects of any size. Surfacing TODOs is just one piece of that.