Finding duplicated code in huge codebases

This is a WIP article

In the spring of 2023, I finished my master thesis at The University of Oslo in Norway. Ever since I discovered the field of programming languages and compiler technology, I have been interested in leveraging such technologies in order to improve the quality of my own code. As a Neovim user, I have become familiar with technologies such as LSP and tree-sitter, which is a pleasure to use in my daily programming work. For my thesis I wanted to explore the idea of adding more static analysis tools while editing, leveraging the editor-agnostic aspects of LSP, and the language agnostic (grammar pluggable) aspects of tree-sitter. The problem I decided to tackle was detection of duplicate code, otherwise known as code clone detection.

The problem

To set the stage, this is the problem we are trying to solve:

We are building a static analysis tool which runs as an LSP server, giving indications to the programmer that there are code duplicates in their codebase, at a given set of locations. As the programmer writes, changes or deletes code, we need to be able to react to those changes, finding or removing code clones from the programmers view as they appear/disappear.

The real challenge with this problem is the algorithmic aspect. We need to process potentially many MLOC (million lines of code) and we need to do it quickly. This is impossible to do in real-time, but we probably don't have to reprocess everything whenever the programmer does an edit, right? As the programmer edits source code, there is only a small portion of code in a single file which is edited at a time. This means that 99%+ of our codebase and list of clones will probably stay the same after each edit. If we can incrementally update our list of clones given only the small edit of the source code, this problem goes from impossible to do quickly, to a realistic, but tough challenge.

Preferrably we would have a solution capable of calculating the initial clones in O(N) time, and from there we are able to efficiently update the clone list given a smaller edit, in O(|edit|) time.

This problem could be seen as a variation of the "maximal repeat" problem. Given a string, we want to find all the maximal repeat of at least size k, where k is the minimum length of a pair of code clones. However, we wouldn't want to plug our entire code-base directly into a maximal-repeat algorithm, the string would be too large to process in a reasonable amount of time. We might also want to only consider certain sections of source code (for example all methods). Therefore we need to pre-process our source-code.

Pre-processing timeline

Parsing
Fingerprinting

Memory optimizations

Fingerprint size compared to source-code size
Source-map
Keeping only open files and their AST's in memory

The algorithm

TODO

Suffix array for maximal-repeat (LCP)
Fetching code clones from LCP array back to source code
Dynamic suffix array update