Using Artificial Intelligence in Open Source Audits

Open source auditing is a tedious process, where the auditor must produce a “bill of materials” (BoM) listing all the…

Open source auditing is a tedious process, where the auditor must produce a “bill of materials” (BoM) listing all the open source components used within a software product. Depending on the size and complexity of the software, the codebase being audited may contain thousands or tens of thousands of source code files. The BoM must identify not only all the open source components and snippets contained in the codebase but also the open source licenses associated with each open source component. This tedious process, which some might describe as dull, is extremely error prone. Furthermore, licenses are often sloppily declared, and many open source projects make minor modifications to existing open source licenses, which can be difficult for an auditor to spot.

In many different industries, automation is seen as a solution to replace or reduce dull, repetitive, and error-prone work currently performed by people. Open source auditing is no different in this respect, and several open source compliance solutions are claiming to automate the auditing process. Automation, however, can be a double-edge sword. When an automated system makes an error, it can easily go unnoticed, which can lead to big problems further down the road.

AI in Open Source Audits

At first glance, automating the open source auditing process may seem straight-forward: Scan the codebase and identify “matches” against a database of open source components, then mark the matches with the corresponding open source components and license, and finally generate a BoM listing all the open source components and their corresponding licenses. In reality, the process is quite a bit more challenging.

License declarations are nearly always expressed in human language, and essentially an automated system must get into the minds of open source developers and try to understand their intent when writing a particular license declaration. Computers are fundamentally logical, whereas people have been known to defy logic from time to time. Take, for example, this license declaration:

* This program is free software: you can redistribute it and/or modify

* it under the terms of the GNU Affero General Public License as

* published by the Free Software Foundation, either version 3 of the

* License.

The presence of the word “either” indicates a choice among two or more licenses, however, the full stop after “version 3 of the License” indicates a choice of exactly one. Those familiar with open source licenses will recognize this text as part of the standard header for the “AGPL-3.0-or-later” license, except that the developer (or someone else) has removed the phrase “or (at your option) any later version.” So the question is, does it become “AGPL-3.0-only” or “AGPL-3.0-or-later”? An automated system must know what to do in this scenario.

This is just one of many thousands of examples where “the devil is in the details.” AI can be used to identify such situations and even to decide what to do in many such situations.

The problem of false positives

Another challenge in automating open source auditing is that the scanning process can generate a large number of so-called “false positives”. A false positive in this context means a match to an open source component that does not really contain any code subject to copyright, due to it being overly generic. The following Java code snippet demonstrates an example:

  1. private String name;
  2. private int id;
  3. public String getName() {
  4. return name;
  5. }
  6. public void setName(String name) {
  7. name = name;
  8. }
  9. public int getId() {
  10. return id;
  11. }
  12. public void setId(int id) {
  13. id = id;
  14. }

This code is so generic (and probably auto-generated) that nobody is likely to argue it falls under copyright protection, so an auditor would simply discard this match. The problem is that in certain codebases, there might be thousands of such “false positives,” creating a time-consuming nuisance for auditors to clear them all. Here again, AI can be used to distinguish between a bonifiedmatch that should be included in the BoM and a false positive that can be safely ignored.

Can AI really automate the whole open source auditing process?

It is perhaps not surprising that some are skeptical about the ability of AI to automate the entire open source auditing process, especially those familiar with the intricacies of open source auditing. Some skepticism may arise out of fear for the replacement of human jobs with computers. In other cases, the skepticism derives from an authentic understanding of how challenging it is to always get it right when it comes to open source auditing.

At FossID, we are focused on building an AI system that errs on the side of caution. We don’t want our solution to make any mistakes at all, so when it is unsure about how to classify a particular code snippet or license declaration, it alerts a human auditor to review the case. In this way, we see AI as a tool to help the auditor. By handling the clear-cut cases automatically, the auditor can focus more time and attention on the problematic cases. This capability can greatly reduce the error rate because inevitably the time and resources available to audit a particular software product are limited. Instead of spendingXhours to go through thousands of repetitive trivial matches, the auditor can give the challenging cases the attention that they deserve. This makes the task of an auditor more intellectually stimulating and ultimately elevates his or her role as an expert in software legal compliance.

At the same time, as AI technology advances, the breadth and diversity of situations that it can handle will certainly increase, and the number of problematic cases may start to dwindle. At FossID, we aim to be on the leading edge of such developments. Nonetheless, we believe the job of open source auditor is safe for the foreseeable future.

Other Articles relevant