Ownership of AI-Generated Code Hotly Disputed

GitHub Copilot dubs itself as an “AI pair programmer” for software developers, automatically suggesting code in real time. According to GitHub, Copilot is “powered by Codex, a generative pretrained AI model created by OpenAI” and has been trained on “natural language text and source code from publicly available sources, including code in public repositories on GitHub.”

However, a class-action lawsuit filed against GitHub Copilot, its parent company Microsoft, and OpenAI claims open-source software piracy and violations of open-source licenses. Specifically, the lawsuit states that code generated by Copilot does not include any attribution of the original author of the code, copyright notices, and a copy of the license, which most open-source licenses require.

“The spirit of open source is not just a space where people want to keep it open,” says Sal Kimmich, an open-source developer advocate at Sonatype, machine learning engineer, and open source contributor and maintainer. “We have developed processes in order to keep open source secure, and that requires traceability, observability, and verification. Copilot is obscuring the original provenance of those [code] snippets.”

“I very much hope that what comes out of this lawsuit will be something I can rely on when making decisions about training models in the future.”
—Stella Biderman, EleutherAI

In an attempt to address the issues with open-source licensing, GitHub plans to introduce a new Copilot feature that will “provide a reference for suggestions that resemble public code on GitHub so that you can make a more informed decision about whether and how to use that code,” including “providing attribution where appropriate.” GitHub also has a configurable filter to block suggestions matching public code.

The onus, however, still falls on developers, as GitHub states in Copilot’s terms and conditions: “GitHub does not claim any rights in Suggestions, and you retain ownership of and responsibility for Your Code, including Suggestions you include in Your Code.”

In addition to open-source licensing issues, Copilot raises concerns in terms of the legality of training the system on publicly available code, as well as whether generated code could result in copyright infringement.

Kimmich points out the Google v. Oracle case, wherein “taking the names of methods, but not the functional implementation is OK. You’re replacing the functional content but still keeping some of the template.” In the case of Copilot, it might generate copyrighted code verbatim. (See related tweet below from Tim Davis, computer science professor at Texas A&M University, as an illustration of Copilot generating copyrighted code.)

Kit Walsh, a senior staff attorney at the Electronic Frontier Foundation, argues that training Copilot on public repositories is fair use. “Fair use protects analytical uses of copyrighted work. Copilot is ingesting code and creating associations in its own neural net about what tends to follow and appear in what contexts, and that factual analysis of the underlying works is the kind of fair use that cases involving video game consoles, search engines, and APIs have supported.”

But when it comes to generated code, Walsh says it boils down to “how much [Copilot] is reproducing from any given element of the training data” and if it encompasses creative expression that is copyrightable. “If so, there could be infringement happening,” she says.

The lawsuit against GitHub Copilot is the first of its kind to challenge generative AI. “It’s setting a legal precedent that has implications for other generative tools,” Walsh says. “It’s the type of work that if a person authored [it, they] could qualify for copyright protection, and it could embody someone else’s copyrighted work like snippets of code.”

“If I as an engineer would like to use Copilot, I will need to be able to restrict what it provides me to code that’s attributed to the license.”
—Sal Kimmich, Sonatype

For Stella Biderman, an AI researcher at Booz Allen Hamilton and EleutherAI, the lawsuit is a welcome development. “It’s going to, I hope, provide clarity and guidance as to what is actually legal, which is one of the big issues for those working on open-source AI,” she says. “I very much hope that what comes out of this lawsuit will be something I can rely on when making decisions about training models in the future.”

The open-source community seems divided on the lawsuit and GitHub Copilot itself. For instance, the Software Freedom Conservancy has been vocal about its concerns with Copilot—even calling for a boycott of GitHub—but is cautious about joining the class-action lawsuit. Kimmich says they know of open-source developers taking an ethical stance in choosing not to use Copilot, but also others who are enjoying it. “They’re learning while developing and executing code on the fly.”

Kimmich themself is on a waitlist for Copilot and recognizes the benefits it offers developers. “The neural network behind it is using more than just code to help you—it’s providing much more contextual information,” they said. “It means I as a developer now have an extended intelligence, which is giving me a contextualized recommendation. I think that’s excellent. It’s the most powerful generative intelligence that we’ve had so far for this application.”

Yet unless the open-source licensing issue is solved, Kimmich envisions using GitHub Copilot only for pet projects and exploring new packages. “It stops short of production code because of the licensing issue,” they said. “If I as an engineer would like to use Copilot, I will need to be able to restrict what it provides me to code that’s attributed to the license, or have a license which states that it was codeveloped. If I can’t locate the provenance of the original licenses or the original intellectual property, then I need to be able to know if I want to avoid it.”

Another solution would be for GitHub Copilot to modify its AI model so that it traces attribution and gives credit to the original authors of the code, adding the associated copyright notices and license terms in the process, which Biderman says is technologically feasible. “The position that OpenAI and Microsoft seems to have taken is that it is unduly onerous on them to filter by license when other models successfully do it.” She points to academic models such as InCoder as an example, which is trained on code that it has a license for. “There are other options and other models that are both more ethical and more likely to be legal,” Biderman says.
Source: IEEE Spectrum Computing