GenLaw ’23: Accepted Papers

Back to GenLaw ↩︎

The Restatement (Artificial) of Torts

by Colin Doyle [spotlight] [pdf]

This article explores different processes for using large language models to construct parts of a Restatement of Torts based on the models’ understanding of leading torts cases. The performance of these models is evaluated by comparing the results with the existing, human-written Restatements of Torts. By tracking the discrepancies between the two restatements, we can gain insights into both the machine and human processes for understanding the law. Where the two restatements converge would tend to indicate the reliability of both sources on a particular subject. Where the restatements diverge may reflect a language model’s limitations, human authors’ preferences, or meaningful differences in how humans and machines process information. This analysis has implications for the potential of large language models as tools for legal research and writing, the function and authority of Restatements of Law, the future of human-machine collaboration in legal practice, and the potential for machine learning to reshape the law itself.

The Data Provenance Project

by Shayne Longpre; Robert Mahari; Anthony Chen; Niklas Muennighoff; Kartik Perisetla; Will Brannon; Jad Kabbara; Luis Villa; Sara Hooker [spotlight] [pdf]

A wave of recent language models have been powered by large collections of natural language datasets. The sudden race to train models on these disparate collections of incorrectly, ambiguously, or under-documented datasets has left practitioners unsure of the legal and qualitative characteristics of the models they train. To remedy this crisis in data transparency and understanding, in a joint effort between experts in machine learning and the law, we’ve compiled the most detailed and reliable metadata available for data licenses, sources, and provenance, as well as fine-grained characteristics like language, text domains, topics, usage, collection time, and task compositions. Beginning with nearly 40 popular instruction (or ‘’alignment’’) tuning collections, we release a suite of open source tools for downloading, filtering, and examining this training data. Our analysis sheds light on the fractured state of data transparency, particularly with data licensing, and we hope our tools will empower more informed and responsible data-centric development of future language models.

by Rui-Jie Yew; Dylan Hadfield-Menell [spotlight] [pdf]

In this paper, we consider the impacts of a pre-training regime on the enforcement of copyright law for generative AI systems. We identify a gap between the proposed judicial tests for copyright liability and the evolving market of deployed generative models. Specifically, proposed judicial tests assume a tight integration between model training and deployment: the ultimate purpose of a model plays a central role in determining if a training procedure’s use of copyrighted data infringes on the author’s rights. In practice, modern models are built and deployed under a pre-training paradigm: large models are trained for general-purpose applications and to be specialized to different applications, often by third parties. This creates an opportunity for developers of pre-trained models to avoid direct liability under these tests. Based on this, we argue that copyright’s secondary liability doctrine will play a central role in the practical effect of copyright regulation on the development and deployment of AI systems. From this insight, we draw on similarities and dissimilarities between generative AI and the regulation of peer-to-peer file sharing through secondary copyright liability to understand how companies may manage their copyright liability in practice. We discuss how developers of pre-trained models can, through a similar combination of technical and developmental strategies, also subvert regulatory goals. Our paper emphasizes the importance of a systems-level analysis to effectively regulate AI systems. We conclude with a brief discussion of regulatory strategies to close these loopholes and propose duties of care for developers of ML models to evaluate and mitigate their models’ downstream effects on the authors of the copyrighted works that are used in training.

Diffusion Art or Digital Forgery? Investigating Data Replication in Stable Diffusion

by Gowthami Somepalli; Vasu Singla; Micah Goldblum; Jonas A. Geiping; Tom Goldstein [spotlight] [pdf]

The emergence of diffusion models has revolutionized generative tools for commercial art and graphic design. These models leverage denoising networks trained on massive web-scale datasets, enabling the creation of powerful commercial models like DALL·E and Stable Diffusion. However, the use of these mega-datasets raises legal and ethical concerns due to unknown data sources and the potential for models to memorize training data. We investigate the occurrence and extent of content replication in state-of-the-art diffusion models. Our study focuses on the identification of replication in the Stable Diffusion model, trained on millions of images. Our findings likely underestimate the actual rate of replication due to search limitations. The definition of replication can vary, and we present our results without imposing a rigid definition, allowing stakeholders to draw their own conclusions based on their involvement in generative AI.

Measuring the Success of Diffusion Models at Imitating Human Artists

by Stephen Casper; Zifan Guo; Shreya Mogulothu; Zachary Marinov; Chinmay Deshpande; Rui-Jie Yew; Zheng Dai; Dylan Hadfield-Menell [spotlight] [pdf]

Modern diffusion models have set the state-of-the-art in AI image generation. Their success is due, in part, to training on Internet-scale data which often includes copyrighted work. This prompts questions about the extent to which these models learn from, imitate, or copy the work of human artists. This work suggests that tying copyright liability to the capabilities of the model may be useful given the evolving ecosystem of generative models. Specifically, much of the legal analysis of copyright and generative systems focuses on the use of protected data for training. However, generative systems are often the result of multiple training processes. As a result, the connections between data, training, and the system are often obscured. In our approach, we consider simple image classification techniques to measure a model’s ability to imitate specific artists. Specifically, we use Contrastive Language-Image Pretrained (CLIP) encoders to classify images in a zero-shot fashion. Our process first prompts a model to imitate a specific artist. Then, we test whether CLIP can be used to reclassify the artist (or the artist’s work) from the imitation. If these tests match the imitation back to the original artist, this suggests the model can imitate that artist’s expression. Our approach is simple and quantitative. Furthermore, it uses standard techniques and does not require additional training. We demonstrate our approach with an audit of Stable Diffusion’s capacity to imitate 70 professional digital artists with copyrighted work online. When Stable Diffusion is prompted to imitate an artist from this set, we find that the artist can be identified from the imitation with an average accuracy of 81.0%. Finally, we also show that a sample of the artist’s work can be matched to these imitation images with a high degree of statistical reliability. Overall, these results suggest that Stable Diffusion is broadly successful at imitating individual human artists.

Machine Learning Has A Fixation Problem

by Katrina Geddes [pdf]

This extended abstract explores the consequences of digital fixation for both the copyright infringement liability of generative art models, as well as the capacity of machine learning models to interfere with identity formation through automated gender recognition.

by Ram Shankar Siva Kumar; Jonathon Penney

We outline the legal complexities and questions concerning prompt injection attacks through the lens of the Computer Fraud and Abuse Act (CFAA).

From Algorithmic Destruction to Algorithmic Imprint: Generative AI and Privacy Risks Linked to Potential Traces of Personal Data in Trained Models

by Lydia Belkadi; Catherine Jasserand [pdf]

This contribution discusses the ‘algorithmic disgorgement’ tool used by the FTC in four settlement cases relating to unfair or deceptive data practices, where the FTC ordered to delete not only the data unlawfully processed but also the resulting models. According to some scholars, this measure could mean and be justified by the fact that some models contain traces or shadows of training data. Reflecting on this tool from a USA and EU legal perspective, we question the opportunity to design more granular legal assessments and contend that regulators and scholars should consider, if evidenced, whether (traces or fragments of) personal information can be contained in or disclosed by models when defining deletion or retraining obligations. This issue has received limited interdisciplinary attention, hindering ongoing discussions on generative AI regulation.

Developing Methods for Identifying and Removing Copyrighted Content from Generative AI Models

by Krishna Sri Ipsit Mantri; Nevasini NA Sasikumar [pdf]

Recent progress in generative AI has enabled the automatic generation of human-like content, but models are often trained on data containing copyrighted information, raising legal questions. This abstract proposes developing methods to identify copyrighted content memorized by generative models systematically. By evaluating how closely generated content matches copyrighted training data, we could highlight potential copyright issues. We also propose techniques to target and remove memorized copyrighted information directly, potentially enabling the “copyright-free” use of pre-trained generative models.

What can we learn from Data Leakage and Unlearning for Law?

by Jaydeep Borkar [pdf]

Large Language Models (LLMs) have a privacy concern because they memorize training data (including personally identifiable information (PII) like emails and phone numbers) and leak it during inference. A company can train an LLM on its domain-customized data which can potentially also include their users’ PII. In order to comply with privacy laws such as the “right to be forgotten”, the data points of users that are most vulnerable to extraction could be deleted. We find that once the most vulnerable points are deleted, a new set of points become vulnerable to extraction. So far, little attention has been given to understanding memorization for fine-tuned models. In this work, we also show that not only do fine-tuned models leak their training data but they also leak the pre-training data (and PII) memorized during the pre-training phase. The property of new data points becoming vulnerable to extraction after unlearning and leakage of pre training data through fine-tuned models can pose significant privacy and legal concerns for companies that use LLMs to offer services. We hope this work will start an interdisciplinary discussion within AI and law communities regarding the need for policies to tackle these issues.

AI and the EU Digital Markets Act: Addressing the Risks of Bigness in Generative AI

by Ayse Gizem Yasar; Andrew Chong; Evan Dong; Thomas Krendl Gilbert; Sarah Hladikova; Roland Maio; Carlos Mougan; Xudong Shen; Shubham Singh; Ana-Andreea Stoica; Savannah Thais; Miri Zilka [pdf]

As AI technology advances rapidly, concerns over the risks of bigness in digital markets are also growing. The EU’s Digital Markets Act (DMA) aims to address these risks. Still, the current framework may not adequately cover generative AI systems that could become gateways for AI-based services. This paper argues for integrating certain AI software as core platform services and classifying certain developers as gatekeepers under the DMA. We also propose an assessment of gatekeeper obligations to ensure they cover generative AI services. As the EU considers generative AI-specific rules and possible DMA amendments, this paper provides insights towards diversity and openness in generative AI services.

by Argyri Panezi; John O Shea [pdf]

The paper analyses the case of legal translation through the lens of legal liability, also touching upon copyright and professional rules. We explore what it means to advance legal translation in a legal and ethical manner with the aim of supporting, not suffocating, the human expert in the centre of the process.

Generative AI and the Future of Financial Advice Regulation

by Talia Gillis; Sarith Felber; Itamar Caspi [pdf]

This paper explores the complex regulatory issues arising from the use of generative AI tools, such as ChatGPT, in regulated professions, with a focus on the financial advisory sector. Despite the ability of such technologies to provide potentially transformative and cost-effective services, they present unique challenges for traditional regulatory structures. Specifically, the paper examines the uncertainties associated with AI-powered chatbots providing financial advice, a role traditionally undertaken by broker-dealers and investment advisors bound by regulations like the suitability standard or fiduciary duty. These AI technologies disrupt the conventional advisor-client relationship and blur the distinction between regulated financial advice and “advice-neutral” content found on blogs or offered by robo-advisors. Unlike robo-advisors, AI tools like ChatGPT are not tailored for financial advice provision, raising questions about liability for advice given. Meanwhile, their personalized and persuasive outputs contrast with the generic nature of advice-neutral content, suggesting the inadequacy of current disclaimers as risk mitigators. The paper proposes a forward-looking approach to these regulatory challenges, advocating for the integration of AI’s conversational capabilities with the infrastructure of robo-advisors or traditional financial services. Such an approach balances consumer protection with affordability and opens the way for regulatory adaptations to account for the emergence of AI in financial advice services. This necessitates revisiting and modifying the current licensure model to better accommodate the varied services offered by AI-assisted technologies, considering their potential benefits and impacts.

Exploring Antitrust and Platform Power in Generative AI

by Konrad Kollnig; Qian Li [pdf]

The concentration of power in a few digital technology companies has become a subject of increasing interest in both academic and non-academic discussions. One of the most noteworthy contributions to the debate is Lina Khan’s Amazon’s Antitrust Paradox. In this work, Khan contends that Amazon has systematically exerted its dominance in online retail to eliminate competitors and subsequently charge above-market prices. This work contributed to Khan’s appointment as the chair of the US Federal Trade Commission (FTC), one of the most influential antitrust organizations. Today, several ongoing antitrust lawsuits in the US and Europe involve major technology companies like Apple, Google/Alphabet, and Facebook/Meta. In the realm of generative AI, we are once again witnessing the same companies taking the lead in technological advancements, leaving little room for others to compete. This article examines the market dominance of these corporations in the technology stack behind generative AI from an antitrust law perspective.

PoT: Securely Proving Legitimacy of Training Data and Logic for AI Regulation

by Haochen Sun; Hongyang Zhang [pdf]

The widespread use of generative models has raised concerns about the legitimacy of training data and algorithms in the training phase. In response to the privacy legislation, we propose Proof of Training (PoT), a provably secure protocol that allows model developers to prove to the public that they have used legitimate data and algorithms in the training phase, while also preserving the model’s privacy such as its weights and training dataset. Unlike the previous works on verifiable (un)learning, PoT emphasizes the legitimacy of training data and provides a proof of (non-)membership to testify whether a specific data point is included/excluded from the training set. By combining cryptographic primitives like zk-SNARK, PoT enables the model owner to prove that the training dataset is free from poisoning attacks and that the model and data were called following the logic of training algorithm (e.g., no backdoor is implanted), without leaking sensitive information to the verifiers. PoT is applicable in the federated learning settings by new multi-party computation (MPC) protocols that accommodate its additional security requirements such as robustness to Byzantine attacks.

When Synthetic Data Met Regulation

by Georgi Ganev [pdf]

We argue that synthetic data produced by differentially private generative models can be sufficiently anonymized and, therefore, anonymous data and regulatory compliant.

Provably Confidential Language Modelling

by Xuandong Zhao; Lei Li; Yu-Xiang Wang [pdf]

Large language models are shown to memorize privacy information such as social security numbers in training data. Given the sheer scale of the training corpus, it is challenging to screen and filter these privacy data, either manually or automatically. In this paper, we propose Confidentially Redacted Training (CRT), a method to train language generation models while protecting the confidential segments. We borrow ideas from differential privacy (which solves a related but distinct problem) and show that our method is able to provably prevent unintended memorization by randomizing parts of the training process. Moreover, we show that redaction with an approximately correct screening policy amplifies the confidentiality guarantee. We implement the method for both LSTM and GPT language models. Our experimental results show that the models trained by CRT obtain almost the same perplexity while preserving strong confidentiality.

The Extractive-Abstractive Axis: Measuring Content “Borrowing” in Generative Language Models

by Nedelina Teneva [pdf]

Generative language models produce highly abstractive outputs by design, in contrast to extractive responses in search engines. Given this characteristic of LLMs and the resulting implications for content Licensing & Attribution, we propose the the so-called Extractive-Abstractive axis for benchmarking generative models and highlight the need for developing corresponding metrics, datasets and annotation guidelines. We limit our discussion to the text modality.

by Inyoung Cheong; Aylin Caliskan; Tadayoshi Kohno [pdf]

Large language models (LLMs) have the potential for significant benefits, but they also pose risks such as privacy infringement, discrimination propagation, and virtual abuse. By developing and examining “worst-case” scenarios that illustrate LLM-based harms, this paper identifies that U.S. law may not be adequate in addressing threats to fundamental human rights posed by LLMs. The shortcomings arise from the primary focus of U.S. laws on governmental intrusion rather than market injustices, the complexities of LLM-related harms, and the intangible nature of these harms. As Section 230 protections for online intermediaries may not extend to AI-generated content, LLM developers must demonstrate due diligence (alignment efforts) to defend themselves against potential claims. Moving forward, we should consider ex-ante safety regulations adapted to LLMs to give clearer guidelines to the fast-paced AI development. Innovative interpretations or amendments to the Bills of Rights may be necessary to prevent the perpetuation of bias and uphold socio-economic rights.

Reclaiming the Digital Commons: A Public Data Trust for Training Data

by Alan Chan; Herbie Bradley; Nitarshan Rajkumar [pdf]

Democratization of AI means not only that people can freely use AI, but also that people can collectively decide how AI is to be used. The rapid pace of AI development and deployment currently leaves little room for collective control. Monopolized in the hands of private corporations, the development of the most capable foundation models has proceeded largely without public input. There is currently no implemented mechanism to account for their negative externalities like unemployment and the decay of the digital commons. In this work, we propose that a public data trust assert control over training data for foundation models. First, we argue in detail for the existence of such a trust. We also discuss feasibility and potential risks. Second, we detail a number of ways for a data trust to incentivize model developers to use training data only from the trust. We propose a mix of verification mechanisms, potential regulatory action, and positive incentives. We conclude by highlighting other potential benefits of our proposed data trust and connecting our work to ongoing efforts in data and compute governance.

Chain Of Reference prompting helps LLM to think like a lawyer

by Nikon Rasumov-Rahe; Aditya Kuppa; Marc Voses [pdf]

Legal professionals answer legal questions based on established reasoning frameworks e.g. Issue, Rule, Rule, Application, Conclusion (IRREAC). We propose a novel technique named chain of reference (CoR) where legal questions are pre-prompted with legal frameworks thus decomposing the legal task into simple steps. We find that large language models like GPT-3 improve Zero-Shot performance by up to 12% when using the chain of reference.

Compute and Antitrust: Regulatory implications of the AI hardware supply chain, from chip design to foundation model APIs

by Haydn Belfield; Anonymous Author [pdf]

We argue that the antitrust and regulatory literature to date has failed to pay sufficient attention to compute, despite compute being a key input to AI progress and services (especially with the advent of powerful new generative AI systems), the potentially substantial market power of companies in the supply chain, and the advantages of compute as a ‘unit’ of regulation in terms of detection and remedies. We explore potential topics of interest to competition law under merger control, abuse of dominance, state aid, and anti-competitive agreements (cartels and collusion). Major companies and states increasingly view the development of AI over the coming decades as core to their interests, due to its profound impact on economies, societies, and balance of power. If the rapid pace of AI progress is sustained over the long-term, these impacts could be transformative in scale. This potential market power and policy importance, particularly in the generative AI field, should make compute an area of significant interest to antitrust and other regulators.

by Daphne E Ippolito; Yun William Yu [pdf]

Modern machine learning for Generative AI are dependent on large-scale scrapes of the internet. There are currently few mechanisms for well-intentioned ML practitioners to pre-emptively exclude data that website owners and content creators do not want generative models trained on. We propose two mechanisms to address this issue. First, building off of the existing protocol, we recommend a protocol which enables a website owner to specify which pages of their website are appropriate for ML models to train on. Second, we propose a standardized tag which can be added to the metadata of image files to indicate that they should not be trained on.

by Brian L Zhou; Lakshmi Sritan R Motati [pdf]

Issues over the copyright of Large Language Models (LLMs) have emerged on two fronts: using copyrighted Intellectual Property (IP) in training data, and the ownership of generated content from LLMs. We propose adopting an opt-in system for IP owners with fair compensation determined by tagging metadata. We first suggest the development of new, multimodal approaches for calculating substantial similarity within generated derivative works by using tags for both content and style. From here, compensation and attribution can be calculated and determined, allowing for a generated work to be licensed and copyrighted while providing a financial incentive to opt-in. This system can allow for the ethical usage of IP and resolve copyright disputes over generated content.

by Noorjahan Rahman; Eduardo Santacana [pdf]

A question that has been top of mind for many proprietors of large language models (LLMs) is whether training the models on copyrighted text qualifies as “fair use.” The fair use doctrine allows limited use of copyrighted material without obtaining permission from the copyright owners. The doctrine applies to uses such as criticism, comment, news reporting, teaching, scholarship, or research. To determine if using copyrighted material qualifies as fair use, courts consider factors such as the purpose of the use, the nature of the work, the amount used, and the effect of the use upon the material’s commercial value. This paper evaluates the relevance of the fair use doctrine in determining the legal risks that may arise for organizations that provide LLMs for use by the public in exchange for fees. The authors argue that while the fair use doctrine has some limited relevance in evaluating the risks associated with selling LLM services, other legal doctrines and devices will have an equal impact, if not more. These doctrines include the registration requirement for copyright infringement suits, the terms of service imposed by website distributors of copyrighted materials, the challenges of certifying a copyright infringement class action, and the absence of copyright protection for facts and ideas. The authors analyze 1) the legal risks arising from training LLMs using copyrighted text, 2) the challenges that authors of copyrighted text have in enforcing copyrights, and 3) what stakeholders and users of LLMs can do to respect copyright laws in a way that achieves the policy goals of copyright law and also permits the public to benefit from services provided by LLMs.

Applying Torts to Juridical Persons: Corporate and AI Governance

by Aaron Tucker [pdf]

In theory, tort law provides a general structure for identifying and addressing a wide variety of concrete harms in society, and could provide a mechanism to address the harms of deployed AI systems. However, even in contemporary non-AI contexts remediations for many torts such as those involving privacy violations are often difficult to obtain in practice. In other domains specially-crafted legislation with specific liabilities and rules succeed at compelling companies to implement specific procedures. This essay draws parallels between problems in corporate governance and AI governance.

by Saba Amiri; Eric Nalisnick; Adam Belloum; Sander Klous; Leon Gommans [pdf]

Generative Models (GMs) are becoming increasingly popular for synthesizing data for a diverse range of modalities. One common aspect of training GMs is that they need large amounts of training data. The era of big data has made such huge troves of data available to us. However, utilizing such a diverse range of datasets to train GMs has its own technical and ethical challenges, one of the most prominent ones being privacy. Data used to train GMs could potentially contain personal and sensitive information. More nuanced issues could also arise such as a model learning the specific style of an artist or memorize copyrighted material such as books. Thus, we need to make sure the GMs are learning as much as possible about the population to generate high quality results while learning as little as possible about discriminating and/or protected of individual of the training data - both definition of a and a subjective to the specifics of the problem. In this work, we present early results for a new method for making Normalizing Flows, a powerful family of GMs, differentially private without adding noise. Based on the definition of pure \varepsilon-DP, we show that using the ability of flow-based models for exact density evaluation we can add differential privacy to flows by limiting their expressivity instead of adding noise to them. We show, through this methodology, that enforcing privacy can lead to the obfuscation of private material (denoted as such with a watermark). This implies the benefits of privacy preserving methods for removing discriminant features. But it could have negative side effects: preserving privacy can potentially hide the fact that a model was trained on protected material. We show that when features identifying copyrighted materials are non-discriminative and prevalent in the dataset, the DP model is still able to capture them.

Gradient Surgery for One-shot Unlearning on Generative Model

by Seohui Bae; Seoyoon Kim; Hyemin Jung; Woohyung Lim [pdf]

Recent regulation on right-to-be-forgotten emerges tons of interest in unlearning pre-trained machine learning models. While approximating a straightforward yet expensive approach of retrain-from-scratch, recent machine unlearning methods unlearn a sample by updating weights to remove its influence on the weight parameters. In this paper, we introduce a simple yet effective approach to remove a data influence on the deep generative model. Inspired by works in multi-task learning, we propose to manipulate gradients to regularize the interplay of influence among samples by projecting gradients onto the normal plane of the gradients to be retained. Our work is agnostic to statistics of the removal samples, outperforming existing baselines while providing theoretical analysis for the first time in unlearning a generative model.

Protecting Visual Artists from Generative AI: An Interdisciplinary Perspective

by Eunseo Choi [pdf]

Generative AI undeniably poses threats to visual artists’ livelihoods. Technical intricacies of the model and challenges in proving market substitution make it difficult for creators to establish strong cases for copyright infringement in the U.S. Defending human authorship and the creative arts will require effective design and use of legal and technical solutions grounded in behaviors, concerns, and needs of those impacted by the model. To do this, this paper calls for interdisciplinary collaboration among social scientists, legal scholars, and technologists.