reading robot icon GenLaw '23

GenLaw ’23: Resources

Also see the glossary

Intellectual Property

Understanding Generative Artificial Intelligence and Its Relationship to Copyright by Christopher Callison-Burch (May 2023) [video]

Foundation Models and Fair Use by Peter Henderson, Xuechen Li, Dan Jurafsky, Tatsunori Hashimoto, Mark A. Lemley, and Percy Liang (2023)

Artificial Intelligence’s Fair Use Crisis by Ben Sobel (2017)

LIBRARY OF CONGRESS Copyright Office 37 CFR Part 202 Copyright Registration Guidance: Works Containing Material Generated by Art (2023)

Legally Speaking Text and Data Mining of In-Copyright Works: Is It Legal? by Pamela Samuelson (2021)

Fair Learning by Mark A. Lemley and Bryan Casey (2021)

There’s No Such Thing as a Computer-Authored Work—And It’s a Good Thing, Too by James Grimmelmann (2016)

Resilient open commons by Luis Villa (2022)

Formalizing Human Ingenuity: A Quantitative Framework for Copyright Law’s Substantial Similarity by Sarah Scheffler, Eran Tromer, and Mayank Varia (2022)

Provable Copyright Protection for Generative Models by Nikhil Vyas, Sham Kakade, and Boaz Barak (2023)

Authors and Machines by Jane Ginsburg and Luke Ali Budiarjo (2019)


A taxonomy of privacy by Daniel Solove (2006)

Privacy Harms by Danielle Citron and Daniel Solove (2022)

Deep Fakes: A Looming Challenge for Privacy, Democracy, and National Security by Bobby Chesney and Danielle Citron (2019)

Privacy in Context: Technology, Policy, and the Integrity of Social Life by Helen Nissenbaum (2009)

Information Privacy and the Inference Economy by Alicia Solow-Niederman (2022)

Privacy As Intellectual Property? by Pamela Samuelson (2000)

What Does it Mean for a Language Model to Preserve Privacy? by Hannah Brown, Katherine Lee, Fatemehsadat Mireshghallah, Reza Shokri, and Florian Tramèr (2022)
Language models use unstructured text data which means private information is nebulous and also unstructured.

Language Models

Eight Things to Know about Large Language Models by Sam Bowman (2023)
Good introduction to LMs.

A Very Gentle Introduction to Large Language Models without the Hype by Mark Riedl (2023)

HuggingFace NLP Course
A more detailed set of tutorials on what NLP is, what Transformers are, and some more details about training/using language models.
Prompting Guide by DAIR.AI (2022)
Great guide on prompting.
A Primer in BERTology: What we know about how BERT works by Anna Rogers, Olga Kovaleva, and Anna Rumshisky (2020)
A fairly comprehensive review of the BERT language models that create embeddings of text.
BERT for Humanists by Matt Wilkens, David Mimno, Melanie Walsh, Rosamond Thalken, and Maria Antoniak (2022)
Fantastic tutorial on language models, geared toward those that do not have a background in the area

The Practical Guides for Large Language Models

Diffusion Models

Tutorial on Denoising Diffusion-based Generative Modeling: Foundations and Applications by Karsten Kreis, Ruiqi Gao, and Arash Vahdat (2022)
Video tutorial on diffusion models.
Deep Unsupervised Learning using Nonequilibrium Thermodynamics by Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli (2015)
The paper that introduced diffusion models.

Generative Modeling by Estimating Gradients of the Data Distribution by Yang Song, Stefano Ermon (2019)

Denoising Diffusion Probabilistic Models by Jonathan Ho, Ajay Jain, Pieter Abbeel (2020)

Variational Diffusion Models by Diederik P. Kingma, Tim Salimans, Ben Poole, and Jonathan Ho (2021)

Training data extraction

Extracting Training Data from Large Language Models by Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel (2021), blog, video
Original paper showcasing extracting training data from large language models.
Deduplicating Training Data Makes Language Models Better by Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini (2022)
Duplicates in the data continue to be the most easily identifiable reason for memorization.
Quantifying Memorization Across Neural Language Models by Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang (2022)
A more careful and comprehensive study showing larger models memorize more.
Extracting Training Data from Diffusion Models by Nicholas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramèr, Borja Balle, Daphne Ippolito, and Eric Wallace (2023)
It’s also possible to extract training data from diffusion models.

Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models by Gowthami Somepalli, Vasu Singla, Micah Goldblum, Jonas Geiping, and Tom Goldstein (2022)

Membership Inference

Membership Inference Attacks against Machine Learning Models by Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov (2017)
First paper introducing the idea of membership inference.
Membership Inference Attacks From First Principles by Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramer (2021)
Current best membership inference attack.
Label-Only Membership Inference Attacks by Christopher A. Choquette-Choo, Florian Tramer, Nicholas Carlini, and Nicolas Papernot (2021)
Masking outputs does not prevent membership inference.
Understanding Membership Inferences on Well-Generalized Learning Models by Yunhui Long, Vincent Bindschaedler, Lei Wang, Diyue Bu, Xiaofeng Wang, Haixu Tang, Carl A. Gunter, and Kai Chen
Outliers can be more vulnerable to membership inference.

Differential Privacy

Auditing Differentially Private Machine Learning: How Private is Private SGD? by Matthew Jagielski, Jonathan Ullman, and Alina Oprea (2020)

Considerations for Differentially Private Learning with Large-Scale Public Pretraining by Florian Tramèr, Gautam Kamath, Nicholas Carlini (2022)
Privacy is hard. Publicly accessible data is not the same as nor equivalent to public data. Differential privacy has limitations. Public data looks different from private data in meaningful ways, but our benchmarks sometimes miss that.

Training Text-to-Text Transformers with Privacy Guarantees by Natalia Ponomareva Jasmijn Bastings Sergei Vassilvitskii (2022)

How to DP-fy ML: A Practical Guide to Machine Learning with Differential Privacy by Natalia Ponomareva, Hussein Hazimeh, Alex Kurakin, Zheng Xu, Carson Denison, H. Brendan McMahan, Sergei Vassilvitskii, Steve Chien, and Abhradeep Thakurta (2023)
Practical guide to implementing DP.
Multi-Epoch Matrix Factorization Mechanisms for Private Machine Learning by Christopher A. Choquette-Choo, H. Brendan McMahan, Keith Rush, and Abhradeep Thakurta
State-of-the-art privacy mechanism.


Generating Harms: Generative AI’s Impact and Path Forwards by (May, 2023)

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 by Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell (2021)
One of the earliest published works on possible harms in large language models

Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models by Alex Tamkin, Miles Brundage, Jack Clark, and Deep Ganguli (2021)

Ethical and social risks of harm from Language Models by Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, Zac Kenton, Sasha Brown, Will Hawkins, Tom Stepleton, Courtney Biles, Abeba Birhane, Julia Haas, Laura Rimell, Lisa Anne Hendricks, William Isaac, Sean Legassick, Geoffrey Irving, and Iason Gabriel (2021)

Accountability in an Algorithmic Society: Relationality, Responsibility, and Robustness in Machine Learning by A. Feder Cooper, Emanuel Moss, Benjamin Laufer and Helen Nissenbaum (2022)
A synthesized analysis of why accountability is so hard/elusive for AI/ML systems

Contestability in Algorithmic Systems by Kristen Vaccaro, Karrie Karahalios, Deirdre K. Mulligan, Daniel Kluttz, and Tad Hirsch (2019)

Ongoing litigation

Doe 1 et al. v. GitHub, Inc. et al. GitHub Copilot class action lawsuit

Andersen et al. v. Stability AI Ltd. et al. Stable Diffusion class action lawsuit

In the news

AI Ethics & Policy News compiled by Casey Fiesler

Samsung workers made a major error by using ChatGPT by Lewis Maddison for TechRadar (2023)

OpenAI threatened with landmark defamation lawsuit over ChatGPT false claims by Ashley Belanger for Ars Technica (2023)

Getty Images is suing the creators of AI art tool Stable Diffusion for scraping its content by James Vincent for The Verge (2023)

UK Government axes plans to broaden existing text and data mining exception by Eleonora Rosati for IPKat (2023)

The UK government moves forward with a text and data mining exception for all purposes by Alina Trapova and João Pedro Quintais for Kluwer Copyright Blog (2022)

Israel Ministry of Justice Issues Opinion Supporting the Use of Copyrighted Works for Machine Learning by Jonathan Band for The Disruptive Competition Project (2023)

Copyright in generative deep learning by Giorgio Franceschelli and Mirco Musolesi for Data & Policy (2022)

‘AI’ at Bologna: The Hair-Raising Topic of 2023? by Porter Anderson for Publishing Perspectives (2023)

Large Libel Models? Liability for AI Output by Eugene Volokh (draft 2023)

Section 230 Won’t Protect ChatGPT by Matt Perault (2023)

OpenAI’s hunger for data is coming back to bite it by Melissa Heikkilä (2023)