
The news has broken that Meta, parent company of social media giants Facebook and Instagram, trained their Llama AI system on Library Genesis – an online trove of pirated books and research papers.
Also known as LibGen, the database currently contains over 7.5 million books, including works of authors such as J.K. Rowling and Stephen King in multiple languages – thanks to The Atlantic you can browse the repository here. Seems like the perfect place to get enough quality content to power an artificial intelligence competing with the likes of ChatGPT, right? Especially when you would rather save the money and time accompanying the usual hassle of individually licensing copyrighted content you want to use.
Oh wait. The intentional use of pirated, copyrighted materials is exactly what a group of American authors, including Ta-Nehisi Coates, Junot Díaz and Sarah Silverman, are taking Meta to court for. Meta is defending its decision under the claim that their AI training is a form of fair use, stating that the process was ‘transformative’, and the AI would not replicate or substitute the books in any way.
The court case is still very much underway and may take a long time yet to resolve. While fair use could potentially save Meta in terms of legality, this instance raises questions as to whether big tech companies are being governed correctly. As technology develops exponentially faster, laws and policy are struggling to keep up. Whether copyright law is adequately serving modern creatives, and internet users in general, against new technologies such as machine learning is now up for debate. Is all data available online, whether personal, creative or work-related, at risk of extraction and exploitation for somebody else’s gain?
Breaking down AI
The issue at the heart of this battle is information, the lifeblood of all algorithms and machine learning systems. To say that these machines are actually learning at all, though, is quite generous. Kate Crawford makes the argument that AI is not truly artificial, nor is it actually intelligent. In fact, it is the result of vast amounts of human labour, material resources, and of course, data. Without data these collections of algorithms can offer very little. In the same way, when powered by incomplete or poorly constructed datasets, these systems yield questionable results.
While no AI system can possibly be perfect, it makes sense that algorithms consuming extensive amounts of well-written information may be able to string a few good lines together. The love, time, and monetary investment of authors, editors, publishers, printers, distributors, salespeople and more which have been poured into the data driving Llama, however, are lost.
Meta’s Llama AI has been claimed the largest open-source AI, though its openness is contested due to the secrecy of the dataset that went into creating it. The system is powerful, promising general knowledge, mathematical, and superior multilingual translation abilities. At the end of 2024 Meta reported huge demand for Llama, with 650 million downloads of the product. What’s more, the Meta AI Assistant which is included across the Facebook, Messenger, WhatsApp and Instagram platforms is built using Llama. Hidden in search bars, feeds, and chats, Meta’s AI is now deeply intertwined within the architecture of their platform products. Offering such a vast tool to make information and connection more accessible, all free of charge, could almost seem to justify a little piracy.
Except Meta is actually making money off Llama. New court filings reveal that Meta receives a share of revenue made by hosts of Llama on their users, and while the precise entities contributing to this paycheque haven’t been released, hosts of Llama include the likes of Dell, Amazon Web Services, Google Cloud, and more. In contrast, writing is already a notoriously low-paid profession, with Australian authors making on average $18,200 a year. The fact that Meta is making money on the appropriated content of underpaid creatives is not a good look. However, having already taken the data and fed it into the system, can anything actually be done? Consent can’t be given, and any retrospective licensing fee would be ludicrously high. Justification of fair use in this case is murky, so it might be worth taking a step back through Meta’s history of regulatory mishaps to get a better idea of who we are dealing with.

A history of heists
Meta has come a long way since Mark Zuckerberg first launched Facebook back in 2004. From a single social network serving a local community, the company has grown and evolved alongside the internet and its users to become something of an empire today. However, Zuckerberg’s rise to power has not been without hiccups. Among the most notable, perhaps, is the infamous Cambridge Analytica Scandal of 2018, sparking what Terry Flew has described as ‘a crisis in trust’. The scandal involved the harvesting and access of the data from 87 million unknowing Facebook users by a third-party app. This information was then sold on and used by consulting firm Cambridge Analytica to inform the targeting of voters in Donald Trump’s triumphant 2016 political campaign. A clear legal and ethical blunder, the influence which this breach spotlights sheds some light on the power of data.
Facebook is fuelled by user data. Most users may only be vaguely aware of the trade-off they make when accessing the “free” platform, as the depth of value which data harbours to technology companies is not immediately clear.
Scholar Shoshanna Zuboff coined the term ‘surveillance capitalism’ to describe the form of information capitalism which collects information about users in order to predict and alter behaviour in ways which can make them money. Whether that is through targeted advertising, personalised shopping suggestions, or simply enticing users to stay online for longer through a finely tuned algorithm, personal data is at the heart of the platform business model today. All of this is made explicit in Facebook’s Terms of Service, which users agree to when they sign up to use the site. What isn’t covered in this user agreement, however, is the sale of data to a third party.
Ultimately, Facebook’s misuse of data cost them some poor press and a US$725 million settlement. While little more than a slap on the wrist for a company which has continued to grow and dominate the social media spaces, this was punishment nonetheless. The dodgy handling of user data is a very different story to the appropriation of copyrighted material to power AI, but it points to a common problem. The ‘crisis of trust’ which Flew flagged back in 2018 is still very much relevant. Governments and other regulators continue to be called on to manage these platforms, but the pace at which new technologies are developed is running laps around the progress of law and policy.

Data ownership and AI
“Generative AI opens the door for the creation of derivative works that could
potentially overshadow or undermine the market for the original content” Nicola Lucchi
Like all the other algorithms behind Meta’s platforms and systems, Llama AI needs data. However, it isn’t so much data about its users that it requires. Instead, information about the world, computation, translation, comprehension and the ability to put together a robust written response to prompts is key. As established, vast amounts of quality data are therefore required in order to create a functional generative AI system.
Inappropriate use of copyrighted material is not without precedent in this sector. At least 25 copyright lawsuits against AI companies are currently in progress in the US alone, including the case against Meta as well as a number against OpenAI, the creator of ChatGPT. In the book industry specifically, concerns certainly aren’t unfounded. Already AI-generated books have been found summarising or even parroting human authors’ works on Amazon.
Law professor Nicola Lucchi addresses the legal questions which accompany generative AI using the case study of popular LLM ChatGPT, breaking down copyright and intellectual property concerns. Along with questions as to the ethics and legality of using copyrighted material as input, Lucchi further investigates the potential issues of generative AI output. As AI systems, LLMs such as ChatGPT and Llama cannot claim copyright for their outputs. Furthermore, without significant contribution and effort on their part, the user of the AI also cannot claim ownership. With no clear-cut answer, in law or in ethics, further questions arise. As Lucchi posits, does this mean copyright should belong to the programmers, or to the company owning the AI system? Or are all outputs copyright-free?
Returning to the case at hand. The stripping of all acknowledgement and refusal of any revenue to the authors whose works have been used to create AI outputs appears deeply unjust. This very blog post is littered with credits to other authors, researchers, and journalists who have done the work required for me to synthesise and support my arguments. Every other blog post on this page is the same, and if any of us writers failed to do so we would be stripped of our legitimacy, we would be blatantly plagiarising, and it would be considered universally immoral. So, why should AI get away with it?

Data hungry
Data is at the heart of all of Meta’s technologies, whether it’s personal data, data about the world, data about interactions, or data in the form of words on a page. The personal data that Meta has such a fraught history of appropriating is to our social media feeds as published books and papers are to Llama AI. Where the collection and use of personal data to power algorithms is protected by user agreements and privacy policies, the torrenting of pirated published material may very well be protected in court under the claim of fair use.
Digital technology giants such as Meta have long had the privilege of governing themselves, while laws have not been able to keep up with society’s growing dependence on private companies for our work, leisure, social and family interaction. This case, while not yet closed, is a chance to reconsider the way we see these technology companies. In examining the immense AI offerings of Meta, the way they have been integrated into the company’s social media platforms, and the deeply extractive and exploitative nature of their creation, maybe it’s time the company is treated as more than a provider or intermediary.
Proper governing of big tech companies such as Meta should acknowledge all their faces, and the reality of technology today. Meta’s act of torrenting is certainly unlawful, but somehow their use of that information to create a new product is still under consideration. This case is not unique, and the sheer number of copyright lawsuits against AI developers clearly demonstrates that governments and policymakers are being called on to act. Now more than ever we need to consider whether internet governance is keeping up with the velocity of technological development.
References:
Brittain, B. (2025, March 26). Meta says copying books was ‘fair use’ in authors’ AI lawsuit.
Reuters. https://www.reuters.com/legal/litigation/meta-says-copying-books-was-fair-use-use-authors-ai-lawsuit-2025-03-25/
Crawford, K. (2021). The Atlas of AI: Power, Politics, and the Planetary Costs of Artificial
Intelligence (1st ed.). Yale University Press. https://doi.org/10.2307/j.ctv1ghv45t
Creamer, E. (2025, April 3). ‘Meta has stolen books’: authors to protest in London against AI
trained using ‘shadow library’. The Guardian. https://www.theguardian.com/books/2025/apr/03/meta-has-stolen-books-authors-to-protest-in-london-against-ai-trained-using-shadow-library
dxlmedia.hu. (2022). [Online image of Facebook logo on phone screen]. Unsplash.
https://unsplash.com/photos/a-cell-phone-on-a-table-Xh3k8-vfl8s
Eccles, B. (2024). Jolly Roger flag flying in Swanage harbour. [Online image]. Unsplash. https://unsplash.com/photos/a-pirate-flag-with-a-skull-and-crossbones-on-it-RvLyITzf4NU
Flew, T. (2018). Platforms on trial. InterMEDIA, 46(2), 24-29.
https://eprints.qut.edu.au/120461/
Knibbs, K. (2024a, January 10). Scammy AI-Generated Book Rewrites Are Flooding Amazon.
Wired. https://www.wired.com/story/scammy-ai-generated-books-flooding-amazon/
Knibbs, K. (2024b, December 19). Every AI Copyright Lawsuit in the US, Visualized. Wired.
https://www.wired.com/story/ai-copyright-case-tracker/
Lucchi, N. (2024). ChatGPT: A Case Study on Copyright Challenges for
Generative Artificial Intelligence Systems. European Journal of Risk Regulation, 15,
602–624. https://doi.org/10.1017/err.2023.59
McCallum, S. (2022, December 24). Meta settles Cambridge Analytica scandal case for
$725m. BBC. https://www.bbc.com/news/technology-64075067
Meta. (2024a, April 18). Meet Your New Assistant: Meta AI, Built With Llama 3.
https://about.fb.com/news/2024/04/meta-ai-assistant-built-with-llama-3/
Meta. (2024b, July 23). Introducing Llama 3.1: Our most capable models to date.
https://ai.meta.com/blog/meta-llama-3-1/
Meta. (2024c, December 19). The future of AI: Built with Llama.
https://ai.meta.com/blog/future-of-ai-built-with-llama/
Meta. (2025, January 1). Terms of Service. Retrieved April 7, 2025, from
https://www.facebook.com/terms/
Mirjalili, S. (2024, August 2). Meta just launched the largest ‘open’ AI model in history. Here’s
why it matters. The Conversation. https://theconversation.com/meta-just-launched-
the-largest-open-ai-model-in-history-heres-why-it-matters-235689
Mrva-Montoya, A. (2025, April 1). Meta allegedly used pirated books to train AI. Australian
authors have objected, but US courts may decide if this is ‘fair use’. The Conversation.
https://theconversation.com/meta-allegedly-used-pirated-books-to-train-ai-
australian-authors-have-objected-but-us-courts-may-decide-if-this-is-fair-use-253105
Reisner, A. (2025a, March 20). Search LibGen, the Pirated-Books Database That Meta Used
to Train AI. The Atlantic. https://www.theatlantic.com/technology/archive/2025/03/search-libgen-data-
set/682094/
Reisner, A. (2025b, March 20). The Unbelievable Scale of AI’s Pirated-Books Problem. The
Atlantic. https://www.theatlantic.com/technology/archive/2025/03/libgen-meta-
openai/682093/
Tomasso, P. (2016). [Online image of overlapping open books]. Unsplash.
https://unsplash.com/photos/open-book-lot-Oaqk7qqNh_c
Tullius, T. (2020). A painting on a wall warning visitors about video surveillance [Online
image]. Unsplash. https://unsplash.com/photos/black-and-white-wall-mounted-light-Q2-EQDwxFtw
Wiggers, K. (2025, March 21). Meta has revenue sharing agreements with Llama AI model
hosts, filing reveals. TechCrunch. https://techcrunch.com/2025/03/21/meta-has-
revenue-sharing-agreements-with-llama-ai-model-hosts-filing-reveals/
Zuboff, S. (2015). Big other: Surveillance Capitalism and the Prospects of an Information
Civilization. Journal of Information Technology, 30(1), 75-89.
https://doi.org/10.1057/jit.2015.5
Be the first to comment