Text and Data Mining – Decoding Copyright Challenges in India

Authored by Anju Jain Kumar, Gunjan Jadiya, Hriday Chokshi

Outline

Increasing number of copyright claims in the United States and other countries, challenging use of in-copyright works in machine learning and AI generated output.

Increasing trend of countries introducing exceptions in their copyright laws enabling machine learning / text and data mining.

Indian copyright law does not include specific safeguards for machine learning or text and data mining and the general exceptions under the law are limited.

Due diligence of terms of the platforms /databases where the content resides becomes important by participants in the value chain, whether you are a content creator or an AI product developer.

Introduction

With the increasing use of AI and generative AI, human ingenuity is being challenged by these rapidly advancing technologies. In December 2023, the New York Times[1] sued Open AI and Microsoft in the United States for the alleged infringing use of its copyright works.[2] The primary claims made by the NY Times are that millions of its articles were used to train chatbots who now compete with it. This legal battle is one amongst the many copyright claims against Open AI, including actions brought forth by numerous authors and artists.[3] While the law on use of in-copyright works in training data continues to develop in the United States, countries like Japan, Singapore and the EU have included limited exceptions under their copyright laws to enable text and data mining.[4] Closer to home, a pivotal question looms – how does the copyright law in India balance the interests of copyright works on the one hand and the enabling of machine learning and AI on the other? According to a recent press release, the GOI[5] has expressed confidence in the adequacy of the copyright laws to address concerns surrounding AI generated works and related innovations. This write-up looks at this question under Indian law.

TDM

Training data or TDM is the foundational step for any AI model. It is the systematic collection of extensive digitized material, coupled with the utilization of software to analyze and extract valuable information from this corpus.[6] This involves web scraping, web crawling, and web archiving amongst other things. The EU Directive on Copyright describes TDM as “New technologies that enable the automated computational analysis of information in digital form, such as text, sounds, images or data.”[7] According to the EU Directive, TDMs make possible the processing of large amounts of information to gain new knowledge and discover new trends. While TDM finds application in several non-AI[8] contexts, this writeup focuses on TDM employed for training AI models.

Use Cases

A. Machine learning, deep learning, pattern recognition without reproducing in-copyright works in generative output

TDM involves collecting and cleaning data for analysis, pattern recognition, deep learning etc. This includes making a copy of the data to be studied and subsequently transferring it to a tool for examination.[9] Making a copy or the reproduction of an in-copyright work is the exclusive right vesting with the copyright owner unless permitted by the copyright owner or an exception permitted under the Indian copyright law.

One of the claims made in the NYT Complaint is that the act of making an unauthorized copy of the in-copyrighted works for machine learning amounts to copyright infringement. [10] In previous non-AI related cases[11], US courts have supported the view that copying of in-copyright texts in TDM for research purposes is fair use. These claims are yet to be determined in the context of AI. Countries like Singapore[12] and Japan[13] have reduced the uncertainty and introduced exceptions in their copyright laws, permitting copying of in-copyright works for machine learning, pattern recognition, data verification, subject to conditions.[14] The EU Directive on Copyright issued to its member states directs the member states to allow reproduction (i) by research organizations and cultural heritage institutions of in-copyright works for TDM, for the purpose of scientific research; and (ii) for all other purposes on the condition that the right holder has not opted out of such use of their work.[15]

In India, there are no specific exceptions for copying or reproduction of in-copyright works for machine learning purposes. Therefore, the use cases need to fall within the ambit of existing exceptions under the copyright law, such as fair dealing.[16] The scope of fair dealing in India is narrow and applies to the literary, dramatic, musical, or artistic works.[17] Sound recordings and cinematograph films fall outside the scope of the fair dealing.[18] Only use cases such as private or personal use, including research, criticism, or review that satisfy the test of fair dealing are not considered infringement.[19] The courts have traditionally looked at the following three factors in deciding what is fair dealing of an in-copyright work: (i) the amount and substantiality of the portion used; (ii) the purpose and character of the use; and (iii) the effect on the potential market.[20] Courts have held that if the purpose of the use is commercial in nature then it is not considered private or personal use, thus falling outside the scope of fair dealing.[21]

Keeping in mind the narrow exceptions under Indian copyright law, it would be prudent to evaluate certain aspects of the TDM activity, for example (i) purpose or use of the TDM and would any of these purposes fall within the exceptions; and (ii) terms and conditions of the data bases/sets that are used for machine learning or TDM.

B. Machine learning, deep learning, pattern recognition with use of in-copyright works in generative output

In generative AI, the training data may also be reproduced while generating responses solutions or services. Such reproductions in output could trigger rights of copyright holders such as reproduction rights, communication rights and adaptation rights. In the NYT Complaint the NY Times claims that “the current GPT-4 LLM will output near-verbatim copies of significant portions of Times Works when prompted to do so. Such memorized examples constitute unauthorized copies or derivative works of the Times Works used to train the model.”[22] Other cases in the United States have made similar claims. In October 2023, Universal Music Group filed copyright infringement lawsuit against Anthropic AI alleging that the AI is “copying and distributing lyrics from over 500 songs by renowned artists such as Katy Perry, the Rolling Stones, and Beyoncé.”[23]

Under Indian law, the analysis would hinge on where does the generative output fall on the spectrum of copyright, full reproduction – adaptation/derivative – new original work. The commonly used test by Indian courts has been whether the work is substantially similar to the in-copyright learning data. If there is substantial similarity, it would be considered infringement unless it falls with the statutory exceptions, which as we observed in A, are limited. Courts have looked at (i) quality of the content copied as opposed to quantity[24]; (ii) ‘total concept and feel test’, where the determination is based on whether a reader, spectator, or viewer, after experiencing both works, unmistakably perceives the subsequent work as a copy of the original;[25] and (iii) abstraction-filtration-comparison test, that involves analyzing works by abstracting their core ideas, filtering out unprotectable elements, and comparing the remaining protected elements to assess if infringement has occurred.[26]

Our Lens

We are seeing an increasing number of countries amending their copyright laws to include TDM related exceptions, some wider than the others. These changes are being brought to participate and stay ahead in the build and adoption of AI models. In India, there has been a history of exceptions being carved out to balance the rights of the copyright holders and technological advancements.[27] The government’s current stance, as articulated in the press release, indicates the absence of immediate plans to modify existing laws in the context of training data and AI. Would the existing exceptions support the increasing use of in-copyright works in training AI models for commercial use? Unlikely. Currently, individual participants in the value chain are left to determine how their works and databases are used and the commercials associated with such use.[28]

[1] Hereinafter “NY Times”.

[2] The New York Times Vs. Microsoft Corporation, Open AI & Ors. (2023) (Hereinafter “NYT Complaint”).

[3] Authors Guild Vs. OpenAI and Ors. (2023); Andersen vs. Stability AI (2023).

[4] (Hereinafter “TDM”) ‘Factsheet on Copyright Act 2021’ (Intellectual Property Office of Singapore, 24 November 2022); ‘EU Directive on Copyright’ (European Parliament, 2019); Articles 30-4, 47-4, 47-5 of the Japanese Copyright Law, 1970. See also: ‘Japan Amends its Copyright Legislation to Meet Future Demands in AI and Big Data’ (European Alliance for Research Excellence, 3 September 2018).

[5] Government of India, See also: The Press Release.

[6] ‘Text and Data Mining – What is TDM?’ (University of Cambridge); ‘Text Data Mining: A Proposed Framework and Future Perspectives’ (International Journal of Business Information Systems, 2015).

[7] ‘EU Directive on Copyright’ (European Parliament, 2019).

[8] For example, TDM finds application in scientific research for efficient literature analysis and in business intelligence for market trend identification and legal compliance research. The use of AI is not necessary for such analysis. The term “TDM” was coined in 1999 by Marti A. Hearst.

[9] ‘Text and Data Mining’ (University of Bermingham).

[10] (n 2).

[11] Authors Guild, Inc. Vs. Google, Inc. (804 F.3d 202); Authors Guild, Inc. v. HathiTrust (902 F.Supp.2d 445).

[12] (n 4).

[13] (n 4).

[16] Section 52, Indian Copyright Act.

[17] Super Cassettes Industries Limited and Ors. Vs. Chintamani Rao and Ors. 2011 SCC OnLine Del 4712.

[18] Ibid.

[19] Section 52, Indian Copyright Act.

[20] Civic Chandran Vs. C. Ammini Amma, 16 PTC 329 Madras; Blackwood and Sons vs. A.N. Parsuraman, AIR 1959 Madras 410.

[21] Super Cassettes Industries Ltd. Vs. Hamar Television Network Pvt. Ltd. and Ors. 2011(45) PTC 70(Del); Tips Industries Ltd. Vs. Wynk Music Ltd. and Ors. 2019 SCC OnLine Bom 13087.

[22] (n 2).

[23] ‘Universal Music files $75 million lawsuit against AI firm Anthropic for copying Rolling Stones, Beyonce lyrics’ (The Economic Times, 20 October 2023).

[24] R.G. Anand Vs. M/S Deluxe Films and Ors., AIR 1978 SC 1613.

[25] Ibid.

[26] Shamoil Ahmad Khan Vs. Falguni Shah & Ors. 2020 SCC OnLine Bom 665; Also see: The “Abstraction, Filtration, Comparison” Test (Ladas & Perry LLP).

[27] An example of the same is the software related amendments made to the Copyright Act in 1994.

[28] YouTube, TikTok, and Instagram.