What does the American Privacy Rights Act mean for AI Governance?
Plus: Non-English AI, AI training AI, and why AI Governance is getting harder
Hi there! Stay tuned. We’ve got some exciting new research that we’ll be publishing over the coming weeks around model transparency evaluations. More to come.
In today’s edition of The Trustible Newsletter. 6 minute read.
What does the American Privacy Rights Act mean for AI Governance?
The Rise of Non-English LLMs
AI-Ception – When models are trained on AI content
Why AI Governance is going to get a lot harder
Stanford HAI Summarizes the World of AI
—
1. What does the American Privacy Rights Act mean for AI Governance?
On April 7, 2024, Senator Maria Cantwell (D-WA) and Representative Cathy McMorris Rodgers announced the American Privacy Rights Act (APRA), a proposed bill that would enact a federal privacy law for the U.S. APRA is the second proposed bipartisan federal privacy bill within the last two years, after the American Data Privacy and Protection Act failed to advance in the House of Representatives in 2022.
APRA contains several provisions related to regulating AI. First, APRA would require ‘large data holders’ to conduct annual impact assessments on covered algorithms that pose a risk of harm to certain individuals or groups of people (e.g., minors, race, gender, or political affiliation). Covered entities and service providers would also be responsible for evaluating the design of any covered algorithms before deployment to reduce the risk of harm to certain individuals or groups of people. Finally, the proposed bill would mandate that individuals are notified and can opt-out of the use of covered algorithms to make or facilitate ‘consequential decisions,’ such as access to healthcare, employment opportunities, housing, and credit.
The proposed bill empowers both the Federal Trade Commission and states’ attorneys general to enforce the law. Interestingly, the proposed legislation also allows for individuals to sue covered entities that violate certain provisions, which includes the sections related to AI. In the past, federal privacy legislation that allowed individuals to sue companies had served as a barrier to passage.
APRA will also face some headwinds given it negates state privacy laws (known as preemption). Lawmakers from California have opposed federal privacy legislation that preempts the California Consumer Privacy Act, arguing that any federal privacy law should be a floor not a ceiling. As more states move forward on enacting their own privacy laws, lawmakers from those states are likely to push back on proposed legislation that preempts those state privacy laws.
Our Take: APRA advances the conversation on AI governance as it attempts to address the lack of comprehensive privacy protections and regulations. The impact assessment and algorithmic design provisions underscore what policymakers are concerned about regarding AI and should move organizations to think about how they design, deploy, and monitor their AI systems. However, the proposed bill presents some contradictions as it seeks to balance data privacy protections with its AI-related provisions. Specifically, the data minimization requirements are at odds with how AI systems operate, as they require large amounts of data to accurately function. Moreover, similar to state privacy laws, as more states look to rein in AI there may be pushback to cede authority to Congress on AI regulations.
2. The Rise of Non-English LLMs
Earlier this week, OpenAI announced a new office in Tokyo, and the development of a GPT-4 variant focused on better Japanese language support. This investment in Japan, which is seeking to be an AI leader, is likely the beginning of investments from ‘Big AI’ into non-western countries. One key limitation of many top large language models is that they primarily work best in English. While few developers disclose their training datasets distribution statistics, the Common Crawl dataset, a massive web dataset of billions of web pages that all LLMs universally use for training, is roughly 46% english. While many leading LLMs show impressive abilities for machine translation, and can support non-english languages to various extents, driving adoption can still be a challenge without localized support staff, internationalized documentation, or licensing deals with leading media or publication companies across the globe.
Numerous other LLM creators are heavily focused on building better support for non-English languages. Microsoft recently invested in Abu Dhabi-based G42 as part of broader efforts to forge closer ties between US and Middle Eastern tech companies. Earlier this year, Cohere for AI released AYA, along with several supporting datasets focused on 101 under-resourced languages. Similarly, both Mistral and 01.ai have been focusing on better support for French and Mandarin respectively to try and create localized AI ecosystems.
Adding new language support to LLM isn’t as simple as just feeding it proper data however. Many of the steps leading up to the ‘deep learning’ work, such as labeling, deduplicating, and tokenizing data sources can differ greatly depending on the character set, grammatical rules, or even reading direction of the languages involved.
Key Takeaways: According to Edelman, people in developing economies such as Saudi Arabia, India, China, Kenya, Nigeria and Thailand were more likely to embrace AI versus their peers in developed countries, who are more likely to reject it. It’s no wonder AI developers are expanding their services to non-English languages.
3. AI-Ception – When models are trained on AI content
Adobe’s generative AI feature, Firefly, was apparently trained on content that was itself generated by AI from competitors. According to a recent Bloomberg report, up to 5% of the training data used for Firefly may have come from Midjourney, or other similar image generation tools. This is despite using a moderation process to vet all the content included in the ‘Adobe Stock’ dataset for copyright, toxic, or otherwise inappropriate content. The difficulty in even strict filtering is that generated content is still extremely difficult to identify, and no watermarking or content verification schemes are enforced. Adobe has been leading efforts to change that with the Content Authenticity Initiative, although adoption of that standard is not yet widespread.
While using a small amount of generated content may not have been illegal, it can potentially negatively impact the quality of the models. A paper from 2023 by researchers at Rice and Stanford University tested an image generation model trained on its own outputs. They found that across multiple different variants, that the quality of images quickly degraded. They called this quick degradation ‘Model Autophagy Disorder’ (MAD) and offered a range of explanations for the effect.
Authentic, human-generated content is essential for the performance of generative AI models. This may explain why so many AI companies are hungry for licensing deals with authenticated user platforms like Reddit. Collecting and preserving human-generated data now (perhaps before the majority of the internet becomes AI generated content?) will be vital for their continued growth.
Key Takeaways: Detecting AI generated content is difficult even for the most resourced and best intentioned organizations. But they may have the biggest incentive to enforce watermarking or verification schemes as AI models trained on AI generated content seem to degrade in quality.
4. Why AI Governance is going to get a lot harder
One of the things that makes AI Governance hard is that it involves collaboration across multiple teams and an understanding of a highly complex technology and its supply chains. It's about to get even harder.
The complexity of AI governance is growing along 2 different dimensions at the same time – both of them are poised to accelerate in the coming months:
Internal Complexity
Higher number of AI use cases
More AI-enabled vendors/tools
More stakeholders involved with AI
External Complexity
Customer demands around trustworthy AI
Accelerating & uncertain regulatory activity
Evolving AI risk research & best practices
The challenge: organizations that don't get ahead of this complexity 1) slow down their AI adoption, 2) introduce risks & harms to stakeholders, 3) increase compliance costs over time. Read more in our latest blog post.
5. Stanford HAI Summarizes the World of AI
The Stanford Institute for Human-Centered Artificial Intelligence (HAI) released their 2024 AI Index Report. Here’s the link for the 502 page report. Here are some key takeaways as it relates to AI governance:
AI means productivity, and it’s here to stay: AI beats humans on some tasks, and has demonstrated data points that AI makes workers more productive and leads to higher quality work. However, AI without proper oversight can lead to diminished performance.
Regulations advance: Last year alone, the total number of AI-related regulations in the US grew by 56.3%, with over 20 federal regulatory agencies issuing AI regs. Legislatively, twice as many AI bills were introduced in 2023 compared to 2022. Around the world, AI was mentioned in legislative proceedings in 49 countries, with the EU playing the leading role on its landmark AI Act.
Responsible AI is top of mind: recent research highlights gaps in standardized evaluation of LLM responsibility (more to come from Trustible on this!), with political deepfakes impacting elections and complex vulnerabilities emerging in LLMs. Globally, businesses express increasing concerns about AI risks, including privacy and data security, slowing down their AI adoption. AI developers' lack of transparency hinders safety understanding, while AI incidents, including biased content and copyrighted misuse, are surging, intensifying debates on balancing immediate and existential risks.
*********
As always, we welcome your feedback on content and how to improve this newsletter!
AI Responsibly,
- Trustible team