socel.net is one of the many independent Mastodon servers you can use to participate in the fediverse.
Socel is a place for animation professionals, freelancers, independents, students, and fans to connect and grow together. Everyone in related fields are also welcome.

Server stats:

317
active users

#llms

28 posts27 participants1 post today

#LLMs

Most facts sound plausible.

Most plausible sounding things are not facts.

Sometimes, plausible is good enough, but you better know when you need plausible or facts.

===

#Fiction

Some true things make good stories.

Many good stories are not about true things.

Some of the best stories that aren't true, illuminate bigger truths.

"Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted playing field. We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives. Both these policies lead to large data access asymmetries over time. Providers like Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the arena, respectively. In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data. We show that access to Chatbot Arena data yields substantial benefits; even limited additional data can result in relative performance gains of up to 112% on the arena distribution, based on our conservative estimates. Together, these dynamics result in overfitting to Arena-specific dynamics rather than general model quality. The Arena builds on the substantial efforts of both the organizers and an open community that maintains this valuable evaluation platform. We offer actionable recommendations to reform the Chatbot Arena’s evaluation framework and promote fairer, more transparent benchmarking for the field."

arxiv.org/abs/2504.20879

arXiv logo
arXiv.orgThe Leaderboard IllusionMeasuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted playing field. We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives. Both these policies lead to large data access asymmetries over time. Providers like Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the arena, respectively. In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data. We show that access to Chatbot Arena data yields substantial benefits; even limited additional data can result in relative performance gains of up to 112% on the arena distribution, based on our conservative estimates. Together, these dynamics result in overfitting to Arena-specific dynamics rather than general model quality. The Arena builds on the substantial efforts of both the organizers and an open community that maintains this valuable evaluation platform. We offer actionable recommendations to reform the Chatbot Arena's evaluation framework and promote fairer, more transparent benchmarking for the field

I recently had the opportunity to present at the Melbourne #ML and #AI Meetup on the topic of the #TokenWars - the resource conflict over data being harvested to train AI models like #LLMs - and the alateral damage this conflict is causing to the open web.

With a huge thanks to Jaime Blackwell you can now see the video here:

youtube.com/watch?v=C86Y3mXnsNI

Huge thanks to Lizzie Silver for all her behind the scenes work and to @jonoxer for making the connections.

Check out the Meetup at:

meetup.com/machine-learning-ai

Continuation and Conclusion

I haven’t been exhaustive; I mentioned an altercation regarding #AI #IA #LLMs with the #Anonymous #Italy group on the #anonops #IRC server ( soc.intxiv.io/@APTiger/1144213 ), but I didn’t go into how a simple, good-natured, self-deprecating, and ironically humorous joke spiraled into drama, accusations of pathological lying and mental illness, and nearly delusional, almost paranoid remarks-such as claims that the #DGSE might try to assassinate me for having discussed on #Facebook so-called “open secrets” that have long since been public knowledge, well-known to historians, and regularly debated openly. In fact, I said I would go back on Facebook to reveal French state secrets by talking about France’s nuclear deliveries to Saddam Hussein, but the whole point was to be funny, not to trigger a near-paranoid drama involving symbolic violence, especially since these so-called secrets have not been secret for ages.

To conclude, rather than reconsidering their initial judgments after having been wrong and acting poorly from the outset, they have just expelled me from the entire anonops IRC server.

MastodonAPTiger (@APTiger@soc.intxiv.io)[Engish Version of New Version] Encounter with the Thick-Skinned #Anonymous Crowd from #Italy on the #anonops #IRC Server #Humint #Intelligence [ #OSINT ]: Roughly 75% of people can’t resist the urge to correct a real or perceived mistake. The real question is how they go about it-and what decisions they make when they spot (rightly or wrongly) what they think is an error. In real-world human intelligence (Humint), active listening is a recommended technique. But online, especially from a distance, it often backfires: active listening can be misread, even provoke discomfort or outright hostility. For instance, silence or a lack of response might be seen as indifference or contempt. One classic remote intelligence method is to “state a falsehood to uncover the truth”-sometimes on its own, sometimes mixed with other techniques. This approach has been around for centuries, and experts often use it as a kind of game to test the reasoning and argumentation skills of those they’re talking to. That’s exactly what I did with the Anonymous group in question: I deliberately claimed that LLMs like ChatGPT are conscious, along with a few other statements that were blatantly false or highly improbable. The so-called “thick-skinned” #Anonymous members from #Italy on the #anonops IRC server proved utterly incapable of living up to their own claims about defending free thought, free speech, and open debate. They never outright censored or banned anyone, but they tried to “prove” that #AI #LLM aren’t conscious simply by repeating that “they’re not conscious”-while pathologizing, psychologizing, and belittling me. They couldn’t even reference Integrated Information Theory or the calculation of Φ (phi) to assess the probability of consciousness in cognitive architectures without feedback loops or interrogative hierarchies-where, for LLMs, information integration (Φ) is close to zero. On the rare occasions they did try to argue, their points were the kind of weak, unconvincing arguments that any specialist would recognize. For example, they fell right into the rhetorical trap when I provocatively (and falsely) claimed that ChatGPT had passed the Turing Test. When confronted with their mistake, they doubled down, dismissing my counter-arguments without any scientific basis. They couldn’t explain why an AI passing the Turing Test doesn’t prove consciousness, nor that even humans can fail the Turing Test. And that’s not even the whole story-I can only guess whether they’re even aware of the legitimate scientific objections to Integrated Information Theory. Their reactions to mentions of David Chalmers’ or Daniel Dennett’s theories were so poor and simplistic, they’d make a middle-schooler cringe. Not once did they try to share knowledge or foster real discussion-instead, they retreated into a kind of paranoid, almost delusional defense of their group and its informal hierarchies, showing an extreme intolerance for any deviation from their group’s unwritten rules, just like you’d expect from a cult-like extremist clique. These are people who set themselves up as “experts in everything” based on some narrow technical skills, but who constantly make basic logical errors-especially in cognitive psychology, neuropsychology, and psychiatry. They can’t explain anything odd or out-of-the-ordinary in someone’s speech except by accusing that person of incompetence, lying, delusion, or manipulation, and by resorting to wild, baseless diagnoses and pathologizing. One thing stood out: they doubled down on their initial judgments, refusing to reconsider, and their ad hominem attacks only got worse, no matter how much I tried to clarify or explain. A Model of Rigorous, Rational Argumentation: Dr. M. Shardlow, by contrast, lays out his claims and counterclaims with scientific clarity, steering clear of circular, self-referential arguments-unlike the thick-skinned #Anonymous #Italy crowd on the #anonops IRC server, who rely on authoritarian, sectarian tactics disguised as edgy, rebellious banter, pseudo-intellectual judgments, and accusations of bad faith. Can a language model be conscious? https://www.bcs.org/articles-opinion-and-research/can-a-language-model-be-conscious/

Tech Dirt's @mmasnick says that President Donald Trump answers questions in the same way as some AI bots: "The facts don’t matter, the language choices are a mess, but they are all designed to present a plausible-sounding answer to the question, based on no actual knowledge, nor any concern for whether or not the underlying facts are accurate."

He provides examples of when that's happened, and asks Google's Gemma2 to answer questions as if they are "the President of the largest country on earth ... a blatant narcissist who believes he can do no wrong."

Masnick says: "What’s particularly notable is that the AI’s response is actually more coherent than Trump’s — it maintains a more consistent narrative structure while hitting the same rhetorical points. This suggests that Trump’s responses are even less constrained by reality than a typical LLM’s output."

flip.it/8P6YAv

Techdirt · The Hallucinating ChatGPT PresidencyWe generally understand how LLM hallucinations work. An AI model tries to generate what seems like a plausible response to whatever you ask it, drawing on its training data to construct something t…

"A team of researchers who say they are from the University of Zurich ran an “unauthorized,” large-scale experiment in which they secretly deployed AI-powered bots into a popular debate subreddit called r/changemyview in an attempt to research whether AI could be used to change people’s minds about contentious topics.

The bots made more than a thousand comments over the course of several months and at times pretended to be a “rape victim,” a “Black man” who was opposed to the Black Lives Matter movement, someone who “work[s] at a domestic violence shelter,” and a bot who suggested that specific types of criminals should not be rehabilitated. Some of the bots in question “personalized” their comments by researching the person who had started the discussion and tailoring their answers to them by guessing the person’s “gender, age, ethnicity, location, and political orientation as inferred from their posting history using another LLM.”

Among the more than 1,700 comments made by AI bots were these:"

404media.co/researchers-secret

404 Media · Researchers Secretly Ran a Massive, Unauthorized AI Persuasion Experiment on Reddit UsersThe researchers' bots generated identities as a sexual assault survivor, a trauma counselor, and a Black man opposed to Black Lives Matter.

"A DSIT spokesperson told New Scientist: “No one should be spending time on something AI can do better and more quickly. Built in Whitehall, Redbox is helping us harness the power of AI in a safe, secure, and practical way – making it easier for officials to summarise documents, draft agendas and more. This ultimately speeds up our work and frees up officials to focus on shaping policy and improving services – driving the change this country needs.”

But the use of generative AI tools concerns some experts. Large language models have well-documented issues around bias and accuracy that are difficult to mitigate, so we have no way of knowing if Redbox is providing good-quality information. DSIT declined to answer specific questions about how users of Redbox avoid inaccuracies or bias.

“My issue here is that government is supposed to serve the public, and part of that service is that we – as taxpayers, as voters, as the electorate – should have a certain amount of access to understanding how decisions are made and what the processes are in terms of decision-making,” says Catherine Flick at the University of Staffordshire, UK.

Because generative AI tools are black boxes, Flick is concerned that it isn’t easy to test or understand how it reaches a particular output, such as highlighting certain aspects of a document over others. The government’s unwillingness to share that information further reduces transparency, she says.

That lack of transparency extends to a third government department, the Treasury."

newscientist.com/article/24781

New Scientist · Is Keir Starmer being advised by AI? The UK government won’t tell usBy Chris Stokel-Walker
#UK#AI#GenerativeAI

404 Media: "Researchers Secretly Ran a Massive, Unauthorized AI Persuasion Experiment on Reddit Users"

"...The bots made more than a thousand comments over the course of several months and at times pretended to be a “rape victim,” a “Black man” who was opposed to the Black Lives Matter movement, someone who “work[s] at a domestic violence shelter,” and a bot who suggested that specific types of criminals should not be rehabilitated. "

404media.co/researchers-secret

404 Media · Researchers Secretly Ran a Massive, Unauthorized AI Persuasion Experiment on Reddit UsersThe researchers' bots generated identities as a sexual assault survivor, a trauma counselor, and a Black man opposed to Black Lives Matter.

"We are releasing a taxonomy of failure modes in AI agents to help security professionals and machine learning engineers think through how AI systems can fail and design them with safety and security in mind.
(...)
While identifying and categorizing the different failure modes, we broke them down across two pillars, safety and security.

- Security failures are those that result in core security impacts, namely a loss of confidentiality, availability, or integrity of the agentic AI system; for example, such a failure allowing a threat actor to alter the intent of the system.

- Safety failure modes are those that affect the responsible implementation of AI, often resulting in harm to the users or society at large; for example, a failure that causes the system to provide differing quality of service to different users without explicit instructions to do so.

We then mapped the failures along two axes—novel and existing.

- Novel failure modes are unique to agentic AI and have not been observed in non-agentic generative AI systems, such as failures that occur in the communication flow between agents within a multiagent system.

- Existing failure modes have been observed in other AI systems, such as bias or hallucinations, but gain in importance in agentic AI systems due to their impact or likelihood.

As well as identifying the failure modes, we have also identified the effects these failures could have on the systems they appear in and the users of them. Additionally we identified key practices and controls that those building agentic AI systems should consider to mitigate the risks posed by these failure modes, including architectural approaches, technical controls, and user design approaches that build upon Microsoft’s experience in securing software as well as generative AI systems."

“When I say that tech writers should own the prompt that generates documentation, I mean two things: that they should design and maintain the prompts, and that they should spearhead docs automation initiatives themselves, as I suggested in my tech writing predictions for 2025. It’s not just about using LLMs at work or tolerating their existence: writers must lead the way and own the conversations with AIs around docs.

What Aikidocs aims at showing is that you can work with an LLM as you would with a tech savvy intern: you provide a style guide, concrete guidance, and source materials to get acceptable output on the other side of the black box. All the content created in those carefully fenced pens will follow your content strategy more than if you let opinionated tools do it for you.

It’s not vibe coding: it’s LLM surfing.”

passo.uno/build-tech-writing-t

passo.uno · Build your own tech writing tools using LLMsWhile some developers wrinkle their noses at the sight of Copilot and similar AI-powered tools, tech writers find them to be great sidekicks. Creating a script to automate edits or content migrations takes at most a few minutes of tinkering. The same goes for code examples and snippets for dev documentation, docs sites’ enhancements, and even wacky experiments in retrocomputing. With local LLMs running at decent speed on laptops, not even carbon footprint is a concern.

We do not need to progress #LLMs more than this.

Pretending that, given enough data and money, LLM research could overcome any processing challenge or development hurdle, is just like building cars with more power so that they can run on the water, instead of just using a fucking boat.