Back in the day when we all of our servers and apps were on prem (ha ha, the 00s...those were the days), the saying "garbage in/garbage out" was religion. At my first tech gig in 200, everyone just took that statement to heart whenever we considered dumping new data sources into any of our on-prem DBs, especially our CRM.
I used to refer to this practice "data hygiene", the idea that if you infect your most crucial resource (in this case, the database of potential customers for our sales folks to qualify and call, to whom we would blast emails, and so forth) with absolute garbage, then what you get out is burned out sales people, diminishing revenue, and a database that's basically worthless.
This particular article just reminds me of why this is so important, and drills down into the specifics of the "WHY" of that. I know a bunch of us saw this coming, and I think that (utterly speculative on my part) the enterprise applications for next-gen AI, especially anything built on ChatGPT, will have a layer of impatience that may end this current wave sooner than anyone thinks.
No one single free platform ever gives a shit about the quality of the end user experience. However, B2B environments are wildly different and those who run them give all the shits about the quality of data when it impacts their customers and, eventually, their revenue.
Do we need data hygiene classes for the folks that came into professional and enterprise environments after cloud computing became the standard?
I think if Google et al are dumb enough to head towards the blinding lights in the tunnel without averting them, we will start using different search applications that don't serve us literal nonsense. It's already bad enough. If it gets worse, there will be a critical mass of folks who will patch together other behaviors. I mean...it wasn't that long ago when we used to do this all the time.
My worry is that it’s normal citizens that garbage up the web. If everyone has open access to generative LLMs at low cost and use them to write blog posts, articles, and research papers. All we get is AI stuff. The web will have two generations, before AI and after AI.
“If Google consistently gives you garbage results in search, for example, you might be more inclined to pay for sources you trust and visit them directly.”
If the AI content apocalypse spreads like a virus across the internet, people will take refuge in small, quieter, higher signal spaces -- many of which will be paid. I could see the tech giants who have the major llm's paying for content. There was a story today about media org's potentially banding together on this.
The media is already moving towards smaller, niche outlets. Perhaps AI infecting the "open" internet/social internet beyond use, with enough junk and misinformation, will lead to this tipping point that -- while nightmarish (2024 elections) at first -- ultimately speeds the path to a better internet. Rather than an internet defined by nation-state platforms, we'll have villages, and archipelagos of good information -- and yes it will balkanize too. But that means less people yelling at each other with no path to common ground. This may be overly hopeful -- which is not my forte -- but the trend lines are already there. Google's generative search makes SEO a waste of time. And the content marketers, will double down on social media for distribution, and the platforms become literally nothing but marketing distribution platforms, much of which will be AI content. The platforms, already on the much reported decline become that much more uninhabitable for people to interact.
AI looks like it could be the accelerant that pushes it over the edge. And as others have remarked, the new internet maybe looks like a modernized version of the old internet.
Yes Brandon. If web search results become useless, and if one cares about quality search results, then they will seek out and search in those arenas, like here on substack. What google did and delivered insane value is it de-siloed all the information across the web and made it searchable. If the web search experience becomes total garbage, then we go back to silos, search Substack, search LinkedIn, search beehive, search each of my favorite blogs, search Spotify, search my email and on and on. I got a little baby company solving for this, it just turned 2 years old. Who wins in this garbage search world? We’ll bloggers that add value on whatever platform they choose, podcasts, your personal content database, your org’s content database, I think LinkedIn has the best shot, maybe Reddit, maybe Meta, but the winners are all on a gradient scale where quality content > noise. The key theme here is searching for content curated by a human, is better than searching for content written by a generative model. Twitter seems to have too much noise. And the web has too much noise and is getting noisier. These two both lose on their current path.
Hi - newbie here. Casey, thank you for your work here and on Hard Fork. Your coverage from the intersection of media/comms/AI/tech got me to hit that "subscribe" button.
Big Tech is marketing the idea that it doesn't need datasets; partly, I think, in response to flak from media/publishers' criticism of content mining/extraction. Also, there's the broader tension in the information economy. Media/publishers have been a steady stream of information (some better verified than others), forming the internet's text-based corpus.
It feels like everyone's trying to define the value of training data -- from Reddit to my world (I work in public media in Canada).
Thanks again for your coverage of this space -- it's going to be fascinating to see how it all evolves.
Back in the day when we all of our servers and apps were on prem (ha ha, the 00s...those were the days), the saying "garbage in/garbage out" was religion. At my first tech gig in 200, everyone just took that statement to heart whenever we considered dumping new data sources into any of our on-prem DBs, especially our CRM.
I used to refer to this practice "data hygiene", the idea that if you infect your most crucial resource (in this case, the database of potential customers for our sales folks to qualify and call, to whom we would blast emails, and so forth) with absolute garbage, then what you get out is burned out sales people, diminishing revenue, and a database that's basically worthless.
This particular article just reminds me of why this is so important, and drills down into the specifics of the "WHY" of that. I know a bunch of us saw this coming, and I think that (utterly speculative on my part) the enterprise applications for next-gen AI, especially anything built on ChatGPT, will have a layer of impatience that may end this current wave sooner than anyone thinks.
No one single free platform ever gives a shit about the quality of the end user experience. However, B2B environments are wildly different and those who run them give all the shits about the quality of data when it impacts their customers and, eventually, their revenue.
Do we need data hygiene classes for the folks that came into professional and enterprise environments after cloud computing became the standard?
Did it have a rib removed?
We are quickly on a path where all search results and AI responses will be complete garbage. I don't see how you roll this trend back.
I think if Google et al are dumb enough to head towards the blinding lights in the tunnel without averting them, we will start using different search applications that don't serve us literal nonsense. It's already bad enough. If it gets worse, there will be a critical mass of folks who will patch together other behaviors. I mean...it wasn't that long ago when we used to do this all the time.
My worry is that it’s normal citizens that garbage up the web. If everyone has open access to generative LLMs at low cost and use them to write blog posts, articles, and research papers. All we get is AI stuff. The web will have two generations, before AI and after AI.
“If Google consistently gives you garbage results in search, for example, you might be more inclined to pay for sources you trust and visit them directly.”
If the AI content apocalypse spreads like a virus across the internet, people will take refuge in small, quieter, higher signal spaces -- many of which will be paid. I could see the tech giants who have the major llm's paying for content. There was a story today about media org's potentially banding together on this.
The media is already moving towards smaller, niche outlets. Perhaps AI infecting the "open" internet/social internet beyond use, with enough junk and misinformation, will lead to this tipping point that -- while nightmarish (2024 elections) at first -- ultimately speeds the path to a better internet. Rather than an internet defined by nation-state platforms, we'll have villages, and archipelagos of good information -- and yes it will balkanize too. But that means less people yelling at each other with no path to common ground. This may be overly hopeful -- which is not my forte -- but the trend lines are already there. Google's generative search makes SEO a waste of time. And the content marketers, will double down on social media for distribution, and the platforms become literally nothing but marketing distribution platforms, much of which will be AI content. The platforms, already on the much reported decline become that much more uninhabitable for people to interact.
AI looks like it could be the accelerant that pushes it over the edge. And as others have remarked, the new internet maybe looks like a modernized version of the old internet.
Yes Brandon. If web search results become useless, and if one cares about quality search results, then they will seek out and search in those arenas, like here on substack. What google did and delivered insane value is it de-siloed all the information across the web and made it searchable. If the web search experience becomes total garbage, then we go back to silos, search Substack, search LinkedIn, search beehive, search each of my favorite blogs, search Spotify, search my email and on and on. I got a little baby company solving for this, it just turned 2 years old. Who wins in this garbage search world? We’ll bloggers that add value on whatever platform they choose, podcasts, your personal content database, your org’s content database, I think LinkedIn has the best shot, maybe Reddit, maybe Meta, but the winners are all on a gradient scale where quality content > noise. The key theme here is searching for content curated by a human, is better than searching for content written by a generative model. Twitter seems to have too much noise. And the web has too much noise and is getting noisier. These two both lose on their current path.
Hi - newbie here. Casey, thank you for your work here and on Hard Fork. Your coverage from the intersection of media/comms/AI/tech got me to hit that "subscribe" button.
Big Tech is marketing the idea that it doesn't need datasets; partly, I think, in response to flak from media/publishers' criticism of content mining/extraction. Also, there's the broader tension in the information economy. Media/publishers have been a steady stream of information (some better verified than others), forming the internet's text-based corpus.
It feels like everyone's trying to define the value of training data -- from Reddit to my world (I work in public media in Canada).
Thanks again for your coverage of this space -- it's going to be fascinating to see how it all evolves.