Bias in the Age of Data
I was recently diagnosed with a neurodegenerative disorder, not one of the nightmarish ones that you dread to get, but still one that will have serious impact on my life as I progress. When I received the diagnosis, I did what any self-respecting adult would do, I asked the expert in front of me zero questions, thanked them for their time and subsequently spent the next 72 hours researching every progression study, every paper and every article I could find on it. I learnt a lot about the disease but I’m sure not much more than I would have learnt if I had just asked relevant questions to the expert I had waiting so long to see.
Where did this reliance on search engines over experts stem from and at what point did it reach this level, because I’m not the only one, majority of Americans trust Google for their News over actual news outlets, which to me makes sense but I think that’s part of the problem – there’s this overarching view that because this search engine has been coded it removes the bias that news outlets have become famous for but data is inherently bias, so how can search results be anything but?
While I don’t think bias in data sources is a new or undiscovered topic at all, with the historic juggernaut rise of GenAI and the use of large language models (LLM) being used to create these source of truth magic machines, the topic of data bias is one that I don’t think we’re putting enough emphasis on. GenAI has already become such a hot topic, that you can’t even say GPT three times in a mirror without an elderly relative appearing to ask if ChatGPT has feelings, but I think depending on what stage the models that you’re training for purpose are at there’s a serious importance to keeping an open and well catalogued collection of the data it’s being trained on.
Take a GPT model like the one that Bloomberg have spearheaded, how do you guarantee there is no bias, or even accidental agenda to the tool based on how it was fine-tuned. Will it suggest tools or services that benefit you, the user? Or ones that give the appearance of aid, while directly benefitting the organization that built the tool.
I think this openness on what data has been used to generate these tools is a must. With the rise of industries trying to patent their own models, I believe a database with all of the data the model was trained on should be a necessity to allow for the patent to be awarded. While I’m not an avid conspiracy theorist by any means, the secrecy that will inherently be adopted by organizations to ensure the unique selling points of their models over competitors gives rise to situations that harken back to the demonization of fat in the diets of people across the western world. The sugar research foundation paying researchers to publish a review on fat and heart disease, had huge implications on the health of people for years to come and led to sugar consumption to skyrocket with the emphasis put of fat-free foods, loaded with sugars to compensate for the flavor.
What stops this same interference and bias in data from continuing to happen if not for open datasets and transparency in model training. Would you trust a health tool if you knew it was built with an LLM sponsored by Coca-Cola? What about a technical advisory tool built byAWS, is the advice you receive to migrate to the cloud un-bias, does it actually make sense for where you organization is on the maturity index, or is the tool’s primary goal value generation – not for the user, but for the creator.
Living in the ‘Age of Data’ has a lot of pros, it’s taught me to be skeptical of everything I read and to try, because there is never a guarantee, to only utilize un-biased information to back decisions I make on a daily basis, but I fear that while the public enjoys the sorcery of these GenAI tools and majority don’t want to know how it works, its this very bliss in the ignorance around these tools that will allow industries to take advantage of the public.