Editor’s Note: Today’s post is by Minhaj Rais. As Senior Manager, Strategy & Corporate Development at CACTUS, Minhaj collates industry insights and conducts in-depth analyses to help shape global organizational strategy and product incubation. In his previous roles, Minhaj has worked in various functions related to editing and reviewing research manuscripts, freelance hiring, marketplace operations, and social media management.
Big data has permeated society, and there is barely any aspect of our lives that is not being mined for data. This data is then fed into machines contributing to a detailed profile that assigns us to various cohorts. Data mining has become so widely prevalent that most of us have become resigned to the fact that privacy has long ago been sacrificed at the altar of a data-driven world.
But there might be a glimmer of hope — advancements in AI are actually bringing about possibilities of providing solutions without infringing on user privacy. The “post-cookie era”, wherein the use of third-party cookies on websites is being gradually tapered off, is encouraging the emergence of AI-powered solutions that can provide rich, contextualized information through means that aren’t as overtly invasive. In the subsequent paragraphs, we first cite a few generic examples before dwelling into some examples more relevant to the scholarly publishing industry.
As one example, in early 2021, Apple and Google devised a system to notify their users of potential exposure to COVID without the need for installing any data mining apps. The system uses Bluetooth short-range wireless communication technology. Virginia was among the first states to launch COVIDWISE Express — an app-less service providing functionality similar to the COVIDWISE app but one which allows users to anonymously share their COVID-positive test results. It’s worth noting that the adoption of these applications was largely unsuccessful because of the general mistrust toward tech providers, especially in terms of their privacy protection claims.
On a similar note, while Firefox, Safari, and Brave have been blocking third-party cookies by default, Google Chrome has announced that they’ll soon be joining the gang in deploying technology that is less invasive than tracking cookies. Once effective, this will drastically change the way privacy and cross-site tracking work on the web, though these services still track users through cohorts and semi-persistent IDs and might not be as “privacy protecting” as being generally perceived.
Apple is now giving users more control over how their data is used for online advertising, much to Facebook’s chagrin. Facebook saw good reason to be outraged at this decision because without being able to mine user data, they wouldn’t be able to present highly personalized ads to users. Facebook is most worried about the change in the default setting that requires users to consent to mining their personal data — Apple users now have to opt in to allow advertisers to track their data and use it for personalized advertising. You can imagine the effect this move has had on advertisers — a survey among 2.5 million active daily mobile users showed that only about 4% were allowing apps access to their Identifier for Advertisers (IDFA) tag.
What these three examples highlight is that it is very much possible to serve customers without brazen infringement on their privacy. The brighter side is that technology leaders such as Google and Apple have now been forced to relent and take some steps in this direction. It has taken several years of efforts by rights groups to bring concerns regarding privacy and user rights to the fore, and it’s important that this momentum is maintained, largely because these efforts by big tech are not entirely altruistic and might instead be driven by the intention of taking preemptive steps to avoid regulation, to try to dispel their “privacy invading” images, or seeking alternative means to continue business as usual and yet appear to be more “privacy conscious.”
Big data and AI have begun to play an increasingly important role in our lives, and while bias and prejudice inherent in AI-driven systems are being called out, there is a strong need for emphasizing the importance of designing AI systems that work without having to infringe on user privacy. Experts have pointed out that once your anonymized data becomes a part of a large data set, AI has the ability to de-anonymize this data based on inferences from other devices. There is a vast amount of literature that firmly establishes the fact that deanonymization is rather easy in large datasets.
Consider your customer profile on Amazon. Everything we browse on the website is recorded and stored against our profile data and enriched with behavior recorded from other sites before we’re shown recommendations on Amazon based on our interests. When we stick to just one website or app like Amazon for a majority of our purchases, we’re actually limiting our choices to what’s available on Amazon rather than what’s actually available in the big world beyond the platform.
The important aspect to be noted here is that so long as users stay within their ecosystem, Amazon is able to dictate our purchase choices through predictive algorithms based on the thoroughly detailed user profile created based on our past purchases and browsing history. We don’t realize all this until we actively explore other options. In other words, our data is being used by tech giants to exploit the inherent inertia and our inclination toward certain products and services.
Hence, a paradigm shift is required in the manner in which algorithms are designed. The key lies in figuring out alternatives that can work without strong dependence on mining personally identifiable information. While this might seem difficult in certain applications, some out-of-the-box thinking can help change prevailing trends. There are several approaches that can solve this problem without requiring personally identifiable information. One of them is to use AI-powered technologies, which can do a fabulous job without mining personal data, and require that they’re designed to protect data privacy when the models are being built.
One such example is UNSILO’s Recommend solution that provides readers with recommendations based on concept extraction technology without having to depend on mining any user identifying data (full disclosure: UNSILO is a product from Cactus, my employer). The tool is designed to provide extremely relevant results in line with the users’ interests, which in turn are identified based on the concepts related to their search string instead of depending on third-party cookies or personally identifiable information.
The key aspect here is that the tool illustrates how search results and recommendations can remain highly relevant and accurate without mining personal data. Such tools open the doors to new possibilities and provide direction for futuristic tools that are independent of highly pervasive cookies.
More recently, scite has been making waves with their proposition to change the way research is conducted. Their vision for Smart Citations would allow users to see how a publication has been cited by providing the context of the citation and a classification describing whether it provides supporting or contrasting evidence for the cited claim. Their Citation Statement Search tool could help users find expert analyses and opinions on any topic by directly searching how anything has been cited.
The scite algorithm is meant to help users discover research related to their own and to potentially open a new paradigm in serendipitous discovery and research collaboration opportunities. They are working to build a system that uses natural language processing combined with a network of human experts to classify citations as supporting, contradicting, or simply mentioning.
There are two important aspects here: scite shifts the focus on how research has been cited, not just how many times it has been cited and the fact that scite hopes to offer results without tapping into personally identifiable user data. The latter aspect is more relevant to our current topic of discussion.
Another example is the concept of federated learning deployed by Google in its Gboard smart keyboard to predict which word to type next. Federated learning functions by building a final deep neural network from data stored on many different devices, such as cellphones, rather than one central data repository. Federates learning helps protect privacy because the original data never leaves the local devices. The flipside though is that the aggregated data still records personally identifiable information and has potential for privacy abuse.
That said, federated learning drives home the point that even very highly customized and personalized services such as word prediction can be built without compromising on user privacy when the core technology functions around the concept of maintaining privacy.
In sum, it is high time more attention is paid to developing AI and machine learning technologies that focus on providing solutions without infringing on user privacy. Such solutions will also help avoid the imposition of further regulations. Thus, it’s more important now than ever to try and influence policy decisions while we’re still in a position to determine the direction and design of future technologies.
The future must be designed to maintain human dignity and respect user privacy without compromising on the quality of product or service. A few tools have set the right precedent in heading toward a cookie-free future, and we must ensure that the momentum is not lost.
1 Thought on "Guest Post — Can Technology in the Post-cookie World be Designed to Respect User Privacy?"
There’s another very important implication of the ‘post-cookie’ changes that browser vendors are making – core authentication technologies relied on by the scholarly community for resource access and (increasingly) editorial workflows also rely on these same features e.g. IP addresses, 3rd party cookies, ‘link decoration’ (the strings added to the end of URLs).
Browsers can’t distinguish between invasive tracking and the legitimate use of these features to manage identity and access. We’re already starting to see this impact users e.g. Safari’s new Private Relay feature assigns users a random IP address that protects their privacy while also ensuring they fail IP authentication …
The SeamlessAccess project has been tracking this issue since last year and published a great blog post with more detail on how this impacts access at https://seamlessaccess.org/posts/2021-07-06-browserchanges/
And NISO has a dedicated session on this topic at their upcoming conference (‘Access Apocalypse? Be Prepared for Anything’, 17 February), where I’ll be joined by Heather Flanagan and Jason Griffey to discuss what’s going to happen and what we can do about it. Join us to learn about this important issue that is increasingly going to impact access across our community … More at https://np22.niso.plus/Category/5483096b-26e6-481c-a651-6dce721cbc68.