English: Still to harvest.
English: Still to harvest. (Photo credit: Wikipedia)

Every once in a while, I either receive or am forwarded an email solicitation from a new journal or book program. Sometimes, the email solicitation even makes a stab — often clumsy — at a personal greeting, an approach that can backfire and reveal the deeply impersonal nature of the communication (Dear Kanderson or similar is likely to appear).

I’m often left wondering, “Where did they get my email address?”

This turns out to be an interesting question, and journal publishers may be a major if unwitting supplier.

Academic email addresses are easy to scrape. University sites list them next to names. Rosters of professional meetings often publish them in attendee lists. But journal articles post them openly and prominently for corresponding authors.

Some journal publishers have wised up, and are hiding their authors’ email addresses behind layers of quasi-security, but even these tricks can be overcome with relative ease.

Getting email addresses is simple — scrapers go after the HTML of sites, so if the email addresses are obtainable in the HTML, they’re easily scraped. The most distinctive part of an email address, of course, is the “@” sign. Some publishers try to obfuscate email addresses by wrapping them in some HTML spans and getting rid of the @ sign. In these cases, the markup might look like this:

E-mail: <span><span class=”em-addr”>john.doe{at}foo.com</span></span>

Sites that take this approach can have a javascript function that replaces the {at} for humans to see and then add a mailto: link so users can click on it, which all happens when the page is displayed in the UI layer. But the HTML doesn’t have any “@” symbols in it, and that’s the attempt at security.

This approach makes it harder to get an email address for crawlers programmed to look for “@”. However, once a spammer knows the substitution pattern, they can modify their scripts accordingly. But each approach like this is probably preventing some email harvesters from getting to the addresses.

Other approaches to help users keep their email addresses from harvesters is to create proxy email addresses or email boxes, so that “jane.doe@foo.com” becomes “customersupport@foo.com”. This way, individuals aren’t spammed, and harvesters may choose to drop those emails from their indexes because of their generic nature. Some sites make their users fill out contact forms to get in touch, routing the email internally and not exposing any email addresses to the outside world.

But scholarly publishers are pretty free with their publication of email addresses. In a quick survey of some journals that publish thousands of articles in total, email addresses were clearly available both visually and in the page source (HTML). For scrapers, getting email addresses from journals is a walk in the park.

One firm selling these addresses is SciData, which substitutes the “@” for “a” in its logo, indicating that email is @ the heart of their business. I won’t use that here because it’s just too visually annoying. Jeffrey Beall recently wrote about SciData on his Scholarly OA blog.

SciData offers a sample of its file of 8.1 million entries covering more than 30 bioscience and medical science names. The sample suggests that journals are the source for a lot of the data — after all, article title and abstract are two of the eight fields offered. If SciData were scraping just academic sites, I doubt they’d reliably get those two items in their database.

Of course, if you’re an academic, email addresses can be used for even less savory reasons, perhaps even to provide a deceitful author with a mask to wear in front of editors. Two recent retractions described on Retraction Watch reveal how email addresses can be abused to get fraudulent papers through.

One Chinese researcher was caught fraudulently suggesting peer reviewers for his paper — the editors became suspicious when the email addresses for two of these were updated to Chinese accounts within minutes of each other, even though the suggested reviewers weren’t in China and lived on the other side of the world from one another. In another amazingly brazen case of deception involving email (but not poached emails):

1) Dr. Wei Jia Kong’s name was used as the corresponding author without his knowledge or consent. Furthermore, Dr. Cai Pengcheng was unaware of his status as an author of the manuscript.

2) A fake email address for Dr. Kong was constructed and used by the authors to intercept any information that would be sent to the corresponding author.

ORCID has email address as a part of the record. Perhaps one of the benefits of that initiative will be an end to the days of openly posting email addresses in the HTML of scholarly articles.

Email addresses have a domain associated with them, and this is akin to the branding of the email and the person sending it. I would never think of sending a professional email to a peer using a gmail.com or verizon.net or comcast.net account. Poaching either the email of another person, posing as them, and leveraging their online brand is a form of identity theft.

Even though there are many new ways to communicate, email addresses remain our main way of communicating in writing, transmitting files, and maintaining our online brand. Journals should think carefully about how they handle email addresses. Publishers may be inadvertently feeding the email address harvesters. And they may want to make sure that the person on the other end of that email is who they claim to be.

(Hat tip to DB for the advice about <span> and {at} for managing email addresses in HTML.)

Enhanced by Zemanta
Kent Anderson

Kent Anderson

Kent Anderson is the CEO of RedLink and RedLink Network, a past-President of SSP, and the founder of the Scholarly Kitchen. He has worked as Publisher at AAAS/Science, CEO/Publisher of JBJS, Inc., a publishing executive at the Massachusetts Medical Society, Publishing Director of the New England Journal of Medicine, and Director of Medical Journals at the American Academy of Pediatrics. Opinions on social media or blogs are his own.

Discussion

6 Thoughts on "How Did They Get My Email Address? The Accidental and Plentiful Supply of Academic and Scientific Contacts"

Re’ your last comment, some of my colleagues use gmail addresses as their primary account, due to dissatisfaction with the IT depts at their home institutions. If anybody is ignoring their e-mails because of that, they certainly haven’t noticed

If you are an academic, chances are your email is all over the place. Mine you can get from both my department web sites, the university directory, published reports on the university web sites, university committee minutes published on the web etc.

Yes, it is a problem and generates tons of spam including one or two requests a day to submit papers and/or review for conferences I’ve never heard. Web based scams abound but anyone who has some form of web presence is at risk. Fortunately spam filters work great and sop up most of it while rarely flagging legitimate mail.

Publishing author emails in important for the free exchange of information. I contact authors concerning their papers and had plenty of people contact me concerning my papers for very legitimate reasons. If a publisher doesn’t want to give out author emails all they have to do is sent up a form for the purpose and pass on the information to the author w/o giving out their email address. Some publishers do that.

I like I expect others also harvest data from publishing web sites for legitimate research purposes. I think I provided some data on PLoS One’s time from submission to acceptance to publication in response to a post you did 6 or 8 months ago all harvested from the PLoS web site.

I am currently harvesting author names, emails and article titles from PLoS One and several other “mega journals” in order to conduct a survey of the authors concerning their motivations to publish in these journals. It has already received human subject approval.

My point is while it can be problematic, it is certainly not the only way faculty get their emails culled and publishing author contact information has some very legitimate uses as well.

I consider publishing in our journal to be a public act, equivalent to standing in Lafayette Square or Hyde Park and shouting from a soapbox. Unless an author has a need to protect their identity (e.g. political repression), their identity and email should be public.

Emails are harvested all kinds of ways and are almost impossible to hide. The only way to minimize spam is to use a good filter.

‘Every once in a while…’? Every day. Plus several conference-spam emails for those mega-conferences in Orlando and Dubai. Plus several dozen for scientific products I have no use for.

Comments are closed.