At ITHAKA’s Next Wave conference, leaders from academic societies, libraries, publishers, and other organizations come together to focus on important strategic issues challenging higher education, specifically in the United States. Each year at this event, ITHAKA President, Kevin Guthrie, interviews someone who is at the leading edge, working to change organizations, systems or practices. On Dec 4, 2019, Kevin talked with Xiao-Li Meng, Professor of Statistics at Harvard University, about the increasingly central role data science is playing in research and teaching, the changes he sees on the horizon and is helping to foster, and the impetus behind the recently launched open access Harvard Data Science Review where he is Editor-in-Chief. The published interview below is based on notes taken during the discussion, compiled by my colleague Heidi McGregor, with edits from Kevin and Xiao-Li. A video of the interview is also available.
Guthrie: You’ve referred to data science as an ecosystem. What do you mean by that and why is that important?
Meng: I wrote that in Data Science: An Artificial Ecosystem, my first op-ed. I used the word “artificial” to signal that data science is both human-created and very intensively computer-based, like AI. Recently I attended an international dialog on AI. One speaker put a line on the screen that he said was used by a university to attract undergraduate students: “If you want to solve all the problems in the world, major in computer science.” My statistical ego got provoked immediately beyond six sigma. What about statistics, applied mathematics, operation research, and everything else.
What about humanities?
Exactly. But then the speaker put up another line that the same university used for advertising their graduate school: “If you want to solve all the problems created by computer science, please enroll in a graduate program in the arts and sciences.” Whoever humored us with this very clever pairing really had a profound understanding of the complexity, enormity, and interconnectivity of data science. So many things we have today are created by the computer: we would not have the internet, we would not have AI; but it’s also created all kinds of problems that are all interconnected.
I talk about data science as an ecosystem for a couple reasons, but the most important one is that the data science evolution is changing individual fields. Look at statistics, which has always been applied oriented, starting from agriculture experiments and genetics. Over 100 years, we have interacted with many other fields (e.g., econometrics, biostatistics, envirometrics, etc.). During this long course of things happening when statistics is being used in other fields, we statisticians all felt great about it. Until this data science revolution happens, now not only do we feel great about it, we also feel very threatened as statisticians. Will we still be valuable as a profession in 10, 20, 30 years from now? Because data science itself is going through profound change, if you don’t evolve with the ecosystem, you’ll be extinct.
In preparing for this conversation and talking with you, one of the things that stood out for me is that unlike other disciplinary fields, data is a content type and your ecosystem, data science is like the oxygen – everyone has or needs access to data. In founding the Harvard Data Science Review, your motto is “everything data science and data science for everyone.” Why do you say for everyone?
It’s almost a no-brainer. Look around at what data science has done and how all our lives are affected. When I say for everyone, I don’t mean just for all the research fields in universities. If you look around at what’s happening in industry, government, NGOs, you will see how data science becomes such a priority. Our second issue has an article by Nancy Potok, the Chief Statistician of the United States, looking at the impact of the Evidence Based Act where there is a requirement to hire three new positions in each major government agency: a chief data officer, a chief evaluation officer, and a senior statistical officer. That’s 72 positions across the 24 major agencies. Consider the impact these people will have on us. If we want our data science to be right—I am not shy to say that the intellectual mission of HDSR is to help to define and share what data science is or should be—and to achieve that level, we just have to include everyone, particularly on the pedagogical side.
People say everyone will have to have some data science understanding or capability in their education. Is that right and if so, how far in the future will that need to happen?
Yes, everyone should have some familiarity, but data science is not just about the traditional statistics, computer science, or STEM fields. If you think in that narrow way, you’d exclude a lot of people. We need to think about all the ways questions are being raised. There is now a whole area called “algorithm politics”, related to algorithm fairness, accountability, interoperability, accessibility, etc; and a lot of that are legal scholars’ or philosophers’ job. Therefore, there are lots of entry points to data science. The HDSR board consists of literally legal scholars, philosophers, all the way to quantum physicists. You don’t have to be the one doing the mathematics or statistics to be part of the data science community. It’s like many people are good wine connoisseurs but most of them don’t know the chemistry or how the wine is made, but they know enough to appreciate wine.
So with respect to learning, you have written that curricular design is moving too slowly.
What I meant is that compared to the research progress on data science methods, the research on data science education is still at the dinosaurs’ age. Many universities are creating masters-level data science programs. There are good reasons to start there, but show me a curriculum and I can tell you with good statistical confidence which species of dinosaurs had put their weight down on it, that is, whether it was created by statisticians, or computer scientists, or engineers, etc. That is ok for now, because things are happening too fast and everyone needs to contribute, and naturally we contribute in ways we know how. But this is not a systematic, optimal data science education. The problem is that we all have a narrow view, myself included until I got involved with HDSR. Most of us think of data science as a new discipline, which for university administrations means a new department. But creating a new department of data science is the wrong way to go. You need to think of data science like a science, like a social science, like the humanities. You rarely hear of a “Department of Science” or “Department of Social Science”, but rather a School of Science or School of Social Science. So we need to ask how will we teach data science coherently, and what will be the right infrastructure to support it? I will give a shout out to the University of California at Berkeley for doing the right thing. They are creating a division of data science – it’s university wide, a new school – recognizing that data science permeates other fields and requires this scale. But overall we are really behind with infrastructure as well as how to deliver pedagogically.
We’ve been doing research on data faculty in higher ed. One thing we hear is that the competition is intense. It’s hard to get and keep faculty and graduate students who are going to Google, Facebook, etc. These organizations have the large data sets and capacity. Are you concerned about brain drain from the academy to industry?
Like many of my colleagues, ten years ago I was very worried. Many of us looked at how many of our graduate students were going into universities to measure our value. And for students don’t follow our suit, we label them as pursuing “alternative careers”. I started to change my view when I became dean of a graduate school of arts and sciences. Leaving higher ed? Not so bad if they do. Think about it as part of the ecosystem. If you want to convince the public that higher ed is important, have your students out there in industry and government in leadership roles. For HDSR I need voices from industry and government, and I now know how to find them. Also, if lots of PhDs make lots of money, they can fund higher ed and create a stronger system long-term. I engage them and tell them: bring back the problems you want academics to work on. That’s part of the reason HDSR was established to create a forum for all these voices and perspectives to come together.
You’re talking about change, in every which way. Change to how work is done. Change to how we organize. It’s just change, cubed. We have many librarians and publishers with us today. What’s the role of the library with respect to data science?
I have to say that after one and half years working on HDSR, my respect for library science has gone up by orders of magnitude. Throughout human history, libraries have played the critical role of preserving, disseminating, and curating knowledge. And a lot of things we do now are data curation, data provenance, reproducibility and replicability of science. How do we get researchers to record what they do so that future scholars can use it, but in a way that is not burdensome to the researchers? Libraries have been working on these issues for centuries, and I certainly hope that we can all learn from library science. Data science is also changing library science itself, because of digital publications and open access, for example.
And your advice for those who lead societies or academic publishing organizations?
The first, most obvious thing is to hire someone who knows data. By knowing data, I don’t just mean someone who can program or analyze data. I mean someone who understands the enterprise of data, how they interact with people, how we make decisions with evidence. Data is one form of convincing each other. Like when our politicians present data. We rarely talk about how we got to the data—just as we don’t talk about how to make sausages —and all the dirty assumptions that are behind it. By hiring someone who knows data, they will understand the nuance. In a way, all data science does is to help us to make a better educated guess using data.
At ITHAKA, we try to hire people who know data science. I think this is a problem disguised as a solution. We need you to make more of them.
Yes, that’s why I say we are so far behind on the education side!
What’s the one thing that most disturbs you about data science?
I think of the real estate slogan: location, location, location. What worries me and should worry you: selection, selection, selection. We select data and methods to prove our points, we select what to report, and journals obviously want to select most impressive results to publish. Selection alone is not a problem. We always have to make selections; it’s how science moves forward. When it becomes a problem is when you report statistical significance or probability, which simply speaking is just how many things happened over how many things could have happened. The problem of selection is that it changes the denominator, so if you don’t take that into account, by making enough selections you can get almost any answer you want. Selection is not a problem when you can account for it in the reporting. But most of this is happening via machinery today, and it’s a black box. That’s why we read in the newspaper today, coffee is good for you and tomorrow coffee is bad for you. The research may even use the same dataset, but comes to different conclusions because the researchers select different pieces. This problem is not new, but what’s new is that we are now doing it at a massive scale, and hence the risk of false discovery is greatly increased.
Physical and biological sciences can be checked. Social sciences and policy making are much harder. When we can be misled by selection bias, it’s much harder to assess its negative consequences. Often it takes years, and by then it’s too late.
So that’s your biggest worry. What’s the positive impact of data science? We are talking about this covering all our lives in society – tell us something good.
Professionally, I feel great now, as being a statistician now gets a lot more respect and attention. People are excited about and interested in the work. More broadly, data science creates a new common language for all of us.. It creates a platform for us to talk with each other about shared interest and that’s good for humankind.
What’s your vision for Harvard Data Science Review in 10 years?
Ten is too slow. Five years! I want it to be the place for data science like Science or Nature for science, and New England Journal of Medicine for medical science. But we also want everyone to go there. We have research articles, perspectives articles, application articles, and education articles. We have a range of columns targeted at different types of readers, such as policy makers, industrial leaders, and K-12 students and their teachers and parents. We have pieces on AI, baseball, individualized medicine, policy, even Oscar movie predictions.
Yes, there is a fun one where you have machine learning algorithms predict whether a Beatles’ song is by Lennon or McCartney.
Following the interview, several attendees asked questions touching on peer-review, reproducibility, preservation, and a fundamental question: “Data has always been there in research. What’s really different?”
Meng concluded: Using data for research is definitely not new, but the ability to collect so many data and crack them is. The scale is very important. It’s creating a new form of investigation. Physics is a great example. It used to be there were three major approaches – theoretical, applied, simulation – but now, some physicists let the machine find patterns using algorithms and then try to understand if these patterns make sense. This is a new form of investigation, but it is also very dangerous, of course. As statisticians we always say: If you torture data enough, the data will confess. So then you need to know how much confession is real, and how much of it is just to get over the pain. So you have to be careful with this new scale and these approaches, but data is not a new phenomenon.
The complete interview, including attendee questions, can be viewed online.