The world’s growing interest in data science is undeniable. One can see this reflected in the increased number of new academic degree and certificate programs, the number of new jobs available for individuals with data science skills, and the rising migration of researchers with computational skills to the private sector. In an effort to channel our efforts, we focus on six major themes around which academic data science discussions coalesce:


How do we create and sustain long-term career trajectories for a new generation of scientists whose research depends crucially on the analysis of massive, noisy, and/or complex data and whose work may require substantial curation and/or development efforts? What about those scientists who focus entirely on building next-generation tools that others ultimately use to derive new science? These career paths are currently complicated by strong competition from industry, a tendency by universities to measure productivity using only traditional metrics (e.g., journal publication) that do not reflect the scholar’s full contribution, and a general failure to provide a sufficiently supportive and meaningful environment and culture.

We see persistent, complex, and deep-rooted challenges for the career paths of people whose skills, activities, and work patterns do not fit neatly into the roles and success metrics of traditional academia. However, data science in research universities requires precisely the kind of complex, long-term interdisciplinary work with methodological and engineering efforts that leads to “low performance” under traditional metrics and “slow progress” and “lack of fit” in existing career tracks.

Activities around this theme in the Data Science Environments include the following:

  • Defining positions and career tracks (e.g., data science faculty lines, data science fellows)
  • Promoting institutional policies that facilitate career development for data scientists
  • Building a recognizable data science community across scientific domains
  • Promoting mentorship and job satisfaction

back to top

Education and Training

Training in data science is needed at all levels in academic research, from the undergraduate student to the tenured faculty member. A major challenge to data science training is that it is inherently multidisciplinary, requiring training in computer science, statistics, applied mathematics, and one or more domain areas. Traditional university curricula do not provide this kind of broad data science training, and often data science courses are relegated to specific departments with no consideration of cross-disciplinary needs.

The Education Working Group is developing ways to use innovative teaching methods and formats to offer both formal and informal training in data science skills at undergraduate, graduate, and professional levels. Viewing education in its widest sense, the aim is to make data science technology and expertise far more accessible both within universities and beyond.

back to top

Tools and Software

Data-intensive science and data-driven discovery requires an ecosystem of software tools that are openly available, easy to use, and translatable across domains. How can the development, hardening, sustaining, sharing, and integration of techniques into a reusable software infrastructure be recognized and incentivized? There are known obstacles to the development of software tools, including the fact that domain scientists are specifically trained to develop and deliver the advanced software they require and that computer scientists and engineers have little incentive to harden, sustain, share, and integrate novel techniques into a reusable software infrastructure. We are working to remove these obstacles both within the Data Science Environments and in the broader academic community.

Specific goals around this theme include the following:

  • Deliver high-impact, usable, and generalizable software for science that does not “reinvent the wheel”
  • Promote open source software practices
  • Create, strengthen, and deepen connections between tool developers and investigators in data-driven domains
  • Promote “matchmaking” between methods/tools and substantive domains

back to top

Reproducibility and Open Science

The pace of knowledge acquisition in science is impeded by a lack of sharing of the research process in its entirety in addition to all of its outputs. This makes it difficult for researchers to build upon each other’s work and undoubtedly results in some “reinvention of the wheel.” Further, there is increasing scrutiny of publications reporting on “irreproducible results,” which can lead to misinformation and public distrust of science. As we transition to data-intensive scientific discovery, we have the opportunity to address these issues through software tools and practices that support the sharing, preservation, provenance tracking, and reproducibility of data, software, and scientific workflows.

As data-intensive scientific discovery becomes more common, reproducibility and open science becomes both more challenging and more important. Work around this theme includes identifying and promoting repositories for sharing data and workflows, developing techniques to query and analyze shared data and workflows that will facilitate reuse, and creating software tools to better support sharing and reproducibility.

back to top

Physical and Intellectual Space

Innovative data science is advanced at universities through the creation of high-quality physical spaces that successfully cultivate the “water cooler” effect (i.e., exchange and collaboration), raise the level of prestige for data science and scientists, and are adaptable to a range of activities that can promote data science research and learning. It is also essential that spaces be designed to bring together the data scientists who reside in academic units spread across our (large) campuses. This requires the spaces to be valuable as study, contemplation, development, and individual work areas.

The goal of discussions around this theme include how to design common spaces that encourage collaboration both within and between universities in a setting where stakeholders wear multiple hats and often have their own labs to manage. What events are most successful at bringing in researchers from multiple domains? How can space provide both the means for collaboration and the proper setting for concentration on data-intensive science?

back to top

Data Science Studies

Data Science Studies is a multi-sited working group of cross-disciplinary researchers studying the sociocultural and organizational processes around the emerging practice of data science. As a sub-field of Science and Technology Studies, we utilize a variety of qualitative and computational methods to understand the changing scientific practices, knowledge infrastructures, and cultural values that are shaping the environment for data-intensive scientific research. This encompasses, but is not limited to research focused around topics such as: the cultural and institutional contexts for open science, open software, and open data, tool building and scientific workflow, new pedagogical models for data science education, how disciplines are adapting to the demands of data-intensive science, new epistemologies and social implications of data-intensive science, human-centered data science, and studies of sociotechnical data science ecosystems.

back to top