The Data Science Revolution
The world is made up of data. Bits and bytes of information flow through nearly every aspect of our lives, from our internet habits to how our bodies function. But what does it mean to live in a world increasingly defined by data? Discover how Bucknellians are advancing the field of data science to better understand and unlock the potential of the data around us.
by Matt Jones
illustrations by Jon Krause
hrough humanity’s scientific pursuit of knowledge, novel ways of seeing have been developed to reveal what was previously hidden from sight. The invention of assistive machines and technologies have given us the ability to view everything from the microscopic world of electrons and cells to the distant reaches of the universe. But it is perhaps the introduction of modern computing, and subsequently the internet, that has played the largest role in reconstructing how we see the world. Specifically, as one made up of data.
In the same way that matter is the basic building block that undergirds the physical reality of the universe, data is the element that arranges matter into something meaningful to us: information that can be conveyed, interpreted and, with a little finessing, understood. While the concept of data is centuries old, the digital age has imbued the term with a new significance.
With the advent of the electronic computer in the mid-20th century, soon followed the arrival of data processing and data analysis, computer science, data mining and Big Data — disciplines and research paradigms that emerged in response to an existence more and more defined by bytes and bits of numerical code. The same can be said for data science, which is both a discipline unto itself and a reimagining of how to innovatively apply familiar methodologies and systems.
More importantly, data science is a way of seeing and thinking about the world, and a growing number of Bucknellians are already using data to build better businesses, extend lifespans, make the internet a safer place and even secure the future of the planet.
What Is Data Science?
The Birth of Big Data
“When you look from a distance, it allows you to see patterns you didn’t see before,” says Professor Song Chen, Chinese history.
“Data science is not the tools — it’s not AI; it’s not machine learning — it’s about how the tools are used to solve problems.”
First coined by literary historian and theorist Franco Moretti, distant reading applies computational methods to large bodies of literary data. Chen likens the difference between close reading and distant reading to the difference between viewing a location first from the ground and then from above, in an airplane. The bird’s-eye perspective gives the viewer the ability to read patterns within the landscape otherwise unobservable and untraceable from more immediate vantage points.
As a social historian, Chen finds distant reading a useful framework for prosopography research, which unveils patterns, connections and commonalities between individuals in a larger population.
“We’ll look at when they were born, the people that they studied with, the people they taught, how they entered government, what offices they held, where and from what time to what time, their marriage connections, their political connections, all sorts of stuff,” says Chen.
These connections can be used to illuminate how the relationships between individuals correspond to larger historical transformations across social, legal, political and economic institutions.
While distant reading was first developed to further literary studies, it’s a way of seeing that can benefit many fields of study, including environmental science and geography, medicine and health care, and marketing and merchandising.
Protecting the Planet
Data that accurately quantifies the number of trees in a given area is especially useful for informing decision making by government agencies and private sector stakeholders about how to assess and implement protection and restoration efforts.
Ertel’s work is housed primarily within a field of artificial intelligence known as computer vision, which uses machine learning algorithms to recognize and analyze visual imagery. Put succinctly: She teaches computers how to see trees. To do that, she relies on massive visual datasets of high-resolution images gathered by the European Space Agency’s Copernicus Sentinel-2 imaging mission. These kinds of bird’s-eye-view images can give researchers a general understanding of the state of the world’s forests, but Ertel’s restoration work in non-forest landscapes requires an even greater level of precision and detail.
“Being able to count trees from satellite imagery is really complicated because you can imagine trees often have canopies that overlap, and so being able to detect individual trees is a really difficult exercise from just the imagery itself,” she says.
Redefining Medicine
“We’ve been able to take the field of computer vision to look at slides of people’s tumors and start digging into what’s happening within the tumor itself and the area surrounding the tumor,” says Justin Johnson ’01, executive director of data science at AstraZeneca.
Johnson’s team is primarily concerned with leveraging data science to develop transformative medicines that lead to better patient treatment outcomes, with a vision to eliminate disease eventually. One of the first steps in creating a picture of an oncology patient lies in medical imaging, such as MRI and CT scans, which produce images of the body’s interior that clinical professionals can then analyze to develop a plan for medical intervention.
Training models on Big Data annotated by pathologists can improve the scalability, reproducibility and efficiency of our ability to analyze images using AI. This can help to detect subtle patterns earlier, reduce human bias and alleviate the workload of pathologists so they can focus on more complex cases.
The goal then is to transfer the highly specialized training of doctors, who can interpret the visual data of medical images, to an AI that can do the same thing repeatedly. The accuracy of these AI models relies heavily on the right data, the right oversight, and the ability to explain what the AI is doing to build trust and ensure these models are used correctly.
In Johnson’s case, he and his team use multimodal data from different sources to construct the most comprehensive picture of a patient, improving disease understanding and allowing for more refined ways to model and design patient-centric design trials.
“Everything from clinical trial outcomes to radiomics and digital pathology data to their genomic data — we’re trying to get this data organized so that we can start building models to understand different outcomes,” says Johnson. “The more high-quality data you have, the more you can refine your model. The better you can understand the disease, the better you can streamline trials and treatment.”
Historically, methods of drug discovery relied on clinical trials in which one group of participants was given a new drug while the control group was either given a placebo or placed under a standard treatment. The most obvious drawback of this model is that the control group doesn’t receive potentially critical medical treatment, a fact that often serves as a barrier to participant recruitment.
Where data science offers new possibilities for improving patient outcomes is in the use of synthetic control groups in clinical trials. Instead of patients being assigned to the control group, AI models can use real-world data collected from a wide range of other sources, such as electronic health records, disease registries and historical clinical trial data, to create a synthetic control group.
“We can use AI to generate a synthetic control arm based on data already available in the public domain, or on what we have internally, so that patients can be matched to the therapy best suited for them. We can really start pushing people toward therapy that will be of immediate benefit,” says Johnson. “Data is the lifeblood of what we’re doing. I think it’s the oil of the health care industry right now.”
Comparing oil and data offers an interesting perspective. Both are resources that can be refined, but the similarities mostly end there. Unlike crude oil, data lacks uniformity and consistent value. While data can be processed to extract insights and knowledge, those are concepts, not tangible products. Instead, what matters with data is how those insights are applied to solve specific problems in different contexts.
The Business of Data
“Data without context means nothing,” says Martin Gilliard ’99, co-founder and CEO of Arteli, a predictive analytics platform that uses artificial intelligence and machine learning to optimize physical retail spaces, which includes everything from furniture to apparel stores. “Tons of people have data, but if they don’t understand how to apply it, then it adds no value.”
Gilliard’s work is primarily concerned with helping physical stores understand and apply data insights to optimize operations and increase revenue. To do this, he uses predictive analytics, a combination of machine learning and statistical analysis, to uncover patterns within large datasets that can be leveraged to make predictions about future behaviors.
“We start with what products should actually be in the store. And what that means is not just what the consumers want, but where margin is made. We help them understand how pricing should be done, inventory quantity, inventory replenishment and how the store should be designed,” says Gilliard. “We don’t impact the number of people that walk in the store, but the economics of what happens once they’re inside.”
“Data is never perfect. in fact, much of the data that is indiscriminately collected is likely to be bad.”
“In a physical store, you’re dealing with things that historically have never been digital and things that have never been measured,” says Gilliard. “So the opportunity to innovate is huge, but the complexity is also huge, which is probably why most companies haven’t even jumped into it.”
It’s not just data about product inventory, purchase history and pricing that Gilliard has to consider. There are also macroeconomic factors such as inflation, the impacts of which are often unequally distributed across various geographic locations, that shape how consumers interact with products. Ideally, predictive analytics can examine the relationship between these variables to discern patterns that can be used to inform businesses about how consumers will respond to new conditions and future trends based on past experiences.
“Where I think the biggest innovation will happen is actually in the operations of the business,” Gilliard says, noting that businesses can use data to optimize all steps of the supply chain. “There’s so much within the operational part of retail that AI will be able to supplement and do a lot faster and better in the future.”
User Beware
“Data is never perfect,” says Moores. “In fact, much of that data that is indiscriminately collected is likely to be bad.”
Bad data can mean a number of things, though the term typically refers to data that is inaccurate, fraudulent or biased in some way. A model trained on flawed data can generate faulty insights that lead to misinformed decisions, a principle more succinctly captured with the maxim “garbage in, garbage out.”
“One of the fundamental tasks data scientists face is they have all this noisy data, and they need some way to find patterns out of it and develop measures from it,” says Colin Henry ’11, a post-doctoral research fellow in the Program on Extremism at George Washington University.
As a former Data Science for Social Good Fellow at Vanderbilt University who studies online extremism and hate speech, Henry primarily works with text from the internet. Specifically, he uses a massive dataset of more than 200 million individual posts from platforms such as YouTube, Facebook, Telegram, Gab, 4chan and Parler to train artificial intelligence models to detect and identify hate speech.
The primary challenge of teaching an AI model how to recognize hate speech lies in the fact that the already murky boundaries of the concept get further eroded within the vast network of online communities that populate the internet. For instance, while certain words can often be categorized as hate speech, the contexts in which they are deployed are harder to define, such as in instances when users engage in counterspeech or seek to reclaim terms that have been co-opted by hate groups.
“It’s a really nuanced process,” says Henry, one that is guided by machine learning, a subset of AI that uses algorithms to teach computers how to extract patterns from collected data to improve specific tasks. Henry’s research uses supervised learning, a form of machine learning that relies on human intervention to identify and label important data to better train machines. For instance, online comments deemed hateful are classified into specific categories, including hate based on religion, sex and gender, race, ethnicity, nationality and antisemitism. “We go through and mark these individually. We’ve had people doing this for a couple years, so we have close to 10,000 hand-coded elements. Once you have that, then you can train all kinds of models to do this classification for you.”
The ability to detect hate speech with AI models, in concert with an analysis of social media platform terms of service agreements, has allowed Henry to derive insights about the connections between specific categories of hate language and the potential for users to be deplatformed from an online community.
One thing that becomes clear when conducting a broad survey of data science is that the discipline is still taking shape, in part because it is growing — at an exponential rate no less — at the same time that its practitioners are attempting to define the broad contours of its capabilities and limitations. However, it is precisely this indeterminateness that belies the field’s potential and guarantees that Bucknellians will continue to be at the forefront of an evolving, dynamic field.