back view of a blocky illustrated man standing at the base of a waterfall, the figure pulls back a side of the waterfall like a shower curtain to reveal lines of bright blue binary numbers against a black background

The Data Science Revolution

The world is made up of data. Bits and bytes of information flow through nearly every aspect of our lives, from our internet habits to how our bodies function. But what does it mean to live in a world increasingly defined by data? Discover how Bucknellians are advancing the field of data science to better understand and unlock the potential of the data around us.

by Matt Jones

illustrations by Jon Krause

hrough humanity’s scientific pursuit of knowledge, novel ways of seeing have been developed to reveal what was previously hidden from sight. The invention of assistive machines and technologies have given us the ability to view everything from the microscopic world of electrons and cells to the distant reaches of the universe. But it is perhaps the introduction of modern computing, and subsequently the internet, that has played the largest role in reconstructing how we see the world. Specifically, as one made up of data.

In the same way that matter is the basic building block that undergirds the physical reality of the universe, data is the element that arranges matter into something meaningful to us: information that can be conveyed, interpreted and, with a little finessing, understood. While the concept of data is centuries old, the digital age has imbued the term with a new significance.

With the advent of the electronic computer in the mid-20th century, soon followed the arrival of data processing and data analysis, computer science, data mining and Big Data — disciplines and research paradigms that emerged in response to an existence more and more defined by bytes and bits of numerical code. The same can be said for data science, which is both a discipline unto itself and a reimagining of how to innovatively apply familiar methodologies and systems.

More importantly, data science is a way of seeing and thinking about the world, and a growing number of Bucknellians are already using data to build better businesses, extend lifespans, make the internet a safer place and even secure the future of the planet.

What Is Data Science?

Lauren Moores ’85 is a data consultant and strategist with more than 25 years of experience in using AI and computational tools to develop data-based solutions for a variety of industries. She first remembers using the term “data science” back around 2008, which also happened to be the same year that the title of data scientist entered the public lexicon. For her, it was a way of signaling to clients that she wasn’t just working with research and analytics. She was embracing technology that allowed her to store, manipulate and access the massive amounts of data being generated across the internet through mobile phones, smart devices and computers. “Data science is not the tools — it’s not AI; it’s not machine learning — it’s about how those tools are used to solve problems and build the systems that allow you to make decisions easier.”

illustration of a cracked egg in a nest releasing a crowing chick made of glowing, blue binary numbers

We are living in what has been dubbed the Zettabyte Era, so named in 2012 when the amount of digital data in the world surpassed one zettabyte, the equivalent of one trillion gigabytes. The proliferation of the internet has served to exponentially increase the amount of data, with estimates suggesting that humans produce more than 2.5 quintillion bytes of data every day. To participate in the modern, digital world — to search Google, click a link, wear your Fitbit, send a text message, swipe on Tinder, ask a question on Reddit, fire off a Facebook post when you’re upset, watch questionable reality television on Netflix, either half-heartedly or with rapt attention — is to generate data.

The Birth of Big Data

By the early 2000s, datasets had grown to immense sizes and become more complicated in makeup, with a diverse range of content, including text, images and videos, all while being produced at a rapid pace. Managing these vast and complex datasets required a new approach. Big Data centers on processing massive amounts of information that are otherwise impossible to comprehend when viewed within smaller segments. In this way, Big Data runs counter to the traditional academic approach of textual analysis, in which a close reading of a single text or handful of texts is performed to derive qualitative insights.

“When you look from a distance, it allows you to see patterns you didn’t see before,” says Professor Song Chen, Chinese history.

“Data science is not the tools — it’s not AI; it’s not machine learning — it’s about how the tools are used to solve problems.”

First coined by literary historian and theorist Franco Moretti, distant reading applies computational methods to large bodies of literary data. Chen likens the difference between close reading and distant reading to the difference between viewing a location first from the ground and then from above, in an airplane. The bird’s-eye perspective gives the viewer the ability to read patterns within the landscape otherwise unobservable and untraceable from more immediate vantage points.

As a social historian, Chen finds distant reading a useful framework for prosopography research, which unveils patterns, connections and commonalities between individuals in a larger population.

“We’ll look at when they were born, the people that they studied with, the people they taught, how they entered government, what offices they held, where and from what time to what time, their marriage connections, their political connections, all sorts of stuff,” says Chen.

These connections can be used to illuminate how the relationships between individuals correspond to larger historical transformations across social, legal, political and economic institutions.

While distant reading was first developed to further literary studies, it’s a way of seeing that can benefit many fields of study, including environmental science and geography, medicine and health care, and marketing and merchandising.

Protecting the Planet

In environmental science, a form of distant reading called remote sensing allows us to observe and understand transformations in the natural world on a larger scale. Since NASA launched the first one in 1972, a series of Landsat satellites has detected and classified objects on Earth. By producing millions of images, this technology has enhanced our understanding of species, urban development, water use and glacier retreat. Today, it’s giving scientists deeper insights into the health of global ecosystems.

Algorithm
A set of step-by-step instructions given to a computer to perform a task or solve a problem

Artificial intelligence
The science of teaching computers to think and learn on their own, simulating human intelligence

Big Data
Large and complex datasets beyond the ability of non-computer methods to manage and analyze efficiently

Computer vision
A field of artificial intelligence that teaches computers how to identify and understand information from visual inputs

Machine learning
A type of AI that enables computers to learn and improve without being explicitly programmed for specific tasks

“The field has been evolving, but essentially, data science itself is the art of managing Big Data to derive insights and then inform decision making,” says Jessica Ertel ’14, a data scientist in the World Resources Institute’s Global Restoration Initiative. “What that means in my role right now is we essentially combine remote sensing data — satellite imagery and radar imagery — with machine learning techniques to derive insights about where trees exist on the planet right now.”

Data that accurately quantifies the number of trees in a given area is especially useful for informing decision making by government agencies and private sector stakeholders about how to assess and implement protection and restoration efforts.

Ertel’s work is housed primarily within a field of artificial intelligence known as computer vision, which uses machine learning algorithms to recognize and analyze visual imagery. Put succinctly: She teaches computers how to see trees. To do that, she relies on massive visual datasets of high-resolution images gathered by the European Space Agency’s Copernicus Sentinel-2 imaging mission. These kinds of bird’s-eye-view images can give researchers a general understanding of the state of the world’s forests, but Ertel’s restoration work in non-forest landscapes requires an even greater level of precision and detail.

“Being able to count trees from satellite imagery is really complicated because you can imagine trees often have canopies that overlap, and so being able to detect individual trees is a really difficult exercise from just the imagery itself,” she says.

“Data is the lifeblood of what we’re doing. I think it’s the oil of the health care industry right now.”

Because computers are not equipped with the same sensory hardware as the human eye and brain, they don’t see pixels as dots of color, but as a series of numbers corresponding to the amounts of red, green and blue present in each pixel. In Ertel’s work with computer vision, she must create a structural framework of high-quality code capable of translating the visual data of satellite imagery into something computers can understand. In doing so, she can train models to detect and analyze complex patterns in images to delineate the features that characterize different types of trees.

Redefining Medicine

Where the field of data science tends to excel is in analyzing immense datasets too unwieldy in scope to be processed by other means. Computer vision aids this process in its ability to automate and replicate the kind of detection and analysis that the human visual system is capable of at scale.

“We’ve been able to take the field of computer vision to look at slides of people’s tumors and start digging into what’s happening within the tumor itself and the area surrounding the tumor,” says Justin Johnson ’01, executive director of data science at AstraZeneca.

Johnson’s team is primarily concerned with leveraging data science to develop transformative medicines that lead to better patient treatment outcomes, with a vision to eliminate disease eventually. One of the first steps in creating a picture of an oncology patient lies in medical imaging, such as MRI and CT scans, which produce images of the body’s interior that clinical professionals can then analyze to develop a plan for medical intervention.

Training models on Big Data annotated by pathologists can improve the scalability, reproducibility and efficiency of our ability to analyze images using AI. This can help to detect subtle patterns earlier, reduce human bias and alleviate the workload of pathologists so they can focus on more complex cases.

The goal then is to transfer the highly specialized training of doctors, who can interpret the visual data of medical images, to an AI that can do the same thing repeatedly. The accuracy of these AI models relies heavily on the right data, the right oversight, and the ability to explain what the AI is doing to build trust and ensure these models are used correctly.

In Johnson’s case, he and his team use multimodal data from different sources to construct the most comprehensive picture of a patient, improving disease understanding and allowing for more refined ways to model and design patient-centric design trials.

“Everything from clinical trial outcomes to radiomics and digital pathology data to their genomic data — we’re trying to get this data organized so that we can start building models to understand different outcomes,” says Johnson. “The more high-quality data you have, the more you can refine your model. The better you can understand the disease, the better you can streamline trials and treatment.”

Historically, methods of drug discovery relied on clinical trials in which one group of participants was given a new drug while the control group was either given a placebo or placed under a standard treatment. The most obvious drawback of this model is that the control group doesn’t receive potentially critical medical treatment, a fact that often serves as a barrier to participant recruitment.

Where data science offers new possibilities for improving patient outcomes is in the use of synthetic control groups in clinical trials. Instead of patients being assigned to the control group, AI models can use real-world data collected from a wide range of other sources, such as electronic health records, disease registries and historical clinical trial data, to create a synthetic control group.

illustration of a female figure carrying the Earth overhead with some effort, the Earth's reflected cast shadow is made of blue binary numbers

“We can use AI to generate a synthetic control arm based on data already available in the public domain, or on what we have internally, so that patients can be matched to the therapy best suited for them. We can really start pushing people toward therapy that will be of immediate benefit,” says Johnson. “Data is the lifeblood of what we’re doing. I think it’s the oil of the health care industry right now.”

Comparing oil and data offers an interesting perspective. Both are resources that can be refined, but the similarities mostly end there. Unlike crude oil, data lacks uniformity and consistent value. While data can be processed to extract insights and knowledge, those are concepts, not tangible products. Instead, what matters with data is how those insights are applied to solve specific problems in different contexts.

The Business of Data

Business applications of data science represent a potential paradigm shift in the discipline, one in which data is conceived of as less of a commodity with its own embedded value and more as a service where value is idiosyncratic and contextual, not something to be extracted, but created from the exchange that occurs when particular needs prompt the production of specific knowledge.

“Data without context means nothing,” says Martin Gilliard ’99, co-founder and CEO of Arteli, a predictive analytics platform that uses artificial intelligence and machine learning to optimize physical retail spaces, which includes everything from furniture to apparel stores. “Tons of people have data, but if they don’t understand how to apply it, then it adds no value.”

Gilliard’s work is primarily concerned with helping physical stores understand and apply data insights to optimize operations and increase revenue. To do this, he uses predictive analytics, a combination of machine learning and statistical analysis, to uncover patterns within large datasets that can be leveraged to make predictions about future behaviors.

“We start with what products should actually be in the store. And what that means is not just what the consumers want, but where margin is made. We help them understand how pricing should be done, inventory quantity, inventory replenishment and how the store should be designed,” says Gilliard. “We don’t impact the number of people that walk in the store, but the economics of what happens once they’re inside.”

“Data is never perfect. in fact, much of the data that is indiscriminately collected is likely to be bad.”

Using data to customize the experience of shopping online is one that can be accomplished by analyzing click-through rates, ad impressions and internet searches. But the challenges of customizing shopping at a physical retail store, a space naturally shared by individuals with different desires and needs, is decidedly more complicated.

“In a physical store, you’re dealing with things that historically have never been digital and things that have never been measured,” says Gilliard. “So the opportunity to innovate is huge, but the complexity is also huge, which is probably why most companies haven’t even jumped into it.”

It’s not just data about product inventory, purchase history and pricing that Gilliard has to consider. There are also macroeconomic factors such as inflation, the impacts of which are often unequally distributed across various geographic locations, that shape how consumers interact with products. Ideally, predictive analytics can examine the relationship between these variables to discern patterns that can be used to inform businesses about how consumers will respond to new conditions and future trends based on past experiences.

“Where I think the biggest innovation will happen is actually in the operations of the business,” Gilliard says, noting that businesses can use data to optimize all steps of the supply chain. “There’s so much within the operational part of retail that AI will be able to supplement and do a lot faster and better in the future.”

User Beware

While the benefits of data are substantial and can lead to positive, life-changing impacts, it’s important to recognize the consequences of overreliance.

“Data is never perfect,” says Moores. “In fact, much of that data that is indiscriminately collected is likely to be bad.”

Bad data can mean a number of things, though the term typically refers to data that is inaccurate, fraudulent or biased in some way. A model trained on flawed data can generate faulty insights that lead to misinformed decisions, a principle more succinctly captured with the maxim “garbage in, garbage out.”

“One of the fundamental tasks data scientists face is they have all this noisy data, and they need some way to find patterns out of it and develop measures from it,” says Colin Henry ’11, a post-doctoral research fellow in the Program on Extremism at George Washington University.

As a former Data Science for Social Good Fellow at Vanderbilt University who studies online extremism and hate speech, Henry primarily works with text from the internet. Specifically, he uses a massive dataset of more than 200 million individual posts from platforms such as YouTube, Facebook, Telegram, Gab, 4chan and Parler to train artificial intelligence models to detect and identify hate speech.

The primary challenge of teaching an AI model how to recognize hate speech lies in the fact that the already murky boundaries of the concept get further eroded within the vast network of online communities that populate the internet. For instance, while certain words can often be categorized as hate speech, the contexts in which they are deployed are harder to define, such as in instances when users engage in counterspeech or seek to reclaim terms that have been co-opted by hate groups.

“It’s a really nuanced process,” says Henry, one that is guided by machine learning, a subset of AI that uses algorithms to teach computers how to extract patterns from collected data to improve specific tasks. Henry’s research uses supervised learning, a form of machine learning that relies on human intervention to identify and label important data to better train machines. For instance, online comments deemed hateful are classified into specific categories, including hate based on religion, sex and gender, race, ethnicity, nationality and antisemitism. “We go through and mark these individually. We’ve had people doing this for a couple years, so we have close to 10,000 hand-coded elements. Once you have that, then you can train all kinds of models to do this classification for you.”

The ability to detect hate speech with AI models, in concert with an analysis of social media platform terms of service agreements, has allowed Henry to derive insights about the connections between specific categories of hate language and the potential for users to be deplatformed from an online community.

One thing that becomes clear when conducting a broad survey of data science is that the discipline is still taking shape, in part because it is growing — at an exponential rate no less — at the same time that its practitioners are attempting to define the broad contours of its capabilities and limitations. However, it is precisely this indeterminateness that belies the field’s potential and guarantees that Bucknellians will continue to be at the forefront of an evolving, dynamic field.

Back to Full Issue

Table of Contents