If you were on the city streets of Brooklyn or the Bronx in the past five years, you may have inadvertently crossed paths with a data science class. You probably didn't realize it because the classes looked nothing like a traditional classroom. Teenagers from the neighborhood wandered around in small groups. They were outfitted with tablets, pens, paper, cameras and maps. They periodically took pictures on the street, entered bodegas, chatted with passersby in Spanish or English, and entered information on their tablets.
These young people were attending their regularly scheduled school classes as well as participating in a project called "Local Lotto" created by math educator Laurie Rubel, the Civic Data Design Lab at MIT, and the Center for Urban Pedagogy in New York City. Local Lotto was designed to teach place-based statistics and data analysis to high school seniors and community college students. The learning goals for this curriculum are tied to standard mathematical concepts.1 For example, teachers want to introduce ratios and probability as well as expose students to more advanced concepts like spatial data analysis, combinatorics and mathematical modeling.
The regular way this is done is to line students up in chairs facing a teacher, and talk directly about the mathematical and numerical concepts, and then give students some problem sets. For example, Harvard's Introduction to Data Science classroom, pictured below, looks a lot more like the data science class we might all be expecting. It takes place in a standard large lecture hall with auditorium seating, students listen to lectures, and then they turn in problem sets individually in Jupyter Notebooks.
But Local Lotto had different aspirations – Laurie Rubel, the math educator who helped design it, is a leader in something called "mathematics for spatial justice". She and the other organizers wanted students to learn probability in relationship to issues in their everyday lives, in their everyday neighborhoods, in dialogue with people in their community. Specifically, they wanted young learners to come up with a data-driven argument to the question: "Is the lottery good or bad for your neighborhood?"
In New York State, like other US states that operate lotteries, the proceeds from sales of lottery tickets are directed towards state revenues. Lottery ticket purchases correlate with socioeconomic status. Low-wage workers buy more tickets than their higher-earning counterparts. Ticket sales are not allocated geographically to the districts where sales were made. Because of this, scholars have argued for decades that the lottery system is a form of unfair, regressive taxation – essentially a "poverty tax" – whereby low-income, Black and Latinx neighborhoods are taxed more because they play more, but do not receive a proportionate share of the public benefits that follow.
The Local Lotto curriculum is organized around 10-15 standard class sessions. Learners start talking about the lottery and probability by playing chance-based games and creating large, colored charts of ratios. Then they really dive into probabilities by computing and considering the jackpot games, like Sweet Millions. The State of New York runs an advertising campaign for these games with the slogan, "Hey, you never know…" And they advertise Sweet Millions as "your best chance from the New York Lottery to win a million for just a buck." The best chance turns out to be about 1 in 4 million, and a whole class session is devoted to learners coming up with creative interpretations of 4 million that relate to their lives. For example, you would have to drink 15,444 Arizona ice teas to consume 4 million calories; or wait 61 years for 4 million M subway trains to pass you on the platform. After these sessions, learners create infographic posters that respond to the lottery's "you never know" slogan.
At this point in the module, students leave the classroom with the goal of collecting data about how people in their neighborhood experience the lottery, which is how you might have encountered them on the streets of New York. They map stores that sell lottery tickets. They record interviews with shopkeepers and ticket-buyers on their tablets and geolocate them on their maps. They take pictures of lottery advertising. Afterwards, students present their results to the classroom – including themes and quotes that emerge from different perspectives such as consumers, people who choose not to buy lottery tickets, and shopkeepers.
With this introduction to participatory mapping and qualitative data analysis, the module now turns entirely to mapping the impacts of the lottery spatially. They examine choropleth maps of median income in New York City, they make ratio tables, and they examine how the State's profits from the lottery are distributed spatially to see whether they correlate with spending or with median family income (there is no correlation). Finally, learners prepare their culminating data-driven argument – an opinion story with evidence from their mathematical analysis, their interviews and their spatial analysis. Is the lottery good for their neighborhood?
The Local Lotto approach worked – in the sense that it led to greater understanding of the mathematical concepts with students who had previously been struggling in math. For example, before one iteration of Local Lotto, only 2 of 47 learners were able to use mathematics to find the correct number of possible combinations in a state lottery game where they choose six numbers from a set of forty. Later, almost half (21 of 47) were successfully able to calculate the number of combinations. The Local Lotto approach also worked in the sense that it produced deep engagement with the subject matter. One student shared that what he learned was “something new that could help me in my local environment, in my house actually,” and that he now tries to convince his mother to spend less money on the lottery by “showing her my math book and all the work.” The female native Spanish speakers in the class, who didn't often participate in classroom discussion, became essential translators during the participatory mapping module in order to interview neighbors. And students went on to teach other teachers about the curriculum, both locally and nationally. As Rubel describes, "We brought them to Brooklyn College to present to faculty there. We had a group of New York City school kids, and each kid led a small group of faculty from lots of disciplines, showing them how to read the maps and how to interpret some of the data. That was neat." The students subsequently presented at Math for America and a national conference on math and social justice in San Francisco.
What's different about the Local Lotto approach to teaching data analysis and statistical concepts? And why should people outside of math education be paying very, very close attention to learning experiments like this that disrupt the status quo of data science education?
Teaching and technology have a couple things in common. To start with, both are associated with utopian social aspirations. Horace Mann stated famously in 1848, "Education, then, beyond all other divides of human origin, is a great equalizer of conditions of men—the balance wheel of the social machinery." Mann is credited as one of the first advocates of universal public education and was quoted as recently as the Obama administration to illustrate its commitment to poverty reduction through public education. Likewise, information technology has been the subject of utopian speculation. If only everyone had access to the new tools of information, the thinking goes, then perhaps our social ills would be fixed. We can see this in the discourse around the digital divide, or the Open Data Movement, or in specific examples like the One Laptop per Child program, where it was imagined that one way to alleviate poverty and inequality was through giving every child in the Global South a laptop.
The democratic aspirations around both teaching and technology are fascinating to explore in historical relief. While it is hard for those of us with privilege to see our own blind spots right now, it is from a historical distance that we can begin to perceive the ways in which people's attempts to "do good" with the innovations of their day ended up reproducing a status quo in which the people in power are still on top. We know in retrospect, for example, that Mann was serious about the "man" part of his quote. That is to say, education was to be an equalizer of the conditions of men, but not women. In public speeches, Mann argued that men and women could not be treated as equals in the education system because their anatomy is different, "...there is not one single organ in structure, position and function alike in man and woman, and therefore there can be no equality between the sexes." So, the radical part of Mann's social imaginary for the time was that he imagined an education system which treated all white, Anglo-Saxon, Christian men, regardless of class background, as worthy of education. But women, people of color, immigrants, disabled people and others remained excluded from the equalizing.
Likewise, consider the One Laptop per Child project (OLPC), conceived by Nicholas Negroponte, a white man based at MIT. The bold idea is to transform global education by distributing low-cost laptops to kids. At the Techonomy conference in 2010, Negroponte responded to some of his critics, "One of the things people told me about technology, particularly about laptops, in the beginning was 'Nicholas, you can't give a kid a laptop that's connected and walk away.' Well, you know what, you can. You actually can. And we have found that kids in the remotest parts of the world, when given that connected laptop, like some of the kids in these pictures, not only teach themselves how to read and write, but most importantly, and this we found in Peru first, they teach their parents how to read and write." Unfortunately, evidence from deployments of OLPC in various countries has not borne out the theory of change that laptops lead to literacy. But what is crystal clear in OLPC is the way in which "doing good" reinscribes existing power relationships. The white men at MIT are cast in the part of the magnanimous benefactors and the children in Africa, Asia, and Latin America in the role of victims that need their help (in the form of cheap computers).
So, teaching and technology have utopian social aspirations, but those are inextricably linked to the people who are doing the imagining, including their gender, class, colonial status and race position in society. Note how neither Mann nor Negroponte envisioned a mutual exchange – they did not imagine that those they were helping might have something to teach them. The utopian imagining done by those in power almost always stops short of mutual transformation or sharing power. Why?
The other thing that teaching and technology have in common is that they are both what feminist computer scientist Lynette Kvasny would call "sites of social reproduction." Feminist theories of social reproduction demonstrate how supposedly neutral places and practices are actually ways to maintain and secure an existing, unequal social order, based in patriarchy, racism and other forms of exclusion. When we frame marginalized people as in need of help from the dominant group, she writes, then "[h]umanity is stolen from historically disadvantaged people as they come to be seen as have-nots, the unemployed, and the urban poor." Although the goal of "doing good" or "helping" or "democratizing" might be well-intentioned, that first act of imagining others in a deficit position in relation to yourself repeats the cycle of domination and oppression. Access to education and technology becomes a way to socialize "those people" into a given social order, without challenging the very basis of the order's existence.
What does this mean practically in the data science classroom? Imagine teaching as a way to model the world. The world that is modeled in the Harvard data science classroom is threefold: Elite men lead. Female faculty comprise less than a third of computer science and statistics faculty.2 In all the publicly available syllabi for CS109, no female faculty has ever been the lead professor for the course. Second, data science is abstract and technical. Steps like cleaning and wrangling data are depicted as solely technical conundrums and there is little to no discussion of the social context, ethics, values, or politics of data. Third, the goal of learning data science is modeled as individual mastery of technical concepts and skills. The teachers impart the technical knowledge via lecture and students complete assignments and quizzes individually.
Beginning courses such as CS109 play a key role in introducing learners to the concerns of a field (and by extension, what the field is not concerned with). Becoming socialized into the CS109 model of the world means that one sets aside any concern with the social and political, with justice and fairness, with values and motivations. As such, it is no wonder that people who have been socialized into this world order are terrible at creating inclusive visions of how technology might actually be used for collective benefit. They have never been taught a vocabulary for seeing and addressing how gaps in power, privilege and justice relate to technical and informatic fields. This model of teaching might be called "The Horace Man-Factory Model of Data Science" or "Let A Thousand One Laptops Per Child Bloom."
While we have been picking on Harvard and MIT, the issue is really not this single class at this single institution. Rather it is that the Man-Factory model of teaching data science is the dominant model across the hundreds of courses now running in high schools, institutions of higher education and on MOOCs like Coursera and Udacity.3 Data science courses turn a healthy profit, so institutions have been racing to set up online Masters' and certificate programs in order to capitalize on the demand. What gets lost in the process, and in academia's prioritization of research over teaching, are larger questions about power, ethics and values and how those are modeled in teaching practices.
Paulo Freire, the Brazilian educator who wrote the now-classic Pedagogy of the Oppressed in 1969, stated it this way, "Education either functions as an instrument which is used to facilitate integration of the younger generation into the logic of the present system and bring about conformity or it becomes the practice of freedom, the means by which men and women deal critically and creatively with reality and discover how to participate in the transformation of their world." Freire called status quo education "the banking model" in which teachers deposit knowledge into students heads, which are conceived of as empty vessels. In contrast, the feminist pedagogy of bell hooks draws from Freire to assert that if learning is to be a practice of freedom then it must be a two-way street – a process of mutual transformation. Indeed, there is much that those in power need to unlearn if they seek to challenge the status quo. bell hooks commented that, "In my books I try to show how much my work is influenced by what students say in the classroom, what they do, what they express to me...This is one of the primary differences between education as a practice of freedom and the conservative banking system which encourages professors to believe deep clown in the core of their being that they have nothing to learn from their students."
So, while there is an emerging focus on data ethics and accountability in research, we need to turn that same focus now to teaching. Teaching is particularly high stakes as a site of social reproduction right now because of the sheer number of people who are working their way through these newly spawned courses and programs in data science. Additionally, due to the rapid pace of change in database technologies, machine learning libraries, and visualization packages, even the most accomplished professionals have to be teaching and learning at all times.
For the remainder of this chapter, we take you on a tour of what it looks like to teach data science with an intersectional feminist lens. Walking back through the prior chapters in this book, we outline how the feminist principles that we discussed – like valuing multiple voices or resisting binary thinking or embracing emotion – apply to teaching. Luckily, none of us have to imagine new things from scratch because projects like Local Lotto are already disrupting the status quo and modeling more emancipatory alternatives to the Man-Factory.
How might we teach a data science that is grounded in values of equity and co-liberation?
In The Power Chapter, we detailed how data and its products (like maps and algorithms), can be used to secure power or to contest power. Our examples showed how it matters deeply who is doing the mapping – Gwendolyn Warren mapped the deaths of Black children by white commuters so her community could demand justice whereas the Federal Home Owners Loan Corporation mapped Black and brown residents so that they could systematically deny them bank loans. In general, those who wield their data from a position of power tend to use that technology to preserve a status quo in which they are on top. This is true even when the people in power think of themselves as being anti-racist and anti-sexist because "privilege is blind to those who have it." We named this as a "privilege hazard" and argued for it being a key consideration in the data ethics toolbox. Perhaps unsurprisingly, the people with the most privilege are also the people who argue most stridently against embedding specifically named and detailed values in technology. For example, according to Safiya Noble it took Google until 2013 to start suppressing derogatory and pornographic characterizations of Black women in autocomplete search suggestions. Here we might remind you again that their board is 82% white men.
How might we teach a data science that is grounded in values of equity and co-liberation? As data ethicist Anna Lauren Hoffman has written, "Most important, engineers and data scientists need to listen to and engage with affected communities." Listening and engaging is the first step towards co-liberation. And the only way to work respectfully with those most affected by a problem is to develop a sophisticated understanding of structural oppression and how your own identity factors into that. While computer science offers no help in navigating these waters, emerging technical design frameworks do. For example, the equityXdesign framework that we discussed in The Numbers Don't Speak for Themselves, retools IDEO's human-centered design process with an explicit focus on oppression and deliberately centers equity as core value. Importantly, the framework was developed by three African American women who have a powerful vision for ending racism: "Racism and inequity are products of design. They can be redesigned."
Coming out of the field of interaction design, Jill Dimond & Thomas Smyth have developed a concept called "anti-oppressive design," which focuses on creating interactive systems that strive to end one or more forms of oppression. The framework consists of a series of questions and practices that teams can use in order to prioritize which work to take on, make governance decisions for the organization, and assess whether the work that was done met their core values around ending oppression. Dimond herself used the anti-oppressive design framework to create Hollaback, a platform and movement for change that supports survivors of gender-based sexual harassment on the street. At a larger scale, Una Lee, Wesley Taylor, Victoria Barnett, Ebony Dumas, Carlos (L05) Garcia, and Sasha Costanza-Chock are organizing a movement for "design justice" that advocates for design with ten explicitly stated values. The first of these is: "We use design to sustain, heal, and empower our communities, as well as to seek liberation from exploitative and oppressive systems."
How might these emerging frameworks be made manifest in teaching about data science? The simplest answer is to start making contact and building relationships with the people most affected. Let's look back to Local Lotto, where the high school students were trying to answer a question with data: Is the lottery good or bad for your neighborhood? The group could have remained in the classroom, learning about probability concepts and working with data sets about family income and lottery winners, and made their estimation. But the organizers made an explicit choice to value the voices and experiences of neighborhood residents, as data. These interactions sparked new potential lines of inquiry for both the learners and the teachers in the project. As Laurie Rubel describes it, "by talking to people on the street, multiple groups discovered that some people traveled to other parts of the city to buy their lottery tickets because they felt like more winners are going to happen in these other spots. Like, they go to the Upper East Side to buy their tickets. That's an interesting idea. Are there more winners on the Upper East Side?" This was not originally a question that either students or teachers would have asked.
The choice to venture outside the classroom reveals two important things about data science that are not typically emphasized in standard curricula: 1) Quantitative data requires context - The data may appear to say one thing, but what do the people say? What are their attitudes and beliefs about the lottery? What new lines of inquiry might be discovered by young people in the neighborhood talking to older people? And 2) Building social cohesion - Kubel and her associates repeatedly describe the importance of cultivating sociopolitical consciousness in the learners. We can think about the relationships initiated in the interviewing process as contributing to the social fabric of the neighborhood – knitting residents together through conversations about shared interests. These relationships may be deepened and cultivated for later mobilization, say in the form of a concrete demand or community action.
There have also been experiments in situated, partnered ethics in higher education classrooms. For example, in 2014, Sasha Costanza-Chock ran a semester of their Co-Design Studio course at MIT about the topic of data and surveillance. In this course, MIT undergrad and grad students work in partnership with an outside organization to co-design technologies around a critical issue. This is MIT, so Costanza-Chock could have easily partnered with the Department of Defense or Google or AT&T for the course. But instead of brand name partners, "we were interested in how to amplify the already existing work against the harmful impacts of surveillance, so we partnered with organizations that are leading that fight based on the lived experience of their communities." Costanza-Chock prioritized working with organizations led by people from the community. Partners included the Detention Watch Network, which works to challenge injustices in the challenge the U.S. immigration detention and deportation system, and Transition House, a domestic violence prevention organization that was interested in developing technologies to protect survivors from interpersonal surveillance from their partners. Students initiated relationships with and learned about the perspectives of groups who they otherwise never would have come into contact with. They built technologies like SpideyApp – an Android-based Stingray4 detector – and graphics and media like the Surveillance Self-Defense Guide as products of these new relationships.
Racial justice educator Chris A. Miller emphasizes the importance of contact as the first step in fighting oppression. What this means is that relationship-building between people of different backgrounds is the first step in any quest to use technology for co-liberation, whether that is across differences in race, gender, technology literacy, age, profession, ability status or other. If we take that to its logical conclusion, then there can be no "data for good" and no "ethical AI" without contact, relationship-building and trust-building between systems designers and the people with the least power in the system.
How might we teach a data science that names and values the labor of all those involved?
In Show Your Work, we explored the many forms of labor involved in data work, from the individuals and groups that serve as the source of the data, to those who collect and process it, to those who analyze it and put it on display. We explained why naming all of these forms of labor, especially those we cannot see, is a feminist act. We also outlined how the current landscape of work, with its work from any place, at any time attitude, leads itself to additional forms of immaterial labor. These perpetuate the exploitation and inequality that we already see in other aspects of twenty-first century global life. Identifying these myriad forms of invisible labor is the first step in acknowledging how the visualizations and other data analyses that we see rely upon the work of many hands.
Thinking about invisible labor can also help point to the forms of work that are harder to quantify, and therefore visualize, because they involve emotional outlays rather than physical or financial ones. The work of caring for a sick relative, as documented in the Atlas of Caregiving, for example, involves not only keeping track of medication schedules and doctor visits, but also helping to bear the burden of the illness, and projecting calm in moments of medical crisis. There is a version of care work involved in data science as well--for instance, the work of Te Whakakaokao, the Ngā Upoko Tukutuku Reo Māori Working Group, which is responsible for designing the library subject headings that enable members of New Zealand’s Māori community to locate information about their history and culture in the National Library. Like care-giving, this work is performed for benefit of others, to enable the creation of future knowledge.
How might we teach a data science that names and values the labor of all those involved in the process? And how might we become more attuned to the invisible forms of labor, like care work, that would otherwise be overlooked? We can start by naming all of the people involved in our own projects, and in the courses about data that we teach. Where did we find the datasets that we work with? Is there information on those websites, or in those communities, about who contributed to the dataset? How was the dataset processed, and by whom? Have those people since left the project? If so, can we record their contributions and their names? Similar questions can be asked about the texts and activities that we assign in our classes. By whom were they authored? What topics do they engage? If there are activities that we “borrowed” from other courses, have we acknowledged the teachers whose work we employ?
This approach to crediting intellectual labor derives from feminist practices of citation, but it is not limited to academic contexts alone. Think of it as resistance to what Sarah Ahmed calls screening techniques, a concept that describes how “certain bodies take up spaces by screening out the existence of others.” When bodies are screened out, they don’t appear--let alone have their contributions recognized by others. This is not always intentional, but it is unfortunately self-perpetuating. To borrow another example from Ahmed, it’s like sinking into a leather armchair that is comfortable because it’s molded to the shape of your body over time. You probably wouldn’t notice how the chair would be uncomfortable for those who haven’t spent time sitting in it-- those with different bodies, and with demands on their time.
Even in classrooms outfitted with the most rigid of plastic seats, we can still work to create a more comfortable intellectual space in the room. We can include more women and people of color among the scholars whose work we assign; and we can include more projects relating to women and people of color among the examples we discuss. Brian Croxall suggests that we should think about how we might “fork” our classes--a metaphor he borrows from version control software-- so that we can acknowledge the intellectual labor of the scholars and teachers that our own classes rest upon. We can also acknowledge the intellectual labor (and other forms of work) that our projects rest upon by naming the people who performed that work, and by working hard, ourselves, to ensure that any invisible labor is better accounted for.
In data science, the names of these people, and the work they perform, are not always easy to locate--if they can be located at all. But taking steps to document all of the people who worked on a particular project at the time that it is taking place can help to ensure that a record of that work remains after the project has been completed. In fact, this is among the four core principles that comprise the Collaborators’ Bill of Rights, a document developed by an interdisciplinary team of librarians, staff technologists, scholars, and postdoctoral fellows in 2011, in response to the proliferation of types of positions, at widely divergent ranks, that were being enlisted on scholarly digital projects. More recently, at UCLA, a team of eleven students and faculty members worked together to author the Student Collaborators’ Bill of Rights. Supplementing the original document with ten additional principles, the student version emphasizes the importance of empowering students to “make critical decisions about the intellectual design of a project or a portion of a project,” and credit them accordingly.
In distinguishing between the intellectual opportunities offered by collaborative digital projects, and the mechanical work that is also required, the Student Collaborator’s Bill of Rights draws attention to the importance of affording students the space to grow as project leads. But at Northeastern University, in Boston, literature professor Elizabeth Maddock Dillon takes another approach. She includes assignments involving complex data processing tasks in her courses, so that students can also appreciate the intellectual labor involved in tasks that would otherwise seem purely mechanical. For instance, in her Literature and Digital Diversity course, a text encoding exercise-- a required part of the process for converting unstructured text into structured data-- presents students with the lines from Shakespeare’s The Tempest spoken by Caliban, the only native inhabitant of the island where the play takes place, and asks students to think about how they might make note of any colonialist language in their markup scheme.
The Colored Conventions Project (CCP), directed by P. Gabrielle Foreman, at the University of Delaware, and run by a large team of students, staff, and faculty, seeks to address issues of labor at the level of data entry and content creation, as well as in the actual dataset that the CCP is working to create: a corpus of meeting minutes from the nineteenth-century Colored Conventions, events in which Black Americans, fugitive and free, gathered to strategize about how to achieve legal, social, economic, and educational justice. Justifiably wary of the free labor of crowd-sourcing, the CCP asks its teaching partners to sign a Memo of Understanding (MoU) before contributing to the project. The MoU makes explicit the importance of keeping track of and crediting any student contributions to the project. (In fact, there is a second MoU that students complete, which asks them to share their contact information, should they feel comfortable doing so, so that they can be named on the CCP site).
In addition, the MoU asks that teaching partners address issues of labor in the dataset itself. Because the dataset is derived from the conventions’ meeting minutes, which tended to record the official convention participants and the discussions they initiated, it does not sufficiently acknowledge the contributions of the women who were often in attendance, albeit in an unofficial capacity; or those who worked in the boarding-houses where the male delegates stayed during the conventions, enabling their participation; or those who stayed home altogether, taking care of children and housework, ensuring that their husbands and sons could attend. To address this disparity in the dataset, the MoU asks that all instructors introduce a woman involved in the conventions, such as a wife, daughter, sister, or fellow church member, alongside every male delegate who is named. (A growing body of information about these women is housed on the CCP website). As the MoU explains, “This is our shared commitment to recovering a convention movement that includes women’s activism and presence—even though it’s largely written out of the minutes themselves.”
The issue of invisible labor in data science is significant, but it can seem difficult to address only because it has gone unacknowledged for so long. Taking simple steps to keep track of the participants who contribute to a project, and crediting them in the end result, can contribute to a visible record of the work that data analyses and visualizations rest upon. That information is, after all, data. And data, as we know, can be a powerful tool for combating the inequities we encounter in the world--and in our own workplaces, labs and classrooms as well.
How might we teach a data science that honors context?
In The Numbers Don't Speak for Themselves, we outlined how data do not always represent what they appear to, particularly when it comes to data about about women and marginalized groups. Following Donna Haraway, knowledge is never absolute, but always situated in a social, cultural, historical and material context. Untangling and investigating how it is that datasets are products of those contexts can help us understand the ways in which power and privilege may be obscuring the truth. The collection environment – or data setting, as Yanni Loukissas suggests we call it – may have power imbalances, marketing hype, social stigma or incentives at cross purposes that complicate how and whether data are complete and representative. Sexual assault information on college campuses, for example, is self-reported by higher ed institutions whose bottom line is directly threatened by reporting high rates. Likewise, data published online in spreadsheets and through APIs often lack robust metadata, including the reason and purpose they are collected as well as the limitations (ethical, social and technical) of what they can and should be used for. Lacking this context for orientation, strangers in the data set run the risk of getting things entirely wrong or actually doing harm by filling in the missing information with their own biases and assumptions.
How might we teach a data science that honors context? Situating data for learners is a particular challenge, since the conventions of both spreadsheets and precise graphics make them appear objective. If data is never "raw", but always cooked through a process that involves people, institutions, politics and processes, then we must give newcomers tools and vocabulary to examine the cooking process. This has not been the norm in statistics education, which has tended to teach data sets as illustrations of theoretical methods. As Chris Wild and Maxine Pfannkuch write, "The arid, context-free landscape on which so many examples used in statistics teaching are built ensures that large numbers of students never even see, let alone engage in, statistical thinking." And even in data science learning, where there is more value placed on "real-world", "messy" examples, deep interrogation of the business purpose and political factors of a data set is not the norm. A data librarian at a Boston-based research university recently grumbled to Catherine that he has to support hundreds of students in developing a basic understanding of what data are and where they come from. For the first assignment in their introduction to data science class, the professor gives them a spreadsheet of several hundred thousand rows of Boston health inspections and tells them to "find something interesting." At a loss for how to get started, the students show up on the data librarian's doorstep.
Honoring context and situated knowledge would proceed in the exact opposite direction – newcomers would be taught that you cannot look forwards towards new insights from data until you look backwards at the data setting. One example of a learning activity that tries to do this is called "Asking Questions". Learners use WTFcsv (part of the Databasic.io suite of tools, co-developed by Catherine and Rahul Bhargava) to learn how to get started with basic spreadsheet analysis. True to its name, WTFcsv is a simple online tool designed to help people understand WTF is going on with their CSV5 file. It takes each column from a spreadsheet and characterizes patterns in the data across that column.
Rather than being instructed to "find a story" immediately in their data, learners are charged with asking good questions to pursue as lines of inquiry. They break into small groups, choose one of the sample data sets to work with, examine WTFcsv's summary visualizations and brainstorm questions that they want to ask the data. Facilitators encourage learners to use the visualizations to generate many types of questions, including context questions ("What's the source of this data? Why did they collect this information? Who uses it?"), ethical questions ("Is it ok to publish people's full names? How did sex end up as a binary variable?"), quality questions ("Is this data complete? How was it acquired?"), data formatting questions ("What does the 'Parch' column mean?"), as well as data analysis questions ("Did women survive at a higher rate than men?") After 10 minutes of brainstorming questions and sources for related data, learners are asked to select one as the most interesting one to share back to the group. The debrief conversation focuses on how developing rich questions usually means that you need a lot more information than what is contained in the dataset itself – you may need to do background research, consult with domain experts or seek other data sources in order to take the next step with a dataset. Encouraging many types of questions, including questions about trust in the source, missing data, and data formatting, helps learners start to connect the data back to the institutional and historical context where it was collected, emphasizing that those things also matter deeply to any meaning that comes from patterns observed in the data.
Beyond developing learners' muscles to connect datasets back to their context, this activity also models a process that consists of iterative question generation in dialogue with others about factors outside of the dataset. Think about this in contrast to the "find something interesting" assignment where students are individually charged with coming up with creative insights from a large-ish data set with no tools, background information, or other scaffolding. The latter models a process in which data science starts with data, domain knowledge is not necessary, it is imagined that all of the answers are contained within the single dataset itself and it is one individual's job to find them.
Other practices are being developed for learners who are at more advanced stages of their learning process. Miriam Posner requires that learners in her digital humanities class at UCLA interview domain experts about their data set before they do any analysis. The data biography, used by Heather Krause, a data scientist and educator, is another emerging context and data verification tool. Prior to beginning the analysis process, Krause asks people, particularly journalists, working with data to write a short history of a particular data set and answer four basic questions: Where did it come from? Who collected it? When? How was it collected? Why was it collected? Krause advocates using data biographies as a first step in understanding the origin story of the data set because, as she says, "you need to treat data with as much care as you would treat any source in any journalism project." In her online tutorial about data biographies, she describes how data about violence against women in Malawi appears to improve dramatically from one year to the next, but in fact these variations are due to the data collection being undertaken by two different organizations employing two different methodologies. Undertaking a data biography can reveal these inconsistencies, places in the data pipeline where disparate data were combined, and power imbalances.
Another emerging practice that attempts to better situate data in context is the development of data user guides. Bob Gradeck, manager of the Western Pennsylvania Regional Data Center, started writing data user guides because he got the same questions over and over again about popular data sets he was managing, like property data and 311 resident reports in Pittsburgh. Reports Gradeck, "It took us some time to learn tips and tricks... I wanted to take the stuff that was in my head and put it out there with additional context, so other data users didn't have to do it from scratch." Data user guides are simple written documents that contain a narrative portrait of a data set. They describe, among other things: The purpose and application of the data; The history, format and standards; The organizational context; Other analyses and stories that have used the data set; and the limitations and ethical implications of the data set. While it is more of a commitment than a data biography and takes up more classroom time, writing a data user guide impresses upon learners just how much background context and complexity there is to uncover about even the most seemingly simple data. To date, groups in several different learning situations have been assigned to write data user guides with promising results: graduate students in Digital Media at Georgia Tech, undergraduates in a data visualization course at Emerson College, and fifty librarians participating in an online course called "Civic Data Ambassadors.”6
These are promising experiments and practices, but if we aspire to honoring context in a more systemic way, it is important to pose the question, "Who is good at context?" While most data science education has tended to be situated in departments of Computer Science, nobody would think of computer scientists as having great conceptual, theoretical or practical tools for understanding the social and political environment. It's simply not what the field has been concerned with. Social scientists, humanists, ethnographers, psychologists and designers all have more robust ways of navigating and understanding context. What this points to is modeling a transdisciplinary approach to data science education where it isn't "owned" by a single discipline, but rather taught in studio-form and grounded in the topical subject matter of the data.
How might we teach a data science that is not about individual mastery but about arriving at shared meaning?
Modeling a data collection and analysis process that embraces many voices and perspectives leads us back to the chapter Unicorns, Janitors, Ninjas, Wizards and Rock Stars. While many metaphors that are used in the popular media for data scientists promote an image of a lone wizard (man) who dominates and tames unruly data to extract "intelligence" (for his corporate employer), there are powerful counter narratives embodied in feminist-led projects like the Anti-Eviction Mapping Project and GoBoston 2030. What the latter get right is that they start with the idea that traditional methods of institutional data collection and analysis are not working for everyone equally. So, they organize a process to center the perspectives of marginalized groups and value knowledge from distinct standpoints. Which is to say that process matters. A lot. And participatory processes are inherently messy and multivocal, hard to tame and dominate like rows and columns. And that's ok.
How might we teach a data science that is not about individual mastery but about arriving at shared meaning? Unfortunately, many data science courses and workshops plant individual learners in front of computer screens and walk them through technical trainings in R, D3, Excel, or Tableau. What this models to learners is a world in which data science is primarily a technical endeavor isolated from social circumstances, and success is defined by one's individual mastery of tools.7 This does wonders to perpetuate the newcomer perception of data science as highly exclusionary male wizardry.
What are some alternatives to this learning model? In Detroit, three community technology groups have banded together to produce something called Data DiscoTechs. These drop-in, peer-to-peer learning events create a space where people can discover technology together.8 Each Data DiscoTech features a variety of stations including topics like data scraping, data visualization and an introduction to social media data. The Detroit Community Technology Project emphasizes that it's important that stations are staffed by volunteers from the community, "Participants learn at their own pace and from people who understand the context of their neighborhoods and communities." Data DiscoTechs connect open government data to issues that the community cares about and that have immediate relevance for people's everyday lives. For example, the Transit Justice station taught people how to make heatmaps that combined transit and survey data in order to analyze bus routes in Detroit in terms of quality and availability, and then propose alternatives.
While there are plenty of computers and software programs in view at a Data DiscoTech, there is also a lot of engaged conversation, collaborative problem-solving and laughing going on. There are kids, teenagers, and older adults asking questions and teaching things to each other. As one of the organizers states, "A Data DiscoTech not only creates an opportunity to demystify technology and data but it also creates an opportunity to build relationships." It's worth a pause here to reflect on that last bit: an opportunity to build relationships. Most data science educators imagine that they are educating individuals and would likely not put "building relationships" as a top learning goal for their work. But perhaps they should. Because when is a data-driven project wholly conceived, executed and implemented by a single person? Like, never.
Is data literacy a set of individual skills where learners graduate to being technical ninjas or could it be a collective muscle – learned and practiced in context and in community? Could it be both? Learning experiments like Data DiscoTechs simultaneously build individual skills and build social cohesion – the relational infrastructure of the community to address its own data challenges. They model to newcomers that you don't have to be a wizard to work with data, and you also don't have to hire an expensive outside wizard who totally doesn't get your needs and concerns. Rather, you build trust and relationships with guides in your community which can be mobilized when needed for deeper collaborations in the future. Another advantage of prioritizing relationship-building relates back to the feminist concept of design from the margins that we discussed in Unicorns, Janitors, Ninjas, Wizards and Rock Stars. As Kimberly Seals Allers, women's health advocate, says, "Whatever the question, the answer is in the community." People in a community know its problems, intimately, and they know which phenomena go uncounted, underreported or neglected by institutions in power – such as the deaths of Black children killed by white commuters in Detroit mapped by Gwendolyn Warren. They also know what the harmful impacts of data are for their people. Building trust across power differentials increases the chances that people at the margins could engage authentically and help educate those in power, particularly the ones that aspire to "doing good" with data.
Building the capacity of organizations and communities to work with data is what led Rahul Bhargava to pen a blog post titled "You Don't Need a Data Scientist, You Need a Data Culture." In it, he describes the Data Culture Project, co-developed with Catherine, to scale data literacy across an organization. Many of the challenges nonprofit and community-based organizations face in making effective use of data have less to do with tech skills and more to do with organizational process and culture.
For example, many nonprofits collect lots of data but don't actually use it. Or data gets silo'ed in the IT department and seen as "the tech people's job". Or departments engage in separate data analysis efforts where they could be creatively pooling efforts. Or staff are alienated by KPIs9 and dashboards and don't see the relevance for their everyday work.
The answer to these challenges is not that everyone needs more spreadsheet training but rather that there needs to be a more effective participatory process around how data is used in the organization. The Data Culture Project addresses this through a free, self-service curriculum with monthly activities. For example, after his organization led its staff in the Sketch a Story activity, Michael Smith from the Telluride Foundation reported that, "One of my colleagues came back after the session to share a 'breakthrough' on how to use the tool to analyze our program content. I also overheard our Communications/Marketing team discussing how they could use it to analyze interview and social media data." In the process of learning a new concept, like quantitative text analysis, staff also build relationships across silos.
There's that building relationships thing again. Prioritizing relational infrastructure and multiple voices in the data learning process helps learners implicitly understand feminist standpoint theory, even if they don't address it in those terms. Meaning, there is never one singular possible interpretation of a set of data (that one discovers by sitting at a computer for long enough) but rather a better or worse collective process informed by data to arrive at shared meaning.
How might we teach a data science that addresses the politics of and the absences in counting and measuring?
In What Gets Counted Counts, we discussed how the data we collect, and the categories we place them into, matter deeply for the analyses that can then be performed. Once categories have been established, it can be nearly impossible to go back and look for information that has been left out. Because standard data collection categories, like gender, are often derived from existing social categories, it is absolutely essential that we question those categories and the assumptions that underlie them, before we translate them into the categories we use to collect our data. As the example of the gender binary makes clear, binary distinctions are also often secretly hierarchies, with one category on top and the other on the bottom--to say nothing of the people, like Maria Munir, who are excluded from the classification system altogether.
Counting quickly gets complicated, as Munir’s story shows, but most data science courses present datasets as if they just dropped from the sky. In these cases, the teaching environment is modeling a world in which the role of the data scientist is that of a pure technician. Somebody else asks the research questions, somebody else convinces the institution that the project is worthy, somebody else allocates the resources, somebody else designs the data entry and somebody else does the data collection. Then the data scientist enters the picture. If this is the way it works, then we argue that the "data scientist" shouldn't be elevated with the "scientist" bit at the end, but rather revert back to the less sexy "analyst" who doesn't ask hard questions and dutifully does the computing handed down by management.
What this prevents the teacher and learners from doing is having a productive discussion about what actually gets counted and measured, and what does not. It prevents them from discussing the thorny problems involved in collecting data about hard-to-measure phenomena that matter deeply for the well-being of women and people of color – like sexual harassment, domestic violence, discrimination in the healthcare system, police killings, hate crimes, indigenous land use, food deserts. The list goes on. It prevents the learners from discussing institutional ethics and responsibility – whose job is it to measure maternal mortality? Whose job is it to mobilize that data to do something? It prevents the class from discussing values and consent – when is it unethical to count and measure something? When do you walk away from the institution that hired you?
How might we teach a data science that addresses the politics of and the absences in counting and measuring? Educators can start by refusing to model "Plop Data Science" where the learner is repeatedly plopped into the middle of the process with a data set, with the implied takeaway that "someone else" will always figure out that hard stuff that comes before the data set is collected. At Georgia Tech, for example, Lauren teaches a data visualization class that relies upon data from the U.S. Census. But before examining the dataset, students learn about the history of the census, its political origins, and its change over time. She explains how, in the census, even just counting the number of people in the country is political; before emancipation, white residents of the slave-holding South wanted as many representatives as possible in Congress, but they didn’t want to acknowledge that Black southerners counted in equal measure of themselves. Hence the notorious “three-fifths compromise,” which counted each enslaved person as three-fifths of a citizen. This allowed slaveholder interests to dominate the US government until the Civil War.
Another way to show how counting matters is to have the students themselves learn basic methods of counting and measuring and collecting. This does not have to exhaustively cover all possible methods but can focus on tuning learners into the right questions to ask. For example, Catherine and Rahul Bhargava have a learning activity called Paper Spreadsheet designed to introduce newcomers to basic ideas of data collection as well as the limitations of what data can and cannot represent about the world. In Paper Spreadsheet, learners fill out a row of information about themselves on a large, colored piece of paper, including their name, hometown, number of siblings and color of their shirt. In the ensuing discussion, the facilitator asks questions about what the data does represent about the people in the class and what it does not represent, leading the learners towards questions about representation and ethics. For example, the column "color of your shirt" leads learners to enter in a single color, but often their shirt is multicolored or patterned. The facilitator can point out that data collection is reductive, by definition and by design, but that it's important to be attentive to when more complexity is necessary to answer the questions at hand. What do we miss out on if we only have one column about shirt color? That resolution might be fine for a basic characterization of people in the room, but completely insufficient for a fashion designer to create new shirt designs. Likewise, the facilitator can ask the learners questions about ethics, consent and privacy, such as "When would you refuse to input your data into this spreadsheet? What columns would be invasive to collect, such as your sexuality or mental health status? Does it matter who is doing the measuring?"
Case study discussions can allow learners to go further and unpack some of the ways that structural oppression shows up in counting and measuring. For example, Anna Lauren Hoffmann taught an undergraduate course called Data & Ethics at Berkeley for two years that used a case-based approach to introduce learners to ethical issues across the whole pipeline of data processing. Students read and discussed cases based on high-profile examples featured in news reports and op-eds. In one example around data collection, learners read about the launch of Apple's Health app in 2014. The app initially enabled the tracking of health data ranging from calories consumed to heart rate to blood alcohol content. A notable absence, quickly seized upon by many female tech writers, was the lack of ability to track menstrual cycles, something that women have been doing for centuries longer than people have taken chromium supplements. In Hoffmann's class, she used this case as a way to discuss the power of default choices made by designers and engineers, and whose bodies were included and excluded based on those defaults. She also invited students into the design process – "we would use white boards to draw alternative fitness tracking dashboards for different populations." Students imagined what a fitness dashboard for people with scoliosis might look like, for example, or for migrant agricultural laborers. In the case of the latter, the student designers included information about hours of sunlight, alerts for when to take water breaks and connected the laborers to ways to document workplace abuse and take political action.
Beyond discussions of case studies, learners might be introduced to participating in open source communities, or other volunteer groups, whose explicit purpose is to collect data – like Open Street Maps or New York’s Homeless Outreach Population Estimate.
Counting is complicated, but it’s easy to forget when you’re handed a dataset and told “ready, set, go.” A data science that calls attention to the decisions made when counting, and shows how those decisions impact the questions that can then be asked, leads to more truthful or accurate answers with respect to the dataset at hand. In addition, it can help to shine a light on new questions worth asking--questions that students who are trained to think hard about the source of their sources, will be well-equipped to begin to explore.
How might we teach a data science that equally values ethics, emotions and reason?
In the chapter On Rational, Scientific, Objective Viewpoints from Mythical, Imaginary, Impossible Standpoints, we contrasted the Periscopic gun deaths visualization with a graphic by the Washington Post about active shooters to initiate a conversation about emotion in data science and visualization. We outlined how contemporary Western thinking about data has evolved from a flawed model of a "master stereotype" where what is perceived as rational and objective is valued more than that which is perceived as emotional and subjective. The master stereotype would say that emotions cloud judgement and distance amplifies objectivity. But a feminist perspective challenges everything about that master stereotype. Emotions don't cloud judgement – they produce deep engagement and incentive to learn. Patricia Hill Collins, for example, describes an ideal knowledge situation as one in which "neither ethics nor emotions are subordinated to reason."
So, how might we teach a data science that equally values ethics, emotions and reason? The question sounds abstract, but it might be as simple as reconsidering the subject matter of the data that you teach with.
Data analysis techniques are often discussed as though the subject matter of the data is interchangeable and neutral. Many teaching examples use so-called "classic" data sets like mtcars – a dataset about different features of cars from a 1974 edition of Motor Trends Magazine. Each row of data contains measures for horsepower, miles per gallon, # of cylinders, and so on. But who cares about measuring car efficiency? Mechanics, salespeople, car companies, and people that like cars – disproportionately dudes. The authors of this book – not so much. Valuing emotion, in this case, would mean ensuring that the choice of teaching data has some cultural and emotional proximity to the teacher and to the learners – i.e. they have some reason to care about the subject matter, they have some ground truth experiential knowledge of the data, and they have some emotional or ethical investment in asking questions of that data. In the case of Local Lotto, that proximity is literal and geographic - the learners are from the neighborhood where they are collecting data about lottery usage.
In another example, Tahir Hemphill is an artist who founded the Rap Research Lab (RRL), an after school program for Black, hispanic, immigrant, female, and transgender youth. RRL teaches data analysis using a vast database of hip hop lyrics which Hemphill assembled called the Rap Almanac. RRL learners produce sophisticated research questions, analyses and visualizations, which are informed by their deep knowledge of hip hop culture. Rather than seeing the youth's emotional connection to hip hop lyrics as something that makes them "biased", Hemphill's project uses that intimate experiential knowledge to bridge their investment in learning data analysis techniques. The youth are already experts in the experience and context of hip hop music, so they can mobilize that existing expertise to ask interesting questions of the data. Emotional and ethical proximity to the subject matter is an asset – they are a form of insider knowledge which makes for better and different questions than those asked by strangers in the data set. Indeed, Helen Kennedy and her research team at the University of Sheffield found in 2016 that the subject matter of a visualization, and its relation to the viewer's everyday life, was a key factor in people's engagement with them. People invested more time and energy in looking at visualizations whose subject matter mattered to them.
Valuing emotion and ethics as equal to reason may also mean that educators could expand the output forms that they show as examples and that they structure into assignments. Valuing emotions leads us to ask – why stop at a web visualization or a statistical model as "proof" of learning? How about sculptures, murals, quilts, VR experiences, sonifications, and other "visceralizations" of data? And valuing ethics leads us to ask - what is the most appropriate form of output given an analyst's obligation to communicate results to a given community, most especially the community who is represented in the data? If you think back to the case of the Groundwork Somerville data analysis discussed in Unicorns, Janitors, Ninjas, Wizards, and Rock Stars, the output took the form of a data mural. This large-scale analog painting, situated in a working garden, communicated the youths' analysis of food security data in an accessible, novel format and helped the sponsoring organization build awareness and solidarity around its mission in the community.
Valuing emotions and ethics may represent a shift for more technical data science educators, who may not be used to thinking about data as creative communication, nor have the social-relational muscles to think about multiple audiences and stakeholders. In these cases, educators could benefit from partnerships with media, art and design educators, whose fields are built on experimenting with these questions.
Justice is a Journey
We opened this chapter with the example of the Local Lotto project. The organizers did many innovative things in this project – they situated data analysis as an issue of cultural relevance for the learners, they built relationships between learners and neighbors, they valued the voices and experiences of the community, they gave multiple opportunities for creative communication with data – all while teaching important concepts around spatial data analysis and probability. But Local Lotto also has its failures and open questions when it comes to achieving its stated ethical goals around spatial justice. In this, it resembles every other project in the entire world, including those undertaken by you, the readers, and us, the authors.
The organizers of Local Lotto wrote a paper for the Harvard Educational Review in 2016 in which they reflected on the successes and failures of the project. While there is much evidence to point to success, including the test scores and learners' engagement with the material, they note several oversights and issues to expand on in the next version of the curriculum. First, and importantly, they noted that the collaborating teachers and course designers on the project are white and Asian whereas the youth in the classes were predominantly Latinx and Black. While the narrative of the course designers focused primarily on income inequality and did not engage race, they write that "the students consistently surfaced race." Since race and ethnicity were not part of the teaching material, the teachers felt that they did not have the experience or background to discuss it explicitly, and would deflect those conversations. The organizers are now taking steps to explicitly integrate discussions about race. They also plan to include race, ethnicity and age data in the next version of the curriculum, because "youth, and in this case youth of color, have different understandings about racial boundaries; theirs are differently nuanced and scaled than affluent, White, or adult perspectives."
Another question the course designers pose to themselves has again to do with the identity of who is in the classroom and who is leading the classroom (primarily white or Asian outsiders to the neighborhood and the community). The organizers describe "limited but recurring instances of resistance from students" to the project's central thrust around investigating income inequality in the lottery. This resistance from youth learners might be summarized like this, "you have no business coming into my world and telling my people that we are doing things wrong and that we should use your tools to do things right." While this was not the course designers' purpose, it is easy to see how a curriculum taught by outsiders, focused on income inequality, could be seen as passing judgement on people in the neighborhood and perpetuating a deficit view of low-income people. As in, "If only they knew what was good for them, they would not buy lottery tickets. We will use our tools of privilege to teach them." In reflecting on these unintended and possibly harmful results, the course designers determined that their steps for the next iteration would include revised maps and visualizations that did not paint such a simple narrative about income inequality as well as connecting students with people in their community who are actively working on issues of income inequality. Which is to say, they will work harder next time to build relationships between the youth and their community.
So, is the Local Lotto project a feminist failure? Or an admirable achievement for equity-focused learning? It is both of these things. What is important is to understand justice as a journey and intersectional feminism as a set of tools to apply along the way. While one might master commands in R or write a badass Python scraper, there is no such mastery of feminism because you are (hopefully!) always in a state of unlearning your own privilege and encountering new social and political differences that challenge your prior worldview and make you extremely uncomfortable. Discomfort, shame and failure are par for the course. Python scrapers might be frustrating but will never make you confront your own sexism. What becomes important in working towards justice is what Donna Haraway calls "staying with the trouble" – having the courage to keep going when the work is difficult and fuzzy and you and your people and your institutions are a major part of the problem. One of the biggest strengths of Local Lotto is the courage of the project creators to publicly and reflexively interrogate themselves and their process, to detail their failures as well as their commitments to doing better on the next iteration. Justice is a journey and the most important part of it is that you stay with the trouble (and hopefully cause some trouble, yourself).